Summary
A Reddit user announced the completion of a 103 billion token Usenet corpus, spanning 1980-2013. This dataset includes 408 million posts from 18,347 newsgroups across 9 hierarchies and underwent extensive processing, including full deduplication and binary removal.
What happened
An individual user on Reddit's r/MachineLearning subreddit announced the completion and documentation of a large Usenet corpus, which they have been assembling and processing over several years. The creator describes it as one of the larger privately held pretraining corpora available.
Key details
- **Size:** 103.1 billion tokens (cl100k_base encoding).
- **Content:** 408 million posts.
- **Coverage:** Spans 33 years, from 1980 to 2013.
- **Scope:** Covers 18,347 newsgroups across 9 newsgroup hierarchies.
- **Processing:** Included full deduplication and removal of binary content (e.g., `alt.binaries.*` hierarchies were excluded).
What to watch
While currently a privately held corpus, the existence of such a large and historically rich dataset could be significant for future research. Its unique time frame and content could offer distinct advantages for training specialized language models focused on historical internet communication, cultural shifts, or specific domain knowledge prevalent in early online communities.
Editorial note
AI Dose summarizes public reporting and links to original sources when they are available. Review the Editorial Policy, Disclaimer, or Contact page if you need to flag a correction or understand how this site handles sources.
Continue Reading
Explore related coverage about community news and adjacent AI developments: [r/ML] Why ML conference reviews sometimes feel like a “lottery“ [D], [HN] Show HN: Hackamaps – A global hackathon map I build after hitting Lovable Limits, [r/ML] A Hackable ML Compiler Stack in 5,000 Lines of Python [P], [r/ML] Phosphene local video and audio generation for Apple Silicon open source (LTX 2.3) [P].
Related Articles
- [r/ML] Why ML conference reviews sometimes feel like a “lottery“ [D]
May 2, 2026
- [HN] Show HN: Hackamaps – A global hackathon map I build after hitting Lovable Limits
May 2, 2026
- [r/ML] A Hackable ML Compiler Stack in 5,000 Lines of Python [P]
May 1, 2026
- [r/ML] Phosphene local video and audio generation for Apple Silicon open source (LTX 2.3) [P]
May 1, 2026
Next read
[r/ML] Why ML conference reviews sometimes feel like a “lottery“ [D]
Stay with the thread by reading one adjacent story before leaving this update.
Comments
Sign in to leave a comment.