[r/ML] I spent years building a 103B-token Usenet corpus (1980–2013) and finally documented it [P]

An individual user on Reddit's r/MachineLearning subreddit announced the completion and documentation of a large Usenet corpus, which they have been assembling and processing over several years. The creator describes it as one of the larger privately held pretraining corpora available.

What to know first

A Reddit user announced the completion of a 103 billion token Usenet corpus, spanning 1980-2013.

This extensive historical dataset offers a unique resource for training large language models, providing insights into human communication and cultural evolution across a significant period of early internet history.

**Size:** 103.1 billion tokens (cl100k_base encoding).

**Content:** 408 million posts.

Summary

A Reddit user announced the completion of a 103 billion token Usenet corpus, spanning 1980-2013. This dataset includes 408 million posts from 18,347 newsgroups across 9 hierarchies and underwent extensive processing, including full deduplication and binary removal.

What happened

Key details

**Size:** 103.1 billion tokens (cl100k_base encoding).
**Content:** 408 million posts.
**Coverage:** Spans 33 years, from 1980 to 2013.
**Scope:** Covers 18,347 newsgroups across 9 newsgroup hierarchies.
**Processing:** Included full deduplication and removal of binary content (e.g., `alt.binaries.*` hierarchies were excluded).

What to watch

While currently a privately held corpus, the existence of such a large and historically rich dataset could be significant for future research. Its unique time frame and content could offer distinct advantages for training specialized language models focused on historical internet communication, cultural shifts, or specific domain knowledge prevalent in early online communities.

Editorial note

AI Dose summarizes public reporting and links to original sources when they are available. Review the Editorial Policy, Disclaimer, or Contact page if you need to flag a correction or understand how this site handles sources.

Continue Reading

Next read

[r/ML] Why ML conference reviews sometimes feel like a “lottery“ [D]

Stay with the thread by reading one adjacent story before leaving this update.

Comments

Loading comments...