0
Likes
0
Saves
Back to updates

AI update explained

[r/ML] I spent years building a 103B-token Usenet corpus (1980–2013) and finally documented it [P]

An individual user on Reddit's r/MachineLearning subreddit announced the completion and documentation of a large Usenet corpus, which they have been assembling and processing over several years. The creator describes it as one of the larger privately held pretraining corpora available.

Impact: 7/10

In 10 seconds

What to know first

  • A Reddit user announced the completion of a 103 billion token Usenet corpus, spanning 1980-2013.
  • This extensive historical dataset offers a unique resource for training large language models, providing insights into human communication and cultural evolution across a significant period of early internet history.
  • **Size:** 103.1 billion tokens (cl100k_base encoding).
  • **Content:** 408 million posts.

Why it matters

This extensive historical dataset offers a unique resource for training large language models, providing insights into human communication and cultural evolution across a significant period of early internet history.

Swipe left/right

Summary

A Reddit user announced the completion of a 103 billion token Usenet corpus, spanning 1980-2013. This dataset includes 408 million posts from 18,347 newsgroups across 9 hierarchies and underwent extensive processing, including full deduplication and binary removal.

What happened

An individual user on Reddit's r/MachineLearning subreddit announced the completion and documentation of a large Usenet corpus, which they have been assembling and processing over several years. The creator describes it as one of the larger privately held pretraining corpora available.

Key details

  • **Size:** 103.1 billion tokens (cl100k_base encoding).
  • **Content:** 408 million posts.
  • **Coverage:** Spans 33 years, from 1980 to 2013.
  • **Scope:** Covers 18,347 newsgroups across 9 newsgroup hierarchies.
  • **Processing:** Included full deduplication and removal of binary content (e.g., `alt.binaries.*` hierarchies were excluded).

What to watch

While currently a privately held corpus, the existence of such a large and historically rich dataset could be significant for future research. Its unique time frame and content could offer distinct advantages for training specialized language models focused on historical internet communication, cultural shifts, or specific domain knowledge prevalent in early online communities.

Editorial note

AI Dose summarizes public reporting and links to original sources when they are available. Review the Editorial Policy, Disclaimer, or Contact page if you need to flag a correction or understand how this site handles sources.

Continue Reading

Explore related coverage about community news and adjacent AI developments: [r/ML] Why ML conference reviews sometimes feel like a “lottery“ [D], [HN] Show HN: Hackamaps – A global hackathon map I build after hitting Lovable Limits, [r/ML] A Hackable ML Compiler Stack in 5,000 Lines of Python [P], [r/ML] Phosphene local video and audio generation for Apple Silicon open source (LTX 2.3) [P].

Related Articles

Next read

[r/ML] Why ML conference reviews sometimes feel like a “lottery“ [D]

Stay with the thread by reading one adjacent story before leaving this update.

Comments

Sign in to leave a comment.

Loading comments...