Of Usenet News & LLM Datasets

by Shelt Garner
@sheltgarner

The last I checked, Google has a nearly-complete archive of Usenet from its founding until at least around 2000 when everything went to shit there because of porn spam. It would be dumb for Google not to include Usenet in any Large Language Model dataset.

You would have to tweak it some, of course, but there are about 20 years of high quality words to use to train your LLM to be found with any Usenet archive. A lot of is outdated and full of vitral, but there is also a lot of human interaction and humanity to be found there, as well.

This is so much the case that if you were to include Usenet archive information in your LLM training dataset, you would probably endup with a very human-like LLM. I don’t know, maybe Google is already using their Usenet archive. Usenet was very popular back in the day.

Given how many Usenet servers there were at one point, I’m sure if you were working on an open source LLM that you could probably find a few million words to train your open source LLM by scooping up all the archived Usenet posts you could find.

Or not, what do I know. But it is an intriguing use for all those words that are now just forgotten Internet history. For everyone except for me, of course. 🙂

Author: Shelton Bumgarner

I am the Editor & Publisher of The Trumplandia Report

Leave a Reply