We are so cooked

BoxedFenders [any, comrade/them]@hexbear.net · 1 month ago

We are so cooked

FlakesBongler [they/them]@hexbear.net · 1 month ago

reddit

This explains why it’s so confidently wrong so often

LaGG_3 [he/him, comrade/them]@hexbear.net · 1 month ago

SchillMenaker [he/him]@hexbear.net · 1 month ago

Even at sub-5% Quora is still doing some work here

FlakesBongler [they/them]@hexbear.net · 1 month ago

Quora explains why it’s so horny

Especially since half of Quora is just weird erotica

AOCapitulator [they/them, she/her]@hexbear.net · 1 month ago

Wait really? I thought it was a questions and answers site why are people posting fanfic smut there lol

FlakesBongler [they/them]@hexbear.net · 1 month ago

You should listen to the Quorators podcast

They go over all this sort of shit

axont [she/her, comrade/them]@hexbear.net · edit-2 1 month ago

AI acolytes tell me their preferred AI has the advantage of access to all the world’s data, the full knowledge of mankind and yet 9.3% of its knowledge comes from walmart.com

if 9.3% of a hypothetical humans’s knowledge came from walmart.com that person would be rightfully put in the pillory in the town square for the crime of demonic possession

Mindfury [he/him]@hexbear.net · 1 month ago

Walmart hosts the Codex Astartes in the backend, hard to access using the website manually but you can crawl it

axont [she/her, comrade/them]@hexbear.net · 1 month ago

the omnissiah manifesting physically in our universe through the machinations of retail backends. hail the motive force

LocalMaxima [she/her]@hexbear.net · 1 month ago

The chart title is a bit misleading. This isn’t the source of training data, but the sites that are linked to in responses. Google AI overview was included in the results, which kind of explains why this is list is just the sites you would expect to be at the top of a Google search

The_hypnic_jerk [he/him]@hexbear.net · 1 month ago

They automated putting “reddit” at the end of a Google search and called it agi

LeeeroooyJeeenkiiins [none/use name]@hexbear.net · 1 month ago

The llm itself admitted this!

FnordPrefect [comrade/them, he/him]@hexbear.net · 1 month ago

State Department like, “Yeah, look at all of those distinct and independent sources of information ”

but at least with yahoo on there we can be confident that grok will have lots of quality details about pregnartcy

InevitableSwing [none/use name]@hexbear.net · 1 month ago

I love that vid.

“Dangerops prangent sex? will it hurt baby top of its head?” still the best one

I don’t know if it’s best but def in the top three.

NephewAlphaBravo [he/him]@hexbear.net · edit-2 1 month ago

“gregnant” and “pregnart” live in my brain rent free forever

HexReplyBot [none/use name]@hexbear.net · 1 month ago

I found a YouTube link in your comment. Here are links to the same video on alternative frontends that protect your privacy:

BeanisBrain [he/him, they/them]@hexbear.net · 1 month ago

Allow me to propose an alternative input set:

60% marxists.org (for historical theory)
30% redsails.org (for contemporary criticism)
5% youtube.com (only transcripts of Hakim and Luna Oi videos)
5% hexbear.net (for flavor)

alexei_1917 [mirror/your pronouns]@hexbear.net · edit-2 1 month ago

I think a chatbot trained only on ML theory would certainly be fun to play with. Ask a political or economic question, get something that sounds just like Lenin and makes about as much sense as some particularly dense parts of Capital.

(And even though it’s a robot, I do feel a weird perverse thrill at the idea of taking a completely politically unconscious and blank slate mind and providing it only the Marxist-Leninist perspective, and never exposing it to any other political viewpoint until a strong ideological foundation is built. That’s kinda neat.)

BountifulEggnog [she/her]@hexbear.net · 1 month ago

You need a big dataset to train a model, unfortunately Marxist-Leninists are too short spoken.

alexei_1917 [mirror/your pronouns]@hexbear.net · 1 month ago

Short spoken? Some of our theory seems pretty damn long.

BountifulEggnog [she/her]@hexbear.net · edit-2 1 month ago

That bit was a joke, although I would expect all theory to be much less then the amount of data needed to pretrain a model big enough to produce anything- coherent.

Actually, here’s some math. SmolLM was trained on 600b tokens. Das Kapital is roughly 288k words, about 218k tokens. We’ll round to 250,000 tokens. Divided into 600,000,000,000 and we would need 2.4 million Das Kapitals worth of text to train SmolLM. V2 uses 2t tokens, 8 million Das Kapitals. There’s obviously a lot more theory then that, and you could probably throw forums like ours in, prolewiki, maybe some youtube subtitles. Synthetic data from theory. LLMs just need to eat a lot of text unfortunately. Qwen3 trained on 36 trillion tokens, 144 million Kapitals.

hotcouchguy [he/him]@hexbear.net · 1 month ago

I believe there are methods to train on a large, general dataset, and then re-train on a small, focused dataset, but I’m not sure of any specifics

BountifulEggnog [she/her]@hexbear.net · 1 month ago

Yes, lots of ways, and definitely the approach for something like this. You would still have to be picky about data though, pre training still effects its biases a lot. Especially if the hope is a blank slate that’s only seen ML thinking.

alexei_1917 [mirror/your pronouns]@hexbear.net · 1 month ago

Yeah, absolutely. Creating a thing capable of at least appearing to think, that is literally unable to understand Western liberal nonsense because it’s been fed only ML aligned material to read and process, might not be possible. I just thought the concept was kinda neat.

alexei_1917 [mirror/your pronouns]@hexbear.net · 1 month ago

Yeah, when you put it that way, one can see the issue. I was kind of joking myself, we have a lot of theory, and while it might be a drop in the bucket for a machine that needs to basically eat boatloads of text, when it comes to humans reading it, even just what a lot of orgs agree on as the core texts, is a lot of reading to do. And the theory itself is often… not short spoken or concise in any sense. Some of it can really feel like it’s long and complicated on purpose.

Saymaz@lemmygrad.ml · 1 month ago

deleted by creator

emdash [comrade/them]@hexbear.net · 1 month ago

Why did they need to pirate every book on Anna’s Archive if they were just going to cite social media and product advertisements?

ElChapoDeChapo [he/him, comrade/them]@hexbear.net · 1 month ago

Well they had to do it quick before the FBI took them down on accout of these tech demons reporting them to the FBI after the API training

take_five_moments [any]@hexbear.net · 1 month ago

target.com

lmao

FlakesBongler [they/them]@hexbear.net · 1 month ago

Home of some of the worst wannabe police-cop LP guys ever

dannoffs [he/him]@hexbear.net · 1 month ago

How on earth do they get almost 5% from home depot?

LaGG_3 [he/him, comrade/them]@hexbear.net · 1 month ago

InevitableSwing [none/use name]@hexbear.net · 1 month ago

The trombone garden hose was invented in 1782 on sale now!

segfault11 [any]@hexbear.net · 1 month ago

woodenghost [comrade/them]@hexbear.net · edit-2 1 month ago

Fucking Amazon? Why? For badly translated product descriptions and fake reviews? Those already were the closest thing to AI texts, before AI even existed.

Walmart? Really? Can it get any worse?

LinkedIn

Noooo!

varmint [he/him]@hexbear.net · 1 month ago

Why does this add up to way more than 100%?

roux [they/them, xe/xem]@hexbear.net · edit-2 1 month ago

They used AI to generate the chart.

XxFemboy_Stalin_420_69xX [none/use name]@hexbear.net · 1 month ago

presumably bc the same prompt can generate citations from multiple sites

Rod_Blagojevic [none/use name]@hexbear.net · 1 month ago

GrouchyGrouse [he/him]@hexbear.net · 1 month ago

Time to edit all 400,000 of my Reddit comments to be about the 1997 point-and-click videogame Star Wars: Yoda Stories

BoxedFenders [any, comrade/them]@hexbear.net · 1 month ago

And I’d like to add that one of the reasons why Reddit is so high on this list is that they have positioned themselves as a source for easily scrapable data for the big AI players. It is now the highest priority of the company to appease the tech giants starved for cheap user content to feed its AI monsters. Reddit’s stock has also gone up 300% in the last year just for these partnerships alone.

Des [she/her, they/them]@hexbear.net · 1 month ago

i really wish I had taken that IPO offer fucking reddit

coolusername [none/use name]@hexbear.net · 1 month ago

it has a PE of over 300

MF_COOM [he/him]@hexbear.net · 1 month ago

How does it scrape YouTube? Like the comments? Or the videos that have transcripts? Or the output from closed captioning?

BountifulEggnog [she/her]@hexbear.net · 1 month ago

Yes, everything.

The_Filthy_Commie@lemmygrad.ml · 1 month ago

A scatological Ourobouros.

ShimmeringKoi [comrade/them]@hexbear.net · 1 month ago

I guess that explains the writing style