We are so cooked

BoxedFenders [any, comrade/them]@hexbear.net · 1 month ago

We are so cooked

BeanisBrain [he/him, they/them]@hexbear.net · 1 month ago

Allow me to propose an alternative input set:

60% marxists.org (for historical theory)
30% redsails.org (for contemporary criticism)
5% youtube.com (only transcripts of Hakim and Luna Oi videos)
5% hexbear.net (for flavor)

alexei_1917 [mirror/your pronouns]@hexbear.net · edit-2 1 month ago

I think a chatbot trained only on ML theory would certainly be fun to play with. Ask a political or economic question, get something that sounds just like Lenin and makes about as much sense as some particularly dense parts of Capital.

(And even though it’s a robot, I do feel a weird perverse thrill at the idea of taking a completely politically unconscious and blank slate mind and providing it only the Marxist-Leninist perspective, and never exposing it to any other political viewpoint until a strong ideological foundation is built. That’s kinda neat.)

BountifulEggnog [she/her]@hexbear.net · 1 month ago

You need a big dataset to train a model, unfortunately Marxist-Leninists are too short spoken.

alexei_1917 [mirror/your pronouns]@hexbear.net · 1 month ago

Short spoken? Some of our theory seems pretty damn long.

BountifulEggnog [she/her]@hexbear.net · edit-2 1 month ago

That bit was a joke, although I would expect all theory to be much less then the amount of data needed to pretrain a model big enough to produce anything- coherent.

Actually, here’s some math. SmolLM was trained on 600b tokens. Das Kapital is roughly 288k words, about 218k tokens. We’ll round to 250,000 tokens. Divided into 600,000,000,000 and we would need 2.4 million Das Kapitals worth of text to train SmolLM. V2 uses 2t tokens, 8 million Das Kapitals. There’s obviously a lot more theory then that, and you could probably throw forums like ours in, prolewiki, maybe some youtube subtitles. Synthetic data from theory. LLMs just need to eat a lot of text unfortunately. Qwen3 trained on 36 trillion tokens, 144 million Kapitals.

hotcouchguy [he/him]@hexbear.net · 1 month ago

I believe there are methods to train on a large, general dataset, and then re-train on a small, focused dataset, but I’m not sure of any specifics

BountifulEggnog [she/her]@hexbear.net · 1 month ago

Yes, lots of ways, and definitely the approach for something like this. You would still have to be picky about data though, pre training still effects its biases a lot. Especially if the hope is a blank slate that’s only seen ML thinking.

alexei_1917 [mirror/your pronouns]@hexbear.net · 1 month ago

Yeah, absolutely. Creating a thing capable of at least appearing to think, that is literally unable to understand Western liberal nonsense because it’s been fed only ML aligned material to read and process, might not be possible. I just thought the concept was kinda neat.

alexei_1917 [mirror/your pronouns]@hexbear.net · 1 month ago

Yeah, when you put it that way, one can see the issue. I was kind of joking myself, we have a lot of theory, and while it might be a drop in the bucket for a machine that needs to basically eat boatloads of text, when it comes to humans reading it, even just what a lot of orgs agree on as the core texts, is a lot of reading to do. And the theory itself is often… not short spoken or concise in any sense. Some of it can really feel like it’s long and complicated on purpose.

Saymaz@lemmygrad.ml · 1 month ago

deleted by creator