• alexei_1917 [mirror/your pronouns, any]@hexbear.net
      link
      fedilink
      English
      arrow-up
      21
      ·
      edit-2
      2 days ago

      I think a chatbot trained only on ML theory would certainly be fun to play with. Ask a political or economic question, get something that sounds just like Lenin and makes about as much sense as some particularly dense parts of Capital.

      (And even though it’s a robot, I do feel a weird perverse thrill at the idea of taking a completely politically unconscious and blank slate mind and providing it only the Marxist-Leninist perspective, and never exposing it to any other political viewpoint until a strong ideological foundation is built. That’s kinda neat.)

          • BountifulEggnog [she/her]@hexbear.net
            link
            fedilink
            English
            arrow-up
            8
            ·
            edit-2
            2 days ago

            That bit was a joke, although I would expect all theory to be much less then the amount of data needed to pretrain a model big enough to produce anything- coherent.

            Actually, here’s some math. SmolLM was trained on 600b tokens. Das Kapital is roughly 288k words, about 218k tokens. We’ll round to 250,000 tokens. Divided into 600,000,000,000 and we would need 2.4 million Das Kapitals worth of text to train SmolLM. V2 uses 2t tokens, 8 million Das Kapitals. There’s obviously a lot more theory then that, and you could probably throw forums like ours in, prolewiki, maybe some youtube subtitles. Synthetic data from theory. LLMs just need to eat a lot of text unfortunately. Qwen3 trained on 36 trillion tokens, 144 million Kapitals.

            • hotcouchguy [he/him]@hexbear.net
              link
              fedilink
              English
              arrow-up
              6
              ·
              2 days ago

              I believe there are methods to train on a large, general dataset, and then re-train on a small, focused dataset, but I’m not sure of any specifics

              • BountifulEggnog [she/her]@hexbear.net
                link
                fedilink
                English
                arrow-up
                6
                ·
                2 days ago

                Yes, lots of ways, and definitely the approach for something like this. You would still have to be picky about data though, pre training still effects its biases a lot. Especially if the hope is a blank slate that’s only seen ML thinking.

                • Yeah, absolutely. Creating a thing capable of at least appearing to think, that is literally unable to understand Western liberal nonsense because it’s been fed only ML aligned material to read and process, might not be possible. I just thought the concept was kinda neat.

            • Yeah, when you put it that way, one can see the issue. I was kind of joking myself, we have a lot of theory, and while it might be a drop in the bucket for a machine that needs to basically eat boatloads of text, when it comes to humans reading it, even just what a lot of orgs agree on as the core texts, is a lot of reading to do. And the theory itself is often… not short spoken or concise in any sense. Some of it can really feel like it’s long and complicated on purpose.