• BountifulEggnog [she/her]@hexbear.net
      link
      fedilink
      English
      arrow-up
      8
      ·
      edit-2
      3 days ago

      That bit was a joke, although I would expect all theory to be much less then the amount of data needed to pretrain a model big enough to produce anything- coherent.

      Actually, here’s some math. SmolLM was trained on 600b tokens. Das Kapital is roughly 288k words, about 218k tokens. We’ll round to 250,000 tokens. Divided into 600,000,000,000 and we would need 2.4 million Das Kapitals worth of text to train SmolLM. V2 uses 2t tokens, 8 million Das Kapitals. There’s obviously a lot more theory then that, and you could probably throw forums like ours in, prolewiki, maybe some youtube subtitles. Synthetic data from theory. LLMs just need to eat a lot of text unfortunately. Qwen3 trained on 36 trillion tokens, 144 million Kapitals.

      • hotcouchguy [he/him]@hexbear.net
        link
        fedilink
        English
        arrow-up
        6
        ·
        3 days ago

        I believe there are methods to train on a large, general dataset, and then re-train on a small, focused dataset, but I’m not sure of any specifics

        • BountifulEggnog [she/her]@hexbear.net
          link
          fedilink
          English
          arrow-up
          6
          ·
          3 days ago

          Yes, lots of ways, and definitely the approach for something like this. You would still have to be picky about data though, pre training still effects its biases a lot. Especially if the hope is a blank slate that’s only seen ML thinking.

          • alexei_1917 [mirror/your pronouns]@hexbear.net
            link
            fedilink
            English
            arrow-up
            3
            ·
            2 days ago

            Yeah, absolutely. Creating a thing capable of at least appearing to think, that is literally unable to understand Western liberal nonsense because it’s been fed only ML aligned material to read and process, might not be possible. I just thought the concept was kinda neat.

      • alexei_1917 [mirror/your pronouns]@hexbear.net
        link
        fedilink
        English
        arrow-up
        2
        ·
        2 days ago

        Yeah, when you put it that way, one can see the issue. I was kind of joking myself, we have a lot of theory, and while it might be a drop in the bucket for a machine that needs to basically eat boatloads of text, when it comes to humans reading it, even just what a lot of orgs agree on as the core texts, is a lot of reading to do. And the theory itself is often… not short spoken or concise in any sense. Some of it can really feel like it’s long and complicated on purpose.