• INeedMana@piefed.zip
    link
    fedilink
    English
    arrow-up
    2
    ·
    2 days ago

    Wait, so the tokens are not “2 to 4 characters” cut as the input goes, anymore? Those can be whole words too?

    • brucethemoose@lemmy.world
      link
      fedilink
      arrow-up
      2
      ·
      edit-2
      2 days ago

      Pretty much. And more.

      “The end.”

      Might be a mere 3 tokens total:

      ‘“The ‘ ‘end.”’ ‘/n/n’

      I don’t know about ClosedAI, but the Chinese models in particular (like Qwen, GLM and Deepseek) went crazy optimizing their tokenizers for English, Chinese, or code, with huge vocabs for common words/phrases and even common groupings of words + punctuation/spacing as single tokens. It makes the models more efficient, as the same text counts as far fewer tokens.

      “About 1 token per word” is a decent estimate for a block of text, even including spaces and punctuation.