also dead internet is probably true, oh well

IO 😇@lemmy.blahaj.zone · 2 months ago

also dead internet is probably true, oh well

brucethemoose@lemmy.world · edit-2 2 months ago

Maybe design the AI to be honest and admit that it is not sure or doesn’t know?

That’s literally what it does!

Under the hood, LLMs output 1 ‘word’ at a time.

Except they don’t. It’s actually the probabilities of the thousands of words in its vocabulary being the most likely word in the block of text its given. It’s literally just 30% "and", 20% "but", '5%, "uh" and so on, for thousands of words.

In other words, for literally every word, they’re spitting out ‘here’s a table of what I think is most likely the next word, with this confidence.’

Thing is:

This is hidden from users, because the OpenAI standard and such is to treat users like children with a magic box instead of giving them a peek under the hood.
The ‘confidence’ is per word, not for the whole answer.
It’s just a numerical model. It’s simply a guess of confidence, it doesn’t really know and has no way to reason its own correctness out.
What’s more, there’s no going back. If an LLM gets a word obviously ‘wrong,’ it has to choice but to roll with it like an improv actor. It has no backspace button. The only sort-of exception is a reasoning block, where it can follow up an error with a ‘No, wait…’
This output is randomly ‘sampled’ so the most likely prediction isn’t even always chosen! It literally means, even if the LLM is gets an answer right, there’s a chance the wrong answer will appear from a pure roll of the dice, which is something OpenAI does not like to advertise.
This all seems stupid, right? It is! There are all sort of papers on alternatives to sampling or self correction or getting away from autoregressive architectures entirely, all mostly ignored by the Big Tech offerings you see. There are even ‘oldschool’ sampling methods like beam search or answer trees that have largely been forgotten, because they aren’t orthodoxy anymore.

EDIT: If you want to see this for yourself, see mikupad: https://github.com/lmg-anon/mikupad

Or its newer incarnation in ik_llama.cpp: https://github.com/ikawrakow/ik_llama.cpp

Its UI will show all the ‘possible tokens’ of every word as well as highlight the confidence of what was chosen, with this example showing a low-probability word that was randomly picked. It won’t work with OpenAI, of course, as they now hide the output’s logit probabilities for ‘safety’ (aka being anticompetitive Tech Bro jerks).

INeedMana@piefed.zip · 2 months ago

Wait, so the tokens are not “2 to 4 characters” cut as the input goes, anymore? Those can be whole words too?

brucethemoose@lemmy.world · edit-2 2 months ago

Pretty much. And more.

“The end.”

Might be a mere 3 tokens total:

‘“The ‘ ‘end.”’ ‘/n/n’

I don’t know about ClosedAI, but the Chinese models in particular (like Qwen, GLM and Deepseek) went crazy optimizing their tokenizers for English, Chinese, or code, with huge vocabs for common words/phrases and even common groupings of words + punctuation/spacing as single tokens. It makes the models more efficient, as the same text counts as far fewer tokens.

“About 1 token per word” is a decent estimate for a block of text, even including spaces and punctuation.