I don’t know about ClosedAI, but the Chinese models in particular (like Qwen, GLM and Deepseek) went crazy optimizing their tokenizers for English, Chinese, or code, with huge vocabs for common words/phrases and even common groupings of words + punctuation/spacing as single tokens. It makes the models more efficient, as the same text counts as far fewer tokens.
“About 1 token per word” is a decent estimate for a block of text, even including spaces and punctuation.
Pretty much. And more.
“The end.”
Might be a mere 3 tokens total:
‘“The ‘ ‘end.”’ ‘/n/n’
I don’t know about ClosedAI, but the Chinese models in particular (like Qwen, GLM and Deepseek) went crazy optimizing their tokenizers for English, Chinese, or code, with huge vocabs for common words/phrases and even common groupings of words + punctuation/spacing as single tokens. It makes the models more efficient, as the same text counts as far fewer tokens.
“About 1 token per word” is a decent estimate for a block of text, even including spaces and punctuation.