Skip to main content

The Tokenizer: How AI Models Translate Text into Numbers

The Tokenizer: How AI Models Translate Text into Numbers

How do computers actually process text?
#

Strictly speaking: Not at all!
Computers are pure calculating machines – internally they process numbers only.
With natural language, as we speak or write it, they can do absolutely nothing.
That’s why every text must first be translated into a form the computer can understand.

One character – one number: ASCII and ANSI
#

Each character (e.g., a letter or punctuation mark) is assigned a number – typically in the range of 0 to 127 (for ASCII) or up to 255 (for ANSI).

These numbers fit exactly into one byte – the smallest addressable storage unit of a computer.
With this approach, all English letters, digits, and some special characters can be represented.

flowchart TD
  classDef bigText font-size:1.0em;

  A(["⌨️ Text input"]):::bigText
  B(["🔢 Convert to numbers"]):::bigText
  C(["🧮 Computer processes numbers"]):::bigText
  D(["🔡 Re-conversion to characters"]):::bigText
  E(["🖥️ Output on screen"]):::bigText

  A --> B --> C --> D --> E

Excerpt from the ASCII Table (American Standard Code for Information Interchange):

IDCharIDCharIDCharIDChar
3233!34"35#
36$37%38&39'
40(41)42*43+
480491502513
65A66B67C68D
97a98b99c100d

This works perfectly for American English. But this simple encoding was limited to the US market – for many other languages, it was not sufficient.

Regional Code-Pages
#

Outside the USA, there are many languages with additional characters:
e.g., ä, ö, ü, ñ, ç, é – or completely different scripts like Cyrillic, Arabic, Chinese.

To solve this problem, so-called code pages were introduced:

  • Each country or region had its own character table
  • The characters from byte 128–255 were assigned differently depending on the code page

Examples of Regional Character Encodings (Code Pages):

Code PageRegion / LanguageExample Character
ASCIIUSA / EnglischA, B, !, @
ISO-8859-1Western Europeé, ä, ö, ü
Windows-1252Western Europe (Microsoft), , ,
ISO-8859-6Arabicا, ب, ت
ISO-8859-8Hebrewא, ב, ג

This worked locally – but when exchanging data across language or country borders it led to chaos:

  • Texts were displayed incorrectly
  • A text could be readable in one country, but absolutely incomprehensible in another

Unicode – one character set for the whole world
#

To solve this problem permanently, several tech companies founded the Unicode Consortium.

Goal:

A common standard that can uniquely encode all characters of all languages in the world.

The Unicode Standard was born – and with it, it became possible to process:

  • Chinese characters
  • Cyrillic, Arabic
  • Mathematical symbols
  • Emojis 😄
  • … and many other types of characters

Unicode uses 1 to 4 bytes per character, depending on the encoding (e.g., UTF-8, UTF-16, UTF-32).
UTF-8 is the most commonly used character set today, as it encodes all Unicode characters variably with 1 to 4 bytes, is space-efficient, and remains fully compatible with ASCII.

This finally made possible:

  • Cross-language text exchange
  • Consistent display on all devices
  • No chaos with code pages

But beware: Not all texts are stored in Unicode format – so the trouble isn’t completely over yet …

Machine text understanding: far more than just words
#

Making sense of texts is an enormous challenge for computers. What is natural for us humans poses huge problems for computers.
With traditional programming techniques, this problem could not be solved.

Earlier approaches tried to analyze texts using simple algorithms – for example, to detect emails as spam.

  1. Pattern Recognition: Programs searched the text for typical word patterns or phrases that indicate spam.
  2. Bag-of-Words Model (BoW): The text was broken down into individual words and these were counted.
    Terms like “free” or “win” were statistically more likely to be found in spam messages.

These algorithms sometimes produced surprisingly good results – but they had nothing to do with real text understanding.

Transformer: The Breakthrough
#

It was only through neural networks – and especially through the introduction of transformer models – that it became possible to capture language with context and meaning. This marked the beginning of a new era in machine language understanding.

LLMs (Large Language Models) manage, using ingenious techniques, to relate the meaning of individual words – more precisely: tokens – to one another.
This allows them not only to read texts but also to understand them semantically.

But even with LLMs, everything begins with an essential step:
The text must first be converted into a form the computer can understand – namely: into numbers.

For this, the input is broken down into small units – so-called tokens.

Text fragments – Tokens
#

Tokens are the fundamental building blocks of a text.
They can be whole words, word parts, or even individual characters.
Each token is assigned a unique numeric ID – via the so-called vocabulary, a table of all known tokens.
These IDs form the basis for the computational operations in the language model.

Excerpt from the vocabulary
#

TokenIDDescription
the464most frequent English word
hello7592whole word
Grüß29213subword (f. E. in „Grüße“)
e68single letter
!0punctuation mark

GPT models use about 50,000 tokens in their vocabulary (e.g., GPT-3: 50,257 tokens).

How was this vocabulary created?
#

Before a language model like GPT is trained, special algorithms analyze billions of texts to identify the most frequent character sequences.
These sequences – whole words, word parts, or individual characters – form the basis of the so-called vocabulary, i.e., the token table.

The procedure behind this is called Byte Pair Encoding (BPE).
The goal is to find recurring text patterns that can be efficiently represented as tokens.
Only on the basis of these tokens can the language model learn to compute with text and derive meaning.

LLMs like ChatGPT are true language geniuses – they understand over 90 languages
#

Not only words from one language are taken into account.
Models like ChatGPT are multilingual – they have been trained on texts in over 90 languages.

Accordingly, the vocabulary contains tokens from many different language spaces, for example:

  • English
  • German
  • Chinese
  • Arabic
    … and many more.

This way, the model can not only recognize text in different languages – it can also understand, analyze, and further process it.

The Tokenizer
#

The tokenizer is the component that breaks a text down into smaller units – so-called tokens.

GPT-Tokenizer Live Demo!

Just give it a try – enter your text and see live how GPT breaks it down into tokens!

(GPT-3 uses the same tokenizer as GPT-2 – based on Byte Pair Encoding (BPE).)

This process takes place in 4 steps:

flowchart TD
  classDef step fill:#eef,stroke:#888,stroke-width:1px,rx:10,ry:10,font-size:1em;

  A[🧮 UTF-8-Encoding]:::step
  B[🔤 Byte-to-Unicode-Mapping]:::step
  C[🔗 Byte Pair Encoding]:::step

  subgraph LEXP [Token-ID-Assignment]
    direction LR
    D[📘Vokabular]:::step --> E[🧾„hello“ → 7592]:::step
  end

  A --> B --> C --> LEXP
  1. UTF-8-Encoding
    The text is first converted into bytes – that is, numbers between 0 and 255.

  2. Byte-to-Unicode-Mapping
    Not all byte values correspond directly to visible characters (e.g., control characters). That’s why GPT-3 uses the so-called “byte-to-unicode trick” to uniquely convert every byte value into a representable Unicode character. This ensures that all characters can be processed safely.

  3. Byte Pair Encoding (BPE)
    The character sequence is then broken down into tokens using Byte Pair Encoding.
    Frequently occurring character or word parts are combined into larger units according to a merge table.
    Example: t + hth

  4. Token-ID-Assignment
    The identified tokens are then looked up in the vocabulary.
    Each token has a unique numeric ID there, which the model can work with.

At the end of this process, a list of token IDs is created – that is, a sequence of numbers that the language model processes as input.

Hello,mynameisChatGPT.
15496116161438318707100113

An interesting aspect is the so-called “greedy behavior”:
The tokenizer always selects the longest possible matching token from the vocabulary.

Nevertheless, an existing token like “grüße” is not always used as such.
Why?

  • It depends on an exact match (capitalization, encoding).
  • If no exact match is found, the word is split into subtokens – e.g., Grüß + e or gr + ü + ße.

Tiktoken – the more modern variant
#

Newer models like GPT-4 no longer use the original GPT-2/3 tokenizer, but rather a slightly optimized variant called Tiktoken. Tiktoken is still based on Byte Pair Encoding (BPE), but has been internally optimized to be faster, more Unicode-safe, and more robust with multilingual input.


© 2025 Oskar Kohler. All rights reserved.
Note: The text was written manually by the author. Stylistic improvements, translations as well as selected tables, diagrams, and illustrations were created or improved with the help of AI tools.