Large Language Models: How an AI Model Understands Our Language

What do we expect from a language model?
#

We want it to:

answer questions
assist with writing
compose emails
summarize texts

In short: that it produces text.

And that is exactly its purpose:
A language model was developed and trained to generate text.

💡

A language model generates text – nothing more, nothing less.
It is not a reference work, not a consciousness – but a statistical text generator.

It is not a knowledge database, has no consciousness, and does not understand language in the human sense.
But it generates text – endlessly and reliably.
That is its function. That is its “drive”.

How does a language model generate text?
#

It generates text purely statistically – token by token.
Not by page, not by sentence – not even by word.
But one linguistic building block after another – technically: a token.
What exactly that is, we’ll explore later – but for now, let’s just stick with word.

💡

A language model generates text purely statistically – word by word.
Each new word is calculated individually and appended to the previous text –
this principle is called autoregressive.

Mathematisch, statistisch, scheinbar banal – aber aus Milliarden Wahrscheinlichkeiten entsteht flüssiger, zusammenhängender Text.

Language models have existed long before ChatGPT – they have been used for many years in voice assistants like Siri or Alexa, for example.

The basic functionality is essentially the same:

A language model is trained on huge amounts of text to complete existing sentences – word by word.

The task is always the same: From the previous words, calculate the most likely next word – and append it.

A purely statistical process: The semantically most probable word wins – and extends the text.

Classical Language Models
#

In the past, many language models were based on recurrent neural networks (RNNs) or variants of them.
RNNs process text step by step: They read one word at a time, store the context in a hidden state, and use this context summary to statistically predict the next word.

So the text is not processed as a whole, but always sequentially.
The model moves from word to word – and at each step, it calculates which word is most likely to come next, based on the previous context.

But this technique has major disadvantages:
#

Limited context: Even with extended variants, the limit is around 100–200 words (technically: tokens) – often even less.
No text understanding: RNNs capture local word sequences, but no global relationships between terms.
Slow processing: Since RNNs cannot be parallelized, processing is done sequentially – making it inefficient and slow.

💡

Earlier language models like RNNs process text word by word – sequentially, with limited context and without true text understanding.
They are slow, not parallelizable, and only partially suitable for longer texts.

For longer texts or complex tasks, RNNs are therefore hardly practical.

The breakthrough: Large Language Models (LLMs) with transformer technology
#

Everything changed fundamentally in 2017 – with the publication of the groundbreaking paper
“Attention Is All You Need” by Ashish Vaswani and the team at Google Brain.

💡

Transformers are now considered the holy grail of large language models (LLMs).
The publication “Attention Is All You Need” by Ashish Vaswani and his team was revolutionary – for me personally: the eighth wonder of the world.

This publication introduced a revolutionary new architecture: the Transformer.

What was so special about it?
#

Transformer models process all words simultaneously (in parallel)
They can handle much longer contexts
The architecture enables a deeper, semantic understanding of language

The introduction of the Transformer is considered a revolution in language processing (NLP).

Early Transformer Models
#

In 2018, OpenAI released GPT (Generative Pretrained Transformer), the first major autoregressive language model – that is, a model that predicts text step by step, one word at a time.

A year later, in 2019, the next generation appeared with GPT-2 – for the first time freely available and open source.
Anyone could download the model, test it, and explore how an LLM works.

2020 saw the release of GPT-3 – with 175 billion parameters, significantly more powerful, but no longer open source. Instead, only the scientific paper was published.

With GPT-4, OpenAI fundamentally changed its strategy:
Starting with this model, unfortunately, no details about the architecture, number of parameters, or training data were published anymore.

To this day, all major language models are based on this technology:
#

ChatGPT (OpenAI)
Gemini (Google DeepMind)
Claude (Anthropic)
LLaMA (Meta)
… and many more.

Now it gets magical: How transformers really work
#

💡

Transformer models were a breakthrough: They process all words in parallel,
can handle long contexts, and capture semantic relationships much more precisely.

In the following sections, I’ll show you step by step
how this linguistic marvel works – clearly and vividly explained.

← Back Next →

© 2025 Oskar Kohler. All rights reserved.
Note: The text was written manually by the author. Stylistic improvements, translations as well as selected tables, diagrams, and illustrations were created or improved with the help of AI tools.

What do we expect from a language model?#

How does a language model generate text?#

Classical Language Models#

But this technique has major disadvantages:#

The breakthrough: Large Language Models (LLMs) with transformer technology#

What was so special about it?#

Early Transformer Models#

To this day, all major language models are based on this technology:#

Now it gets magical: How transformers really work#

Share this post – this way you help others too.