Self-Attention – The Heart of Modern AI

Meaning through context
#

Transformers made the big breakthrough because they capture the meaning of words not in isolation, but in the entire context.

I am using the technically correct term “token” here. A token can be a word, a part of a word, or even a special character.
For the sake of simplicity, you can just think of them as “words”.

In contrast to earlier models, transformers analyze all tokens in parallel and thereby relate them to each other.
This way, they recognize semantic relationships and can process language on a deeper level.

What we already know so far:

The meaning of individual tokens is represented by embeddings – that is, by their position in the embedding space.
The position in the sentence is additionally taken into account through a position encoding.

But – is that enough to understand the meaning of an entire sentence?
#

Let’s look at an example:

The word “lock” can mean many things:
a padlock, a door lock – or a canal lock by the river.
Without further context, the meaning remains ambiguous. In technical terms, this is called ambiguity.

Let’s now look at a whole sentence:

Sounds poetic – but even here, it remains unclear what is meant.
The adjective “old” helps a little, but it still doesn’t narrow down the meaning precisely.

Let’s extend the sentence with another adjective:

Now it becomes clear: A canal lock doesn’t rust – it’s probably a padlock that is meant.

Let’s refine the sentence once more:

Relationship in the sentence: rusty → lock

Now there is no longer any doubt: it is clearly a bike lock.

This example shows: Words only become unambiguous through context.

💡

This is exactly the basic principle of transformers:
They analyze all tokens in parallel and calculate their mutual meaning in order to capture the semantic structure of a sentence.

Encoder and Decoder in the original Transformer
#

The paper by Vaswani et al., “Attention Is All You Need”, originally introduced a model for machine translation.
For this, there were two key components:

Encoder: Captures the entire input text (e.g., a German sentence) and produces a context-dependent representation of each token – that is, a sequence of vectors representing the meaning within the sentence context.
Decoder: Uses this representation plus the tokens generated so far to produce the translation in the target language step by step.

Modern language models like GPT simplify this architecture:
They consist only of decoder stacks.

That may seem like a reduction – but for language generation, it is a brilliant specialization. Because a decoder can do both:

“Understand” the preceding context (through self-attention)
And “continue” it at the same time (by outputting the next token).

That’s why GPT uses only the decoder part and leaves out the encoder:
It doesn’t translate from one language to another, but simply continues its own story.

Raw Embedding
#

At the beginning, each token receives a raw embedding – a vector representation in embedding space that was learned during training. This embedding is initially context-independent: it does not yet take into account the other tokens in the context window.
The same token – for example, “bank” – is always embedded in the same way, regardless of the context.

💡

Raw embeddings are context-independent: same word, same vector – no matter whether “lock” means a canal lock or a padlock.
Only through self-attention does the meaning become clear from the context.

Only through the context of the surrounding tokens does the original embedding of lock gain a context-dependent meaning – its position in the embedding space shifts accordingly.

In the next example you can observe this yourself:
Depending on which sentence you choose, the embedding of lock moves closer either to the cluster of canal locks or to the cluster of padlocks.

How the transformer recognizes relationships between tokens
#

The Self-Attention mechanism
#

Complex neural mechanisms like the self-attention mechanism are often difficult for humans to grasp intuitively. Frequently, a fitting analogy helps to get a sense of the underlying principle.

There are many such analogies that try to make the concept of attention understandable. From my perspective, however, many of them fall short: they simplify the mechanism so much that one thinks they have understood it – but in reality, they have only understood the image, not the underlying technique.

That’s why I came up with my own analogy – one that is illustrative, yet still stays relatively close to the actual functioning.

An analogy: tokens as consulting experts
#

You can imagine the self-attention mechanism as a kind of expert platform.

Each token in the sequence is an expert that provides two things:
a consulting profile (key) – that is, what it is knowledgeable about –
and specific knowledge (value) that it can share.

Another token submits a query – it specifically searches for expertise that helps it in the current context.

The self-attention mechanism takes care of the matching:
It compares the query with all profiles and calculates who fits best.
The better an expert matches the query, the more knowledge they contribute.

This creates a new representation – a new context-dependent meaning – of the querying token:
enriched with exactly the information that is relevant at the moment – targeted, weighted, and context-dependent.

Each token in the context window submits its query to all other tokens in order to better position itself within the overall context.

At the same time, each token also acts as a consultant, providing information that other tokens can access.

This exchange does not take place actively, but is instead computed in parallel and automatically by the self-attention mechanism.

Query, Key and Value
#

Each token is viewed from three different perspectives in the self-attention mechanism:

as ❓Query: a targeted request to obtain relevant information from the context,
as 🔐Key: a profile that describes what information this token provides,
as 📦Value: the actual information content it can share.

The self-attention mechanism compares a token’s ❓query with the 🔐keys of all other tokens.
The better a 🔐key matches the ❓query, the more strongly the associated 📦value is incorporated into the calculation.

💡

A ❓query is compared with all 🔐keys in the context – and the matching 📦values are incorporated, weighted, into the meaning of the token in the current context.

This way, a new representation is created for each token – enriched with the information from those tokens that are most relevant in the current context.

We have now understood the basic principle of the self-attention mechanism – time to take a look under the hood.
For this, we need some mathematics, but don’t worry: it will remain easy to follow.

Such multidimensional relationships are often hard to imagine.
But with just a bit of mathematics, they can be surprisingly easy to follow.
My tip: stick with it – it’s worth it!

← Back Next →

© 2025 Oskar Kohler. All rights reserved.
Note: The text was written manually by the author. Stylistic improvements, translations as well as selected tables, diagrams, and illustrations were created or improved with the help of AI tools.

Meaning through context#

But – is that enough to understand the meaning of an entire sentence?#

Encoder and Decoder in the original Transformer#

Raw Embedding#

How the transformer recognizes relationships between tokens#

The Self-Attention mechanism#

An analogy: tokens as consulting experts#

Query, Key and Value#

Share this post – that way you’ll help others too.