Understanding Attention Mechanism

Attention is the beginning of devotion. - Mary Oliver

This is one of my favorite quotes about mindfulness. Turns out, it’s also how AI works.

With AI now everywhere in our lives, with Claude writing code for us, ChatGPT writing our emails, I wanted to understand what’s actually happening under the hood. Turns out, it all comes down to attention.

The Problem

Before transformers, neural networks had a memory problems. Recurrent neural networks (RNN) could barely handle 50-100 words. This is because they process sequentially — word 1, then word 2, etc. Information is passed through a chain like a game of telephone. By word 50, the beginning becomes garbled and barely remembered.

Five people standing playing a game of telephone. The last person looks confused.

### Core innovation Attention! Instead of the sequential slog, every word looks at every other word simultaneously. Word 1 can directly connect to word 1000. The result is that the models can handle 1000-100,000 words, which is a 10-1000x improvement without sacrificing quality.

The people all hold yarns and they are connected to each other.

Why it’s powerful

Speed: All words process in parallel. 100 words is 100 sequential steps for RNNs. Transformers process all 100 words at once.
Quality: Word 1 can directly reference word 1000. The model sees the whole picture and maintains quality in its answers.
Trade offs: Attention is expensive. Comparing every word to every other word means N^2 operation. This quadratic cost is why longer input will cost real money.

Technology deep dive

Encoder-decoder model

The original Transformer is an encoder-decoder model designed for translation. The encoder creates understanding of the input, and the decoder generates output.

Both encoder and decoder have multiple layers stacked vertically. The original paper used 6 layers each. GPT-3 has 96 layers. Modern models like GPT-4 and Claude Sonnet 4.5 are estimated to have 60-140+ layers, though the exact architectures are proprietary.

Each layer is an abstraction and understanding of the inputs. This is like reading a book multiple times, each time noticing different things. Each layer builds on the previous one, creating progressively deeper understanding.

Encoder layer structure (2 sub-layers)

Self-attention: words look at each other to understand relationship and context.
Feed-forward: process the gathered context to deepen understanding.

After all encoder layers process the input, their outputs become available to the decoder.

Decoder layer structure (3 sub-layers)

Self-attention: look at the words generated so far
Cross-attention: look at encoder outputs to understand the source input
Feed-forward: process combined information to generate the next word

Cross-attention connects encoder and decoder, letting the decoder access the full context of the input at every generation step.

A concrete example

Let’s trace through the sentence: They saw their first sloth, and focus on how attention processes the word “see”:

In the encoder, when processing “saw”, self attention asks “what helps me understand this word better?” Attention weights might be:

“They”: 45% (who did the action)
“Sloth”: 25% (what they saw)
“First”: 10% (contextualizes sloth)
“Their”: (first) 10%
“Saw”: (itself) 10%

The result is that “saw” now understand it’s the action performed by “they” on “sloth”. It has contextual awareness!

In the decoder, when generating “vieron”, the Spanish for “they saw”, cross-attention asks: “what do I need from the English sentence?”

Attention weights might be:

“They”: 40% (need subject for verb conjugation)
“Saw”: 50% (the action that I’m translation)
“Sloth”: 10% (context for what was seen)

The result generates “vieron”, which is a 3rd person plural, past tense, matching they.

Inside baseball: Self attention mechanism

Three people standing with yarns inside a baseball

Here’s what’s actually happening under the hood when attention “looks at other words.” It has five key steps:

Create queries (Q matrix). Each word asks a question.
- In the encoder: “what other words help me understand my meaning?”
- In the decoder, “what information do I need to generate the next word?”
Create Keys (K matrix). Each word advertises what it offers.
- “I’m a noun”
- “I’m the subject”
- “I’m the action”
Calculate compatibility (Q x K^T)
- How well does each word’s “offer” (key) match what another word “needs” (Query)
- The result is scores showing which words are most relevant to each other.
Convert to probability (softmax): Normalize scores into attention weights that sum to 1.0
Weighted combination (multiply by V): create new representation by mixing word meaning according to attention weights.

Multi-Head Attention: Eight Perspectives at Once

What’s cool to me is that within a single self attention layer, there are multiple processes running at once. The paper has 8 heads, whereas the modern LLMs can have 96 heads. Each “head” specializes in different patterns:

Head 1: Subject-verb relationships Head 2: Word distances Head 3: Semantic meaning Head 4: Grammatical structure Head 5: Positional patterns …and so on

Think of it like eight art critics analyzing the same painting. One notices color, another composition, another texture. Then they combine insights.

After all heads finish, they go through an “add and norm” process.

Concatenation: stacks insights from the 8 perspectives.
Linear projection: mix the 8 perspectives intelligently
Residual connection: Add original input to attention output. The goal is to preserve info if attention doesn’t help.
Layer normalization: Keeps values stable, which is critical for 100+ layer network

Feed-forward network

After self-attention mechanism, every word goes through its own feed-forward network. Attention gathers, “here’s the information for how I relate to the other words”. Feed-forward processes, “let me combine and reason about this information”.

How attention powers HolaMundo

HolaMundo is my AI language tutor. Drop in a URL, get flashcards, comprehension questions, and conversation practice. Here’s where attention does the heavy lifting:

Compare student’s answers to the entire article
When a student answers a question, attention mechanism compares student’s answer to the entire article. It is comparing the answer to the most relevant section in the article, giving some attention to the surrounding context and rest of the story.

Generate specific feedback
When the answer is not quite right, the attention mechanism again finds the most relevant section and generates feedback based on that section.

After learning more about how attention works, an interesting feature to implement is adaptive difficulty. This means the app would track which parts stumped the student, generate follow-up questions for those specific areas. Attention makes this possible.

Without the attention mechanism, using an RNN approach, we would need to chunk the article into small pieces and might miss connection between paragraphs. Attention enables contextual, specific feedback that accelerates learning.

Key takeaways for building AI products

Constraint: Context window is limited by the quadratic cost.
Opportunity: Attention enables explainability. This offers interesting design opportunities that leverages attention, such as highlighting, citaions, adaptive feedback.
Tradeoffs: Understanding attention helps PMs make smarter choices. Examples are When to use long-context models vs. RAG (chunking + retrieval), or how to balance quality, speed, and cost

To close out, attention fundamentally changed how models understand language, and that opens new product possibilities.

Want to see attention in action? Check out HolaMundo.