Reading the method — notation, equations, diagrams

The technical heart of the paper. Notation that looks impenetrable until you know the conventions; equations that look frightening until you slow down. Here is how working researchers read this section.

40 minutesHands-on with a real methodTransformer paper open

By the end of this lesson, you will:

Be able to decode the common notation conventions in machine-learning papers.
Be able to read a non-trivial equation by walking through it term by term — and know when to stop reading and start re-deriving.
Be able to read architecture diagrams and pseudocode as one coherent description of the same thing.
Be able to use AI assistants as a reading partner without surrendering your own understanding.

Notation — the conventions that travel

Different papers use different symbols, but a small set of conventions is near-universal in modern machine-learning papers. Learn these once; they show up everywhere.

Bold lowercase (e.g. x, h) = vector. A one-dimensional array of numbers. The hidden state of a network at one position is usually a vector.

Bold uppercase (e.g. X, W) = matrix. A two-dimensional array. Inputs, weights, and outputs of layers are usually matrices.

Plain lowercase (e.g. d, n, k) = scalar. A single number. Often a dimension (d_model), a count (number of heads), or a position index.

Subscripts. Mean different things in different places. Most often: an index (h_i = the i-th hidden state), a layer marker (W_q = the query weight matrix), or a position in a sequence.

Superscripts. Usually mean "this thing is one of several similar things". W⁽¹⁾ and W⁽²⁾ in the same paper mean the weight matrices of layer 1 and layer 2. Avoid confusing with exponents — context tells you which.

The dimension subscript on the right of a symbol (e.g. x ∈ ℝ^d) tells you the shape. Read these religiously. Most early confusion in reading papers is dimension confusion — you think a thing is a vector and it is actually a tensor.

Greek letters. Usually parameters, hyperparameters, or coefficients you can ignore on a first pass. λ is almost always a regularisation coefficient. β is almost always a learning rate or a coefficient in a loss. α is almost always a weighting or a step size. θ is almost always "the parameters" — i.e. all the learnable weights of the network bundled together.

Aside · Where to confirm notation

Most papers have a brief "Notation" or "Preliminaries" section in the first page of the method, or just before. Read it once at the start. If you are reading a paper without one, the symbols are defined as they are introduced — keep a small note of "x = input, h = hidden state, W = weight" as you go. After three or four method sections this becomes automatic.

Equations — read like a sentence

Every equation in a research paper is a sentence in mathematical form. It has a subject, a verb (the equals sign or relation), and an object. Read it as a sentence — slowly — and the structure becomes visible.

Take the most-famous equation in modern AI: the scaled dot-product attention from the transformer paper.

The attention equation, deconstructed

Attention(Q, K, V) = softmax( (Q K^T) / sqrt(d_k) ) V

Read as a sentence:

  "Attention applied to queries Q, keys K, and values V
   equals
   the softmax of (Q multiplied by K-transpose, divided by the square root of d_k),
   then multiplied by V."

Now walk through each step:

1. Q K^T: matrix multiplication of queries by transposed keys.
   If Q is (n × d_k) and K is (n × d_k), then K^T is (d_k × n)
   and Q K^T is (n × n). Each entry (i,j) is the dot product
   of the i-th query with the j-th key — a score for how
   strongly position i should attend to position j.

2. /sqrt(d_k): divide everything by the square root of the
   key dimension. This is a scaling trick — it keeps the
   dot products from getting too large when d_k is large,
   which would push softmax into low-gradient regions.

3. softmax(...): normalise the scores into a probability
   distribution over the keys, so each row of attention
   sums to 1. This row of weights tells us how much each
   position attends to every other position.

4. Multiply by V: weighted sum of values, using the
   attention weights from step 3.

Final shape: (n × d_v). Each row is a weighted combination
of the value vectors, where the weights come from the
query-key similarity scores.

That is one equation. Five lines of paper, five minutes of careful reading, and you understand a substantial piece of modern AI. The technique that works: slow down, identify what each symbol is and what shape it has, read the operation as a sentence, then re-derive in your head what the operation is doing.

Architecture diagrams — three things to look at

Figure 1 of an AI paper is almost always the architecture diagram. They look pretty; they hide a lot of detail. When you look at one, fix three things in your mind.

1. The data flow. Where does the input enter, where does the output emerge, and which arrows go forward versus which are recurrent. Trace the path with your finger (literally, on the screen).

2. Which blocks are repeated. Most modern architectures have a repeated block — the transformer's "N×" notation next to the encoder block, for instance. The diagram shows the block once; the architecture stacks it many times. Find the N.

3. Where the new contribution is. Most diagrams include components that are standard (embeddings, layer norms, residual connections, output heads). The new contribution is often a single block in the diagram — sometimes highlighted, often not. Identify which block is the contribution of this paper, versus which blocks are inherited from previous work.

Pseudocode — read alongside the diagram

Many papers include pseudocode for the algorithm — either as an "Algorithm 1" box in the body, or as code listings in the appendix. Pseudocode and the architecture diagram describe the same thing in two languages: one visual, one procedural. Read them together. When the diagram is unclear, the pseudocode disambiguates. When the pseudocode is unclear, the diagram disambiguates.

If a paper has neither pseudocode nor a clear diagram, that is a quality signal worth noting. Either the authors did not feel the need to clarify their method, or they were trying to obscure something. Either way, it is harder to read.

Using AI assistants as a reading partner

Modern AI assistants — Claude, ChatGPT, Gemini — are surprisingly useful for paper reading. Three valid uses, one trap to avoid.

Valid use 1 — Notation decoder. Paste an equation in and ask "walk through this equation symbol by symbol, telling me the shape of each thing." The output is usually correct and saves you ten minutes of deciphering.

Valid use 2 — Background filler. When a paper references something you have not read ("we use the standard PPO objective from Schulman et al. 2017"), ask the assistant to summarise the referenced technique. Faster than reading the original cold.

Valid use 3 — Devil's advocate. After you have read a paper, ask the assistant "what are the weaknesses of this method?" or "what counter-evidence would refute this claim?". The assistant is good at brainstorming critiques you might have missed.

The trap — the summary substitution. Pasting in the paper and asking "summarise this" gives you a summary that looks competent but is actually shallow. You will believe you understood the paper. You did not — you understood the summary. Use the assistant to clarify specific parts; do not use it to replace your reading.

Exercise — Read the attention section of the transformer paper (30 minutes)

Open the transformer paper (arxiv.org/abs/1706.03762). Navigate to Section 3.2, "Attention".
Read Section 3.2.1, Scaled Dot-Product Attention, with the deconstruction from above as a guide. Pay particular attention to the dimensions stated in each line. After reading, close the paper and try to write down — in your own words — what scaled dot-product attention does. Open the paper and check.
Read Section 3.2.2, Multi-Head Attention. Identify: how many heads, what each head computes, how the outputs are combined. Sketch Figure 2 from memory.
Read Section 3.2.3, Applications of Attention in our Model. Identify the three places attention is used in the transformer — encoder self-attention, decoder self-attention (with masking), and encoder-decoder attention — and note what is different about each.
If you got stuck anywhere, use your AI assistant to clarify the specific part you got stuck on. Do not use it to summarise the whole section.

Self-check

What does it mean that a symbol appears in bold lowercase versus bold uppercase?
In the attention equation, why do we divide by the square root of d_k?
What are the three things to look at first when reading an architecture diagram?
What is the "summary substitution" trap, and why is it dangerous?

Looking ahead

Lesson 4 moves from method to experiments. The empirical claims of a paper are where most of the heat is — and where most of the weaknesses live. We will look at benchmarks, baselines, ablations, and the specific failure modes researchers spot when they read this section.

← Lesson 2 Lesson 4 — Reading experiments →