The anatomy of a modern AI paper

Every paper looks intimidatingly unique on the surface; underneath, almost all use the same skeleton. Once you know the skeleton, you know where to look for what you need.

35 minutesHands-on with a real paperarXiv access required

By the end of this lesson, you will:

Know the eight standard sections of an AI paper and what each is for.
Be able to map a real paper — the original transformer paper — onto the skeleton.
Know which sections to skim, which to study, and which the appendix often hides.

The standard skeleton

Almost every conference-style AI paper follows the same outline. The labels vary a little; the structure does not. Here are the eight standard sections, in the order they appear.

1. Abstract

One paragraph, typically 150–250 words. The compressed claim of the entire paper. Should contain: the problem, the gap in prior work, the proposed method in one line, the central result in numbers, and the implication. Read it twice on Pass 1; if it is not clear after two reads, the writing is at fault — the authors should have made the claim more legible.

2. Introduction

Three to four pages, usually. Re-states the abstract in prose, with citations. Explains why the problem matters, what previous attempts have missed, and outlines the contribution. The last paragraph of the introduction is almost always a numbered list of contributions ("In this paper, we propose ... we show ... we demonstrate ..."). This list is the most useful sentence-by-sentence summary of the whole paper. Read it carefully.

4. Related work

One to two pages. The author's map of the field. Useful for the citation graph (who has done what, recently) and for understanding the gap the paper claims to fill. On a first pass, scan this for names and papers you recognise. On a deep read, this is where you build your reading list for follow-up.

4. Method

The technical heart. Where the new contribution lives. Mathematical notation, algorithm pseudocode, architecture diagrams. This is the section that takes the longest to read well — Lesson 3 is entirely about doing so. On Pass 1 you skim it; on Pass 2 you read it carefully; on Pass 3 you mentally re-implement it.

5. Experiments

The empirical core. Tables and figures, benchmark numbers, baseline comparisons. The section where claims are tested. Most of the criticism that working researchers level at papers is about this section — how the experiments were set up, what was compared to what, what was held constant, what was changed. Lesson 4 is entirely about reading this critically.

6. Ablations

The "what happens if we remove this part" experiments. Often placed within the experiments section but worth treating separately because they do specific work — they show which parts of the method are actually doing the heavy lifting. A paper with thorough ablations is a confident paper; a paper without them often has a contribution that is more vibes than substance.

7. Discussion / limitations

Where the authors situate their results. The good ones acknowledge limitations honestly. The less good ones avoid them. Increasingly, top venues require an explicit limitations subsection — read it. What the authors will not commit to is often as informative as what they will.

8. Appendix

Increasingly long. Hyperparameter details, additional ablations, proofs, extra results that the page limit forced out of the body. Many papers' real ablations and failure modes are in the appendix because they did not flatter the headline result. Always at least skim the appendix on Pass 2 — it tells you what the authors hoped you would not read.

Aside · The order in which papers are actually read

Note that the order researchers read these sections is rarely the order they appear. A typical Pass 1 looks like: title → abstract → conclusion (yes, the conclusion before everything else) → list of contributions in the introduction → figures and figure captions → section headings. Twelve minutes. You have not read the paper, but you know what it is.

Worked example — the transformer paper

Open the original transformer paper — Vaswani et al., 2017, "Attention is All You Need", on arXiv at arxiv.org/abs/1706.03762. Read along with the next few paragraphs.

Abstract. The first sentence states the problem: dominant sequence-transduction models are based on RNNs or CNNs with attention. The next sentence is the claim: "We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely." Then the result: a BLEU score on English-to-German translation. Then the implication: better quality, more parallelisable, less training time. One paragraph, the whole paper. Five sentences.

Introduction. Two short pages. Walks through why recurrent models are bottlenecks for parallelism, why attention has been useful but tied to recurrence, and arrives at the contribution: an architecture that uses attention without recurrence. The last paragraph is the contribution list.

Related work / background (Section 2). Short. Mentions ByteNet, ConvS2S, and self-attention precedents. Brief.

Method (Section 3 — "Model Architecture"). The substantial part. Has its own subsections: encoder and decoder stacks, attention (scaled dot-product attention, multi-head attention), position-wise feed-forward networks, embeddings, positional encoding. This is where the famous Figure 1 (the encoder-decoder diagram) and Figure 2 (the attention mechanism) appear. We come back to this in Lesson 3.

Why this section matters (Section 4 — "Why Self-Attention"). A short, often-overlooked section that explains the design choice. The total computational complexity per layer, the sequential operations required, and the path length between input and output positions are compared across self-attention, recurrent, and convolutional layers. This is the section that justifies the contribution beyond "it works".

Training (Section 5). Hyperparameter details, optimiser choice, regularisation. The kind of stuff you skip on Pass 1 but need on Pass 2 if you intend to reproduce the work.

Results (Section 6). The English-to-German and English-to-French translation tables. Comparison against the strong baselines of the time (ConvS2S Ensemble, GNMT + RL Ensemble, etc.). Note that the comparison shows their model beats the baselines while using less training compute — that is the substantive claim, and the table is structured to support it.

Conclusion (Section 7). One paragraph. Restates the contribution, mentions code availability, and gestures at future work. Nothing surprising.

Where the ablations live. The transformer paper put its ablations in Table 3 of the body (rare — more recent papers tend to push them to appendices). Read Table 3. Look at what they varied (number of heads, head dimension, dropout, label smoothing, base vs. big model size) and what each variation cost in BLEU. That table tells you which architectural choices matter and by how much. The fact that the authors did this transparently is one reason the paper was so well-received.

Exercise — Apply the skeleton to a new paper (25 minutes)

Pick any AI paper from the last six months on arXiv. If you do not know what to choose, try a paper from the "Highlights" section on Papers With Code or a paper your trusted researcher friends mentioned recently.
For each of the eight sections above, locate it in the paper (it may be combined with another section, or named differently — that is fine). Write one sentence describing what is in that section for this specific paper.
Pay particular attention to the ablations. Where are they? In the body? In the appendix? Are there enough of them to give you confidence in the contribution? Or is the contribution one big claim with no isolation of which component is doing the work?
Write one paragraph summarising the paper. Then test the paragraph: would someone who has not read the paper learn anything specific from it, or just generalities?

Self-check

Name the eight standard sections of an AI paper.
What is the most useful single sentence in the introduction? Why?
Why are ablations doing such important work?
In what order does a researcher read these sections on a first pass?

Looking ahead

Lesson 3 zooms into the hardest section — the method. We will take the transformer paper's Section 3 (attention) and walk through how to read notation, equations, pseudocode, and architecture diagrams without getting lost.

← Lesson 1 Lesson 3 — Reading the method →