Reading the experiments — critically

The empirical section is where claims are tested and where most weaknesses hide. Benchmarks. Baselines. Ablations. The compute-spent trap. The patterns researchers look for, in detail.

40 minutesHands-on with a real paperChain-of-Thought paper open

By the end of this lesson, you will:

Know what a good experiment section looks like and how to spot a weak one.
Be able to read a results table critically — checking baselines, datasets, compute, error bars, and cherry-picking.
Recognise the six specific traps that occur most often in modern AI experimental sections.

What the experiment section is trying to do

The experiment section exists to test the claim of the paper. The claim is usually of the form "method X is better than method Y at task Z on metric M." The experiment section provides the evidence for that claim. Reading it critically means asking, for each table, figure, and result: is this evidence actually supporting the claim, or only appearing to?

This is not cynicism. Most paper authors are honest and most experiments are sound. But the system rewards confident claims, and the page limit forces compression. Marginal results get presented confidently. Disadvantageous comparisons get omitted. Hyperparameters get tuned on the test set. Cherry-picked qualitative examples appear without disclosure. None of this is necessarily fraud. It is, however, where critical reading matters.

The six traps to watch for

Trap 1 — Weak baselines

The most common failure mode. The paper compares its new method against baselines that are either outdated, badly tuned, or both. The new method "wins" because the comparison was unfair, not because the method is good.

How to spot it: look at the publication dates of the baselines cited. If the new method is from 2026 and the strongest baseline is from 2022, ask why. Are there no newer baselines, or did the authors choose not to include them? Check the public leaderboard for the benchmark in question. If the paper's baselines do not include the current top entries, the comparison is incomplete.

Trap 2 — Compute mismatch

The new method uses 10x the compute of the baselines. It wins, but it would be surprising if it did not. The paper does not always disclose the compute used; sometimes you have to compute it yourself from parameter counts, training time, or GPU specification.

How to spot it: look for a "training details" subsection. Find: number of parameters in each model compared, training tokens or epochs, hours of training. If the paper does not provide these, the table may not be a fair comparison. Modern conferences increasingly require a "compute statement"; older papers often do not have one.

Trap 3 — Cherry-picked benchmarks

The paper tests on six benchmarks but reports only the three where it wins. Or tests on the standard suite and one custom benchmark designed to highlight its strengths.

How to spot it: look for the canonical benchmark suite for the area (GLUE/SuperGLUE/MMLU for language understanding, ImageNet for vision, MS COCO for detection, etc.). Does the paper cover the standard suite? Or does it list a small set of benchmarks and not explain why those? An honest paper says "we evaluate on [standard suite] and the results are in Table 2" and shows all of them — wins and losses.

Trap 4 — Lost ablations

The contribution of the paper is unclear because the ablations do not isolate the new component. The paper claims its new method works because of Idea X, but the ablation only shows the full method versus a much weaker baseline — not the full method with and without Idea X.

How to spot it: read the ablation section (often in the appendix) and ask: "is there a row in the table where everything is the same except the new component is removed?" If not, the contribution attribution is weak.

Trap 5 — No error bars, no significance

Most ML papers report a single number per cell of a results table. No standard deviation, no confidence interval, no significance test. The "improvement" of 0.4 points is announced as a meaningful win. But run-to-run variance on the same model and the same data can easily be ±1 point. The improvement may be noise.

How to spot it: check whether the paper reports averages over multiple seeds (typically 3 or 5), and whether it reports standard deviations or confidence intervals. If it reports a single number per cell, treat differences smaller than the typical variance for the benchmark as uncertain.

Trap 6 — Generalisation overclaim

The paper shows the method works on the specific datasets tested, then states a general claim ("our method works for low-resource language modelling"). The general claim is much broader than the evidence supports. Sometimes the broader claim is correct; sometimes it is wishful.

How to spot it: read the abstract, then read the limitations. Did the authors test on the diversity of conditions the abstract implies? If they tested on three languages and claim it works for "low-resource languages" broadly, the claim is in front of the evidence.

Aside · The healthy default

The healthy default when reading an experiment section is mild scepticism. Not cynicism — most papers are honest, most authors are sincere, most results are real. But the system rewards confidence, and you, the reader, are the one who carries the cost of believing a confident-sounding claim that does not hold. Researchers who consistently spot weak experiments are not paranoid; they are calibrated. They probably believe 70% of what they read on first pass and update from there.

Worked example — the Chain-of-Thought paper

Open the chain-of-thought prompting paper — Wei et al., 2022, "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models", at arxiv.org/abs/2201.11903. This is a paper with strong experiments and is broadly well-regarded. Reading it critically is a useful exercise precisely because it is so widely cited.

The central claim. "Chain-of-thought prompting" — giving the model worked examples that show step-by-step reasoning — improves performance on arithmetic, commonsense, and symbolic reasoning tasks. The improvement is most pronounced for large models.

The benchmarks. The paper tests on eight tasks across three categories: GSM8K, SVAMP, ASDiv, AQuA, MAWPS (arithmetic); CommonsenseQA, StrategyQA, Date Understanding, Sports Understanding (commonsense); Last Letter Concatenation, Coin Flip (symbolic). Multiple benchmarks per category, mostly canonical. Decent coverage. (Tick on Trap 3 — benchmarks not cherry-picked.)

The baselines. Standard prompting (the same model, same few-shot examples, just without the chain-of-thought worked examples). This is the right baseline to isolate the contribution. (Tick on Trap 1 — strong baseline.)

The models. Three model families: LaMDA, GPT-3, and PaLM. Multiple sizes within each, from 350M to 540B parameters. The same prompting strategy is applied across all of them. (Strong design — the headline finding is that the effect only emerges at scale, and they have the model-size diversity to show that.)

The error bars. Mostly reported. Figure 4 shows error bars on the GSM8K results. (Tick on Trap 5.)

The generalisation claim. The paper claims the method works across "arithmetic, commonsense, and symbolic reasoning" — but the tests are within those categories, with multiple benchmarks each. The claim is well-supported within the categories tested. The paper does not claim it works for unrelated tasks (translation, summarisation). (Healthy on Trap 6.)

The honest weakness. The "emergence at scale" claim is fragile. The paper shows chain-of-thought helps large models more than small ones. But "scale" in the paper means 540B parameters. The benchmark for "small" includes models that are still hundreds of millions of parameters. Whether the effect is really about scale, or about training-data composition, or about both, is not fully isolated. Follow-up work (Schaeffer et al. 2023, "Are Emergent Abilities of Large Language Models a Mirage?") complicates the original claim. The paper itself was honest; the follow-up makes the picture more nuanced.

This is the pattern of careful experimental reading. Identify the claim. Identify the experiments that test it. Check each of the six traps. Note what is well-supported and what is more fragile. Allow for the possibility that follow-up work will refine — or contradict — the original.

Exercise — Read an experiments section critically (30 minutes)

Pick a paper from arXiv from the last six months. Choose one whose claim interests you — not one you have already read.
Read the abstract and identify the central claim. Write it down in one sentence.
Go to the experiments section. For each of the six traps above, write one sentence about whether the paper avoids it.
- Trap 1 — Baselines (are they current and strong?)
- Trap 2 — Compute (is it disclosed and comparable?)
- Trap 3 — Benchmarks (canonical and comprehensive?)
- Trap 4 — Ablations (do they isolate the contribution?)
- Trap 5 — Error bars (are runs averaged?)
- Trap 6 — Generalisation (does the claim match the evidence?)
Score the paper out of six. A 6/6 is rare. A 3/6 is more common. A 1/6 or 2/6 should make you discount the headline claim significantly.

Self-check

Name the six traps to watch for in experimental sections.
What is the difference between weak baselines and compute mismatch?
Why do ablations matter so much for attributing the contribution?
What is the healthy default level of credulity for a paper on first read?

Looking ahead

Lesson 5 is the wrap. We zoom out from reading single papers to reading the field — citation graphs, follow-up work, code release, reproducibility, and how to build a sustainable reading practice over months and years rather than days.

← Lesson 3 Lesson 5 — Reading the field →