Iteration and evaluation

Your first prompt is rarely your best one. The skill that separates a good prompt engineer from a beginner is not writing the perfect first prompt — it is iterating quickly and systematically when the first one falls short.

35 minutesHands-on debuggingNotebook recommended

By the end of this lesson, you will:

Have a systematic process for debugging a prompt that is not producing what you want.
Know how to A/B test two prompts and decide which works better.
Have started a personal prompt library — the practice that compounds your skill over months and years.

Why first prompts often fail

You write a prompt. You run it. The output is not what you wanted. Three things have likely happened, in roughly this order of frequency:

1. You did not communicate clearly what you wanted. The most common failure. The output is bad because you under-specified — the model filled in the gaps with its defaults, and its defaults are not your preferences.

2. The model lacks information. The model does not know the context, the facts, the source material, the prior decisions. You wrote a prompt that assumed the model knew things it does not know.

3. The model is genuinely bad at this task. The least common failure. Most tasks that AI is bad at, you would have anticipated. When this happens, the fix is either to give up and do it yourself, or to decompose the task into pieces the model is good at (Lesson 3, Pattern 4).

The first step in iteration is diagnosing which of these three is happening. The fix is different for each.

The five-step debugging process

When a prompt is not working, do these five things, in order.

Step 1 — Read what the model produced, carefully

Often the issue is in the middle of the response, not the headline. The model got 80% of the way right. Identify the specific failure: the wrong tone, the missing constraint, the fabricated fact, the misunderstood instruction. Be precise about what went wrong before you change anything.

Step 2 — Ask the model what it did

You can literally ask: "Why did you structure the answer this way?" or "What did you assume about my context?" The model will often surface assumptions it made — which tells you what to add to the prompt. This is one of the most-undervalued debugging moves.

Step 3 — Change one thing

The temptation when a prompt is failing is to rewrite it entirely. Resist. Change one thing at a time. Add specificity. Or add an example. Or remove a contradictory constraint. If you change three things and the output improves, you do not know which change helped.

Step 4 — Re-run in a fresh conversation

Always test your improved prompt in a new conversation. The previous conversation has accumulated context — your earlier messages, the model's earlier responses — that might be helping or hurting in ways you cannot see. A clean test isolates the prompt's effect.

Step 5 — Save the working version

The moment the prompt works, paste it somewhere you will find it again. A note, a text file, a notebook. The number of times you will write the same prompt from scratch because you did not save the last good version is the number that motivates most people to start a prompt library. (We come back to this below.)

A/B testing two prompts

For an important prompt — one you will use many times, or one whose output really matters — it is worth testing two candidate prompts side by side.

The method.

Pick three test inputs that represent the range of cases you expect.
Run Prompt A on all three. Save the outputs.
Run Prompt B on all three. Save the outputs.
Compare side by side. For each input, which prompt produced the better output? On what dimensions — accuracy, tone, structure, completeness, anything else you care about?
Pick the winner based on the comparison, not on which prompt feels nicer or which one was easier to write.

This sounds heavy. For an important prompt that you will use a hundred times, twenty minutes of A/B testing saves hours over time. The discipline pays back.

A/B testing — example

Task: classify customer support tickets by urgency

Prompt A: zero-shot
"Classify the support ticket below as Urgent, High, Normal, or Low."

Prompt B: few-shot with negative examples
"Classify the support ticket below. Use these definitions:
- Urgent = service down for the customer, time-bound (exam, event, transaction)
- High = significant inconvenience, not time-bound
- Normal = standard query or request
- Low = informational or suggestion

Examples:
[3 examples following the definitions]

Note: 'I'm frustrated' or 'this is unacceptable' do NOT make a ticket
Urgent. Tone is independent of urgency.

Ticket: [input]
Classification:"

Test inputs:
1. 'Cannot log in, exam in 20 minutes' → A says Urgent, B says Urgent ✓
2. 'I'm extremely frustrated about a small UI issue' → A says High, B says Low ✓
3. 'Will the analytics export be available next quarter?' → A says Normal, B says Low ✓

Result: B is more consistent, especially on tone-vs-substance edge cases.
B becomes the production prompt.

The 80/20 of prompt debugging

Across thousands of failed prompts watched and debugged, five fixes resolve the vast majority of issues. If your prompt is not working, try these in order — they are listed in approximate priority.

Fix 1 — Add an example. If the model is doing the wrong thing, one or two examples of the right thing usually fix it. This is the single highest-leverage fix.

Fix 2 — Add a constraint. If the model is including something you do not want (tone, phrasing, structure), an explicit negative instruction usually fixes it. "Do not start with 'Certainly!' or 'I'd be happy to'."

Fix 3 — Specify the format. If the output is hard to use, ask for a specific format. JSON, bullet list, table. The model handles structured output well.

Fix 4 — Add the missing context. If the model is fabricating facts or assuming wrongly, paste in the actual facts. The model is not in the room with you.

Fix 5 — Decompose the task. If one prompt is being asked to do too much, split it into smaller prompts and check each.

In experience, fixes 1 and 2 between them solve maybe two-thirds of prompt problems. The remaining third is fixes 3, 4, and 5. The point is that the fixes are systematic — there is a small, finite list to try in order, not an endless space of magic phrases to guess at.

Building a personal prompt library

The single best practice for becoming a stronger prompt engineer is to keep a library of prompts that have worked for you. Most people do not. The few who do compound their skill much faster than those who do not.

The library does not need to be sophisticated. A document, a notes app, a folder of text files. Anything you can search. What goes in it:

The prompt itself — the working version, copy-pasted.
What it does — one sentence above the prompt: "Drafts a project status update from a list of bullet points."
What model you tested it on — sometimes prompts that work well in Claude need light adjustment for ChatGPT and vice versa.
Optional: an example input and output — useful as a sanity check when you revisit the prompt months later.

After three months of light maintenance, the library is one of the most valuable things you own as a knowledge worker. After a year, it is the single piece of "intellectual property" you have built that compounds across your career. Most of mine fits in a single Markdown file.

Aside · The library as a teaching tool

The other reason to keep a prompt library: you can share it. Teaching colleagues to use AI well is one of the most-valuable things you can do at work in 2026. Pointing someone to a library of working prompts in your domain is the fastest way to get them up to your level. It is also a quietly impressive thing to be able to do in a meeting.

The most-common iteration mistake

The mistake we see most often, including in experienced users: when a prompt fails, the user makes the prompt longer. They add more instructions, more constraints, more clarifications. Sometimes this works. Often it makes things worse — long prompts can over-constrain the model, contradict themselves, or bury the most-important instruction under noise.

The corrective: when a prompt is failing and you have already tried fixes 1 and 2 from the 80/20 list, try shortening the prompt rather than lengthening it. Strip back to essentials. Often the shorter version works better. After enough iterations, you develop a sense for when to add and when to subtract.

Exercise — Debug a real prompt (25 minutes)

Find a prompt you have used recently that produced an output you were not fully happy with. Have the model run it again so you can see what is wrong.
Apply the five-step debugging process. Read carefully, ask the model what it did, change one thing, re-run in a fresh conversation, save the working version.
Now A/B test two versions. Your original prompt versus your improved one. Three test inputs. Compare the outputs.
Save the winner in your prompt library. If you do not have one yet, start one with this entry. A simple Markdown file is plenty.
Diary a reminder two weeks from now to add three more prompts to the library. The habit only sticks if you maintain it for a few weeks.

Self-check

What are the three most-common reasons a prompt fails?
What are the five steps in the debugging process?
Why does "change one thing at a time" matter?
What goes in a personal prompt library, and why is it so valuable?

Looking ahead

Lesson 5 is the level above prompts. System prompts — the constitution of any custom bot. Long-context prompts when you have a whole document to work with. Prompts for agents that take actions in the world. How prompt engineering connects to the rest of our free course library.

← Lesson 3 Lesson 5 — From prompts to systems →