ROMEOADVANCED ACADEMY

Lesson 3 of 5 · AI for Sport Analysts

Lesson 3

Working with real sport data

A bot is only as good as the data and the questions you give it. We will use a free, public dataset and walk through an analysis end to end — including the small-sample traps that sport analytics is famous for.

45 minutesHands-on with public dataCSV download required

By the end of this lesson, you will:

  • Have a working analysis pipeline: paste data into the bot, ask it questions, read its answers critically.
  • Know what kinds of sport data exist and which kinds your bot can actually handle in a browser.
  • Be able to spot — and prompt the bot to avoid — the three or four sample-size traps that catch new sport analysts every season.

The three kinds of sport data

Most sport data falls into one of three categories. Your bot can help with all three, but in different ways.

Event data. Every discrete on-field action — pass, shot, tackle, carry, set piece. Usually one row per event with columns for time, player, location, type, outcome. This is what most public sport datasets contain. StatsBomb's open-data releases, FBRef's match logs, and Opta's free samples are all event data. Easy to paste into a bot. Easy to ask questions of.

Tracking data. X-y coordinates of every player on the pitch, twenty-five times per second. A single match generates millions of rows. Your browser-only bot will not load a full tracking dataset, but it can absolutely work with summaries — average positions, distance covered, pressing intensity windows — that you compute and paste in. The interesting work, once it is summarised, is the same kind of work as on event data.

Biometric data. Heart rate, GPS speed and distance, sleep, hydration, blood markers. Usually one row per athlete per session. Sensitive (see Lesson 2 on athlete-data discipline). Bots can help with the boring parts — formatting, charting, weekly summaries — but anything diagnostic stays human.

For this lesson we will work with event data. It is the most accessible and the easiest to put into a browser-based bot. The techniques you use on it transfer to the other two.

Pick a dataset

You have three good free options. Pick whichever fits your sport.

Option A — StatsBomb open data (football). StatsBomb publishes free event data for several competitions on GitHub at github.com/statsbomb/open-data. The 2022 World Cup and the FA Women's Super League are good places to start. The JSON files are large, so we will use a derived CSV. The course site links a small sample CSV (one match, simplified columns) you can download directly.

Option B — FBRef match logs (football). fbref.com lets you export per-player and per-match stats for most top European leagues. Pick a team, a competition, and a season; "Export to CSV" gives you a usable file in seconds.

Option C — World Athletics tables (track and field). worldathletics.org publishes performance tables you can scrape into CSV. The 100m, marathon, and javelin tables are particularly clean.

Option D — Your own data. If you have access to your club's or federation's data and the permission to use it for an educational exercise, this is the best option of all. Anonymise it first if it identifies athletes; we covered the discipline in Lesson 2.

Aside · The sample dataset we will use in the walkthrough

For the worked example below, we use a small simplified CSV: one match between two teams, around 1,800 events. Columns: minute, team, player, event_type (pass / shot / tackle / etc.), x, y, outcome (success / failure), xG (for shot events only). If you do not want to download anything, you can follow the worked example by reading along; the exercise after it requires you to have actually got some data into your bot.

Feeding data to the bot

Both Claude and ChatGPT in 2026 accept CSV file uploads on their free tiers. If you have a small CSV (under about 10MB / 50,000 rows), upload it directly and ask your question. If your dataset is larger or you only need a slice, copy a representative sample (a few hundred rows) and paste it as text.

What the bot is doing under the hood differs slightly between platforms, but for our purposes the workflow is the same: you put data in, you ask a question, the bot reads, computes a bit, and writes an answer. Treat the answer as a first draft from a junior analyst — never as a final read.

A worked example — one match, four questions

Here is the kind of conversation a working analyst has with the bot once the system prompt from Lesson 2 is in place.

Worked example — match analysis

You: [uploads the match CSV] Take a look at this match. What jumps out about Team A's attacking play?

Bot: A few things from a first pass at the file:
• Team A had 14 shots to Team B's 7, with a combined xG of 1.61 vs 0.72. They created more, and created from better positions.
• 9 of Team A's shots came from inside the 18-yard box. The cluster is on the right side of the goalkeeper, suggesting a left-foot-cutting-in pattern.
• Their xT (where computed) is concentrated in the right half-space between the 18-yard line and the centre circle. This is consistent with the shot map.
One caveat: this is one match. The pattern may not hold over a longer window.

You: Good. Compare Team A's pressing in the first half versus the second.

Bot: Splitting the event log by half:
• First half PPDA: 9.2 (high pressing — coordinated press in opposition half).
• Second half PPDA: 14.6 (much less press — pulled into a mid-block).
The switch happens around minute 58, which is also when their first substitution took place. This is descriptive only — I cannot tell you why they made the change from this file alone.

You: Anything concerning about Player 11?

Bot: Player 11 (right-back, Team A):
— 4 successful tackles, 2 missed, 1 yellow card at minute 34
— Lost possession 7 times in their own half, 3 of which led to a Team B shot
— Heat-map cluster shows them being drawn 10–15m higher up the pitch in the second half, leaving space behind
That's the description. Whether it's a tactical instruction working as intended or a player struggling is a coaching call. I would want to look at two or three more matches before forming a view.

You: Predict whether Team A will win next weekend.

Bot: That's a predictive question and I am not going to answer it as posed.
What I can say from this match: Team A created more and pressed harder for an hour. To project that forward I would need (a) more matches, (b) the opponent's recent record, and (c) any injury or suspension news. Even with those, sport is adversarial and short-horizon match prediction by AI on small samples is not reliable. Happy to describe the team's recent profile if you have more data.

Three things worth noticing about that exchange. First, the bot uses the operational definitions you wrote in Lesson 2 — PPDA, xT, the right half-space — without being asked to. Second, it offers caveats unprompted (one match, descriptive only, would want more data). Third, it holds the line on prediction in exactly the way the prompt told it to.

The four small-sample traps

Sport analytics has a notorious set of traps that come from having too little data and too much pressure to read it. Most of them are about confusing signal with noise. Your bot can fall into all of them. Here are the four to watch for.

Trap 1 — One match is not a pattern. A team's xG in a single match has enormous variance. The same team playing the same way can produce 0.4 xG one week and 2.6 the next. Asking the bot "is this a high-xG team?" off a single match will get you a confidently wrong answer. Add a line to your prompt: "For team-level conclusions, require at least 10 matches before describing a pattern as established."

Trap 2 — One season is not a player's level. A player breaking out at 19 with one excellent season is mostly noise. Two seasons is signal. The bot does not know this unless you tell it. For scouting questions especially, ask for the multi-season view.

Trap 3 — Per-match rates are not opportunities. "Player X averages 2.4 progressive carries per 90" sounds like a stat about Player X. It is partly a stat about Player X's manager, system, and teammates. Move them to a different team and the rate moves with the context, not the player. The bot will not flag this unless prompted.

Trap 4 — Survival bias in scouting data. Public league tables and public stats only contain players who got minutes. The ones who did not — because they were injured, were not picked, were sold — are invisible. When you ask "how did this academy produce its breakout players", you are reading the survivors, not the policy.

The defence against all four is the same. When asking the bot a question that touches on any of them, name the trap in the prompt. "How is Player X performing this season — and what would you want to see to confirm this is not just small-sample noise?" The bot, if it has been told to flag sample size, will reply with the answer and the appropriate caveats.

Exercise — A small analysis (35 minutes)

  1. Get data into your bot. Pick one of the three free datasets above. Either upload a CSV directly or paste a representative slice. If you have your own data and permission to use it, even better.
  2. Ask four questions, deliberately mixing easy and hard:
    • A descriptive question. ("What were the top three patterns of play for Team X?" or "What was the distribution of finish times in the men's 100m, 2024 season?")
    • A question about a specific person or athlete. (Watch for athlete-data behaviour from Lesson 2.)
    • A predictive question, framed deliberately to provoke prediction. ("Will Team X make the playoffs?") Watch the bot reframe.
    • A sample-size question. ("Is this a real pattern or could it be noise — what would you want to see to be sure?")
  3. Score the bot. For each answer, rate it 1–5 on:
    • Did it use your operational terminology correctly?
    • Did it offer appropriate caveats?
    • Did it stay descriptive when it should have?
    • Was the answer useful — would you put it in front of a coach?
  4. Adjust your prompt. Wherever the bot scored below a 4, go back to the relevant block of your system prompt and tighten it.

What the bot is good at, and what it isn't

By this point in the course you should have a working feel for both. The bot is good at: reformatting data, summarising it, computing basic stats from a CSV, translating from technical to plain language, generating a first draft of a paragraph. It is less good at: anything requiring multi-season context the bot does not have, anything adversarial, anything genuinely predictive. The art of using it well is keeping it in the first set of tasks and out of the second.

Self-check

  1. What are the three main kinds of sport data, and which is the most browser-friendly?
  2. Name the four small-sample traps. Which is most relevant to your domain?
  3. What is the difference between asking "is this a pattern?" and "what would tell us this is a pattern?" — and which is the better question for a bot?
  4. Why is per-match rate data partly a stat about the team rather than the player?

Looking ahead

Lesson 4 is about communication. The same analysis means different things to a coach, a scout, a broadcaster, and a board. We will use the bot as a writing partner to do that translation, and watch for the specific failures that translation introduces.