Lesson 4 of 5 · AI Security Foundations
Lesson 4
Defences that work
No single control will save you. A small number of well-chosen, layered controls will keep most of the trouble out.
By the end of this lesson, you will:
- Know the defence-in-depth pattern for AI systems and the seven layers it contains.
- Be able to name the specific controls in each layer and what each one is for.
- Have an honest picture of the MLSecOps function and how to staff it.
The pattern: defence in depth for AI
Defence in depth is the principle that no single control should be the only thing between an attacker and a compromise. You stack layers. Each layer catches some attacks the others miss. The attacker has to defeat all of them to succeed.
For AI systems, defence in depth has seven natural layers, working from outside in.
- Identity and access. Who is allowed to interact with the system at all.
- Input boundary. What gets accepted into the model's context.
- Model and prompt configuration. How the model is told to behave.
- Tool gating. What actions the model is allowed to invoke.
- Output handling. What happens to the model's output downstream.
- Monitoring and logging. What you see and what you can replay.
- Continuous red-teaming. The adversarial pressure that keeps the rest honest.
Each of these has specific controls. Most of them you already know in their traditional-security form; the work is applying them to the AI context.
Layer 1 — Identity and access
The first line of defence is authentication and authorisation. Every interaction with the AI system goes through your normal identity layer. AI does not get a pass on this. If anything, AI systems need stricter identity controls because they tend to be high-value internal tools that touch sensitive data.
Controls: SSO, MFA, role-based access to the AI system, audit of who used it for what.
Common failure: The AI demo became a production tool without ever going through the normal identity onboarding. Six months in, nobody knows who has access.
Layer 2 — Input boundary
What is allowed into the model's context? This is the layer that catches direct and indirect prompt injection, malicious file uploads, and oversized inputs.
Controls: Strict typed schemas for inputs where possible. Maximum sizes. Content-aware filters on user-submitted text. Per-source classification of retrieved content (internal vs external; trusted vs untrusted). Image and document parsing with separate sanitisation passes. For RAG systems, careful curation of what is in the vector store.
Common failure: Retrieving content from an internal source that anyone in the company can write to (Slack, Confluence) without recognising that "anyone in the company" includes the threat actor who phished an employee last quarter.
Layer 3 — Model and prompt configuration
How the model is configured to behave. The system prompt is the most visible part of this layer; the model's safety settings (where the provider exposes them) and the choice of model are the others.
Controls: System prompts that explicitly enumerate prohibited behaviours (we built one of these in the Market Research Bot course). Provider safety filters turned on. Model selection appropriate to the risk tier of the application — a customer-facing system may need a more conservative model than an internal research tool. Deterministic settings (low temperature, structured outputs) where determinism matters.
Common failure: A system prompt that documents the rules but is overridable through the user input. The system prompt is a directive, not a fence. The fence is the next two layers.
Layer 4 — Tool gating
This is the most important layer for agents, and the one most often skipped. Each tool the agent can call is a privileged operation. Tool gating means: each tool is on an explicit allowlist, each tool is the minimum needed for the agent's job, and high-impact tools require approval flows.
Controls: Allowlist of tools per agent role. No tool available by default. Approval flow for any destructive, financial, or external-facing tool. Per-call audit of which tool was used and why. Rate limits on individual tools.
Common failure: Giving the agent broad API access "to make it useful" and then being surprised when, under prompt injection, it does something with that API access. The privilege you grant is the privilege the attacker borrows.
Layer 5 — Output handling
The model produces text. Something downstream consumes that text. Output handling is the discipline of treating the model's output as untrusted, the same way you treat user-submitted content.
Controls: Sanitise model output before passing to any system that interprets it. Parameterise downstream queries (do not interpolate model output into SQL strings). Output content filters for sensitive patterns (PII, secrets, internal URLs). Markdown and HTML rendering with safe defaults — no raw HTML unless explicitly required and locked down.
Common failure: A chatbot's response is rendered as HTML with full markdown support including raw HTML. The model emits an `<img>` tag pointing to an attacker URL with cookies in the query string. Session fixation via Markdown is alarmingly common.
Layer 6 — Monitoring and logging
You cannot defend what you cannot see. Logging for AI systems means recording every input, every tool call, every output, with enough context to replay a session in full.
Controls: Structured logging of system prompts, user inputs, retrieved context, tool calls and arguments, tool results, model outputs. Cost monitoring with alerts on anomalies. Behavioural baselines per user and per role. SIEM ingestion of the AI logs alongside everything else. Sampling-based deep inspection of conversations.
Common failure: Logging only user inputs and final outputs, missing the tool calls and retrieved context that explain how the model reached its decision. When an incident happens, you cannot reconstruct what went wrong.
Layer 7 — Continuous red-teaming
The rest of the layers are static. Red-teaming is the active layer that keeps them honest. A red team — internal, external, or a mix — continuously probes the system for the kinds of weaknesses ATLAS and OWASP describe. The point is not to find every flaw; it is to find flaws faster than attackers do.
Controls: A standing red-team practice with documented engagement rules. Coverage of the OWASP Top 10 and ATLAS tactics in each engagement. Tooling to support red-teaming (Anthropic's PyRIT, Microsoft's Counterfit, and others). A formal pipeline from red-team finding to remediation, with time-to-fix metrics. Public reporting of findings where appropriate.
Common failure: "We did a red team a year ago" treated as a sufficient answer. AI systems evolve weekly. Annual red-teaming is annual reassurance, not annual security.
The MLSecOps function
MLSecOps — Machine Learning Security Operations — is the emerging discipline that runs all of the above as a function rather than as a project. Think of it as DevSecOps for the AI stack.
What an MLSecOps function does, day to day:
- Maintains the inventory of AI systems in production and their risk tiers.
- Runs the threat-modelling step for new systems.
- Builds and runs the red-team programme.
- Integrates AI logs with the SIEM.
- Operates incident response for AI-related events.
- Liaises with the AI/ML engineering teams to embed controls during development.
- Tracks the regulatory frame (EU AI Act, ISO 42001, NIST AI RMF) and ensures compliance documentation.
For a mid-sized organisation in 2026, the MLSecOps function is typically one to three people: a senior security engineer with AI exposure, a more junior engineer or analyst, and access to the existing application-security team. Larger organisations build it out further. Smaller ones bolt the responsibility onto an existing security architect.
Aside · Tools to know about
A few specific tools that are useful in 2026. PyRIT (Microsoft) is an open-source framework for automated red-teaming of LLMs. Garak is a probe-based vulnerability scanner for language models. Lakera Guard, Protect AI, and Robust Intelligence are commercial AI-security platforms; all three offer real-time guardrails. Promptfoo is a popular tool for evaluating and red-teaming prompts. None of these alone is a complete solution. They are layers in a defence-in-depth stack.
Hands-on time
Exercise 4.1 · 20 minutes
Map controls to a system
Take the system you threat-modelled in Lesson 3 (or use the Knowledge Assistant from the worked examples). For each of the seven defence-in-depth layers, write one or two sentences answering:
- Is this layer in place for the system today?
- If not, what would the minimum viable control be?
- Who in your organisation owns this control?
Aim for 15 minutes. The output is your one-page controls map for the system — a useful artefact for your next architecture review or audit conversation.
Tools required: a sheet of paper or a text document.
What you should notice
Two things, in my experience.
First, when you map a real system honestly, you find that several layers are partial or absent. This is normal. AI systems were rarely designed with defence in depth in mind because the discipline did not exist when they were first built. The map is a remediation list, not a report card.
Second, the layers where most organisations are weakest are layer 4 (tool gating), layer 6 (monitoring), and layer 7 (continuous red-teaming). The first two are technical and can be added in weeks. The third is organisational and takes a quarter or two to build. None of them is exotic; all three are the things most organisations skip.
Self-check
- Name the seven layers of defence in depth for AI systems.
- Which layer is most often skipped, and why?
- What is the difference between a system prompt as a "directive" and as a "fence"?
- Why is annual red-teaming not enough?
Looking ahead
In Lesson 5, the final one, we look at the regulatory frame. EU AI Act, NIS2, DORA, ISO 42001, NIST AI RMF — what each requires, where they overlap, and how to map your AI systems to them. You will end the course with a one-page AI security posture document you can take to your next audit committee or board meeting.