Guide
How to Run a Security Review With Cursor (Agent-Driven)
Point the agent at a repo, prime it for the vulnerability classes you care about, and fan out sub-agents per class to produce a findings table. Then run each finding through a validation loop — demand the attack chain, code snippets and a working proof of concept — so false positives collapse before a human ever reads them.
On this page
- What are the operating principles for security work with an agent?
- How do I review an unfamiliar repository for the first time?
- Are agents better than static code-scanning tools for security review?
- How do I orchestrate a thorough multi-class review?
- How do I stop the agent reporting false-positive vulnerabilities?
- Which Cursor mode and model should I use for security work?
- How do I build secure infrastructure with Plan mode?
- How should I threat-model prompt injection in an AI tool?
- How do I keep credentials away from the agent and gate risky commands?
- How do hooks, rules and Bugbot fit into the review pipeline?
- What stays human, and where does this approach fall short?
What are the operating principles for security work with an agent?
Treat the agent like a fast, fallible delegate. Four tenets carry most of the practice: give it real context (be clear about the work and what good looks like, but you no longer burn hours on prompt engineering); assume it can be wrong and build the workflow around any single fact being false; ask for proof when you doubt a claim; and ask anything, because agents are unmatched at ramping you into an unfamiliar system fast.
Security people are uniquely good at distrusting a confident source, and that instinct transfers directly to working with a model. Distrust is not the same as dismissal.
Being suspicious of the model will come naturally to us. It doesn't mean that it's useless. It just means that you build workflows around assuming that any individual fact you get could be wrong.
The payoff is structural, not just speed. Tireless agents let you run deep review on every change, not a sampled few. Threat-modeling every PR used to be untenable; now it is tenable. The practitioner who built this workflow self-assessed roughly 5x more effective on older models and 7.5–10x now — self-reported, and your mileage will vary, but the direction is the point.
How do I review an unfamiliar repository for the first time?
Hand the agent a pointer to the repo and ask the questions a security reviewer asks on day one. How old is the repo? Who are the top contributors, with contribution counts and tenure? What are the project's goals? Is there a security policy, CI, a defined development process? What is the top-level architecture? You get back a digest that would take a long time to assemble by hand — and you can keep asking follow-ups until you actually understand the system.
- 1Give the agent the repo pointer (a GitHub URL is enough).
- 2Ask for repo age and a top-10 contributor list with counts and tenure.
- 3Ask for the project goals, security policy, CI setup and dev process.
- 4Ask for a top-level architecture summary, then an actual architecture diagram.
- 5Read the disclosure policy: what is in and out of scope, is there a bug bounty.
Asking for an architecture diagram is one of the strongest "ask for proof" moves — it forces the model to commit to a structure you can interrogate. If it picks unreadable colors, just say so: "I can't read the diagram with the contrast you have, would you mind fixing that?" It fixes it. Critiquing an artifact is cheap; lean on it.
Are agents better than static code-scanning tools for security review?
For most review work, yes — emphatically. The trouble with static rules is that they are static: a rule can't trace context through the codebase the way an agent can, so writing a good one that follows the full data flow is very hard. In a vuln scan the agent drives the linter-style checks intelligently, picking the right ones on the fly. The viable hybrid is to feed linter and scanner output into the agent and have it triage.
Agents add one new failure mode: they make things up. That gives you two problems to manage. False negatives — the agent misses something you cared about. False positives — it confidently flags a non-issue. Each has a specific countermeasure, below.
- Failure mode
- False negative
- Why it happens
- A bare "find vulnerabilities" doesn't tell it what you care about
- Mitigation
- Prime by vuln class ("I'm looking for XSS"); run parallel runs for things you really care about
- Failure mode
- False positive
- Why it happens
- Models are built to be helpful and will fabricate a finding
- Mitigation
- Run every finding through the validation loop: attack chain + code + working PoC
| Failure mode | Why it happens | Mitigation |
|---|---|---|
| False negative | A bare "find vulnerabilities" doesn't tell it what you care about | Prime by vuln class ("I'm looking for XSS"); run parallel runs for things you really care about |
| False positive | Models are built to be helpful and will fabricate a finding | Run every finding through the validation loop: attack chain + code + working PoC |
Agents beat static scanners on context tracing, but you manage their hallucinations with priming and a validation loop.
How do I orchestrate a thorough multi-class review?
Kick off one agent and have it spawn sub-agents grouped by vulnerability class — infrastructure, AppSec, prompt injection, XSS, dependency and SCASoftware Composition Analysis. Scanning third-party dependencies for known vulnerabilities and license problems.. Each sub-agentA child agent a main agent spawns to work in parallel with its own context window, handing results back so the parent's context stays clean. produces a table of findings. For every finding it then has to show the full attack chain with code snippets and write a simulated proof of concept. Running it this way, sub-agents routinely realize a chunk of their own findings were wrong. Sub-agents earn their keep here for two reasons: the validation loop, and keeping each vuln class in its own isolated context.
Interactive diagram. Tab through its regions; each focused region shows its detail in the panel below.
One review run: prime, fan out per vuln class, validate every finding, then merge only what survives. Click a step to see what the agent does.
Most teams never review every PR in security depth — there isn't time. Sub-agents that never tire change the economics, so deep review on every change becomes realistic rather than aspirational.
I've never met a security team where every single PR at the company is reviewed in security depth. Things like threat modeling for any change that were previously untenable are now tenable.
How do I stop the agent reporting false-positive vulnerabilities?
Run a validation loop on anything that looks alarming. When the agent reports a finding — say it calls something a P0 — copy it back and, in effect, tell it you don't believe it: "Explain this in depth including the attack chain and relevant code snippets, tell me in detail how it would be exploited, and prove that the exploit works." The agent re-investigates and often concludes it is not a vulnerability at all — for example, an intentional feature the user invokes themselves. Push findings down to docs and code until model and human reach ground truth.
Interactive diagram. Tab through its regions; each focused region shows its detail in the panel below.
Single-finding loop: demand proof, let the agent re-investigate, then accept or dismiss on evidence.
A current frontier model like Opus 4.x is very good at security reasoning if you tell it to reason that way — and better still if you ask it to self-validate. Add this to your review prompt:
When you find security issues I want you to go and do a validation loop — generate a proof of concept and explain the entire chain end-to-end.
Hallucination is normal, not a reason to stop. In one session the agent claimed a doc line cut off mid-sentence; a quick "check GitHub, here's the actual text" produced "You're right, I was wrong." Assume the model can and will fabricate, make it show you docs and code, and you reach ground truth far faster than reviewing alone.
Which Cursor mode and model should I use for security work?
Match the mode to the task shape. Ask when you just want a question answered and don't want the agent touching code. Agent mode for quick changes. Plan modeA mode that makes no edits: it researches the codebase and produces an editable plan you review before any code changes. for anything net-new or non-trivial, because in plan the agent aggressively farms context up front — finding the injection points ahead of time, forming context more comprehensively, and staying on track during execution. Debug modeA mode that diagnoses a failure: it reproduces the issue, adds instrumentation and watches the logs, rather than reviewing a pull request. is worth keeping in the rotation too. The honest advice: play with all of them.
- Mode
- Ask
- Reach for it when
- You want an answer, no code edits
- In security review
- Interrogating a system without touching it
- Mode
- Agent
- Reach for it when
- Quick, contained changes
- In security review
- Small fixes and ad-hoc checks
- Mode
- Plan
- Reach for it when
- Building anything net-new or involved
- In security review
- Enumerate injection points before any code
- Mode
- Debug
- Reach for it when
- Tracing a stubborn failure
- In security review
- Saved real time on a recent hard bug
| Mode | Reach for it when | In security review |
|---|---|---|
| Ask | You want an answer, no code edits | Interrogating a system without touching it |
| Agent | Quick, contained changes | Small fixes and ad-hoc checks |
| Plan | Building anything net-new or involved | Enumerate injection points before any code |
| Debug | Tracing a stubborn failure | Saved real time on a recent hard bug |
Field-tested mode picks. Plan mode is the one a skeptic became a convert on.
On models, be a daily driver rather than a constant switcher: you build an intuition for how a model approaches problems and where it gets stuck. A current Opus 4.x model works well as a security daily driver. Some GPT-5.x models are very strong on hard problems but slower, so reach for them when you want depth over speed. Let Auto modeA router that reads your prompt and picks a model for you, defaulting to Composer; you steer it with cues like "quickly" or "carefully". route by task when you're undecided. This is a personal observation, not the official Cursor ranking — try models yourself and form your own view.
Interactive widget. Tab through its controls; the result updates in the panel below as you change them.
Plan with a thinker, execute with a fast model.
Match task shape to model: daily-driver reasoning, hard-but-slow depth, design, or let Auto route.
A generated plan once defaulted to an older model name as a training-cutoff artifact — the model assumed that was what it had. Tell it explicitly which current model to use and it updates the plan. Never let the model's idea of "what I am" decide your model selection.
How do I build secure infrastructure with Plan mode?
Use Plan modeA mode that makes no edits: it researches the codebase and produces an editable plan you review before any code changes. whenever you're shipping infrastructure, so you see the agent's thinking before it acts. For a worked example: deploying a third-party AI tool for a coworker, the safe-but-usable landing spot was to keep it off the public internet and reach it over a Tailscale connector (which the tool's own docs recommended), so it's locked down yet reachable from a phone. In plan, the agent writes Terraform into an infra/ directory with good defaults from the docs, explicitly not open to the internet, with reasonable plugins so it isn't trivially prompt-injected.
Read the generated plan carefully — that's the moment to give feedback. Ask the adversarial question directly in plan: "What if somebody sends my boss something that says to ignore previous instructions and fire me?" Then have a security sub-agentA child agent a main agent spawns to work in parallel with its own context window, handing results back so the parent's context stays clean. look for risks in the implementation before you open a PR. You want to know about real issues before the PR exists, not after.
Plan modeA mode that makes no edits: it researches the codebase and produces an editable plan you review before any code changes. makes the agent farm context aggressively before building anything, which surfaces the injection points up front and keeps execution on track. For involved or net-new work it's the default; for a one-line change, skip it.
If we're going to build anything net new or make changes to anything that's relatively involved, then I use plan because it will go and find ahead of time all of the injection points.
How should I threat-model prompt injection in an AI tool?
Start from the root cause: in today's models there is no separation of control and data. The model doesn't understand different trust levels of whoever is feeding it information — instructions and untrusted content arrive on the same channel. Security has made this mistake before, and AI makes it again. That single fact is why prompt injection is the spectre hanging over every agent system.
Because instructions and data share a channel, any untrusted text the agent reads can act as a command. Threat-model accordingly: assume hostile content reaches the model, and design so a malicious instruction can't reach anything dangerous.
There's no separation of control and data... the models are built in a way where they don't understand different trust levels of who's giving them information.
Mitigate with exposure and capability limits, not cleverness. Don't expose the tool to the public internet. Put it behind a private network like Tailscale. Add reasonable plugins and guardrails so it isn't trivially injected. Then run the adversarial scenario explicitly: "what if an email tells the agent to ignore its instructions and take a harmful action?" If your design can't answer that, it isn't done.
How do I keep credentials away from the agent and gate risky commands?
Never hand the agent your credentials, even when you're a GitHub and AWS admin. Use a helper script the agent calls — you type something like creds prod, and the script sets environment variables in the shell the agent drives but cannot itself read. The agent can then run elevated AWS commands without the credentials ever being exposed to it. Architecturally, the agent drives a terminal through a tool; it sees terminal output, but the shell's env properties are not exposed to it. The same pattern works for any privileged system.
Approve every state-mutating and network command; let reads run free. Reading files is fine almost full-stop. Mutating state or anything needing network access asks for approval every time, because your user has more permissions to systems than you're comfortable handing the agent. Writing outside the workspace is a non-issue when you run in git workspaces — if the agent changes something, reset it. Approval isn't friction: the agent writes a complex aws ... | jq command faster than even a CLI expert could, so spending a second to read it and confirm it's sane is cheap. If the agent does something wrong, that's on you.
Where you draw the auto-run line depends on your blast radiusHow much breaks if a change goes wrong; the scope of potential damage.. A high-privilege admin gates network and mutations because their user can reach production. A local crypto developer would set it up differently. Cursor is actively building more sandbox controls to let you decide exactly what the agent can and can't run; check the docs for what's available.
- Command type
- Read files
- Control
- Auto-run
- Why
- Low blast radiusHow much breaks if a change goes wrong; the scope of potential damage.; reads are how the agent does its job
- Command type
- Mutate state
- Control
- Approve every time
- Why
- Your user can change things you don't want touched silently
- Command type
- Network access
- Control
- Approve every time
- Why
- Reaches systems where your permissions exceed the agent's trust
- Command type
- Write outside workspace
- Control
- Git reset if needed
- Why
- Running in a git workspace makes unwanted writes recoverable
| Command type | Control | Why |
|---|---|---|
| Read files | Auto-run | Low blast radiusHow much breaks if a change goes wrong; the scope of potential damage.; reads are how the agent does its job |
| Mutate state | Approve every time | Your user can change things you don't want touched silently |
| Network access | Approve every time | Reaches systems where your permissions exceed the agent's trust |
| Write outside workspace | Git reset if needed | Running in a git workspace makes unwanted writes recoverable |
An allow-reads / gate-mutations-and-network policy, tuned to a high-privilege user's blast radius.
How do hooks, rules and Bugbot fit into the review pipeline?
Layer the controls across the lifecycle. Cursor hooks plus Cursor rules run validation while the agent is performing actions, in-flight. BugbotCursor's automated PR reviewer that posts inline findings and can push fix commits from isolated VMs. runs at PR time as the last gate before merge. And you manually drive the agent in between, periodically telling it to go find its own bugs as you develop. Hooks are specifically powerful for that in-flight validation; rules pin codebase truths the model keeps getting wrong — for example, a rule that the service runs on port 40001, not the 3000 the model kept hallucinating.
Interactive diagram. Tab through its regions; each focused region shows its detail in the panel below.
Each control engages at a different point: in-flight, then at PR, with manual bug-hunting throughout.
BugbotCursor's automated PR reviewer that posts inline findings and can push fix commits from isolated VMs. has caught real security issues in code the reviewer produced themselves. It's the final validation loop before merge. There's a one-click "send to agent and fix" now; the older move of copy-pasting the finding back to the agent to evaluate works fine too. On speed: it's not, by a long shot, the slowest thing in the pipeline — and you want it comprehensive.
Rules that apply on every request eat context every time, so keep the always-applied set lean. Prefer file-pattern rules (all Python files, all scripting files) and nested rules that apply to a folder and its subfolders. A lighter trick: a rule that says "for these top-level operations, consult this document," with the doc organized into sections — so the agent knows where to look without bloating context.
What stays human, and where does this approach fall short?
Merging stays human. The agent finds, proves and proposes; you read every line you'd ship and own the outcome. Don't accept "the model thinks it's a vulnerability, therefore it is" — or the inverse. For penetration testing, the workflow already finds bugs and writes PoCs, so running the PoC as a live exploit isn't a giant leap, and teams are chaining it. But validated-from-source-of-truth (perfect visibility of code and config) is often enough without a live exploit, and you should never warranty offensive work you haven't done yourself.
A few honest limits. Effectiveness multipliers here are self-reported, not measured. Context-window management used to dominate the workflow; modern compaction is good enough that a roughly 9-hour session ran without a single reset — but when an agent goes off the rails, the recovery is still manual: have it summarize, copy that out, spin up a fresh agent, paste it in. Some teased sandbox and PR-security-loop features aren't announced yet, so don't plan around them. And enterprise-wide controls — MCPModel Context Protocol. A standard that lets an AI agent pull in context from outside the repo, like Jira tickets or internal docs. governance, org-level policy — are a separate topic from this single-reviewer workflow.
The division of labor that makes this work: the agent codes fast; the human thinks in systems, reasons from first principles and hunts failure cases. Most of the value is in asking the right questions, not typing the code.
Between the two of us — this very capable agent that writes code very fast and my analytical brain — we tend to do stuff well.
Frequently asked questions
Are AI agents better than static scanners for security review?
For most review work, yes. Static rules can't trace context through a codebase the way an agent can. Agents add one failure mode — they fabricate — so prime them by vulnerability class to cut false negatives and run a validation loop (attack chain plus a working proof of concept) to cut false positives. A good hybrid feeds linter and scanner output into the agent for triage.
What is the validation loop for agent-found vulnerabilities?
When the agent reports a finding, copy it back and demand proof: explain the attack chain, show the code snippets, describe exactly how it would be exploited, and prove the exploit works. The agent re-investigates and frequently concludes it isn't a real vulnerability — often an intentional feature. You can also tell the model up front to self-validate every finding with a proof of concept.
How do I keep my credentials out of the coding agent?
Use a helper script the agent calls (for example, you type creds prod) that sets environment variables in the shell the agent drives but cannot itself read. The agent runs elevated commands without ever seeing the credentials. The agent drives a terminal through a tool and sees output, but the shell's env properties aren't exposed to it. Approve every elevated command and read each one.
Which commands should I auto-run versus approve when reviewing with an agent?
Let reads run free; approve every state-mutating and network command. Reads are low blast radius and how the agent does its work. Mutations and network access touch systems where your user has more permission than you want to hand the agent. Running in a git workspace makes unwanted writes recoverable with a reset. Your exact line depends on your own threat model.
Why is prompt injection so hard to defend against?
Because today's models have no separation of control and data — instructions and untrusted content arrive on the same channel, and the model doesn't distinguish trust levels of whoever feeds it information. Mitigate with exposure and capability limits: keep tools off the public internet, put them behind a private network like Tailscale, add guardrails, and test the adversarial scenario explicitly.
When should I use Plan mode for security work?
Use Plan mode for anything net-new or non-trivial, especially infrastructure. In plan the agent farms context aggressively up front, surfacing injection points before it writes code and staying on track during execution. Read the generated plan carefully and give feedback there. For a quick one-line change, skip plan and use Agent mode.
Sources & last verified
Cursor ships frequently. Facts verified against primary sources on June 25, 2026.