AI-assisted review, testing & anti-patterns
Add AI as an independent signal inside the review process teams already trust.
The PR is the unit of trust
Everything an AI writes is a draft until it lands as a pull request. The PR is where intent, change, evidence, and human accountability fuse into one reviewable, revertable, auditable artifact. If you only internalize one idea from this module: the unit of trust is the PR — not the diff, not the chat transcript, not the agent run.
Why the PR and not the commit or the agent output? Because a PR is the smallest object that carries all four things an enterprise needs at once: a bounded scope (blast radiusHow much breaks if a change goes wrong; the scope of potential damage. you can reason about), a description (the intent and the why), a set of checks (the evidence), and an approval (the human who is now accountable). A raw agent transcript has none of these as durable, queryable records. The PR is what your ITGCIT General Controls. The baseline IT controls auditors check: who can change what, how changes get approved, and how systems are run. auditor will pull six months from now and ask: who approved this, what did the checks say, and does the description match the diff?
AI changes who drafts the code. It must not change how code lands. The pipeline that protects production — scoped PR, deterministic checks, independent review, code-owner approval — stays exactly the same whether a human or an agent typed the characters.
The job of a field engineer is to make AI productive inside that pipeline, never to weaken the pipeline to make AI look productive.
The practical mandate this puts on every author — human or agent-driven — is that the PR must be well-scoped, well-tested, and well-described so that nothing downstream changes. Downstream here means your reviewers, your CI, your CODEOWNERS routing, your release process, your auditors. If an AI-authored PR forces any of those to behave differently — bigger diffs, weaker checks, skipped approvals — you've leaked the cost of AI velocity onto the parts of the system that exist to contain risk.
- Well-scoped
- One coherent change. A reviewer can hold the whole blast radiusHow much breaks if a change goes wrong; the scope of potential damage. in their head. Refactor and behavior change are not mixed in one PR.
- Well-tested
- Tests assert the intended behavior, run in CI, and would fail if the change regressed. Coverage is an input to the gate, not the trophy.
- Well-described
- Title + body state intent, approach, and risk. The description must match the diff — drift between them is itself a review finding.
- Accountable
- A required human code-owner approves. That signature is the non-collapsible step. AI is a signal feeding it, never a substitute for it.
"AI changes who writes the first draft. It does not change how code earns its way into production. The PR is still the unit of trust, and a human code-owner still signs."
Self-check
Layered review: AI as an independent signal
Review is not a single gate; it's a defense in depth. Each layer catches a different class of failure, and — critically — each layer is independent of the others. The moment two layers share the same blind spot, you've spent process without buying safety. The whole architecture is built so that the cheap, fast, deterministic layers catch the boring stuff, and scarce human attention is reserved for judgment: design, intent, risk, and the things only a domain expert sees.
Author self-review → deterministic checks → AI review (independent signal) → specialist review → required human code-owner approval. The code-owner gate never collapses into the others; it is the accountable signature.
- 1Author self-review. Before requesting review, the author reads their own diff line by line. With AI-authored code this is non-negotiable — you are claiming authorship of what the agent produced. If you can't explain a line, it doesn't ship.
- 2Deterministic checks. Lint, type-check, build, tests, SASTStatic Application Security Testing. Scanning source code for vulnerabilities without running it., secret scanning, license checks. These are reproducible and non-negotiable: same input, same verdict. They are the floor, not the ceiling.
- 3AI review (independent signal). BugbotCursor's automated PR reviewer that posts inline findings and can push fix commits from isolated VMs. reads the PR fresh and comments inline. It is a second pair of eyes that did not write the code — that independence is the entire point.
- 4Specialist review. Security, SRESite Reliability Engineering. The team and practice that keeps production reliable: monitoring, on-call, and incident response., data, or domain experts for changes that touch their risk surface. Routed by CODEOWNERS and risk tier, not by vibes.
- 5Required human code-owner approval. The accountable signature. This is the separation-of-duties control: the person who merges is not the person (or agent) who authored. It never collapses into any earlier layer.
The dangerous shortcut is letting AI review substitute for human approval: "BugbotCursor's automated PR reviewer that posts inline findings and can push fix commits from isolated VMs. passed, ship it." That collapses two independent layers into one and quietly destroys separation of dutiesNo single person can author, approve, and deploy the same change. The core control AI autonomy has to respect..
AI review is an input to the human decision, never the decision. If an org wires BugbotCursor's automated PR reviewer that posts inline findings and can push fix commits from isolated VMs.'s pass as an auto-merge trigger with no code-owner sign-off on risk-bearing code, they've built a control that an auditor will flag — and that will eventually merge something nobody owns.
Why independence is load-bearingthe core interview point
If the same model that wrote the code also reviews it, the review inherits the author's blind spots — it will rationalize the same flawed assumption it baked in. Independence is what makes a second signal worth anything. That's why BugbotCursor's automated PR reviewer that posts inline findings and can push fix commits from isolated VMs. reviewing an agent's PR is valuable (different context, fresh read of the final diff) and why a human code-owner reviewing both is still required: the human carries accountability and judgment that no model layer provides.
Reproducible, binary verdict.
Catch: syntax, types, regressions, secrets, license violations.
Trust property: same input → same output, every time.
Probabilistic, contextual second read.
Catch: logic bugs, edge cases, missing error handling, intent drift.
Trust property: independence — didn't author the code.
Judgment + accountability.
Catch: design fit, risk, business intent, the unknown-unknowns.
Trust property: the non-collapsible signature. Separation of duties lives here.
Self-check
QIn the layered review model, which layer must NEVER collapse into the others, and why?
Bugbot mechanics and the discipline of tuning
BugbotCursor's automated PR reviewer that posts inline findings and can push fix commits from isolated VMs. is Cursor's AI code reviewer. Mechanically: it auto-runs when a PR is opened or updated, reads the diff in the context of the repo, and leaves inline comments on the specific lines it's worried about. It is not a linter with a fixed ruleset — it reasons about logic, edge cases, error handling, and whether the change matches its apparent intent. As of June 2026 it's roughly 3x faster and 22% cheaper than the prior generation, finds about 10% more bugs, and 90% of runs finish in under 3 minutes — fast enough to sit in the PR loop without becoming the bottleneck. (Treat the specific percentages as perishable — verify before quoting in a customer setting.)
- Trigger
- Auto-runs on PR open / update (every push to the PR)
- Output
- Inline comments on the exact lines of concern
- Customization
- .cursor/BUGBOT.md — repo- and area-specific custom rules
- Autofix
- Isolated cloud-VM agents propose fixes; ~35% of autofix changes get merged
- Speed (Jun 2026)
- ~90% of runs under 3 minutes; ~3x faster, ~22% cheaper than prior gen (verify)
Custom rules: .cursor/BUGBOT.mdper-area, checked into the repo
Generic review advice is noise. The leverage is .cursor/BUGBOT.md — a checked-in file where a team encodes its hard-won rules: "never call the billing API without an idempotency key," "all DB migrations must be backward-compatible for one release," "PIIPersonally Identifiable Information. Data that can identify a person (names, emails, SSNs); regulated and sensitive. fields must go through the redaction helper." Because it lives in the repo and can be scoped per area, the same BugbotCursor's automated PR reviewer that posts inline findings and can push fix commits from isolated VMs. enforces different standards in the payments directory than in the marketing-site directory. That's how you turn an AI reviewer from a generic nag into a guardian of your specific risk tiers.
Autofix and isolated VMspropose, don't presume
When BugbotCursor's automated PR reviewer that posts inline findings and can push fix commits from isolated VMs. can propose a fix, Autofix spins up an isolated cloud-VM agent to generate the change — sandboxed, so the fix-generation has no ambient access to your laptop or secrets. Roughly 35% of autofix-proposed changes end up merged. Read that number correctly in an interview: it is not a failure rate. It means a third of proposals were good enough to accept and the rest were correctly rejected by a human — which is exactly the AI-proposes / human-disposes model working as designed. A 100% merge rate would actually be alarming; it would mean nobody was reviewing.
The single fastest way to kill an AI reviewer's credibility is false positives. Three noisy comments and engineers start reflexively dismissing every BugbotCursor's automated PR reviewer that posts inline findings and can push fix commits from isolated VMs. comment — including the true one that would have caught the incident.
This makes tuning first-class work, not a someday-chore. It needs an explicit owner and a cadence: someone reviews dismissed/ignored comments, prunes or sharpens .cursor/BUGBOT.md rules, and tracks the signal-to-noise ratio over time. An untuned reviewer doesn't just fail to help — it actively trains your team to ignore review.
If asked 'how do you know BugbotCursor's automated PR reviewer that posts inline findings and can push fix commits from isolated VMs. is working?' — don't cite bugs found. Cite signal-to-noise trend and comment-resolution rate, with a named owner and a review cadence. The metric that matters is: are engineers acting on its comments, or dismissing them?
Bonus credibility: name that 'tuning has an owner and a cadence' is the same operating discipline you'd apply to any noisy alerting system — false positives are an SRESite Reliability Engineering. The team and practice that keeps production reliable: monitoring, on-call, and incident response. problem, not a tooling quirk.
Self-check
Test generation discipline: assert intent, not implementation
AI is fantastic at generating tests and terrible at knowing what's worth testing. The discipline is the whole game: a generated test that mirrors the implementation is not a safety net — it's a tripwire that fires only when you change the code, never when the code is wrong.
Assert intent, not the implementationthe cardinal rule
The classic AI test anti-pattern: the agent writes the function, then writes a test that re-states what the function does line for line. If calculateTax has an off-by-one, the test that was generated from that buggy function will assert the buggy output and pass forever. The test must instead encode the intent — the spec, the business rule, the expected behavior at the boundaries — derived independently of how the code happens to work. A good prompt is "write tests that assert the documented behavior and edge cases of this function" — and then you read them and confirm they'd fail against a wrong implementation.
| Mirrors implementation (bad) | Asserts intent (good) |
|---|---|
| Reads the code, restates its output | Reads the spec/requirement, states expected behavior |
| Passes even when the code is wrong | Fails when behavior is wrong, regardless of implementation |
| Breaks on every harmless refactor | Survives refactors; breaks only on behavior change |
| Generated and merged unread | Generated, then read and challenged by a human |
Characterization tests before a refactorpin behavior first
Before you let an agent refactor a gnarly legacy module, write characterization tests first — tests that capture the current observable behavior, warts and all, even behavior you suspect is wrong. This is the one place where 'mirror the existing behavior' is correct, because the goal is to pin behavior so the refactor can't silently change it. AI is excellent at this: point it at the module, have it generate characterization tests across the input space, confirm they pass against the current code, then refactor. If a characterization testA test written to pin down current behavior before a refactor, so you notice if the behavior changes. breaks during the refactor, you've caught a behavior change you didn't intend — exactly what you wanted.
Characterization tests answer 'what does this code do today?' — you mirror behavior on purpose, as a safety harness for change.
Intent tests answer 'what should this code do?' — you derive from the spec, independent of the implementation.
Both are useful. Confusing them — writing intent-shaped tests that secretly just mirror the code — is how you ship a green suite that proves nothing.
Coverage is a gate input, never the goalGoodhart's Law
Coverage is useful as a floor — a PR that drops coverage on a critical module is worth a second look. But the moment coverage becomes a target, Goodhart's Law kicks in: "when a measure becomes a target, it ceases to be a good measure." Teams (and agents) chasing a 90% number generate tests that execute lines without asserting anything meaningful — coverage goes up, real safety goes flat or down. AI makes this worse because it can manufacture coverage-padding tests by the dozen in seconds.
"Coverage tells you what code ran, never whether it's correct. We gate on coverage as a floor and review tests for whether they assert intent. An agent can take any number to 90% — that's exactly why the number can't be the goal."
Self-check
QCoverage on a team's repo jumps from 60% to 92% the month after they adopt an AI agent, but production incidents don't drop. What's the most likely explanation?
Cursor in CI and the bright line of autonomy
Cursor's agent surface now reaches into the CI loop. Cloud Agents (shipped in 3.5, May 2026) run in isolated cloud VMs with terminal and browser access, can work across multiple repos in parallel, and report back asynchronously — which means an agent can be triggered by a failing pipeline, investigate, and propose a fix without a human babysitting the terminal. The CLI gained /debug in 3.1. This is powerful, and it's exactly where the governance question gets sharp.
The bright linepropose vs commit
There is one distinction that decides everything about how you deploy agents in CI:
Agent opens a PR. A human code-owner reviews and merges.
All the existing gates still apply — checks, AI review, approval.
This is the default. It scales without expanding blast radiusHow much breaks if a change goes wrong; the scope of potential damage..
Failure mode is contained: a bad proposal is a rejected PR, not an incident.
Agent merges or deploys without a human in the loop.
Requires explicit governance: narrow scope, low risk tier, audit logging, kill switch, named owner.
Never the default. Justified case-by-case, reversible, observable.
Failure mode reaches production directly — so it's earned, not assumed.
AI proposes is default-safe and should be ~99% of your footprint. AI commits autonomously is a governed exception — allowed only on a narrow, low-risk surface, with logging and a kill switch, owned by someone accountable.
The blast radiusHow much breaks if a change goes wrong; the scope of potential damage. determines which side of the line a use case sits on. A doc typo fix on a static site can earn autonomy. A schema migration on the payments DB never does.
Never fix a pipeline by disabling a checkthe cardinal sin
When CI goes red and the pressure is on, the seductive 'fix' is to disable the failing check, mark the test as skipped, or merge with admin override. This is the one move that is always wrong. The check is the evidence; disabling it doesn't fix the problem, it deletes the proof that a problem exists and silently lowers the bar for everyone after you. An agent under instruction to 'make CI green' will absolutely do this if you let it — which is why agents that touch CI need rules that forbid it, and why a human reviews any change to the pipeline config itself.
- 1Read the failure. Get the actual error and the failing check's logs — don't guess. (The CLI
/debugand Cloud Agents are good at exactly this.) - 2Reproduce locally (or in an isolated agent VM). A failure you can't reproduce, you can't trust you've fixed.
- 3Find the root cause. Is the code wrong, or is the test/check wrong? Both are legitimate, but you must know which — and changing a test to match buggy code is itself the bug.
- 4Fix the cause, not the symptom. Repair the code, or correct the test if the test was genuinely wrong (with justification in the PR). Never disable, skip, or override to go green.
- 5Re-run the full suite and confirm green for the right reason. Land it as a normal PR through the normal gates.
"You never fix a red pipeline by disabling the check. The check isn't the obstacle — it's the evidence. Disabling it doesn't solve the problem, it just deletes the proof and lowers the bar for everyone who merges after you."
Self-check
The anti-pattern taxonomy
Every AI-in-the-SDLC failure mode rhymes. Learn the taxonomy and you can name what's going wrong in a customer's org in one sentence, then point at the guardrail. For each: the failure, who loses trust, and the control that contains it.
| Anti-pattern | Failure → who loses trust | Guardrail |
|---|---|---|
| Vibe-merges | Merging on a green-checkmark feeling without reading the diff → reviewers, then prod when it breaks | Required author self-review + code-owner approval; AI review as independent signal, never as the merge trigger |
| Mega-diffs | One PR touches 80 files mixing refactor + behavior → reviewers can't hold blast radiusHow much breaks if a change goes wrong; the scope of potential damage., they rubber-stamp | Enforce scope: split refactor from behavior; small, coherent PRs; flag oversized diffs in review |
| Prompt-and-pray | Fire a vague prompt, accept whatever comes back unverified → the author (their name is on it) | Specify intent, read the output, test it. The author owns the agent's work, full stop |
| Fabricated confidence | Model states a wrong answer fluently (hallucinated API, invented behavior) → whoever trusted the fluent tone | Verify against ground truth — docs, types, a running test. Tone is not evidence |
| Hidden generated code | Large AI-generated blocks merged with no signal they're generated → future maintainers + auditors | AI-code tracking / attribution; honest PR descriptions; review density scales with how much is generated |
| Context rot | Long agent session drifts off the original task; later edits contradict earlier ones → the author, silently | Short, scoped sessions; re-ground the agent; review the final diff against intent, not the conversation |
| Secrets / prompt-injection | Secrets pasted into context, or hostile content in a repo/issue hijacks the agent → security, the whole org | Secret scanning, Privacy ModeCursor's setting that routes requests under zero-data-retention terms so providers don't store or train on your code./ZDRZero Data Retention. A contractual guarantee that the model provider won't store your code or train on it., terminal sandboxing, least-privilege MCPModel Context Protocol. A standard that lets an AI agent pull in context from outside the repo, like Jira tickets or internal docs., treat external content as untrusted |
| Excessive agent permissions | Agent granted broad repo/tool/prod access 'to be convenient' → security + SRESite Reliability Engineering. The team and practice that keeps production reliable: monitoring, on-call, and incident response.; blast radiusHow much breaks if a change goes wrong; the scope of potential damage. is now huge | Least privilege: model/MCPModel Context Protocol. A standard that lets an AI agent pull in context from outside the repo, like Jira tickets or internal docs./repo allowlists, RBACRole-Based Access Control. Granting permissions by role rather than configuring each person individually., terminal sandboxing, isolated VMs, scoped tokens |
| Volume-as-success / mandated usage | Measuring lines-of-AI-code or mandating usage → leadership credibility, then engineers who game the metric | Measure outcomes (DORADORA metrics. Four widely-used delivery measures: deployment frequency, lead time for changes, change failure rate, and time to restore service., throughput, defect rate), not AI volume; adoption via mentorship, not mandate |
Volume-as-success and mandated usage are dangerous precisely because they look like leadership 'driving adoption.' They produce a metric that goes up while trust goes down — engineers game the number, and the org learns that AI means 'theater you're forced to perform.'
The Box case study is the counter-model: 85%+ daily active and 30–50% throughput gains came from mentorship (+75% usage in 6 weeks via peer enablement), not from a mandate. Pull, not push.
How they clusterthree root causes
Notice the through-line: nearly every guardrail in the right column is something we already covered — the well-formed PR, the layered review with a non-collapsible human gate, intent-asserting tests, least privilege, and outcome metrics over volume. The anti-patterns aren't exotic; they're what you get when you skip the disciplines. Your job in the field is to spot which discipline a team dropped and restore it.
When a customer describes a mess, resist solutioning immediately. Name the anti-pattern first ('that's classic vibe-merging' / 'you've made coverage a target'), say who's losing trust, then prescribe the single guardrail. That sequence — name → trust impact → control — is what separates a field engineer from a feature-lister.
Strongest closing line: 'AI didn't break your pipeline. Skipping a discipline did. We put the discipline back — the PR stays the unit of trust.'