Skip to content
Field Academy
DAY 5 20 min

AI-assisted review, testing & anti-patterns

Add AI as an independent signal inside the review process teams already trust.

0/6 sections

The PR is the unit of trust

Everything an AI writes is a draft until it lands as a pull request. The PR is where intent, change, evidence, and human accountability fuse into one reviewable, revertable, auditable artifact. If you only internalize one idea from this module: the unit of trust is the PR — not the diff, not the chat transcript, not the agent run.

Why the PR and not the commit or the agent output? Because a PR is the smallest object that carries all four things an enterprise needs at once: a bounded scope (blast radiusHow much breaks if a change goes wrong; the scope of potential damage. you can reason about), a description (the intent and the why), a set of checks (the evidence), and an approval (the human who is now accountable). A raw agent transcript has none of these as durable, queryable records. The PR is what your ITGCIT General Controls. The baseline IT controls auditors check: who can change what, how changes get approved, and how systems are run. auditor will pull six months from now and ask: who approved this, what did the checks say, and does the description match the diff?

The thesis

AI changes who drafts the code. It must not change how code lands. The pipeline that protects production — scoped PR, deterministic checks, independent review, code-owner approval — stays exactly the same whether a human or an agent typed the characters.

The job of a field engineer is to make AI productive inside that pipeline, never to weaken the pipeline to make AI look productive.

The practical mandate this puts on every author — human or agent-driven — is that the PR must be well-scoped, well-tested, and well-described so that nothing downstream changes. Downstream here means your reviewers, your CI, your CODEOWNERS routing, your release process, your auditors. If an AI-authored PR forces any of those to behave differently — bigger diffs, weaker checks, skipped approvals — you've leaked the cost of AI velocity onto the parts of the system that exist to contain risk.

What makes a PR "well-formed"
Well-scoped
One coherent change. A reviewer can hold the whole blast radiusHow much breaks if a change goes wrong; the scope of potential damage. in their head. Refactor and behavior change are not mixed in one PR.
Well-tested
Tests assert the intended behavior, run in CI, and would fail if the change regressed. Coverage is an input to the gate, not the trophy.
Well-described
Title + body state intent, approach, and risk. The description must match the diff — drift between them is itself a review finding.
Accountable
A required human code-owner approves. That signature is the non-collapsible step. AI is a signal feeding it, never a substitute for it.
Say it like this

"AI changes who writes the first draft. It does not change how code earns its way into production. The PR is still the unit of trust, and a human code-owner still signs."

Self-check

Layered review: AI as an independent signal

Review is not a single gate; it's a defense in depth. Each layer catches a different class of failure, and — critically — each layer is independent of the others. The moment two layers share the same blind spot, you've spent process without buying safety. The whole architecture is built so that the cheap, fast, deterministic layers catch the boring stuff, and scarce human attention is reserved for judgment: design, intent, risk, and the things only a domain expert sees.

The layered review pipeline
PRcheckshuman review⛔ gatemergebuildstagingcanaryprod
One human gate: everything else automated; roll back by redeploying the last good artifact (mind the DB migrations).

Author self-review → deterministic checks → AI review (independent signal) → specialist review → required human code-owner approval. The code-owner gate never collapses into the others; it is the accountable signature.

  1. 1Author self-review. Before requesting review, the author reads their own diff line by line. With AI-authored code this is non-negotiable — you are claiming authorship of what the agent produced. If you can't explain a line, it doesn't ship.
  2. 2Deterministic checks. Lint, type-check, build, tests, SASTStatic Application Security Testing. Scanning source code for vulnerabilities without running it., secret scanning, license checks. These are reproducible and non-negotiable: same input, same verdict. They are the floor, not the ceiling.
  3. 3AI review (independent signal). BugbotCursor's automated PR reviewer that posts inline findings and can push fix commits from isolated VMs. reads the PR fresh and comments inline. It is a second pair of eyes that did not write the code — that independence is the entire point.
  4. 4Specialist review. Security, SRESite Reliability Engineering. The team and practice that keeps production reliable: monitoring, on-call, and incident response., data, or domain experts for changes that touch their risk surface. Routed by CODEOWNERS and risk tier, not by vibes.
  5. 5Required human code-owner approval. The accountable signature. This is the separation-of-duties control: the person who merges is not the person (or agent) who authored. It never collapses into any earlier layer.
Watch out — the collapse failure

The dangerous shortcut is letting AI review substitute for human approval: "BugbotCursor's automated PR reviewer that posts inline findings and can push fix commits from isolated VMs. passed, ship it." That collapses two independent layers into one and quietly destroys separation of dutiesNo single person can author, approve, and deploy the same change. The core control AI autonomy has to respect..

AI review is an input to the human decision, never the decision. If an org wires BugbotCursor's automated PR reviewer that posts inline findings and can push fix commits from isolated VMs.'s pass as an auto-merge trigger with no code-owner sign-off on risk-bearing code, they've built a control that an auditor will flag — and that will eventually merge something nobody owns.

Why independence is load-bearingthe core interview point

If the same model that wrote the code also reviews it, the review inherits the author's blind spots — it will rationalize the same flawed assumption it baked in. Independence is what makes a second signal worth anything. That's why BugbotCursor's automated PR reviewer that posts inline findings and can push fix commits from isolated VMs. reviewing an agent's PR is valuable (different context, fresh read of the final diff) and why a human code-owner reviewing both is still required: the human carries accountability and judgment that no model layer provides.

Deterministic checks

Reproducible, binary verdict.

Catch: syntax, types, regressions, secrets, license violations.

Trust property: same input → same output, every time.

AI review

Probabilistic, contextual second read.

Catch: logic bugs, edge cases, missing error handling, intent drift.

Trust property: independence — didn't author the code.

Human code-owner

Judgment + accountability.

Catch: design fit, risk, business intent, the unknown-unknowns.

Trust property: the non-collapsible signature. Separation of duties lives here.

Self-check

QIn the layered review model, which layer must NEVER collapse into the others, and why?

Bugbot mechanics and the discipline of tuning

BugbotCursor's automated PR reviewer that posts inline findings and can push fix commits from isolated VMs. is Cursor's AI code reviewer. Mechanically: it auto-runs when a PR is opened or updated, reads the diff in the context of the repo, and leaves inline comments on the specific lines it's worried about. It is not a linter with a fixed ruleset — it reasons about logic, edge cases, error handling, and whether the change matches its apparent intent. As of June 2026 it's roughly 3x faster and 22% cheaper than the prior generation, finds about 10% more bugs, and 90% of runs finish in under 3 minutes — fast enough to sit in the PR loop without becoming the bottleneck. (Treat the specific percentages as perishable — verify before quoting in a customer setting.)

Bugbot at a glance
Trigger
Auto-runs on PR open / update (every push to the PR)
Output
Inline comments on the exact lines of concern
Customization
.cursor/BUGBOT.md — repo- and area-specific custom rules
Autofix
Isolated cloud-VM agents propose fixes; ~35% of autofix changes get merged
Speed (Jun 2026)
~90% of runs under 3 minutes; ~3x faster, ~22% cheaper than prior gen (verify)

Custom rules: .cursor/BUGBOT.mdper-area, checked into the repo

Generic review advice is noise. The leverage is .cursor/BUGBOT.md — a checked-in file where a team encodes its hard-won rules: "never call the billing API without an idempotency key," "all DB migrations must be backward-compatible for one release," "PIIPersonally Identifiable Information. Data that can identify a person (names, emails, SSNs); regulated and sensitive. fields must go through the redaction helper." Because it lives in the repo and can be scoped per area, the same BugbotCursor's automated PR reviewer that posts inline findings and can push fix commits from isolated VMs. enforces different standards in the payments directory than in the marketing-site directory. That's how you turn an AI reviewer from a generic nag into a guardian of your specific risk tiers.

Autofix and isolated VMspropose, don't presume

When BugbotCursor's automated PR reviewer that posts inline findings and can push fix commits from isolated VMs. can propose a fix, Autofix spins up an isolated cloud-VM agent to generate the change — sandboxed, so the fix-generation has no ambient access to your laptop or secrets. Roughly 35% of autofix-proposed changes end up merged. Read that number correctly in an interview: it is not a failure rate. It means a third of proposals were good enough to accept and the rest were correctly rejected by a human — which is exactly the AI-proposes / human-disposes model working as designed. A 100% merge rate would actually be alarming; it would mean nobody was reviewing.

What destroys trust: false positives

The single fastest way to kill an AI reviewer's credibility is false positives. Three noisy comments and engineers start reflexively dismissing every BugbotCursor's automated PR reviewer that posts inline findings and can push fix commits from isolated VMs. comment — including the true one that would have caught the incident.

This makes tuning first-class work, not a someday-chore. It needs an explicit owner and a cadence: someone reviews dismissed/ignored comments, prunes or sharpens .cursor/BUGBOT.md rules, and tracks the signal-to-noise ratio over time. An untuned reviewer doesn't just fail to help — it actively trains your team to ignore review.

Interview framing

If asked 'how do you know BugbotCursor's automated PR reviewer that posts inline findings and can push fix commits from isolated VMs. is working?' — don't cite bugs found. Cite signal-to-noise trend and comment-resolution rate, with a named owner and a review cadence. The metric that matters is: are engineers acting on its comments, or dismissing them?

Bonus credibility: name that 'tuning has an owner and a cadence' is the same operating discipline you'd apply to any noisy alerting system — false positives are an SRESite Reliability Engineering. The team and practice that keeps production reliable: monitoring, on-call, and incident response. problem, not a tooling quirk.

Self-check

Test generation discipline: assert intent, not implementation

AI is fantastic at generating tests and terrible at knowing what's worth testing. The discipline is the whole game: a generated test that mirrors the implementation is not a safety net — it's a tripwire that fires only when you change the code, never when the code is wrong.

Assert intent, not the implementationthe cardinal rule

The classic AI test anti-pattern: the agent writes the function, then writes a test that re-states what the function does line for line. If calculateTax has an off-by-one, the test that was generated from that buggy function will assert the buggy output and pass forever. The test must instead encode the intent — the spec, the business rule, the expected behavior at the boundaries — derived independently of how the code happens to work. A good prompt is "write tests that assert the documented behavior and edge cases of this function" — and then you read them and confirm they'd fail against a wrong implementation.

Mirrors implementation (bad)Asserts intent (good)
Reads the code, restates its outputReads the spec/requirement, states expected behavior
Passes even when the code is wrongFails when behavior is wrong, regardless of implementation
Breaks on every harmless refactorSurvives refactors; breaks only on behavior change
Generated and merged unreadGenerated, then read and challenged by a human

Characterization tests before a refactorpin behavior first

Before you let an agent refactor a gnarly legacy module, write characterization tests first — tests that capture the current observable behavior, warts and all, even behavior you suspect is wrong. This is the one place where 'mirror the existing behavior' is correct, because the goal is to pin behavior so the refactor can't silently change it. AI is excellent at this: point it at the module, have it generate characterization tests across the input space, confirm they pass against the current code, then refactor. If a characterization testA test written to pin down current behavior before a refactor, so you notice if the behavior changes. breaks during the refactor, you've caught a behavior change you didn't intend — exactly what you wanted.

The mental model

Characterization tests answer 'what does this code do today?' — you mirror behavior on purpose, as a safety harness for change.

Intent tests answer 'what should this code do?' — you derive from the spec, independent of the implementation.

Both are useful. Confusing them — writing intent-shaped tests that secretly just mirror the code — is how you ship a green suite that proves nothing.

Coverage is a gate input, never the goalGoodhart's Law

Coverage is useful as a floor — a PR that drops coverage on a critical module is worth a second look. But the moment coverage becomes a target, Goodhart's Law kicks in: "when a measure becomes a target, it ceases to be a good measure." Teams (and agents) chasing a 90% number generate tests that execute lines without asserting anything meaningful — coverage goes up, real safety goes flat or down. AI makes this worse because it can manufacture coverage-padding tests by the dozen in seconds.

Say it like this

"Coverage tells you what code ran, never whether it's correct. We gate on coverage as a floor and review tests for whether they assert intent. An agent can take any number to 90% — that's exactly why the number can't be the goal."

Self-check

QCoverage on a team's repo jumps from 60% to 92% the month after they adopt an AI agent, but production incidents don't drop. What's the most likely explanation?

Cursor in CI and the bright line of autonomy

Cursor's agent surface now reaches into the CI loop. Cloud Agents (shipped in 3.5, May 2026) run in isolated cloud VMs with terminal and browser access, can work across multiple repos in parallel, and report back asynchronously — which means an agent can be triggered by a failing pipeline, investigate, and propose a fix without a human babysitting the terminal. The CLI gained /debug in 3.1. This is powerful, and it's exactly where the governance question gets sharp.

The bright linepropose vs commit

There is one distinction that decides everything about how you deploy agents in CI:

AI proposes (default-safe)

Agent opens a PR. A human code-owner reviews and merges.

All the existing gates still apply — checks, AI review, approval.

This is the default. It scales without expanding blast radiusHow much breaks if a change goes wrong; the scope of potential damage..

Failure mode is contained: a bad proposal is a rejected PR, not an incident.

AI commits autonomously (governed exception)

Agent merges or deploys without a human in the loop.

Requires explicit governance: narrow scope, low risk tier, audit logging, kill switch, named owner.

Never the default. Justified case-by-case, reversible, observable.

Failure mode reaches production directly — so it's earned, not assumed.

The principle

AI proposes is default-safe and should be ~99% of your footprint. AI commits autonomously is a governed exception — allowed only on a narrow, low-risk surface, with logging and a kill switch, owned by someone accountable.

The blast radiusHow much breaks if a change goes wrong; the scope of potential damage. determines which side of the line a use case sits on. A doc typo fix on a static site can earn autonomy. A schema migration on the payments DB never does.

Never fix a pipeline by disabling a checkthe cardinal sin

When CI goes red and the pressure is on, the seductive 'fix' is to disable the failing check, mark the test as skipped, or merge with admin override. This is the one move that is always wrong. The check is the evidence; disabling it doesn't fix the problem, it deletes the proof that a problem exists and silently lowers the bar for everyone after you. An agent under instruction to 'make CI green' will absolutely do this if you let it — which is why agents that touch CI need rules that forbid it, and why a human reviews any change to the pipeline config itself.

  1. 1Read the failure. Get the actual error and the failing check's logs — don't guess. (The CLI /debug and Cloud Agents are good at exactly this.)
  2. 2Reproduce locally (or in an isolated agent VM). A failure you can't reproduce, you can't trust you've fixed.
  3. 3Find the root cause. Is the code wrong, or is the test/check wrong? Both are legitimate, but you must know which — and changing a test to match buggy code is itself the bug.
  4. 4Fix the cause, not the symptom. Repair the code, or correct the test if the test was genuinely wrong (with justification in the PR). Never disable, skip, or override to go green.
  5. 5Re-run the full suite and confirm green for the right reason. Land it as a normal PR through the normal gates.
Say it like this

"You never fix a red pipeline by disabling the check. The check isn't the obstacle — it's the evidence. Disabling it doesn't solve the problem, it just deletes the proof and lowers the bar for everyone who merges after you."

Self-check

The anti-pattern taxonomy

Every AI-in-the-SDLC failure mode rhymes. Learn the taxonomy and you can name what's going wrong in a customer's org in one sentence, then point at the guardrail. For each: the failure, who loses trust, and the control that contains it.

Anti-patternFailure → who loses trustGuardrail
Vibe-mergesMerging on a green-checkmark feeling without reading the diff → reviewers, then prod when it breaksRequired author self-review + code-owner approval; AI review as independent signal, never as the merge trigger
Mega-diffsOne PR touches 80 files mixing refactor + behavior → reviewers can't hold blast radiusHow much breaks if a change goes wrong; the scope of potential damage., they rubber-stampEnforce scope: split refactor from behavior; small, coherent PRs; flag oversized diffs in review
Prompt-and-prayFire a vague prompt, accept whatever comes back unverified → the author (their name is on it)Specify intent, read the output, test it. The author owns the agent's work, full stop
Fabricated confidenceModel states a wrong answer fluently (hallucinated API, invented behavior) → whoever trusted the fluent toneVerify against ground truth — docs, types, a running test. Tone is not evidence
Hidden generated codeLarge AI-generated blocks merged with no signal they're generated → future maintainers + auditorsAI-code tracking / attribution; honest PR descriptions; review density scales with how much is generated
Context rotLong agent session drifts off the original task; later edits contradict earlier ones → the author, silentlyShort, scoped sessions; re-ground the agent; review the final diff against intent, not the conversation
Secrets / prompt-injectionSecrets pasted into context, or hostile content in a repo/issue hijacks the agent → security, the whole orgSecret scanning, Privacy ModeCursor's setting that routes requests under zero-data-retention terms so providers don't store or train on your code./ZDRZero Data Retention. A contractual guarantee that the model provider won't store your code or train on it., terminal sandboxing, least-privilege MCPModel Context Protocol. A standard that lets an AI agent pull in context from outside the repo, like Jira tickets or internal docs., treat external content as untrusted
Excessive agent permissionsAgent granted broad repo/tool/prod access 'to be convenient' → security + SRESite Reliability Engineering. The team and practice that keeps production reliable: monitoring, on-call, and incident response.; blast radiusHow much breaks if a change goes wrong; the scope of potential damage. is now hugeLeast privilege: model/MCPModel Context Protocol. A standard that lets an AI agent pull in context from outside the repo, like Jira tickets or internal docs./repo allowlists, RBACRole-Based Access Control. Granting permissions by role rather than configuring each person individually., terminal sandboxing, isolated VMs, scoped tokens
Volume-as-success / mandated usageMeasuring lines-of-AI-code or mandating usage → leadership credibility, then engineers who game the metricMeasure outcomes (DORADORA metrics. Four widely-used delivery measures: deployment frequency, lead time for changes, change failure rate, and time to restore service., throughput, defect rate), not AI volume; adoption via mentorship, not mandate
The two that masquerade as wins

Volume-as-success and mandated usage are dangerous precisely because they look like leadership 'driving adoption.' They produce a metric that goes up while trust goes down — engineers game the number, and the org learns that AI means 'theater you're forced to perform.'

The Box case study is the counter-model: 85%+ daily active and 30–50% throughput gains came from mentorship (+75% usage in 6 weeks via peer enablement), not from a mandate. Pull, not push.

How they clusterthree root causes

Process collapse: vibe-merges, mega-diffs, prompt-and-prayEpistemics: fabricated confidence, hidden generated code, context rotSecurity & incentives: secrets/injection, excessive permissions, volume-as-success

Notice the through-line: nearly every guardrail in the right column is something we already covered — the well-formed PR, the layered review with a non-collapsible human gate, intent-asserting tests, least privilege, and outcome metrics over volume. The anti-patterns aren't exotic; they're what you get when you skip the disciplines. Your job in the field is to spot which discipline a team dropped and restore it.

Interview framing

When a customer describes a mess, resist solutioning immediately. Name the anti-pattern first ('that's classic vibe-merging' / 'you've made coverage a target'), say who's losing trust, then prescribe the single guardrail. That sequence — name → trust impact → control — is what separates a field engineer from a feature-lister.

Strongest closing line: 'AI didn't break your pipeline. Skipping a discipline did. We put the discipline back — the PR stays the unit of trust.'

Self-check