Guide
Guardrail Tests: Refactor Legacy Code Safely
Guardrail tests pin the current behavior of legacy code before you refactor. You identify how the code behaves today, write tests that capture that exact input and output, confirm each one fails before it passes, then make the minimal change while the tests stay green. The suite becomes the spec, so the agent cannot silently alter behavior.
On this page
- What are guardrail tests for a refactor?
- Why write tests before refactoring instead of after?
- How do I plan guardrail tests for a large codebase?
- Why confirm each guardrail test fails before it passes?
- How do I stop the agent from generating silently-passing tests?
- How do guardrail tests act as the spec for the refactor?
- Which model should plan the work and which should build it?
- How do I close the loop into Confluence and Jira?
- What are the limits of this approach?
What are guardrail tests for a refactor?
Guardrail tests pin the current behavior of a piece of code so a refactor cannot change it without you noticing. You identify the current state, write tests that capture that state, then make the minimal code change while keeping the same inputs and outputs. They are sometimes called characterization tests: they describe how the code behaves now, not how you wish it behaved.
This matters most on large, idiosyncratic repos where no one person holds the whole behavior in their head. A coding agent has a bias for action and will happily rewrite a module. Without a behavioral net under it, you have no proof the new code does what the old code did. The tests are that net.
As the workshop put it: identify the current state, write tests for that current state, and then do the minimal code changes by maintaining the output of the current state.
A legacy Java path migrated to Go should take the same inputs and return the same outputs. Speed and flexibility improve; behavior does not move.
Why write tests before refactoring instead of after?
Tests written after a refactor only prove the new code is self-consistent. They cannot tell you whether you preserved the old behavior, because the old behavior is already gone. Tests written first capture the behavior while it still exists, so any drift the refactor introduces shows up as a red test instead of a production incident.
Interactive diagram. Tab through its regions; each focused region shows its detail in the panel below.
Writing guardrails first turns behavior drift into a failing test. Writing them after only checks the rewrite against itself.
How do I plan guardrail tests for a large codebase?
Start in plan mode, not in agent mode. Plan modeA mode that makes no edits: it researches the codebase and produces an editable plan you review before any code changes. runs the same search-and-retrieval over your repo but produces a Markdown plan you and the agent refine before any code is written. You can hand it a vague, half-formed prompt and use the planning pass to check your assumptions and tighten the scope. On a big repo this is where the real work happens.
In the workshop the target was Grafana - an open-source codebase that has been around for about ten years with thousands of contributors. The presenter ran one plan-mode prompt: find all the FIXME and TODO comments, rank the top five most critical, write a testing plan that pins current behavior, then migrate to the new functionality while making sure nothing else breaks. Plan modeA mode that makes no edits: it researches the codebase and produces an editable plan you review before any code changes. came back with a Markdown plan listing the items as P0 through P2, each with a problem statement and a per-item unit-test plan.
Interactive diagram. Tab through its regions; each focused region shows its detail in the panel below.
A plan-mode pass turns a vague prompt into a ranked, test-anchored migration plan before any code is touched.
Why confirm each guardrail test fails before it passes?
A test that has never failed proves nothing. If a generated test asserts the wrong thing - or asserts nothing at all - it will pass green and give you false confidence. The discipline is old and still applies: never trust an automated test you did not see fail. Make each guardrail fail for the reason you expect, then make it pass.
As one attendee warned: never trust an automated test that you didn't see fail. You write tests based on existing code. This is dangerous - you have no proof.
A real report from the session: Cursor sometimes generates tests that log a failure with something like reporter.log instead of calling the framework's fail() method, or wraps validation in if / try-catch blocks that swallow the assertion. The test then passes while proving nothing. You only catch this by seeing the test go red first and by reading the generated assertions yourself.
The fix is engineering discipline, not a setting. Determine the golden state - the happy-path input and the output it should produce - explicitly. Feed that input/output pair to the agent as the anchor for the test, or codify it in the plan and have the agent adhere to it. Then inspect the generated test code. Cursor is an accelerant to your process here, not a replacement for verifying what is actually under test.
How do I stop the agent from generating silently-passing tests?
Four remedies came out of the workshop, all aimed at keeping the agent honest about failures. None of them is a magic flag; they are about how you scope the task and verify the output.
- Keep the rule file lean and specific. A long rule competes with everything else for the agent's attention; a short one aimed at exactly this problem is more likely to be followed.
- Keep the context window as empty as possible for test work. In a long conversation the rule context can get overwritten as the window fills, so the agent quietly stops obeying it. Open a fresh agent that only runs and writes tests.
- Add a validator step - a skill or a sub-agentA child agent a main agent spawns to work in parallel with its own context window, handing results back so the parent's context stays clean. that runs the test, reads the output, and reruns to confirm a real failure surfaces before the test is accepted.
- For a complex suite, isolate test-running in its own agent rather than mixing it with feature work, so nothing crowds out the verification step.
If you have had a very long chat, the rule context can get overwritten just from the window filling up. That is often why a rule the agent followed earlier gets dropped later. For guardrail tests, start clean and do one thing.
How do guardrail tests act as the spec for the refactor?
Once the guardrails are green against the old code, they become the contract the new code must satisfy. This is test-driven development adapted to agents: agents work well when the requirements are pinned up front, so a skeleton of passing behavioral tests is an ideal guiding light for the implementation that follows.
- 1Pin the behavior: write guardrail tests that capture current inputs and outputs, and confirm each fails before it passes.
- 2Make the minimal change: refactor or migrate the smallest unit that moves your goal - e.g. swap a slow Java path for Go - keeping inputs and outputs identical.
- 3Verify before moving on: re-run the guardrail suite. Only advance to the next item once the current tests are still green. The plan drives the order.
- 4Iterate item by item: walk the ranked P0-P2 list, repeating pin-change-verify so behavior never drifts more than one small step at a time.
Because the suite gates every step, the agent is constrained to changes that keep the tests passing. That is what stops it from silently rewriting behavior on a codebase too large to eyeball.
Which model should plan the work and which should build it?
Split the job. Use a larger, more capable reasoning model to produce the high-level plan, then a leaner, faster in-house model to do the building. In the demo the presenter planned with a frontier Opus-class model and built with one of Cursor's in-house ComposerCursor's own fast coding model, tuned for the editor and priced well below frontier models; the recommended day-to-day model for executing a plan. models, which are fast thinking models tuned for execution. After plan mode you pick the build model and click Build to move into the writing phase.
In the presenter's words: I like to leverage a larger intelligent model to come up with the high-level plan and I like to leverage the leaner faster model to do the building.
The frontier model is worth its cost on the nebulous planning step; ComposerCursor's own fast coding model, tuned for the editor and priced well below frontier models; the recommended day-to-day model for executing a plan. is fast and capable enough to carry out a plan that is already well specified. Auto modeA router that reads your prompt and picks a model for you, defaulting to Composer; you steer it with cues like "quickly" or "carefully". can route this for you if you would rather not pick per phase.
The workshop named specific point releases for these models. Cursor ships new model versions often, so check the in-app model picker and cursor.com/docs for the current ComposerCursor's own fast coding model, tuned for the editor and priced well below frontier models; the recommended day-to-day model for executing a plan. and frontier options rather than pinning a version from any guide, including this one.
How do I close the loop into Confluence and Jira?
Guardrail work is also a reporting and ticketing job. With the Atlassian MCPModel Context Protocol. A standard that lets an AI agent pull in context from outside the repo, like Jira tickets or internal docs. connected, the agent can push the migration plan into a Confluence doc and file Jira stories straight off the plan. In the workshop the presenter appended a final task to the plan - update the Confluence doc with the migration details - and separately asked the agent to turn the TODO comments in the alerting module into Jira stories.
The recommended order is plan first, then report and ticket. Run plan mode, refine the plan, then let the last task fan it out: a Confluence page for the team and a set of Jira tickets to distribute the work. Anything you could do by hand in Atlassian, you can outline to the agent and wrap as a repeatable skill.
In the session the Atlassian MCPModel Context Protocol. A standard that lets an AI agent pull in context from outside the repo, like Jira tickets or internal docs. hung partway through creating tickets. It is a real, useful integration, but treat ticket creation as best-effort: confirm the tickets actually landed rather than assuming the run finished cleanly.
What are the limits of this approach?
Guardrail tests reduce risk; they do not remove judgment. A few honest limits to hold onto.
- Guardrails only protect behavior you actually captured. Untested paths can still change silently, so coverage of the critical behavior is the real safety boundary - not the green checkmark.
- You still have to define the golden state. The agent will not decide what 'correct' means for you; pinning the happy-path input and output is human work.
- Generated tests need a human read. Inspect the assertions for the silent-fail patterns above before you trust the suite as a spec.
- MCPModel Context Protocol. A standard that lets an AI agent pull in context from outside the repo, like Jira tickets or internal docs.-driven reporting and ticketing is convenient but not guaranteed. Verify Confluence pages and Jira tickets landed; integrations like Atlassian can be unreliable mid-run.
The throughline: Cursor accelerates a sound engineering process. It does not replace the discipline of knowing what behavior you are protecting and proving your tests can fail.
Frequently asked questions
What is a guardrail test?
A guardrail test pins the current behavior of code so a refactor cannot change it unnoticed. You capture today's inputs and outputs as tests, confirm each fails before it passes, then refactor while the suite stays green. It is a form of characterization testing aimed at safe migration.
Why write guardrail tests before refactoring rather than after?
Tests written first capture the old behavior while it still exists, so any drift the refactor introduces shows up as a red test. Tests written after only check the rewrite against itself and cannot prove you preserved the original behavior.
How do I plan guardrail tests for a huge legacy codebase?
Use plan mode. Have the agent mine FIXME and TODO comments, rank the most critical, then generate a Markdown plan with per-item test plans. Refine the plan, confirm each test fails first, then migrate minimally item by item, re-running the suite between steps.
Why must I see a test fail before trusting it?
A test that never failed proves nothing - it may assert nothing or assert the wrong thing and still pass green. Seeing it fail for the expected reason confirms it actually exercises the behavior. AI-generated tests can silently pass by logging instead of failing or by swallowing assertions in try-catch.
How do I stop Cursor from generating tests that pass silently?
Keep the rule file lean and specific, keep the context window nearly empty so the rule is not overwritten, add a validator skill or sub-agent that confirms a real failure surfaces, and run tests in a fresh isolated agent. Always read the generated assertions yourself.
Which model should plan the refactor and which should build it?
Use a larger frontier reasoning model for the high-level plan and a leaner, faster in-house Composer model to build it. Plan in plan mode, pick the build model, then click Build. Check the in-app model picker for current versions, since Cursor updates models often.
Sources & last verified
Cursor ships frequently. Facts verified against primary sources on June 25, 2026.