Research
How to Measure AI Coding Productivity and ROI
Measure AI coding productivity with two pillars at once: velocity and quality. Velocity runs from PR throughput (easy, far from the outcome) to feature-completion (hard, closest to it). Quality means customer-facing defects stay flat while internal health improves. Read both against one number - time-to-market.
On this page
How should you measure AI coding productivity?
Hold two pillars side by side: velocity and quality. Velocity on its own rewards shipping faster even when the code rots. Quality on its own rewards safety even when nothing ships. Track them together and the picture stays honest.
Interactive diagram. Step through it with the Next and Previous controls below, or Tab to a region to read its detail.
Velocity climbs from easy-but-distant (PR throughput) to hard-but-decisive (feature completion). Quality splits into what customers feel and what the codebase carries.
Pillar one: velocity
Velocity has three layers, and they trade ease of measurement against closeness to the outcome you actually care about.
- PR velocity
- Easiest to measure and the furthest from the outcome. Still worth recommending as a starting signal - just don't mistake it for value delivered.
- Story-point velocity
- Business value read through Jira. How much planned work clears per cycle. Useful where teams estimate consistently; org-dependent, since point scales vary.
- Feature-completion velocity
- Hardest to measure and closest to the outcome: how fast the team moves through the roadmap and gets features in front of customers.
Pillar two: quality
Quality has an external face and an internal one. Watch both, because more code from an agent can quietly trade one for the other.
Customer-facing defects. The bar is that they stay flat or fall as you ship faster. Rising velocity with rising defects is not a win - it is borrowed time.
Test coverage, code quality, extensibility and maintainability. Expect incremental gains here, not an overnight jump. These move slowly because they are about the shape of the codebase, not today's output.
Why are lines of code and active users weak metrics?
Both feel measurable, and both mislead. More AI-written code is not better code, and it is not faster delivery - it is just more code to review, test and maintain.
"Active users" hides depth. Someone who opens the agent once a week counts the same as someone who has rebuilt their workflow around it, yet they have not integrated the tool at all. A headcount of logins tells you reach, never adoption.
Start from customer and business value, then ask which signals predict it. Lines of code and seat counts predict almost nothing. Time through the roadmap and defect trends predict a lot.
What is code half-life?
Code half-life is how long a piece of code survives before it has to change. Short-lived code that gets rewritten within days suggests the first pass missed the mark or the design could not hold weight.
Read it as a maintainability signal. Code that stays stable for a long time is usually code that fit the problem; code that churns constantly is a flag worth investigating, whoever or whatever wrote it.
What does Cursor show you? Conversation Insights, Cursor Blame, the Analytics API
Cursor ships its own measurement surface so you do not have to infer everything from git. Three pieces do most of the work.
- Conversation Insights
- Passively categorises work as new feature, bug fix or refactor. It can flag under-specified agent turns that would have gone better through Plan modeA mode that makes no edits: it researches the codebase and produces an editable plan you review before any code changes. first.
- Cursor Blame
- Augments git blame with line-level human and agent co-authorship - an AI-code-tracking API that tells you which lines an agent actually wrote.
- Dashboard
- AI share of committed code, agent edits, Tab completions and active users across Agent, BugbotCursor's automated PR reviewer that posts inline findings and can push fix commits from isolated VMs., cloud and CLI.
- Audit log + Analytics API
- An exportable audit log plus a read-only Analytics API, so you can pull the numbers into your own reporting instead of screenshotting a dashboard.
The point of Cursor BlameAn augmented git blame that records line-level human and agent co-authorship, so you can trace which code was written by AI versus a person. is attribution. Once you know which lines came from an agent, you can ask whether agent-authored code carries more defects, churns faster or survives just as long as human-authored code - rather than guessing.
What does the research say?
A University of Chicago study, published in November, looked at 1,000 organisations adopting Cursor and found a 39% increase in org-level output.
Teams merged 39% more pull requests after Agent became the default mode. More-experienced developers were more likely to accept agent-generated code - plausibly because they plan more up front, so what comes back is closer to what they wanted.
Take the 39% as one well-sourced data point, not a guarantee for your team. It is an org-level output figure tied to Agent becoming the default, measured across a large sample - which is exactly why it is worth citing rather than inflating.
What's the one metric that matters?
Time-to-market acceleration. Are features reaching customers faster than they did before? If the answer is yes and defects are not climbing, the tooling is paying for itself. If you can only watch one number, watch this one.
Everything else can become noise. PR counts, completion rates and acceptance ratios are useful inputs, but they are means to the end of shipping value sooner, and it is easy to optimise them while the roadmap stalls.
Generation is the easy part now. Review, quality and security are the new constraints - which is why a velocity number with no quality counterweight is the wrong thing to celebrate.
Frequently asked questions
Is PR velocity a good metric?
It is the easiest velocity signal to measure and worth tracking as a starting point, but it sits furthest from the outcome you care about. More merged PRs is not the same as more value shipped - pair it with feature-completion velocity and quality.
What is Cursor Blame?
Cursor Blame augments git blame with line-level human and agent co-authorship. It is an AI-code-tracking API that shows which lines an agent wrote, so you can compare agent-authored and human-authored code for defects, churn and survival.
Does AI improve test coverage immediately?
No. Internal quality - test coverage, code quality, extensibility, maintainability - improves incrementally, not overnight. Expect gradual gains and measure the trend rather than expecting a step change.
What is the single most important metric for AI coding ROI?
Time-to-market acceleration: whether features reach customers faster without defects rising. Once generation is cheap, the bottleneck shifts to review, quality and security, so that one outcome metric outranks the rest.
Sources & last verified
Cursor ships frequently. Facts verified against primary sources on June 25, 2026.