Research
How Cursor Trains Composer 2
Composer 2 is Cursor's in-house coding model. It starts from an open base model, gets continued pre-training on code, then long-horizon reinforcement learning on real coding tasks run in copies of real repos. Cursor grades it on CursorBench, an internal benchmark built from real engineer queries.
On this page
What is Composer 2?
Composer 2Cursor's in-house agentic coding model: frontier-level coding quality at high speed and low cost, built as a software-engineering specialist rather than a general-purpose model. is the coding model Cursor trained in-house to run inside Cursor. At release it scored about level with Opus 4.6 and slightly behind GPT-5.4 on coding, while serving much faster and costing far less. Composer 2 Fast generates at roughly 200 tokens per second, and the model is priced about an order of magnitude below Opus 4.6.
It is a specialist. Cursor did not build it to win at legal reasoning, financial analysis or general knowledge. The bet is narrower: be excellent at software engineering, and serve quickly because the model carries fewer parameters than a general-purpose one.
Fewer parameters is the whole point. A model that only has to be good at coding can be smaller, which is why Composer 2Cursor's in-house agentic coding model: frontier-level coding quality at high speed and low cost, built as a software-engineering specialist rather than a general-purpose model. Fast answers at conversational speed while a larger general model is still thinking. The trade is deliberate, not a shortcut.
Interactive diagram. Step through it with the Next and Previous controls below, or Tab to a region to read its detail.
Open base model -> continued pre-training on code -> long-horizon RL on real tasks -> graded on CursorBench.
What base model is Composer 2 built on?
Composer 2Cursor's in-house agentic coding model: frontier-level coding quality at high speed and low cost, built as a software-engineering specialist rather than a general-purpose model. starts from Kimi K2.5The open base model Cursor continued-trained into Composer 2 (1T parameters, 32B active, 256K context), chosen mainly for how well it fit Cursor's serving infrastructure., an open model. Cursor tested many open base models and several were strong; K2.5 won mostly on infrastructure fit, meaning how cleanly it slots into the serving stack Cursor already runs.
- Total parameters
- 1T
- Activated per token
- 32B
- Layers
- 61
- Context window
- 256K
- Attention
- Multi-head latent attention (cheap to serve)
A mixture-of-experts model: 1T parameters exist, but only 32B activate on any given token, which keeps serving cost down.
Multi-head latent attention matters for the same reason the parameter count does. It is efficient to serve, so the base already leans toward the speed Composer 2Cursor's in-house agentic coding model: frontier-level coding quality at high speed and low cost, built as a software-engineering specialist rather than a general-purpose model. is built around.
What is continued pre-training?
Before any reinforcement learning, Cursor keeps training the base model on code. This runs in stages: short-context pre-training over a large pile of tokens, a long-context extension out to 256K, then supervised fine-tuning on agent-like data that resembles how the model will actually work in Cursor.
- 1Short-context pre-training over many tokens to raise code-domain knowledge.
- 2Long-context extension to 256K so the model can hold a real repo and a long task in view.
- 3Supervised fine-tuning on agent-like traces (tool calls, edits, multi-step work).
This step targets code knowledge specifically rather than general ability. Ablations, where Cursor trains a version with the step removed and compares, showed a real effect on the final model. The continued pre-trainingAn extra large-scale training stage that raises a model's code-domain knowledge: short-context pre-training, then long-context extension, then fine-tuning on agent-like data. is doing work, not decoration.
How does the reinforcement learning work?
The reinforcement learning loop teaches Composer 2Cursor's in-house agentic coding model: frontier-level coding quality at high speed and low cost, built as a software-engineering specialist rather than a general-purpose model. to finish real engineering work. Cursor collects actual coding problems, features, debugging, migrations and documentation, then runs many attempts per problem inside simulated copies of real repositories. Successful attempts get reinforced; failures push the model away.
A single rollout can reach 200,000 tokens and hundreds of tool calls. The model is not answering a question, it is doing a job: reading files, editing, running tests, reacting to what broke.
Auto Install builds the environment first
A rollout only means something if the repo it runs in actually works. Before RL begins, an "Auto InstallA step where a prior Cursor model sets up each training repo's environment, proposing install commands and writing tests until the build verifiably works." pass, run by the prior model ComposerCursor's own fast coding model, tuned for the editor and priced well below frontier models; the recommended day-to-day model for executing a plan. 1.5, sets up each environment in two stages.
- 1Explore the repo, propose about 10 install commands, and write tests that prove the project is functional.
- 2Run the install, mock what cannot run for real, and retry until the tests pass.
Only then does the new model train against the repo. ComposerCursor's own fast coding model, tuned for the editor and priced well below frontier models; the recommended day-to-day model for executing a plan. 1.5 prepares the ground that Composer 2Cursor's in-house agentic coding model: frontier-level coding quality at high speed and low cost, built as a software-engineering specialist rather than a general-purpose model. learns on.
How does it solve very long tasks?
Two mechanics let Composer 2Cursor's in-house agentic coding model: frontier-level coding quality at high speed and low cost, built as a software-engineering specialist rather than a general-purpose model. run long without wasting effort: a nonlinear length penaltyA training reward that discourages a model from being too verbose or too terse, so easy tasks finish fast and hard tasks get more room. and self-summarizationA model summarizing its own work at a trigger point and continuing from that summary, so it can keep working past its context limit..
The length penalty
A nonlinear penalty on length keeps easy problems efficient while still allowing hard ones to run long. A trivial fix that rambles for thousands of tokens gets discouraged; a genuine migration that needs the room is not punished for taking it. The shape is the point, since a flat penalty would either bloat the easy cases or starve the hard ones.
Self-summarization
When a task approaches the context limit, the model hits a trigger, summarizes what it has done so far, and continues from that summary. Training this behavior teaches Composer 2Cursor's in-house agentic coding model: frontier-level coding quality at high speed and low cost, built as a software-engineering specialist rather than a general-purpose model. to work past its own context window. The same mechanism is what powers context compactionAutomatic compression of earlier conversation when the context window fills up; it can read like lost context while still costing tokens. in the product, so a long Cursor session does not simply run out of room.
What is CursorBench?
CursorBenchCursor's internal benchmark that scores models on both task performance and token efficiency, not accuracy alone. is Cursor's internal benchmark, built from real engineer queries and kept uncontaminated, meaning it is not in any training set. It is deliberately realistic in a way public benchmarks usually are not.
- Prompts are short and under-specified on purpose. Resolving the ambiguity is part of the task, the way it is for a real engineer.
- Scoring counts both quality and completion tokens, so a correct-but-bloated answer does not look as good as a correct-and-tight one.
- It separates strong from weaker models more sharply than SWE-bench does.
- Across training, best-of-16 kept climbing, which is the signal Cursor watched to know the model was still improving.
Real requests rarely spell out every constraint. A benchmark of clean, fully specified prompts rewards a model that follows instructions but never tests whether it can figure out what you actually meant. CursorBenchCursor's internal benchmark that scores models on both task performance and token efficiency, not accuracy alone. keeps that ambiguity in on purpose.
Why does the harness matter?
Cursor runs one harness across every model: the same scaffolding of tools, prompts and orchestration that turns a raw model into something that can edit a codebase. A separate team owns it.
Cursor's framing for why a dedicated team owns the harness. The model is one input; the harness around it is much of what users actually experience.
There is outside evidence for how much the harness carries. Artificial Analysis found that the best coding results came from other labs' models running inside Cursor's harness, not from those models on their own. The scaffolding travels.
What's next? Composer 2.5 and 3
Composer 2.5The current Composer release, better at long-running tasks and at judging when a job needs a light touch versus deep work. is close. It posts a stronger Terminal-Bench score, and the gains come mostly from cleaner reward signals and refined training data rather than a new base, which barely changes between 2 and 2.5. ComposerCursor's own fast coding model, tuned for the editor and priced well below frontier models; the recommended day-to-day model for executing a plan. 3 is the bigger jump: it trains on a much larger cluster.
- Composer 2.5
- Imminent; better Terminal-Bench from cleaner rewards + refined data, base barely changes
- Composer 3
- Trains on a much larger cluster
For scale, about 40 people built Composer 2Cursor's in-house agentic coding model: frontier-level coding quality at high speed and low cost, built as a software-engineering specialist rather than a general-purpose model., split roughly half researchers and half engineers. A focused team, a focused model.
Frequently asked questions
Is Composer 2 a fully from-scratch Cursor model?
No. It starts from the open Kimi K2.5 base, then Cursor adds continued pre-training on code and long-horizon reinforcement learning on real tasks. The training that makes it Composer 2 is Cursor's; the starting weights are open.
How fast is Composer 2?
Composer 2 Fast generates at roughly 200 tokens per second. It serves quickly because it is a coding specialist with fewer parameters than a general-purpose model, and the base it builds on already uses efficient multi-head latent attention.
How good is Composer 2 at coding?
At release it scored about level with Opus 4.6 and slightly behind GPT-5.4 on coding, while costing roughly an order of magnitude less than Opus 4.6. It is tuned for software engineering, not general knowledge.
What is CursorBench and why not just use SWE-bench?
CursorBench is Cursor's uncontaminated internal benchmark built from real engineer queries, with deliberately short, under-specified prompts and scoring that counts both quality and completion tokens. It separates strong from weaker models more sharply than SWE-bench.
What is coming after Composer 2?
Composer 2.5 is imminent, with a stronger Terminal-Bench score from cleaner rewards and refined data while the base barely changes. Composer 3 trains on a much larger cluster.
Sources & last verified
Cursor ships frequently. Facts verified against primary sources on June 25, 2026.