Cursor's internal benchmark that scores models on both task performance and token efficiency, not accuracy alone.