Early in my career, I inherited a team with a beautiful dashboard. Every metric was green. Velocity was up. Story points completed per sprint were trending in the right direction. Code coverage was above the target. Release frequency was on schedule. By every measure on that dashboard, the team was performing brilliantly.

The product they were building had almost no users.

That was my introduction to what Osmani calls the watermelon effect, metrics that look green on the surface but are red underneath. The team was producing outputs at an impressive rate. They just weren’t producing outcomes that mattered.

The Distinction

The difference between outputs and outcomes is simple to state and surprisingly hard to internalise:

Outputs are what you produce, features shipped, bugs fixed, PRs merged, story points completed. They measure activity.

Outcomes are the results of what you produce, user adoption, revenue impact, customer satisfaction, time saved, problems solved. They measure value.

Most engineering organisations measure outputs because they’re easy to count. Lines of code, velocity, release frequency, these are all readily available from your tools. Outcomes are harder because they require understanding the connection between what you built and the effect it had, and that connection is often indirect and delayed.

Osmani frames this clearly: outputs are deliverables; outcomes are actual results. You can ship a feature (output) that nobody uses (no outcome). You can fix a hundred bugs (output) while the product’s core usability problem remains unsolved (no outcome). You can release every two weeks (output) while customer satisfaction declines (negative outcome).

Why We Default to Outputs

The pull toward output metrics is understandable. They’re concrete, they’re available immediately, and they feel objective. When someone asks “how is the team doing?” it’s much easier to say “we shipped 15 features this quarter” than to say “we improved user retention by 3%, though we’re still investigating the causal relationship.”

There’s also a psychological comfort in output metrics. They make you feel productive. A team that ships a lot of features feels like it’s doing well, even if those features aren’t moving the needle on anything important. Fournier makes this point sharply: teams that don’t ship are usually disengaged, but teams that ship without purpose are burning cycles on things that ultimately don’t matter, which contributes to a sense of purposelessness.

The danger is what happens when output metrics become targets. Once velocity becomes a goal rather than a diagnostic, teams optimise for it, inflating estimates, breaking work into smaller tickets, avoiding risky work that might slow them down. The metric goes up while the actual effectiveness goes down.

Frameworks That Help

Several frameworks exist for shifting focus from outputs to outcomes. The ones I’ve found most useful:

OKRs (Objectives and Key Results). The objective describes what you want to achieve in qualitative terms. The key results describe how you’ll know you’ve achieved it in measurable terms. The key results should be outcomes, not outputs. “Launch the new checkout flow” is an output. “Reduce checkout abandonment by 15%” is an outcome. Google uses OKRs extensively, and research at Sears showed teams using them were 11.5% more likely to reach higher performance categories.

GQM (Goal-Question-Metric). Start with a goal, derive questions that would tell you whether you’re achieving it, then identify metrics that answer those questions. This prevents the common failure of collecting metrics first and trying to derive meaning from them afterwards.

SMART goals. Specific, Measurable, Achievable, Relevant, Time-bound. The “Relevant” criterion is the one that matters most here, it forces you to connect the goal to an outcome that actually matters to the business.

The Measurement Challenge

The honest truth is that outcome metrics are harder to collect, slower to materialise, and more ambiguous to interpret than output metrics. A feature you ship today might not show its impact for months. The impact might be confounded by other changes happening simultaneously. The causal chain from “we built this” to “this happened” is rarely clean.

This doesn’t mean you shouldn’t try. It means you need to be comfortable with imperfection. Larson’s advice on measurement is pragmatic: measure easy things first to build trust, only measure difficult things if you’ll actually use the data, and introduce one new measurement at a time. Don’t let the perfect be the enemy of the good.

I’ve also found that combining quantitative outcome metrics with qualitative signals works better than relying on either alone. User interviews, support ticket themes, customer feedback, and your own team’s intuition about whether the product is getting better, these are all valuable data points that don’t show up in dashboards.

The Cultural Shift

Moving from output measurement to outcome measurement isn’t just a metrics change, it’s a cultural change. It requires the organisation to accept that some sprints will produce fewer visible outputs because the team is investing in understanding the problem better. It requires product and engineering to collaborate on defining what success looks like before building starts. It requires leadership to resist the urge to judge teams by how busy they look.

This is a shift I’ve had to argue for at more than one company. The conversation usually starts with someone asking “why did the team only ship three features this quarter?” and the answer being “because those three features moved the retention metric by 8%, which is worth more than the twenty features we shipped last quarter that moved nothing.”

That’s a hard conversation to have, especially in organisations where shipping volume is culturally valued. But it’s the right conversation, and having it repeatedly is how you shift the culture from celebrating activity to celebrating impact.

The Metrics You Keep

If I had to pick a small set of metrics for an engineering team, they’d be outcome-focused:

  • Customer/user impact metrics specific to what the team owns
  • Quality indicators, not just bug counts, but user-facing quality measures
  • Cycle time, how long from idea to user value (not just to deployment)
  • Team health, engagement, retention, sustainable pace

Notice what’s not on the list: velocity, story points, lines of code, number of PRs. These can be useful diagnostics for the team itself, but they should never be reported upward as measures of effectiveness. The moment they become targets, they stop being useful.

Measure what matters. Accept that it’s harder. Do it anyway.