Mind the (Opportunity) Gap

The capability overhang debate is measuring an insufficient gap.

·12 min read

Consider a gym member paying $80 a month, and they typically go twice a week. They use the treadmill and the free weights. They have never swum in the pool, never taken a class, and never used the basketball courts. By any measure of utilization, they're barely scratching what the gym offers.

Is this a problem?

By their fitness goals, seemingly not. Nor by their willingness to keep paying. Perhaps also not their likelihood to recommend the gym to a friend. They're optimizing for what they need. The gym is optimizing its margins. Everyone wins.

Yet, this is the structure of the capability overhang debate in AI — a pattern the industry has decided is broken.

The Conventional Frame

The term “capability overhang” has been popularized by Microsoft's CTO Kevin Scott and has been adopted by OpenAI, enterprise consultancies, and most of the product discourse around AI adoption. The argument goes like this: models are getting dramatically more capable every quarter, but user behavior isn't keeping up. Power users engage at several times the depth of typical users. There's a huge delta between what AI can do and what people are actually doing with it. And the consensus argument is that this is a problem to solve.

OpenAI's framing at Davos splits the solutions into two tracks. One is a macro track pushing nations and workforces to invest in AI fluency and build extensive training programs, treating it as a strategic capability. An enterprise track says organizations need change management, AI literacy training, and structured adoption programs. Both frame the overhang as a gap between model capability and user behavior that gets closed through exposure and education.

Product teams at the AI companies themselves have responded. More onboarding. More tutorial flows. More feature prompts, and more surfaces: ChatGPT spans Chat, Canvas, Projects, Custom GPTs, and Codex; Claude has Chat, Projects, Artifacts, and Computer Use; Gemini layers on top of Workspace. Each one represents more of the model's capability made available, assuming users will learn the mental model to use each one appropriately. I wrote about that surface fragmentation problem in “Good Listener, Difficult Coworker.” It's a symptom of this same diagnosis.

All of this shares an underlying assumption: the gap between model capability and user behavior is the gap worth measuring and closing. That assumption is where the debate goes sideways.

Two Gaps

There are actually two gaps worth distinguishing.

Gap A is the gap between what the model can do and what the user is doing with it. This is the gap the industry is measuring today. You can easily track tokens, sessions, feature touches, etc.

Gap B is the gap between what the user needs and what the user is getting — including needs they don't yet know to ask for. This is the gap that actually matters. It's harder to measure, and it leads to a completely different product strategy.

Gap A optimization produces engagement metrics: tokens per user, session length, retention curves driven by habitual use, feature adoption rates. Gap B optimization produces outcome metrics: did the user accomplish what they came to do, did they get the right depth of help for their actual situation, did they leave the interaction more capable than they arrived.

Let's use another analogy — think about a sports car. An owner drives their Porsche to get groceries. They never see anything close to peak performance. Is that capability overhang a problem? No. That's capability as margin, not waste. The same suspension that handles at 150 mph makes the car enjoyable to drive at 45 mph. The engineering that enables peak performance contributes to the experience at every speed. The owner is getting exactly what they paid for.

Side-by-side photos of the same silver Porsche — parked at a grocery store, and on an open mountain road

Now consider the same car, but with a different owner who loves driving. They live near open roads or mountain passes that would be exhilarating for someone with the right vehicle. But, they have no idea the car can handle those roads. They never push it, never experience the thing that would have made this purchase genuinely meaningful for them. That's a real loss. Not because they failed to utilize the car, but because they're missing an outcome that would have served them.

This is where the capability overhang debate fundamentally confuses two different problems. The first driver, the grocery-shopper isn't a problem. The second, aspiring driver who doesn't know what their car can do is. But the solution to this problem isn't to send every driver to a racetrack. It's to surface the right capability at the right moment for the specific driver behind the wheel.

The product's job isn't to expose more capability. It's to understand the user well enough to reveal the right capability.

What Gap A Optimization Actually Produces

When companies measure Gap A and build to close it, the product failures are predictable.

The first is surface proliferation. The product keeps adding modes, tabs, and specialized interfaces to expose more capability. The cognitive overhead shifts from the model to the user (while this was supposed to be the thing AI eliminated). Chat was originally magical because it required zero learning. The more surfaces a product adds in pursuit of utilization, the more that original promise erodes.

The second is engagement hacks. Streaks, usage badges, “have you tried this feature?” nudges, weekly recap emails about how much you used the product. These drive measurable engagement without driving real user success.

The third is feature walls. “To do that, upgrade to Pro.” Gatekeeping capability behind awareness and onboarding rather than delivered in context so the user has to know what to ask for before the product will serve them. That's the opposite of what AI was supposed to make possible.

What links these failures is a misread of when friction serves the user. Friction in front of someone with strong intent can be productive. It engages, teaches, and builds agency. Friction in front of someone without intent is just friction. Gap A optimization applies friction indiscriminately: assuming every user wants to go deeper when they're actually there to accomplish something specific.

None of this happens because AI product teams are careless or naive. It happens because what they're measuring is insufficient. The metrics dashboard rewards utilization. Utilization is easy to measure. Outcome delivery isn't. So the organization optimizes toward the thing it can see.

What Gap B Changes

If the metric becomes “did the user get what they needed, including what they didn't know to ask for,” the product strategy shifts in concrete ways.

Intent understanding becomes the core product problem. Not “how do we expose more features” but “how do we figure out what the user is actually trying to accomplish and deliver the right depth of capability for their actual situation.” This is a question no other software category can answer well. A document editor doesn't know what you're trying to say. A spreadsheet doesn't know if your analysis is answering the question you care about. AI products, uniquely, have the context to understand user intent. That understanding is the foundation for everything else.

Discovery can then become contextual, not promotional. The capability the user doesn't know about gets surfaced when they'd actually benefit from it. A user who has spent three hours iterating on a complex research task can be shown how to structure it as a multi-step operation. A user writing a simple email shouldn't be nudged toward agent workflows. The model has enough context to distinguish between these cases.

Teaching extends into the interaction itself, not only through separate modules. When a user asks a simple question that a more sophisticated prompt would have answered better, the model can surface what would have worked better or guide the user to add definition as a natural extension of the conversation. “If you want to go deeper on this, I could structure it as...” The user learns by doing and builds agency through use. The model becomes a teacher the way a good colleague does, by watching how you work and occasionally saying “have you tried...” when it would actually help. And, to be fair, this is starting to happen — Claude Code's AskUserQuestion tool is an early indication of where this can work.

The Measurement Problem

Here's the hard part: if outcome satisfaction is the right metric, how do we measure it? Activity is easy to measure. Utility isn't. And understanding whether users are becoming more capable over time is harder still. It may not show up in any single interaction, but instead whether a user's goals get bigger, whether their asks get more sophisticated when warranted, or whether they develop the judgment to know when to go deeper and when a simple answer will do.

This is where AI products have an advantage no other software category has.

The model itself has semantic understanding of what the user asked for and what got delivered. An AI model can make a judgment about whether its response actually served the user's intent. And, it can evaluate the user's satisfaction. Did the output succeed on first attempt? How many turns did it take to complete the task? Did the user come back later to ask more follow-up?

Agency becomes measurable through patterns over time. Is this user taking on more complex tasks? Are their prompts getting sharper? Are they spending less time iterating to get to useful output? Are they using the tool in ways that compound their own capabilities or are they creating dependency? These are hard questions. But they're answerable by the system that has the most context on every interaction.

That's a fundamentally different kind of product instrument than anything that's come before. It's also the reason this moment is the right moment to rethink what we're measuring. Five years ago, you couldn't build a measurement system around semantic understanding of outcomes. Now you can. The companies that figure out how to do it will have a real advantage over the ones still optimizing for tokens.

Why This Matters Commercially

Back to the gym. The economics of subscription fitness depend on a specific dynamic: you want members to come enough that they perceive value and keep paying, but not so much that you strain your equipment and your margins. The gym that converts every member into a daily power user would go out of business. The gym whose members never show up goes out of business too.

AI subscriptions are in the same position. Inference costs now consume a substantial share of revenue at frontier labs — recent analysis of OpenAI's economics puts it at or above half. A power user on a flagship model can cost 10x or more than a casual user. OpenAI is now targeting roughly $600 billion in total compute spending by 2030. There is a real financial reason not to convert every $20-a-month subscriber into a maximum-depth user.

But low utilization with no outcome delivery is the worst of both worlds. The user keeps paying for a while. Then they cancel. Because they never encountered the thing that would have made the subscription feel essential.

This is the position AI companies are about to find themselves in. Google will keep giving away Gemini through Workspace. The floor of “good enough” keeps rising. The question every AI company will face in the next 18 months is: why should users keep paying our subscription when the free alternative is 80 percent as capable for their use case?

“Because we have more features” isn't going to work. Users already aren't using the features they have.

The durable answer is something closer to: “Because our product delivers more of what you actually need, and it keeps getting better at understanding what you need over time.” That's an agency argument, not a utilization argument. And it's the only argument that holds up against commoditization.

Retention in subscription businesses is ultimately a function of whether users feel they've built something with the tool. A gym member who's gotten fit feels real loss at canceling. An AI user who's become materially more effective at their work feels the same loss. Neither is driven by maximum utilization. Both are driven by compounded outcome delivery. That's the flywheel: outcomes build agency, agency builds durable usage, durable usage builds retention and the word-of-mouth distribution that — as I argued in “The Product Was the Fixed Thing” — becomes the actual moat in a market where anyone can build anything.

What Gets Built

The capability overhang isn't going to close because we get better at teaching users. It closes because products get better at meeting users where they are, delivering the right depth for the task at hand, and surfacing the capability the user didn't know to ask for (when that capability would genuinely serve them).

This is a product design problem, not an adoption problem. The measurement problem is hard. The product design problem is hard. The organizational discipline to optimize for outcome over engagement may be the hardest part. But it's the work that matters. Not every gym member needs to swim. Not every Porsche needs a racetrack. But every user deserves a product that understands them well enough to know the difference.

Mind that gap.