Enxiang Qiu

Writing

The Hard Part of AI Startups Isn't Building the Demo

Apr 30, 2026Blog

After watching Jake Heller at YC AI Startup School, I keep coming back to one point: in this AI wave, the scarce thing is no longer the impressive demo. It is turning that demo into a product customers are willing to rely on.

Browse all writing

I recently watched a talk by Jake Heller at YC AI Startup School.

He is the co-founder of Casetext, and later sold the company to Thomson Reuters through CoCounsel.

But the part of the talk that stayed with me was not the 650M exit. It was that he made one thing unusually clear:

In this wave of AI startups, building a demo that looks smart is no longer as rare as it used to be. What is rare is turning that demo into a product customers are willing to depend on over time.

That line feels like a useful dividing line for a lot of AI projects today.

My read on it

If I had to compress this AI startup wave into one sentence, it would be this:

The real barrier is no longer "can you get access to the strongest model?" It is "can you go deep enough on a workflow to make it stable, trustworthy, and operationally useful?"

Many teams can produce the first impressive demonstration:

But demoable and deliverable are not the same thing.

What customers actually pay for is not "it is occasionally brilliant." It is:

That gap is what Jake Heller keeps returning to, and I think that is the strongest part of the talk.

In the AI era, the idea-selection logic has changed

In older software markets, teams often had to guess what users might want.

In AI, there is a more direct starting point: look at what users are already paying people to do today.

This was one of my favorite frames in the entire talk.

If a company is already paying humans to do something, that usually tells you at least two things:

That is much more grounded than starting from "what new capability did the model gain this month?"

He breaks AI opportunities into three types:

On the surface, that sounds like an idea-generation framework. In practice, it forces you back to a harder question:

Are you building around a real workflow, or just around a feature that looks cool?

I strongly agree with that distinction. A lot of AI ideas do not fail because they are unoriginal. They fail because they are not real enough. They sound like slogans rather than like tasks that already exist, are already performed, and are already paid for.

Great AI products are built around workflows, not prompts

The second big part of the talk is how he thinks about building the product itself.

His starting questions are not:

His first question is:

How does the best human in this field actually get the work done?

That is an important difference.

Many AI products stay shallow not because the model is weak, but because the team never really understood the task they were trying to automate. They are automating something they do not truly understand.

Heller's method is simple and disciplined:

This is also a useful way to cool down some of the hype around agents.

Not every problem needs to be "agentic." Many tasks are better handled as straightforward workflows when the path is stable and the steps are known. You only need a more agent-like structure when the task genuinely changes with context.

In that sense, the core capability is not "intelligence theater." It is workflow modeling.

That is valuable because it changes the starting point of a project. You stop asking what the model can do and start asking how the work is actually done.

Evals are the line between demo and product

If the first half of the talk is about choosing the right problem and structuring the workflow, I think the hardest and most important part is his discussion of evals.

His point is blunt:

Many teams can get a system to 60% - 70%. That is enough to raise money, enough to demo, and sometimes enough to win a pilot.

But it is still nowhere near a real product.

In production, the important thing is not that the system is occasionally impressive. The important thing is that it is reliably usable. And reliable usability cannot be produced by intuition alone. It has to be built through evaluation, feedback, and repeated iteration.

That is why I increasingly believe this:

The real engineering discipline of AI application teams is not wiring up the API. It is building the eval layer.

Without evals, you do not know where the system is wrong. Without a holdout set, you do not know whether you are just tuning prompts against a familiar example set. Without error analysis, you are left with a vague feeling that "this seems better now."

Heller makes this sound very practical. Many people stop at 60% and conclude that AI cannot do the task. What he treats as the real threshold is whether a team is willing to spend the long stretch of time needed to refine prompts, add examples, and systematically repair failure modes.

That points to a very grounded truth:

Moats are often less about abstract vision and more about whether someone is willing to do the exhausting work all the way through.

AI products are not just interfaces. They are trust systems.

What I most want to keep from this part of the talk is not only the line that products are not just the pixels on the screen. It is the fuller trust model underneath that line.

In [00:27:06] to [00:30:15], Heller is not really explaining how to make customers think your AI looks impressive. He is explaining how customers slowly become willing to use it inside real work. I think he is making five layers explicit.

First, AI comes with a natural trust gap.
For many companies, AI is not just another software upgrade. It is something new and scary. Customers are not uninterested. Many are eager to try. But they used to hand work to people, and people can be trained, corrected, and managed. AI does not come with that default trust foundation.

Second, trust is not built through claims. It is built through comparative evidence.
That is why he mentions head-to-head comparisons. Do not just stage a demo. Take a real task, run the current human workflow, run the AI workflow, and compare them side by side. Which one is faster? Which one is better? Where does each one fail? Customers do not trust the positioning line. They trust what they can see.

Third, a pilot is not the finish line. It is the start of trust-building.
After [00:28:22], he makes the point very directly: many teams treat winning a pilot like they have already won the account, but many pilots never become real revenue. That means "willing to try" is not the same as "willing to rely," and neither one means the customer is ready to move critical work into the system.

Fourth, trust is built through rollout and onboarding.
From [00:29:05] to [00:29:44], he is very practical: one of the founder's jobs is to make sure customers actually understand the product and can really use it. Sometimes that means more than sending over a link. It means sitting with the customer and helping them run the first real workflow end to end. Many adoption problems look like sales problems, but are really deployment and enablement problems.

Fifth, product isn't just the pixels on the screen.
From [00:29:46] to [00:30:13], this becomes the key point. Trust does not come only from the UI, or from whether one answer looked correct once. It also comes from support, customer success, training, deployment, and the whole surrounding system. In other words, AI products do not just sell capability. They also sell verification paths, onboarding structure, failure handling, and operational support.

This is why I think many teams still misunderstand adoption. They frame it as a product problem or a sales problem, when in practice it is much closer to a systems problem.

A customer being willing to try is not the same as a customer being willing to migrate.
A customer being willing to run a pilot is not the same as a customer being willing to renew.
A customer praising the demo is not the same as a customer entrusting the product with real work.

That is also why Heller warns against over-reading pilot revenue. A lot of AI companies look healthy right now because they are funded by curiosity budgets, not because they have become indispensable.

So the right question is not just "will someone try this?" It is "will someone come to rely on this?"

So what does trust-building look like in practice?

If I turn that section of the talk into concrete product moves, I think it means at least six things.

Real trust is not the first purchase. It is the moment the customer starts to depend on the system.

What I want to take from this

If I turn the talk into reminders for myself, I end up with four.

First, do not start from "what AI feature can I build?" Start from "what work are people already paying to get done?"

Second, do not start with agents. Start with the workflow. Break the task down, then decide which parts deserve model involvement.

Third, do not confuse "it runs" with "it works." Any AI project I want to take seriously should have a minimum eval set from the beginning.

Fourth, do not think of the product as a response box. Evidence, citations, confidence, verification paths, and onboarding are all part of the product.

For example, a more interesting small project to me is not "AI video summarizer." It is a more specific Founder Research Assistant:

input a founder interview or talk, and output a timestamped, evidence-backed, reviewable founder memo.

The value there is not "summary." It is "credibility." The system cannot hallucinate opinions, quotes, or timestamps.

If I applied Heller's framework to that product, the weak version would be: paste a YouTube link in, get back a paragraph that sounds like a summary.
The stronger version would be: every claim comes with a timestamp, a quoted source line, and a low-confidence marker when the system is uncertain, so the user can click back and verify.

The pilot should also start much narrower. Not "general research assistant," but something like "extract five actionable insights from founder interviews."

And the trust design should be explicit: compare the system's output against a user's own manual notes and ask which version misses less, invents less, and is easier to verify.

That is exactly where Heller's framework becomes useful:

That starts to feel like building a product rather than staging a performance.

Closing

Models will keep improving, and tools will keep getting cheaper.

But I increasingly think the teams that actually separate themselves will not be the ones who got access to the newest model first. They will be the ones who:

The teams that can cross the gap from demo -> dependable product look much more like real companies. The rest are still mostly in the presentation layer.