Enxiang Qiu

I recently watched a talk by Jake Heller at YC AI Startup School.

He is the co-founder of Casetext, and later sold the company to Thomson Reuters through CoCounsel.

But the part of the talk that stayed with me was not the 650M exit. It was that he made one thing unusually clear:

In this wave of AI startups, building a demo that looks smart is no longer as rare as it used to be. What is rare is turning that demo into a product customers are willing to depend on over time.

That line feels like a useful dividing line for a lot of AI projects today.

My read on it

If I had to compress this AI startup wave into one sentence, it would be this:

The real barrier is no longer "can you get access to the strongest model?" It is "can you go deep enough on a workflow to make it stable, trustworthy, and operationally useful?"

Many teams can produce the first impressive demonstration:

input some text and get a plausible answer,
throw in a task and make the system look autonomous,
wrap it in a UI and make the demo feel magical.

But demoable and deliverable are not the same thing.

What customers actually pay for is not "it is occasionally brilliant." It is:

is it reliable most of the time,
can mistakes be detected,
can it fit into an existing process,
and does a team feel safe handing real work to it.

That gap is what Jake Heller keeps returning to, and I think that is the strongest part of the talk.

In the AI era, the idea-selection logic has changed

In older software markets, teams often had to guess what users might want.

In AI, there is a more direct starting point: look at what users are already paying people to do today.

This was one of my favorite frames in the entire talk.

If a company is already paying humans to do something, that usually tells you at least two things:

the demand already exists,
and there is already budget attached to the task.

That is much more grounded than starting from "what new capability did the model gain this month?"

He breaks AI opportunities into three types:

assist: help a person do the work
replace: directly replace that work
do the unthinkable: make possible things that were previously too expensive to do with human labor

On the surface, that sounds like an idea-generation framework. In practice, it forces you back to a harder question:

Are you building around a real workflow, or just around a feature that looks cool?

I strongly agree with that distinction. A lot of AI ideas do not fail because they are unoriginal. They fail because they are not real enough. They sound like slogans rather than like tasks that already exist, are already performed, and are already paid for.

Great AI products are built around workflows, not prompts

The second big part of the talk is how he thinks about building the product itself.

His starting questions are not:

should this be an agent,
should we add more reasoning steps,
should we make it fully autonomous.

His first question is:

How does the best human in this field actually get the work done?

That is an important difference.

Many AI products stay shallow not because the model is weak, but because the team never really understood the task they were trying to automate. They are automating something they do not truly understand.

Heller's method is simple and disciplined:

map the expert workflow,
identify which steps are deterministic,
identify which steps genuinely require judgment, synthesis, or interpretation,
use ordinary software engineering where ordinary software engineering is enough,
and only apply the model where the model is actually needed.

This is also a useful way to cool down some of the hype around agents.

Not every problem needs to be "agentic." Many tasks are better handled as straightforward workflows when the path is stable and the steps are known. You only need a more agent-like structure when the task genuinely changes with context.

In that sense, the core capability is not "intelligence theater." It is workflow modeling.

That is valuable because it changes the starting point of a project. You stop asking what the model can do and start asking how the work is actually done.

Evals are the line between demo and product

If the first half of the talk is about choosing the right problem and structuring the workflow, I think the hardest and most important part is his discussion of evals.

His point is blunt:

Many teams can get a system to 60% - 70%. That is enough to raise money, enough to demo, and sometimes enough to win a pilot.

But it is still nowhere near a real product.

In production, the important thing is not that the system is occasionally impressive. The important thing is that it is reliably usable. And reliable usability cannot be produced by intuition alone. It has to be built through evaluation, feedback, and repeated iteration.

That is why I increasingly believe this:

The real engineering discipline of AI application teams is not wiring up the API. It is building the eval layer.

Without evals, you do not know where the system is wrong. Without a holdout set, you do not know whether you are just tuning prompts against a familiar example set. Without error analysis, you are left with a vague feeling that "this seems better now."

Heller makes this sound very practical. Many people stop at 60% and conclude that AI cannot do the task. What he treats as the real threshold is whether a team is willing to spend the long stretch of time needed to refine prompts, add examples, and systematically repair failure modes.

That points to a very grounded truth:

Moats are often less about abstract vision and more about whether someone is willing to do the exhausting work all the way through.

AI products are not just interfaces. They are trust systems.

What I most want to keep from this part of the talk is not only the line that products are not just the pixels on the screen. It is the fuller trust model underneath that line.

In [00:27:06] to [00:30:15], Heller is not really explaining how to make customers think your AI looks impressive. He is explaining how customers slowly become willing to use it inside real work. I think he is making five layers explicit.

First, AI comes with a natural trust gap.
For many companies, AI is not just another software upgrade. It is something new and scary. Customers are not uninterested. Many are eager to try. But they used to hand work to people, and people can be trained, corrected, and managed. AI does not come with that default trust foundation.

Second, trust is not built through claims. It is built through comparative evidence.
That is why he mentions head-to-head comparisons. Do not just stage a demo. Take a real task, run the current human workflow, run the AI workflow, and compare them side by side. Which one is faster? Which one is better? Where does each one fail? Customers do not trust the positioning line. They trust what they can see.

Third, a pilot is not the finish line. It is the start of trust-building.
After [00:28:22], he makes the point very directly: many teams treat winning a pilot like they have already won the account, but many pilots never become real revenue. That means "willing to try" is not the same as "willing to rely," and neither one means the customer is ready to move critical work into the system.

Fourth, trust is built through rollout and onboarding.
From [00:29:05] to [00:29:44], he is very practical: one of the founder's jobs is to make sure customers actually understand the product and can really use it. Sometimes that means more than sending over a link. It means sitting with the customer and helping them run the first real workflow end to end. Many adoption problems look like sales problems, but are really deployment and enablement problems.

Fifth, product isn't just the pixels on the screen.
From [00:29:46] to [00:30:13], this becomes the key point. Trust does not come only from the UI, or from whether one answer looked correct once. It also comes from support, customer success, training, deployment, and the whole surrounding system. In other words, AI products do not just sell capability. They also sell verification paths, onboarding structure, failure handling, and operational support.

This is why I think many teams still misunderstand adoption. They frame it as a product problem or a sales problem, when in practice it is much closer to a systems problem.

A customer being willing to try is not the same as a customer being willing to migrate.
A customer being willing to run a pilot is not the same as a customer being willing to renew.
A customer praising the demo is not the same as a customer entrusting the product with real work.

That is also why Heller warns against over-reading pilot revenue. A lot of AI companies look healthy right now because they are funded by curiosity budgets, not because they have become indispensable.

So the right question is not just "will someone try this?" It is "will someone come to rely on this?"

So what does trust-building look like in practice?

If I turn that section of the talk into concrete product moves, I think it means at least six things.

First, define what the customer is actually afraid of. Not just "they do not trust AI," but specifically: wrong answers, instability, compliance risk, workflow interruption, or a team that does not know how to use the system.
Second, do side-by-side comparisons instead of just demos. Run the real task once with the old process and once with your product, then compare speed, accuracy, and failure patterns.
Third, define what counts as good enough before the pilot starts. Accuracy, miss rate, response time, human review ratio. Without that, the pilot becomes a vague impression rather than a real test.
Fourth, treat the pilot as a trust-building phase, not as a completed sale. The goal is not just trial revenue. It is getting one narrow slice of the workflow to actually move.
Fifth, design the onboarding path. Do not just drop the product in front of users. Give them templates, examples, guided steps, FAQs, and human help when needed.
Sixth, make the system verifiable. Outputs should ideally carry sources, citations, timestamps, evidence links, or low-confidence signals rather than just a polished answer.

Real trust is not the first purchase. It is the moment the customer starts to depend on the system.

What I want to take from this

If I turn the talk into reminders for myself, I end up with four.

First, do not start from "what AI feature can I build?" Start from "what work are people already paying to get done?"

Second, do not start with agents. Start with the workflow. Break the task down, then decide which parts deserve model involvement.

Third, do not confuse "it runs" with "it works." Any AI project I want to take seriously should have a minimum eval set from the beginning.

Fourth, do not think of the product as a response box. Evidence, citations, confidence, verification paths, and onboarding are all part of the product.

For example, a more interesting small project to me is not "AI video summarizer." It is a more specific Founder Research Assistant:

input a founder interview or talk, and output a timestamped, evidence-backed, reviewable founder memo.

The value there is not "summary." It is "credibility." The system cannot hallucinate opinions, quotes, or timestamps.

If I applied Heller's framework to that product, the weak version would be: paste a YouTube link in, get back a paragraph that sounds like a summary.
The stronger version would be: every claim comes with a timestamp, a quoted source line, and a low-confidence marker when the system is uncertain, so the user can click back and verify.

The pilot should also start much narrower. Not "general research assistant," but something like "extract five actionable insights from founder interviews."

And the trust design should be explicit: compare the system's output against a user's own manual notes and ask which version misses less, invents less, and is easier to verify.

That is exactly where Heller's framework becomes useful:

start from a real task,
model the workflow clearly,
separate deterministic steps from model-based steps,
use evals to suppress hallucination,
and use evidence design to build trust.

That starts to feel like building a product rather than staging a performance.

Closing

Models will keep improving, and tools will keep getting cheaper.

But I increasingly think the teams that actually separate themselves will not be the ones who got access to the newest model first. They will be the ones who:

understand real workflows better,
are more willing to do the eval work,
and take trust seriously as part of product design.

The teams that can cross the gap from demo -> dependable product look much more like real companies. The rest are still mostly in the presentation layer.