Lenny's Podcast·January 11, 2026

Why Most AI Products Fail: Lessons from 50+ Enterprise Deployments

OpenAI and Google veterans share the CCCD framework for building AI products that avoid trust erosion and endless hot fixes.

enterprise agents productivity future-of-work

Why Most AI Products Fail: Lessons from 50+ Enterprise Deployments

The Two Fundamental Differences That Break Traditional Product Development

Aishwarya Ranti worked on AI research at Alexa and Microsoft, with 35+ published research papers. Kiriti Bhattam leads Codex at OpenAI after a decade building AI infrastructure at Google and Kumo. Together they've supported 50+ AI deployments and teach the #1-rated AI course on Maven. Their core message: AI products require completely different thinking.

The first difference is non-determinism. "You don't know how your user might behave with your product and you also don't know how the LLM might respond to that." In traditional software, you build a well-mapped decision engine. Booking.com has buttons and forms that convert intent to action predictably. With AI, both input (natural language can express the same intent countless ways) and output (LLMs are probabilistic black boxes) are unpredictable. You're working with an input, output, and process you don't fully understand.

The second difference is the agency-control trade-off. "Every time you hand over decision-making capabilities to agentic systems, you're kind of relinquishing some amount of control on your end." Ash finds it shocking that more people don't discuss this. The AI community is obsessed with building autonomous agents, but autonomy means losing control. Before giving an AI agent more agency, you need to verify it has earned trust through demonstrated reliability.

The 74% reliability problem is real. A UC Berkeley paper found 74-75% of enterprises cited reliability as their biggest problem. That's why they weren't comfortable deploying customer-facing products—they couldn't trust the system. This explains why most enterprise AI today focuses on productivity tools rather than end-to-end workflow replacement.

Why the CCCD Framework Prevents Catastrophic AI Failures

The guests developed the Continuous Calibration, Continuous Development framework after painful experience. They built an end-to-end customer support agent that required so many hot fixes they had to shut it down. Air Canada's chatbot hallucinated a refund policy that didn't exist, and they had to honor it legally. These disasters are preventable.

Start with high control and low agency. "It's not about being the first company to have an agent among your competitors. It's about have you built the right flywheels in place so that you can improve over time." For a customer support agent: V1 just routes tickets to departments (humans still decide). V2 suggests draft responses that humans can edit, logging what changes they make. V3 handles end-to-end resolution only after V1 and V2 proved reliable.

For coding assistants, the same pattern applies. V1: suggest inline completions and snippets. V2: generate larger blocks like tests or refactors for human review. V3: apply changes and open PRs autonomously. For marketing: V1 drafts copy, V2 builds and runs campaigns with approval, V3 launches and auto-optimizes across channels.

The customer support progression teaches everything. Even routing—seemingly simple—can be incredibly complex in enterprises. Taxonomies are messy with duplicate categories and dead nodes from 2019. Human agents know these quirks from experience; AI doesn't. By starting with routing, you fix data issues before they torpedo more ambitious automation. The flywheel effect means each version generates training data for the next.

What Separates Companies That Succeed With AI Products

The guests see a "success triangle" with three dimensions: great leaders, good culture, and technical progress. None work in isolation.

Leaders must rebuild their intuitions. "Leaders have to get back to being hands-on... You must be comfortable with the fact that your intuitions might not be right and you probably are the dumbest person in the room." One CEO Ash worked with blocked 4-6am every morning for "catching up with AI"—no meetings, just learning from trusted sources. He'd come back with questions to bounce off AI experts. Leaders who built intuitions over 10-15 years now need to relearn them.

Culture of empowerment beats FOMO fear. Subject matter experts are critical—they understand what AI should actually do. But in many companies, they refuse to help because they think their jobs are being replaced. Leaders must frame AI as augmentation for 10x productivity, not replacement. Get the entire organization working together to make AI useful.

Technical obsession with workflows, not tools. Successful teams understand their workflows deeply before choosing technology. "80% of so-called AI engineers, AI PMs spend their time actually understanding their workflows very well." The agent might only handle part of a workflow. Machine learning might handle another part. Deterministic code handles the rest. Tool obsession without workflow understanding leads to failure.

Why Evals Are Misunderstood and What to Do Instead

The "eval" debate has become semantic diffusion—everyone uses the term differently. Data labeling companies call expert annotations "evals." PMs writing acceptance criteria call that "evals." Model benchmark comparisons get called "evals." A client told Ash "we do evals" and meant they checked LM Arena rankings.

Neither evals nor production monitoring alone is sufficient. Evals are your trusted product knowledge encoded in test datasets—things your agent absolutely shouldn't do wrong. Production monitoring catches implicit signals: users regenerating answers (indicating dissatisfaction), thumbs down, or switching off features entirely. Evals catch known failure modes; production monitoring catches emerging patterns you couldn't predict.

The process is: deploy, monitor, analyze, iterate. You can't predict every failure mode upfront. Production monitoring alerts you to traces worth examining. Error analysis reveals patterns. Only then do you decide: is this a one-off fix, or a systemic issue requiring new evaluation criteria? Building too many evals too early creates maintenance burden without catching real problems.

5 Takeaways for Building AI Products That Actually Work

Problem first, always - Starting small forces you to define the actual problem; solution complexity is a slippery slope
Pain is the new moat - Companies succeeding went through the pain of learning what works; there's no playbook or textbook yet
One-click agents are marketing - Anyone selling instant autonomous deployment is misleading you; enterprise data is messy and needs calibration
Multi-agent is misunderstood - Dividing responsibilities across peer agents without human orchestration is extremely hard to control
Coding agents remain underrated - Despite Twitter/Reddit chatter, penetration outside Bay Area is still low; massive value creation ahead

What This Means for Organizations Deploying AI Agents

The core insight: AI product development isn't traditional software development with AI swapped in. Non-determinism and the agency-control trade-off mean you can't predict behavior, can't fully control outcomes, and must earn trust incrementally. The CCCD framework—starting with high control, gradually increasing agency as reliability proves out—prevents the catastrophic failures that force shutdowns and erode customer trust. Companies winning at AI aren't moving fastest; they're building flywheels that compound improvement over time.