Building AI Agents? The Hard Part Isn’t the Agent
Why 95% of enterprise AI pilots fail - and what the successful ones do differently.
MIT research found that 95% of enterprise AI pilots fail to deliver measurable returns.
Not because the technology doesn’t work. Not because the models aren’t capable.
MIT calls it a “learning gap” - organizations don’t know how to integrate AI into real workflows.
From what I’ve seen, the gap is architectural. Teams build agents when they should be building systems.
I work with teams building agentic platforms.
The first enterprise customer conversation is usually humbling.
They don’t ask about your model. They don’t ask about your prompts or your reasoning capabilities.
They ask:
“What happens when the agent is wrong? How does it recover?”
“How will I know if something breaks at 2 AM?”
“What happens if a workflow fails halfway through?”
“When do humans get involved? How do you learn from that?”
“How do I audit what the agent did and why? Can one customer’s data touch another’s?”
“Where does inference happen? Does data stay in our region?”
Most teams have good answers to maybe three of these.
That gap is where projects stall. Or die.
An agentic system is not an LLM with tools.
It is a production system where probabilistic reasoning is wrapped by deterministic execution, visibility, and control.
The agent gets all the attention. But the system around it is what actually ships.
Reliability beats intelligence
The most successful deployments aren’t the ones with the cleverest agents. They’re the ones that work consistently.
An agent that gives you an 8/10 answer every time is more valuable than one that gives you 10/10 half the time and fails unpredictably the other half. Enterprise workflows don’t tolerate variance. A finance team running month-end close doesn’t want “usually works.” They want “always works, and when it doesn’t, we know immediately.”
The optimization target that matters is not “how good is the output” but “how predictable is the outcome.”
Evals aren’t optional for agentic systems. They’re foundational.
In traditional software, you write tests once and they pass or fail deterministically. Agentic systems don’t work that way. The same input can produce different outputs. Model updates change behavior silently. Prompts that worked last month drift without warning.
Run evals continuously in production, not just during development. Catch drift before users do.
And when you have multiple agents working together, evals become even more critical. Each agent might perform fine in isolation. But chain them together and failures compound. Agent A’s slightly off output becomes Agent B’s confidently wrong input. Without evals at every handoff, you’re debugging in the dark.
The teams that treat evals as infrastructure - not an afterthought - are the ones shipping reliably.
Observability is the moat
If you can’t see what your agent is doing in production, you can’t improve it. You’re guessing. You’re relying on user complaints to surface issues. By the time you hear about a problem, it’s been happening for weeks.
Tools like Langfuse and Arize give you this visibility. Inputs, outputs, latency, token usage, outcomes, all traceable end to end.
Most teams can’t answer basic questions about their production agents. What’s your p95 latency? What percentage of requests need retry? Which tool calls fail most often?
Observability isn’t a feature you add later. It’s infrastructure that determines how fast you can iterate. Teams with better visibility ship improvements faster. That compounds.
Orchestration is the unsexy superpower
Your agent’s reasoning can be probabilistic. Your execution layer cannot.
When an agent decides to call an API, update a record, or send a message, that action needs to happen reliably. If it fails, it needs to retry. If it retries, it can’t duplicate side effects. If the workflow crashes mid-way, it needs to resume from where it left off, not start over.
Tools like Temporal and Restate give you durable execution, automatic retries, state persistence, and exactly-once semantics.
Simple try-catch logic works for demos. It falls apart in production with multi-step workflows, external dependencies, and edge cases you didn’t anticipate.
Orchestration is what turns an agent into a system.
An agent without durable orchestration is a demo, not a system.
HITL is your learning loop
The common assumption is that HITL is a transitional state. You start with humans in the loop, then gradually remove them as the agent gets better. Full autonomy is the goal.
That’s not what I see working.
The goal isn’t fewer humans. The goal is faster correct outcomes. Sometimes that means more autonomy. Sometimes it means a human reviewing something for 10 seconds before the agent proceeds.
But here’s what most teams miss: HITL isn’t just about catching errors. It’s your learning loop.
Every human correction is training data. Every escalation shows where your agent is weak. Every approval pattern tells you where you can safely increase autonomy.
The teams that treat HITL as a feedback system improve faster than everyone else. They’re not trying to remove humans from the loop. They’re using humans to make the system smarter.
Design for this from day one. Capture why humans intervene, not just that they did. Track which corrections repeat. Feed that back into your evals, your prompts, your guardrails.
The system should get smarter because humans are in the loop, not despite it.
Security & Governance is architecture
Enterprise customers ask about security on day one. The temptation is to add it later. That’s a mistake.
Prompt injection defense. Tenant isolation. RBAC not just for users, but for agents acting on behalf of users. Audit trails for every autonomous action. Data classification that handles derived data, not just raw inputs.
You can’t retrofit this cleanly. Security boundaries are system-level concerns. Design them in from the start and they’re natural. Add them later and they’re a patchwork.
And before you even get to the technical conversation, there’s the compliance gate. SOC 2, ISO 27001, and depending on the industry, HIPAA or PCI-DSS. For agentic systems, this is harder than traditional software - you’re not just auditing what humans did, you’re auditing what agents decided to do autonomously. If you don’t have a clear answer for how you’ll pass an audit, enterprise deals stall before they start.
Every security question is really an architecture question. “How do you prevent X” is really “how is your system designed so that X can’t happen.”
Data locality kills deals
This one catches teams off guard. You build the system, it works, you’re ready to close a deal. Then the customer asks “where does inference happen?” and the project stalls.
Data residency and model residency are different problems. You might store customer data in the right region, but if inference happens via an external API that crosses geographic boundaries, you still have a compliance issue.
Infrastructure-as-code for multi-region deployment. Your entire stack should be deployable to any supported region with configuration changes, not re-architecture.
Some customers need BYOC for compliance or control. It adds complexity for everyone. But for customers with strict requirements, it’s the only path forward.
Compliance and data locality aren’t features you add at the end. They’re constraints that shape your architecture from the start.
The shift
95% of AI pilots fail not because the technology doesn’t work. They fail because teams build agents when they should be building systems.
The agent is a component. The system includes orchestration, observability, human escalation, security, reliability, and everything that makes it work in the real world.
Stop thinking “I’m building an agent.” Start thinking “I’m building a system that includes an agent.”
The agent demos. The system ships.
I advise teams building agentic systems for enterprise. If you’re navigating this, let’s talk.
