sohit’s Newsletter

Domain Understanding is Moat.

sohit kumar — Thu, 28 May 2026 17:58:15 GMT

We have been hearing a lot about what the moat is. Some say the harness is the moat. Some say the model is the moat. It keeps changing every day.

What I believe is - the moat is domain understanding.

Domain understanding means knowing the customer’s workflows, constraints, edge cases, goals, failure modes, and what “good” actually means in their world.

You encode domain understanding into software.

Earlier, you encoded domain knowledge into workflows, databases, and CRUD APIs packaged as SaaS. The knowledge you were able to capture was limited because you could only read/write data and represent it through workflows. That is why we needed humans to work on top of SaaS to provide additional domain understanding which we could not encode.

Today, that domain understanding is being encoded in evals, prompts, and harnesses.

When folks say the harness is the moat - they mean that during compaction, when you hit the context window, it is important to have past summaries to meet the outcome, so you use summarisation instead of a sliding window. This is nothing but domain understanding encoded in the harness. Claude Code’s harness understands the coding domain really well.

When folks say the model is the moat - they mean encoding domain understanding into model weights because we need to deliver value quickly and latency is important.

And may be tomorrow we can have specific chips which will help optimize and solve problem in certain domain.

But you get the idea. It is all about understanding the domain to deliver value to the customer. The shape and form can keep changing.

More you understand that, more value you can provide value to the customer. This is business 101 - provide value to the customer.

This why you want to have feedback loop - iterate fast and get feedback from customer/traces so that you can understand more about the domain you are operating in and encode that into your software and offload work from them as much as possible and provide more value. Feedback loop compounds your moat.

And when you have constraints, you pick the best place to encode that domain understanding.

If training a model does not make business sense because of Capex, you encode it elsewhere: in the harness, prompts, skills, evals, memory, and context layer.

The form changes. The goal stays the same - capture domain understanding and deliver value to the customer.

The moat is how much domain understanding you can capture, encode, and compound inside the product.

Coding Agents Need Software Factories

sohit kumar — Wed, 20 May 2026 15:55:04 GMT

Most engineering teams are making the same mistake with coding agents - confusing faster code generation with faster software delivery.

The bottleneck was never just writing code. It was turning code into software the organization can trust and deploy.

A real software change often touches frontend, APIs, database migrations, infra, rollout flags, integration tests, monitoring, and deployment plans. It has to fit the architecture, respect service contracts, work across repos, pass the right tests, survive rollout, and be trusted in production.

This is where most coding agent workflows still break.

The agent writes the code, but the developer still carries the engineering system context - architecture, repo ownership, service contracts, verification paths, rollout risks, and memory of what failed last time.

I have seen agents open 10,000-line PRs and changes with multiple consumers. At that point, no one can confidently say whether the change works, what it breaks, or whether it still follows the architecture. PRs start queuing up. Reviews take days or weeks. Verification becomes the bottleneck.

The real gap is between generating code and delivering software the organization can trust.

This is why coding agents need software factories.

A Software Factory is a control plane around coding agents. It preserves context, coordinates work, verifies outcomes, learns from every run, and improves itself.

A Software Factory needs three things around coding agents:

A brain that holds the engineering context - repos, architecture, contracts, infra, ownership, decisions, failures, and verification paths.
A sandbox where the factory can bring up the system, run checks, validate integrations, and prove the change works.
A learning loop that turns failed runs and repeated manual work into better workflows, skills, and agents.

Without this, agents produce changes that look right locally but fail at integration points. Developers still carry the context, risk, and memory. Every session starts from zero.

Example: Cross-Repo Feature Development

Imagine a feature touches three repositories owned by three teams.

The code changes are easy.

The problem is knowing what changes first, what depends on what, which contract can break, which tests need to run, and what has to be deployed in what order.

Today, a developer opens Codex or Claude in each repo and asks for the local change.

But the real coordination still happens in the developer’s head.

This API is changing here.
This frontend needs to consume it.
This worker needs to emit a new event. Current infra setup needs to be considered.
This test needs to run after all three changes land.
This rollout should happen after the backend is deployed.
The agents are writing code. The developer is still doing the delivery work - coordinating branches, tests, PRs, dependencies, and hidden integration risks.

In a Software Factory model, you ask once:

Build this feature across the affected repos.

The factory runs a simple loop:

Spec → Build → Verify → Learn → Evolve

Spec Agent brainstorms and defines the outcome.
Build Agents brainstorms and turns the spec into coordinated changes across repos, contracts, data flows, and rollout paths and implement them.
Verification Agent proves whether the change actually works.
Learning Agent captures what was missing, what failed, and what needs to improve next time.
Evolve Agnet updates the factory so the next run is better.

Human coordinates and brainstorm with these agents to ship the feature.

I ran this on a real change spanning four services, including frontend. The architect agent identified the repositories that needed to change, came up with high level design. It delegated the work to lead agents, who broke the request into steps, designed the implementation, and made the changes.

Then a verifier agent spun up the services, tested them, validated the frontend-backend integration, captured screenshots of the working flow using Playwright MCP, and produced a PR summary reviewers could trust.

The system did not just generate code. It coordinated the change, verified the outcome, and produced evidence for review.

Instead of asking an agent to make a code change, you ask the factory to deliver a change.

The Factory Is Not Just For Code Changes

Once the factory has the engineering brain, it is not limited to feature work.

It can diagnose production, cost, reliability, security, and compliance issues because all of them are connected to code, infra, data flows, ownership, policies, and deployment history.

For example, I asked the factory for the top three cost drivers from last month and the reason behind each spike.

The main agent connected to AWS, pulled last month’s cost data, identified the top three services driving spend, checked metrics and infra configuration, correlated the spike with code changes and deployment history, and was able to pinpoint the repo and module where the increase came from.

A normal coding agent cannot do that by only reading the codebase.

A shallow answer would be:

Your database costs are high.

A useful answer is:

Cost increased after this deployment by 28%.
The spike came from this worker because a filter moved from the database layer to application code.
The affected repos are backend-api and reporting-worker.
Verification should include query plan comparison, worker runtime tests, and a cost estimate after replaying the workload.

The useful part is that it can trace the issue back properly - which deployment caused the spike, which service changed, which repo owns it, and what needs to be tested before the fix goes out.

The same factory that helps ship features can also help debug cost, reliability, security, and compliance issues.

The foundation is the same - connected context, verification, and learning loop.

The Brain Is The Engineering Context Layer

The brain gives the factory its engineering memory.

It cannot be a folder of docs or one long prompt. It needs to behave like a knowledge graph: repos, services, contracts, workflows, infra, tests, owners, decisions, failures, and skills connected to each other.

This is important because engineering knowledge is not flat.

A code change depends on architecture.
Architecture depends on infra.
Infra affects cost, reliability, and security.
Verification depends on service contracts.
Future work depends on decisions made today.

The knowledge graph helps the factory understand these connections: which repo owns an API, which service consumes an event, which migration affects a workflow, which test proves a contract still works, which past failure needs to be considered, and which agent or skill fits the work.

If agents do not have access to that connected system context, they will keep producing changes that pass locally and fail in production

Once it knows the affected repos, contracts, workflows, and verification paths, it decides which agents are involved, what each agent does, which repos need changes, what order the work happens in, and how the final change comes together.

Every meaningful change gets checked against the architecture, contracts, ownership boundaries, operational patterns, and previous decisions.

For simple work, the factory keeps the path lightweight -one agent, one repo, one clear verification step.

For larger work, it coordinates multiple agents across frontend, backend, infra, tests, and documentation.

Developers should not be the coordination layer. They should not have to remember which repo changed, which branch depends on what, which test proves the contract, or which dependency can break the release.

The factory coordinates the run.

Developers should guide the system when judgment is needed, not manually coordinate every repo, branch, test, and rollout step.

The Sandbox Verifies The Work

The sandbox is where the factory verifies work.

The system brings up all the services, runs tests, checks contracts, validates integration paths, and proves whether the change works.

This does not mean every change goes through a heavy process.

A good factory has two lanes.

Fast lane handles small fixes: one repo, low risk, obvious verification.

Full lane handles work with real coordination risk - cross-repo features, API changes, infra changes, data migrations, cost work, and anything that affects rollout confidence.

The factory chooses the lane, runs the right checks, and produces evidence.

The developer reviews the evidence and makes the final release call.

How The Factory Improves

Each run should make the next one easier.

After every run, the factory asks:

What context was missing?
Which workflow was too manual?
Which verification step failed too late?
Which repeated step should become a skill?
Which area needs a specialized agent?
Which decision needs to be recorded for future runs?

After each run, feedback is incorporated back into the system.

If a cross-repo change fails because one service emits created_at but another expects createdAt, the factory does not just fix the bug. It records the contract mismatch, updates the verification workflow, and checks that boundary before future PRs.

If local setup fails because seed data is missing, the factory should update the setup skill so future runs start from a working sandbox.

For example, if the authentication flow requires a dummy user and password, the factory should seed those credentials during setup and update the local setup instructions. The next time an agent runs the workflow, it should not have to rediscover the same missing dependency again.

If frontend-heavy work repeatedly needs the same review pattern, the factory turns that pattern into a reusable frontend verification skill, or proposes and create a specialized frontend lead agent.

For example, if an agent needs a skill that does not exist, such as python-best-practices, it can propose it, create it, and use it before writing code.

The factory is version-controlled so that skills, workflows, verification rules, setup instructions, architecture decisions, and lessons from failed runs are all reviewable and teams can collaborate.

Developers and agents work together and improve the factory together, which compounds the system.

The Shift

The unit of work changes from “make this edit” to “ship this outcome safely.”

You no longer ask an agent to make the code change. You ask the factory to take an outcome and carry it through spec, build, verification, and learning.

Claude, Codex, and Cursor are not the factory.

They are the workers inside it.

The factory is the control plane around them: one shared engineering brain for every kind of engineering work, where agents coordinate, the sandbox proves changes, and the system improves with every run.

A prompt gives you output.

A factory gives you a change with tests, screenshots, affected repos, rollout order, and a trail reviewers can trust.

Build your own factory using prinevo.ai.

Why DDD Will Come Back for AI Agents?

sohit kumar — Tue, 12 May 2026 20:40:23 GMT

DDD will come back because agents need a model of the domain they are operating in.

What are we doing when we are creating different skills for different process and giving access to the agent? We are passing business context - the context for the domain your business operates in.

Payment should happen in 30 days, else charge 10% extra - this is a business invariant you need to pass to the agent, as a skill, prompt, static logic, or whatever the case may be.

When people talk about context, the deeper point is not just “give the agent more context.”

The real point is: the agent needs a structured model of the business.

Not just files, docs, embeddings, or tool descriptions. It needs to understand how the domain is shaped ?

For example:

What is a customer?
What is a user?
Are they the same thing?
When are they different?
What is an account?
What state can an invoice move through?
Which actions are allowed before approval?
Which events matter?
Which policy applies in which context?

That is why DDD will come back.

DDD gave us language for:

ubiquitous language
entities
value objects
aggregates
bounded contexts
domain events
invariants

Because when an agent operates inside a business, it cannot rely only on code syntax or generic reasoning.

It has to know the meaning of the system.

If the agent collapses these concepts incorrectly, it will make bad decisions.

This is where context graphs (Domain Model represented as graph ) and DDD meet.

A context graph is not just a retrieval graph.

It becomes a living domain model:

Customer
  -> belongs to Account
  -> has Users
  -> has Contracts
  -> receives Invoices
  -> opens Support Tickets

But the graph also needs boundaries:

Billing.Customer != CRM.Customer
Auth.User != Workspace.Member
Finance.Account != Product.Account

That is bounded context.

Without bounded context, the agent will over-generalize.

It will see the same word in two places and assume it means the same thing.

Or it will see two different words and miss that they refer to the same business concept.

So the real job is not just “stuff more context into the model.”

The job is to help the agent build and evolve a correct domain model.

And this model will not be perfect on day one.

You start with a baseline:

existing code
docs
database schemas
API contracts
workflows
product language
support docs
analytics events
tribal knowledge

If you are building a one-off agent, you may not see the need. But if you are automating a business process or making a function autonomous, then you need to understand the shape of the domain and design your multi-agent system around these bounded contexts.

Once the domain model exists, evals also need to operate against that model.

Right now, evals are more focused on prompts, tool calling, and whether the model gave the correct output. That is true to a certain extent.

But what we really want to validate is:

Did the agent complete the business process as expected?

Eval traces should also contain domain events. Traces containing domain events that agents emit when they complete certain tasks. This Event Source becomes evidence of how the agent understood and acted inside the domain.

A trace should not only say:

Tool called: update_invoice

It should say:

{
  "actor": "agent",
  "action": "approve",
  "entity": {
    "type": "invoice",
    "id": "INV-123",
    "attributes": {
      "bounded_context": "finance",
      "state": "validated",
      "amount": "1200"
    }
  },
  "outcome": {
    "result": "blocked",
    "reason": "approval_required"
  }
}

Then every run teaches the system more:

this term was ambiguous
this workflow had a hidden gate
this entity was misclassified
this repo owns this concept
this policy applies only in this context
this event means something different in another service

That learning should update the domain model.

Context graph -> domain model -> domain events -> eval traces -> learning loop

The context helps the agent reason.

The domain model gives meaning to the context.

The trace records how the agent acted against that model.

The eval checks whether the action was valid.

The learning loop updates the model when the trace reveals ambiguity or drift.

DDD will come back as the practical operating system for AI agents inside real businesses.

Because the bottleneck will not be whether the model can write code.

The bottleneck will be whether the system understands business concepts well enough to act safely.

In the old world, DDD helped humans align around software.

In the agentic world, DDD helps humans and agents align around the business.

Building a Software Factory: From Prompts to Compounding Systems

sohit kumar — Sun, 26 Apr 2026 12:28:18 GMT

As coding agents get better, the bottleneck shifts. Code generation is commoditizing. Planning what to build, defining guardrails and context for how to build it, and verifying the change become the work.

Here’s a metric most teams aren’t tracking: changes to your agent setup (CLAUDE.md, skills, sub-agents) vs changes to your codebase. Does your agent need less instruction over time? That’s the only signal proving your setup is compounding.

Most teams aren’t measuring this. Engineers are prompting on top of the codebase, shipping fast, feeling productive. But the system isn’t getting smarter. Everyone is prompting the same things over and over.

That’s individual speedup with zero org-level compounding. Your team may feel faster, but the system itself isn’t improving.

Compounding starts when repeated prompts become shared system behavior. Before the two shifts that make it happen, a frame worth naming.

A compounding setup has two layers.

The brain holds context, policy, and decisions. CLAUDE.md, skills, sub-agents, architecture rules, security posture. This is where org knowledge lives.

The sandbox is where the agent executes and verifies. Local services, seed data, integration tests, the ability to bring the system up and watch it break.

Most teams have neither. They have a chat window and a codebase. That’s why prompting feels productive but never compounds. There’s nowhere for the learning to land, and nowhere for the agent to test what it built.

The two shifts below are how you build each layer.

1. Invest in the verification layer (this is the sandbox)

How fast can you verify what the agent shipped? Unit tests, integration tests, e2e automation, the ability to bring services up locally with seed data.

This mattered when humans wrote code. It’s 10x more important now.

Last week an agent shipped a multi-repo change where one service emitted created_at and the consumer expected createdAt. Code looked clean in both repos. Tests passed in isolation. An integration test caught it in the sandbox before it hit higher environments.

Most failures aren’t generation failures. They’re verification failures at integration boundaries.

A real feedback loop moves the human up the stack. The agent completes the task, brings services up locally, tests, iterates when it fails, and keeps going until acceptance criteria are met.

2. Turn prompts into policy (this is the brain)

When an agent misbehaves, don’t just re-prompt and move on. Pass the feedback into the system. Update CLAUDE.md, agent.md, sub-agent skills.

Re-prompting is individual work. Updating the context and guardrail layer is org work.

The team’s job becomes managing how the agent builds: security posture, architecture patterns, migration strategy. Every mistake the agent makes is input for the guardrail layer.

Once both layers are in place, you can build a compounding software factory above the codebase: an agent with context across repos and infra, taking a request, making multi-repo changes, bringing services up, testing, and shipping end to end.

Code generation is no longer the hard part. It’s managing shared context across repos and having a verification layer that lets the agent iterate until it converges.

The created_at vs createdAt story is the small version. At scale it’s schema drift, contract mismatches, and infra assumptions that only show up when services talk.

Teams that treat coding agents as individual productivity tools will get speed.

Teams that treat them as compounding software factories will get leverage.

This is Part 1. Next: how to actually build the brain and sandbox layers. Verification loops, feedback systems, shared context, and what has to be in place for agents to compound.

Subscribe now

Context Tiering for Claude Code: The CLAUDE.md Setup That Survives Long Sessions

sohit kumar — Fri, 10 Apr 2026 11:44:22 GMT

TL;DR: Claude Code has 3 context tiers. Put rules in the wrong tier and Claude “forgets.” This is the Context Tiering setup that took me from a 470-line CLAUDE.md and constant reminders to 94 lines and ~99% first-try eval pass.

I had a rule in CLAUDE.md saying “every PR touching ORM models MUST include a migration.” Claude followed it for 20 turns. By turn 40, it was happily adding model fields without migrations. Bold warnings, checklists, repetition: none of it worked.

The fix wasn’t a better prompt. It was moving the rule out of CLAUDE.md entirely.

Claude Code doesn’t forget. It compresses. And if your rules live in the wrong tier, they get compressed first, which is why your 470-line CLAUDE.md feels like it works for 20 turns and then quietly stops.

3 months running Claude Code in production, mostly on Opus, building real systems for real users. Started at 470 lines of CLAUDE.md. Ended at 94 lines plus a layered system of rules, skills, and guidelines. First-try eval pass rate went from ~70% to ~99%. I used to write “don’t forget the migration” or “add eval cases” in almost every prompt. Now I write it zero times. The system handles it.

The model didn’t get smarter. The context architecture did.

The Mistake Everyone Makes

You start by dumping everything into CLAUDE.md: architecture, coding standards, workflows, security rules, eval framework, design patterns. Then Claude:

Forgets your migration checklist on long conversations
Doesn’t follow coding standards 50 messages deep
Lets subagents invent their own rules

I wrote the same rule three times hoping repetition would help. It didn’t. Repetition isn’t reinforcement. Placement is.

You’re treating CLAUDE.md like a knowledge base. It’s not. It’s permanent context, loaded every single turn, never compressed. A 470-line file eats ~2000 tokens per message and Claude’s attention to any individual rule drops as the file grows past ~200 lines.

The Mental Model: Context Tiering

Claude Code has three context tiers, each with a different lifetime:

markdown

PERMANENT (every turn, never compressed)
  CLAUDE.md                      keep TINY
  .claude/rules/* (glob-scoped)  auto-loaded invariants
  Memory index                   cross-session recall

ON DEMAND (loaded when read, compressed over time)
  guidelines files               reference docs
  skill bodies                   workflow instructions

TEMPORARY (compressed first when context fills)
  conversation, file reads, search results

Critical rules go where they’ll never be compressed. Reference knowledge loads only when needed. Everything else is temporary, and that’s fine.

Get this wrong and Claude “forgets.” Get this right and Claude behaves like a senior engineer who actually reads the docs.

The Decision Framework (the most reusable thing in this post)

For any instruction, ask in order:

markdown

Q1: If Claude violates this, does something BREAK?
    YES -> RULE (.claude/rules/)

Q2: Is this triggered by a specific task with ordered steps?
    YES -> SKILL (.claude/skills/)

Q3: Is this reference knowledge Claude looks up while coding?
    YES -> GUIDELINE (-guidelines.md)

Q4: Does Claude need this every single turn to understand the project?
    YES -> CLAUDE.md (keep it SHORT)

Q5: Is this an isolated, parallelizable task that would bloat main context?
    YES -> SUBAGENT (fresh instance, brief it like a new hire)

NO to all -> don't add it.

Examples:

“Never import from customers/ in platform code” → breaks multi-tenancy → Rule
“When building a skill: add tracing, write evals, register, sync to Langfuse” → 7 ordered steps with a clear trigger → Skill
“PascalCase classes, snake_case functions” → reference, no trigger → Guideline
“B2B SaaS for insurance document processing” → needed every turn for context → CLAUDE.md
“Audit all Dockerfiles and report inconsistencies” → isolated, read-heavy, parallelizable → Subagent

If you remember nothing else from this post, remember this framework.

The Five File Types

1. CLAUDE.md (~100 lines): identity and navigation. Project overview, essential commands, folder structure, pointers to other docs. Nothing else. Past 100 lines, Claude’s attention to any single rule starts dropping.

2. Rules files (~30 lines, in .claude/rules/): hard constraints, glob-scoped, never compressed.

markdown

.claude/rules/no-customer-imports.md

globs: src/myapp/**

Platform code must NEVER import from customers/. Breaks multi-tenancy.

This completely solved my migration problem. The 48-line Alembic section that Claude forgot in CLAUDE.md became a 12-line rule that fires every time Claude touches models.py. Haven’t missed a migration since.

Glob scope is the lever: globs: ** = permanent everywhere (use sparingly). globs: **/Dockerfile = nearly free.

3. Guidelines files (~250 lines): coding standards, design patterns. Loaded on demand, compressed over time. Fine, because rules catch the critical stuff.

Split by concern. My original coding-guidelines.md was 727 lines. Every time Claude needed to check a naming convention, it loaded 727 lines covering Python, architecture, skills, migrations, and security. ~3000 tokens for a 20-line answer. I split it into 3 files. Now “fix a Python bug” loads 215 lines. “Design a feature” loads 91. “Build a skill” loads 145.

4. Skills (~200 lines, in .claude/skills/): workflows with a trigger and ordered steps. The ~50-token description is always in context so Claude auto-detects when to fire. The body loads only when invoked. Test: can you describe the trigger in one sentence? Are there 3+ ordered steps? Yes to both = skill.

5. Per-directory CLAUDE.md (~50 lines): only what’s unique to that module. I audited my project and found 74 CLAUDE.md files that were nothing but auto-generated activity logs with zero instructions. Deleted all of them. Kept 31 that had real documentation. Empty per-directory files are noise tax on every session.

Subagents: The Insight That Changes Everything

Subagents are fresh Claude instances. They inherit your CLAUDE.md and rules. They inherit zero conversation history.

If a constraint only exists in your chat, it doesn’t exist for the subagent.

BAD: “Based on our earlier research, fix the bug”

GOOD: “In src/myapp/skills/extract_core.py line 280,
extract_from_classified() fails when classified images
contain duplicate tags. Fix: deduplicate by image path
before the extraction loop.”

I use subagents heavily. When I needed to audit all my Dockerfiles, CLAUDE.md files, and coding guidelines, I launched 4 in parallel. Each explored a different area and came back with findings. Main context stayed clean. If I’d done it inline, file reads alone would have eaten half my window.

The 70% → 99% Trick: Triple Reinforcement

“Always add evals” written in CLAUDE.md alone works ~70% of the time. To reach ~99%:

markdown

Layer 1: Skill DESCRIPTION (always in context)
  Auto-detects "I'm building a skill" and triggers /eval-design

Layer 2: CLAUDE.md INSTRUCTION
  "Every change to LLM code MUST use /eval-design"

Layer 3: Rule FILE (auto-loaded on skills/** edits)
  "Every skill needs tracing + evals + tests + registry"

This is how I solved the eval problem. Our pipeline can silently degrade accuracy on any prompt change. Just writing the rule wasn’t enough. Claude would build a skill, write tests, and forget evals. With three layers, compliance is ~99%. If any one fails, the other two catch it.

Build Rules Reactively

Start with zero. Add them when things break:

markdown

Week 1: Claude commits a .env file       -> no-secrets rule
Week 2: Claude skips a migration         -> migration-required rule
Week 3: Skill ships without evals        -> skill-completeness rule
Week 4: Platform imports customer code   -> isolation rule

All 7 of my rules came from real incidents. The platform isolation rule? Claude imported a customer-specific config parser into the platform module. Broke the second customer’s integration. The dependency pinning rule? Claude added litellm>=1.0 and the next build pulled a breaking version. Each rule paid for itself within a week.

Don’t pre-create 20 hypothetical rules. They bloat permanent context and dilute attention across the rules that actually matter.

Cheat Sheet

The Real Lesson - Feedback Loop

One of the key for this setup is feedback loop, every time I merge a PR a hook runs which collects all the learning from the session and update the setup following the guidelines - your setup evolves with your codebase.

Writing code with an LLM is becoming commodity. Anyone can spin up Claude Code in 5 minutes. The leverage isn’t in the model anymore. It’s in the context architecture and feedback loop you build around it

Context is the moat. The rules, skills, guidelines, and glob-scoped invariants are the asset. They compound. They make every future session better. They turn a flaky genius into a predictable senior engineer.

If your AI workflow feels magical-but-fragile, you don’t need a better model. You need a better context architecture.

Subscribe now

Share sohit’s Newsletter

This refactor will take one afternoon. It saves that much every single day.

Building a Context Graph That Makes Your AI Agent Smarter With Every Run

sohit kumar — Sun, 29 Mar 2026 16:39:28 GMT

Summary

Most AI agents are stateless wrappers around LLMs. They process each request in isolation, with no context from what they decided yesterday or why. In regulated domains like insurance, this is a dead-end.

I built a context graph for motor insurance claims, a knowledge structure where every case the agent processes makes the next case smarter. Entities, parts, costs, decisions, and the reasoning behind them accumulate into a judgment layer that mirrors what experienced surveyors carry in their heads.

By case 5, the agent was finding its own precedents. By case 20, it was flagging cost anomalies and suspicious part combinations that nobody programmed. Not from rules. From the graph growing.

The core thesis: prompt engineering is a ceiling. A context graph is a flywheel.

The Two Knowledge Systems

Every organization runs on two knowledge systems.

The one in the database. And the one in people’s heads.

The database stores what happened: transactions, records, timestamps. But the person who’s been doing the job for fifteen years knows why things happened the way they did. Which exceptions were granted. Which patterns to watch for. Which rules matter on paper and which ones actually matter.

That second knowledge system runs most of the real decision-making. And it has no backup.

This is the reason AI agents hit a ceiling. You can give an agent access to every database, every document, every API. But if the judgment layer lives in someone’s head (the pattern recognition, the institutional memory, the “this doesn’t smell right” instinct) the agent will keep making technically correct decisions that any experienced person would override in seconds.

Agents without a context graph are just stateless wrappers around LLMs. The missing piece isn’t better models. It’s structured knowledge that compounds.

The Gap in Motor Insurance

When a surveyor assesses a claim, they don’t just check if the policy is valid and the documents are in order. They remember that the garage on MG Road always inflates bumper costs. That a rear-end collision claiming both a tail light and a boot lid usually means the damage is real. That a vehicle with three claims in two months deserves closer scrutiny, even if each individual claim checks out.

None of this lives in any database. It lives in experience.

So I ran an experiment. An AI agent that processes motor insurance claims one by one, building a context graph as it goes. Every case adds to the collective knowledge: entities, parts, costs, decisions, and the reasoning behind them.

The term “context graph” comes from Jaya Gupta. The idea is that what makes AI agents truly capable isn’t the model, it’s the structured context they accumulate over time.

Here’s what I didn’t expect.

By case 5, the agent started finding its own precedents. Similar accident, similar vehicle, similar amount, and it pulled the past decision to calibrate its confidence.

By case 20, it was flagging things nobody told it to flag:

Cost patterns that didn’t match what it had learned from prior cases
Part combinations that seemed unusual for the type of accident
Repair patterns at specific garages that looked suspiciously consistent

None of these were hardcoded rules. They emerged from the graph growing.

The biggest surprise: making the agent navigate the graph (follow relationships, inspect connected entities, check cross-case statistics) produced far better reasoning than putting everything into a prompt. The structure itself encodes knowledge that flat text loses.

What I Built

A system where:

Every insurance claim becomes a graph: entities (person, vehicle, garage), evidence (invoice, documents), parts (bumper, headlight), and relationships between them
The agent validates claims against knowledge rules (IRDAI regulations, vehicle parts ontology, document requirements)
The agent searches for precedent, including similar past cases, entity history, and human overrides
The agent navigates the graph using tools. It decides what to investigate, follows edges, checks cross-case statistics
Every decision is stored as a decision trace: what was read, what was compared, what the reasoning was
The knowledge graph grows automatically with every case

The flywheel: more cases → richer graph → better precedent → smarter decisions → more trustworthy traces.

Modeling the Graph

Nodes are what a surveyor would recognize: Person, Vehicle, Policy, Accident, Garage, Part, Invoice, Documents. Each gets a deterministic ID from the most reliable field available:

Person  → dl_number → aadhar_no → pan_no → mobile → name
Vehicle → registration_number → chassis_number → engine_number

When the same vehicle appears in case 5 and case 15, it’s the same node. You can ask “how many claims has this vehicle had?”

Edges represent real-world relationships: person DRIVES vehicle, vehicle REPAIRED_AT garage, part DAMAGED_IN accident. Each edge is something a surveyor would draw on a whiteboard.

The Global Layer (Where the Magic Happens)

Beyond per-case graphs, global nodes accumulate across cases:

GlobalPart tracks cost benchmarks:

(:GlobalPart {part_id: "rear_bumper", claim_count: 19,
              avg_claimed: 7200, min: 7200, max: 7200})

CO_OCCURS_WITH captures which parts appear together:

(:GlobalPart {front_bumper}) -[:CO_OCCURS_WITH {count: 11}]→ (:GlobalPart {radiator_grill})
(:GlobalPart {rear_bumper}) -[:CO_OCCURS_WITH {count: 10}]→ (:GlobalPart {tail_light})

After 20 cases, the agent knows: “rear_bumper is typically Rs.7,200, always in rear_center zone, and 53% of the time appears with tail_light.” Nobody programmed this. The graph learned it from data.

The Validation Layer

IRDAI regulations and domain knowledge encoded as rules that read specific nodes and produce VERIFIED, CONFLICT, or ANOMALY edges:

Rule What It Checks Source policy_active policy.end_date >= accident.date IRDAI Regulations 2024 dl_valid DL expiry >= accident date Motor Vehicles Act 1988 parts_plausibility Claimed parts reachable from impact zone Vehicle parts spatial ontology document_completeness All required documents present IRDAI regulations

The parts plausibility check uses a spatial ontology: 74 parts and 206 adjacency edges modeling how damage propagates through a vehicle. Rear bumper claimed for a rear impact? Propagation score 1.0, consistent. Front bumper claimed for a rear impact? Propagation score 0.015, anomalous.

The ontology is a reference graph the agent consults: spatial reasoning, not hardcoded logic.

The Agent Loop (Where Context Graph Meets Agentic Reasoning)

The key design choice: the agent is not a linear pipeline. It’s a loop where the LLM has tools to navigate the graph and decides for itself what to explore.

Tools the LLM Can Call

The LLM typically makes 4-8 tool calls per case. For clean cases, it checks a couple of parts and decides. For suspicious cases, it digs deeper, following edges, checking history, profiling the garage.

Nobody told the LLM to check part costs or co-occurrences. The LLM decides what’s worth investigating based on what it sees in the graph. This is what makes it agentic rather than scripted.

Watching the Agent Investigate (A Real Case)

Case C-002: Rajesh K. is driving a vehicle insured under Suresh M.’s policy. Side impact at an intersection. Three parts claimed: side mirror (Rs.2,800), door panel (Rs.12,000), running board (Rs.3,500). Total: Rs.18,300.

The agent’s actual investigation:

[THINK] "I see a driver-policyholder mismatch. Let me investigate."

[ACT]  inspect_node(person:TN31XXXXXXXX)
       → name=Rajesh K., relation_to_insured=self, insured_name=Suresh M.

[ACT]  inspect_node(policy:0407XXXXXXXX)
       → insurer=Insurer Co. Ltd, sum_insured=388864, active

[ACT]  query_global_part(rear_view_mirror)
       → 17 past claims, avg Rs.2,800

[ACT]  query_global_part(door_panel)
       → 16 past claims, avg Rs.12,000

[ACT]  query_global_part(running_board)
       → 11 past claims, avg Rs.3,500

[ACT]  get_part_cooccurrences(rear_view_mirror)
       → door_panel co-occurs 8 times, running_board 8 times
       → Pattern is typical for side impacts

[ACT]  make_decision(APPROVE, confidence=0.95)
       → "Driver-policyholder mismatch noted but AUTHORIZED_BY edge exists.
          Parts costs match cross-case averages exactly.
          Part combination typical for side impacts.
          4 similar precedent cases all APPROVED."

Notice: the agent chose to inspect the person node because it saw the mismatch. It chose to verify every part cost. It chose to check co-occurrence patterns. The graph gave it the information. The tools gave it the agency. The LLM connected the dots.

How Precedent Makes the Agent Learn

Before deciding, the agent queries four sources:

1. Entity History. Has this vehicle, person, or garage been in a claim before? A vehicle with 3 claims in 2 months is different from one with its first claim.

2. Tag-Based Similar Cases. Past cases with overlapping tags (impact zone, severity, vehicle type). Four similar cases all APPROVED? High confidence. Two DENIED? Mixed signals, lower confidence or escalate.

3. Dimension Matching. Numerical similarity across case attributes. Architecture supports vector/cosine search for scale.

4. Human Override History. Cases where a human corrected the agent. If a human changed APPROVE to ESCALATE because “front bumper at Rs.12,000 but market rate for Swift hatchback is Rs.6,000,” the agent sees this correction and adjusts on similar patterns.

All four sources merge into the LLM context before it decides. The LLM doesn’t follow a formula. It weighs everything contextually. This is how precedent compounds. Case 1 has zero precedent. Case 20 finds 4+ similar cases. At case 1,000, the agent would have deep precedent for every pattern: not just “similar cases” but “similar cases at this garage, with this vehicle type, in this impact zone, with these parts.”

How the Graph Evolves

Business Impact: What This Unlocks

This isn’t just a technical exercise. The context graph creates measurable business outcomes:

Speed. Clean cases (strong precedent, no anomalies, low amount) can be fast-tracked. The graph provides the confidence signal. A claim that matches 10 prior approvals with identical part costs doesn’t need 30 minutes of human review.

Fraud detection that emerges, not programmed. The agent catches a garage that always claims the same three parts at identical prices. Not because someone wrote a fraud rule. Because the graph made the pattern visible. This is cheaper and more adaptive than rule-based fraud systems.

Reduced human dependency without removing humans. Surveyors review the hard cases instead of every case. The decision trace means they can see exactly why the agent decided what it did, and correct it. Those corrections feed back into the graph.

Regulatory defensibility. Every decision has a full audit trail: which nodes were read, which rules applied, what precedent was found, what the reasoning was. In a regulated industry, “the AI said so” isn’t acceptable. “The AI checked these 6 rules, found 4 similar precedents, verified costs against 16 data points, and here’s the reasoning” is.

Directional hypothesis: A mature context graph processing 1,000+ claims should reduce average claim processing time by 40-60% while catching cost anomalies that manual review misses at scale.

What Worked and What Didn’t (Honest Assessment)

What worked:

The LLM genuinely navigates the graph. It’s not a fancy database. The agent follows edges, checks cross-case stats, references specific nodes in its reasoning
Decision traces are powerful for audit and trust
Part co-occurrence patterns and cost benchmarks emerged naturally from data
The tool-calling loop gives the agent real investigative agency

What didn’t work yet:

All 20 cases approved (the test data had no critical conflicts: valid DLs, active policies, reasonable costs). This validated the happy path. Adversarial testing with expired policies, inflated costs, and invalid documents is next
Tags are too uniform. Precedent search finds too many matches because tags aren’t discriminating enough
Missing data is the biggest problem: 100% of cases had no impact_zone, severity, or vehicle body_type in the ground truth

The gap I haven’t closed yet: image reasoning. A real surveyor looks at photos. Tying visual evidence to graph nodes (”this dent pattern is consistent with a side impact”) is the next frontier. Part can be tied to 1000 images and other cases which can be used for reasoning.

What I Intentionally Kept Simple

NetworkX for in-memory per-case graphs (fast, no server needed for prototyping)
Neo4j for persistent cross-case graph storage
Custom ReAct loop with litellm (not LangChain/CrewAI. I wanted full control over graph construction and trace capture)
No vector database. Overkill for 20 cases. In-memory dimension matching is sufficient. Would add for 1,000+ cases
No LLM fine-tuning. The context graph gives the LLM enough information to reason well. Structured context beats fine-tuning for this use case

Conclusion: The Context Graph Flywheel

The next generation of agents won’t be better because of better models. They’ll be better because they accumulate context.

Every case processed adds entities to the global graph, cost data points for benchmarking, co-occurrence patterns, decision traces for precedent, and validation results that test rule effectiveness. The agent’s first case is a cold start. By case 20, it’s checking costs against 15+ data points and finding 4+ similar past cases. By case 1,000, it would have garage profiles, seasonal patterns, segment-specific cost curves, and a library of human corrections teaching it where its initial judgments were wrong.

The surveyor’s intuition isn’t magic. It’s pattern recognition from hundreds of cases, compounding silently over a career. A context graph captures that same compounding, except it never degrades, never retires, and every new case makes every future case smarter.

The hard part isn’t the technology. It’s modeling the domain correctly and feeding the system enough real data. The flywheel of knowledge graph, agent learns, agent gets better: that’s buildable today.

I’m just getting started.

Subscribe now

Retrieval Debt: The Technical Debt Your Agent Is Paying Right Now

sohit kumar — Wed, 25 Feb 2026 07:38:44 GMT

I started thinking about a simple question: will software design principles still matter when AI agents do most of the coding?

The answer I kept coming back to was: more than ever. Not because the principles changed. Because the reader did.

Your agent indexes your codebase once. But every session it reasons from scratch. No memory of last week’s decisions. No context about why that tradeoff was made. Just retrieval and a best guess.

Token costs spike. Wrong files get edited. You end up guiding your agent through your own codebase like onboarding a new hire who cannot ask questions and never remembers the answers.

Most teams think this is an AI problem. It is a design problem.

Agents Inherit Your Design Decisions

Software design principles were never about aesthetics. They existed to manage ambiguity for whoever changes the code next. The audience was always the next reader who needs to understand this and modify it safely.

That reader is now an agent. And agents suffer from ambiguity more than humans do, not less.

A human encountering unclear code slows down. They hesitate. They ask questions. An agent does the opposite. It makes a confident decision and moves forward.

Ambiguity doesn’t create caution in agents. It creates confident mistakes at scale.

Good design amplifies good agent output. Bad design amplifies bad agent output. The principles didn’t become irrelevant. They became more expensive to ignore.

Three Constraints That Make Violations Expensive

Agents operate under three hard constraints humans never had, and understanding these is why the rest of this post matters.

Context window means agents can only reason about what they retrieve. Scattered logic means partial picture, wrong assumptions, missed files. What a human fills with intuition and experience, an agent either retrieves correctly or gets wrong confidently. There is no middle ground.

Cost means every token loaded is a line item on your bill. Poor design is no longer just slow. A messy codebase costs significantly more to operate on than a clean one. Bad architecture now has a direct, measurable infrastructure cost.

Accuracy is the dangerous one. A human who encounters ambiguous code slows down, double checks, asks questions. An agent encountering the same ambiguity does not hesitate. It makes a confident decision and moves forward. A human who misses something feels unsure. An agent that misses something believes it is done.

Every principle that helped humans manage complexity maps directly onto these three constraints.

Every Principle, Through the Agent Lens

The problems didn’t change. The cost of getting them wrong did.

This Is Not Theoretical

Before going further, here is why this pattern shows up consistently across tools and research.

Cursor ships a 500 line file limit recommendation specifically because large files degrade agent retrieval quality. ¹ LLM research consistently documents accuracy loss on information buried in long contexts, what researchers call the lost in the middle problem. ² And in January 2026 Cursor shipped dynamic context discovery, moving away from static context loading toward pulling only what the agent needs on demand. Their A/B tests showed a 46.9% token reduction just from tighter context loading. ³

Three independent signals pointing at the same thing. Retrievability dominates cost, speed, and correctness.

They are building tooling to compensate for bad codebases. Low Retrieval Debt makes that compensation unnecessary. The teams that win won’t have the best agent prompts. They will have codebases that are cheap to understand, cheap to modify, and hard to misunderstand.

Naming Is Now Architecture

Agents don’t remember your codebase. They find it through semantic search. Your repository is indexed as vector embeddings and the agent retrieves what looks relevant to the query.

Poor naming isn’t a style issue. It is a retrieval failure.

# Agent searching "discount logic" will miss this
def compute_final_value(u, amt, fl=False):
    if u.tier == 2:
        amt = amt * 0.9

# Agent finds this immediately
def apply_user_discount(user, order_amount: float, is_flash_sale: bool = False) -> float:
    premium_discount = 0.10

Same logic. One surfaces. One doesn’t. The agent that misses the first version doesn’t raise a hand. It reimplements somewhere else and moves on confidently.

A function with single responsibility averages 20 to 50 lines. A god class doing six things averages 400 or more. An agent loading the god class to change one behavior loads 8 to 20 times more tokens than necessary. That is not a style problem. That is an infrastructure cost.

Naming conventions are no longer bikeshedding. They are part of your agent infrastructure.

Retrieval Debt: The Metric That Matters

Retrieval Debt is how much context an agent must load to safely understand, change, and verify a single unit of behavior. Lower is better. Always.

This applies to your static codebase, core domain logic, data models, API contracts. Ephemeral generated code and throwaway scripts are a different conversation. But the foundation your agents build everything else on top of will always need these principles, and in an agentic world it needs them more rigorously than ever.

Discount logic scattered across three services means three retrievals, three formula structures, three edits, and a silent inconsistency the agent believes it already fixed. The same logic centralized means one retrieval, one edit, one test.

# High Retrieval Debt
# cart_service.py
if user.is_premium:
    total *= 0.9

# order_service.py
if user.is_premium:
    price = price * 0.9

# invoice_service.py
discount = 0.1 if user.is_premium else 0
final = amount - (amount * discount)

# Low Retrieval Debt
# pricing/discounts.py
PREMIUM_DISCOUNT = 0.10

def apply_user_discount(user, amount: float) -> float:
    if user.is_premium:
        return amount * (1 - PREMIUM_DISCOUNT)
    return amount

A human missing one location creates a bug. An agent missing one location creates a silent production incident it is confident never happened.

What This Means for Your Team

Agents have no domain understanding, no memory of past tradeoffs, no intuition about why your system looks the way it does. Their guess is often structurally plausible and contextually wrong.

Senior engineers become more valuable, not less. Their role shifts from writing code to shaping the context agents operate in, boundaries, names, abstractions, architectural intent. They nudge agents toward the right design by making the problem clear enough that the agent cannot go too far wrong.

For product and engineering leaders who don’t live in the code: ask your team one question. How many files does an agent need to read to safely change this feature? If they don’t know, or if the answer is more than three, you have a Retrieval Debt conversation worth having before your next planning cycle.

If your best engineers are shepherding agents through a messy codebase instead of making architectural calls, your structure is the bottleneck. That is Retrieval Debt showing up as a people cost.

How to Reduce Retrieval Debt This Week

The 2-3 file rule. If a safe change requires more than 2-3 files, you have a problem worth fixing now.

The single retrieval test. Could an agent find the right place with one semantic search? If not, your naming or boundaries are wrong.

The edit surface check. Count how many places change for one business rule update. More than two is Retrieval Debt.

The Bottom Line

Your codebase is no longer just a communication medium between engineers. It is a knowledge base your agents query under cost and accuracy constraints, thousands of times a day.

Agent debt is the new technical debt. Retrieval Debt is how you measure it.

Start with your most changed modules. Count the retrievals needed to make a safe change. Drive that number down.

Most teams are accumulating this without noticing. The ones who design against it now will compound quietly while everyone else wonders why their agent productivity hit a ceiling.

¹ Cursor, Working with Context, docs.cursor.com/en/guides/working-with-context ² Lost in the Middle, Stanford NLP Research, 2023 ³ Cursor, Dynamic Context Discovery, January 2026. cursor.com/blog/dynamic-context-discovery

Subscribe now

Share sohit’s Newsletter

Building AI Agents? The Hard Part Isn’t the Agent

sohit kumar — Sat, 31 Jan 2026 12:57:27 GMT

MIT research found that 95% of enterprise AI pilots fail to deliver measurable returns.

Not because the technology doesn’t work. Not because the models aren’t capable.

MIT calls it a “learning gap” - organizations don’t know how to integrate AI into real workflows.

From what I’ve seen, the gap is architectural. Teams build agents when they should be building systems.

I work with teams building agentic platforms.

The first enterprise customer conversation is usually humbling.

They don’t ask about your model. They don’t ask about your prompts or your reasoning capabilities.

They ask:

“What happens when the agent is wrong? How does it recover?”
“How will I know if something breaks at 2 AM?”
“What happens if a workflow fails halfway through?”
“When do humans get involved? How do you learn from that?”
“How do I audit what the agent did and why? Can one customer’s data touch another’s?”
“Where does inference happen? Does data stay in our region?”

Most teams have good answers to maybe three of these.

That gap is where projects stall. Or die.

An agentic system is not an LLM with tools.

It is a production system where probabilistic reasoning is wrapped by deterministic execution, visibility, and control.

The agent gets all the attention. But the system around it is what actually ships.

Reliability beats intelligence

The most successful deployments aren’t the ones with the cleverest agents. They’re the ones that work consistently.

An agent that gives you an 8/10 answer every time is more valuable than one that gives you 10/10 half the time and fails unpredictably the other half. Enterprise workflows don’t tolerate variance. A finance team running month-end close doesn’t want “usually works.” They want “always works, and when it doesn’t, we know immediately.”

The optimization target that matters is not “how good is the output” but “how predictable is the outcome.”

Evals aren’t optional for agentic systems. They’re foundational.

In traditional software, you write tests once and they pass or fail deterministically. Agentic systems don’t work that way. The same input can produce different outputs. Model updates change behavior silently. Prompts that worked last month drift without warning.

Run evals continuously in production, not just during development. Catch drift before users do.

And when you have multiple agents working together, evals become even more critical. Each agent might perform fine in isolation. But chain them together and failures compound. Agent A’s slightly off output becomes Agent B’s confidently wrong input. Without evals at every handoff, you’re debugging in the dark.

The teams that treat evals as infrastructure - not an afterthought - are the ones shipping reliably.

Observability is the moat

If you can’t see what your agent is doing in production, you can’t improve it. You’re guessing. You’re relying on user complaints to surface issues. By the time you hear about a problem, it’s been happening for weeks.

Tools like Langfuse and Arize give you this visibility. Inputs, outputs, latency, token usage, outcomes, all traceable end to end.

Most teams can’t answer basic questions about their production agents. What’s your p95 latency? What percentage of requests need retry? Which tool calls fail most often?

Observability isn’t a feature you add later. It’s infrastructure that determines how fast you can iterate. Teams with better visibility ship improvements faster. That compounds.

Orchestration is the unsexy superpower

Your agent’s reasoning can be probabilistic. Your execution layer cannot.

When an agent decides to call an API, update a record, or send a message, that action needs to happen reliably. If it fails, it needs to retry. If it retries, it can’t duplicate side effects. If the workflow crashes mid-way, it needs to resume from where it left off, not start over.

Tools like Temporal and Restate give you durable execution, automatic retries, state persistence, and exactly-once semantics.

Simple try-catch logic works for demos. It falls apart in production with multi-step workflows, external dependencies, and edge cases you didn’t anticipate.

Orchestration is what turns an agent into a system.

An agent without durable orchestration is a demo, not a system.

HITL is your learning loop

The common assumption is that HITL is a transitional state. You start with humans in the loop, then gradually remove them as the agent gets better. Full autonomy is the goal.

That’s not what I see working.

The goal isn’t fewer humans. The goal is faster correct outcomes. Sometimes that means more autonomy. Sometimes it means a human reviewing something for 10 seconds before the agent proceeds.

But here’s what most teams miss: HITL isn’t just about catching errors. It’s your learning loop.

Every human correction is training data. Every escalation shows where your agent is weak. Every approval pattern tells you where you can safely increase autonomy.

The teams that treat HITL as a feedback system improve faster than everyone else. They’re not trying to remove humans from the loop. They’re using humans to make the system smarter.

Design for this from day one. Capture why humans intervene, not just that they did. Track which corrections repeat. Feed that back into your evals, your prompts, your guardrails.

The system should get smarter because humans are in the loop, not despite it.

Security & Governance is architecture

Enterprise customers ask about security on day one. The temptation is to add it later. That’s a mistake.

Prompt injection defense. Tenant isolation. RBAC not just for users, but for agents acting on behalf of users. Audit trails for every autonomous action. Data classification that handles derived data, not just raw inputs.

You can’t retrofit this cleanly. Security boundaries are system-level concerns. Design them in from the start and they’re natural. Add them later and they’re a patchwork.

And before you even get to the technical conversation, there’s the compliance gate. SOC 2, ISO 27001, and depending on the industry, HIPAA or PCI-DSS. For agentic systems, this is harder than traditional software - you’re not just auditing what humans did, you’re auditing what agents decided to do autonomously. If you don’t have a clear answer for how you’ll pass an audit, enterprise deals stall before they start.

Every security question is really an architecture question. “How do you prevent X” is really “how is your system designed so that X can’t happen.”

Data locality kills deals

This one catches teams off guard. You build the system, it works, you’re ready to close a deal. Then the customer asks “where does inference happen?” and the project stalls.

Data residency and model residency are different problems. You might store customer data in the right region, but if inference happens via an external API that crosses geographic boundaries, you still have a compliance issue.

Infrastructure-as-code for multi-region deployment. Your entire stack should be deployable to any supported region with configuration changes, not re-architecture.

Some customers need BYOC for compliance or control. It adds complexity for everyone. But for customers with strict requirements, it’s the only path forward.

Compliance and data locality aren’t features you add at the end. They’re constraints that shape your architecture from the start.

The shift

95% of AI pilots fail not because the technology doesn’t work. They fail because teams build agents when they should be building systems.

The agent is a component. The system includes orchestration, observability, human escalation, security, reliability, and everything that makes it work in the real world.

Stop thinking “I’m building an agent.” Start thinking “I’m building a system that includes an agent.”

The agent demos. The system ships.

I advise teams building agentic systems for enterprise. If you’re navigating this, let’s talk.

Subscribe now

Share sohit’s Newsletter

Tech

sohit kumar — Fri, 08 Jan 2021 15:15:19 GMT

Welcome to sohit’s Newsletter by me, sohit kumar. building tech platforms. https://t.co/oyf9UWt2kI https://x.com/ksohit

Subscribe now

In the meantime, tell your friends!