What the Ants Taught Us : Model the Fix, Not the Problem

Cynthia Unwin
Jun 3
10 min read

Photo by Christian Holzinger on Unsplash

The ants are doing well.

That’s not a hedge or a “we’re learning a lot” consolation line — they’re genuinely good at the job now.

Have they ever publicly shamed themselves in front of an audience, say the SVP of IBM Consulting? Maybe. However, that nightmare day aside, what I’ll say is that they have new tools, they’re solving patterns we never explicitly taught them, and the base architecture has settled to the point where adding a new capability is no longer a project. It’s an afternoon. Right now where they really shine is databases related issues. Blocked queries, runaway connections, indexes gone stale, the slow creep of a table that needs maintenance — these are incidents the colony resolves cleanly and without drama.

Databases are where a lot of automation gets brittle, because the “right” action depends so heavily on the state of the system at that moment. The ants handle it because they reason about that state rather than following a script.

That brings us to the subtle re-frame that changed everything. Here’s the shift that made the biggest difference, and it sounds almost too simple to matter: we stopped modelling problems and started modelling fixes. The instinct when you build operational automation is to enumerate what can go wrong. Users can't log in. Page loads are slow. A batch job dies halfway through. Error rates increase. We build a catalogue of failure modes, and for each one we write the response. It feels rigorous, but it is also a trap, because the catalogue is only ever as complete as your imagination on the day you wrote it, and production has a richer imagination than you do.

So we inverted it. Instead of asking “what are the problems,” we asked “what do we actually do.” And that list turns out to be short, concrete, and stable: clear drive space, unblock a database, reindex, restart a service, restart a batch job, strip empty space from an input file, clean up bad data. These are capabilities. They don’t belong to any one incident — clearing space is the right move in a dozen unrelated situations, and a service can hang for reasons you’ll never finish listing. The trick is that a capability isn’t just an action. It’s an action paired with an understanding of when it’s the right thing to reach for. And “when” here doesn’t mean a trigger — it isn’t “reindex if latency crosses some line.” It’s closer to what a good engineer carries in their head: a sense of what reindexing is for, what it costs, what it tends to fix and what it won’t touch, and what the system usually looks like when it’s the move that helps.

Give an ant that — the action, and the shape of the judgement around it — and it doesn’t need you to have anticipated the exact incident. It can look at a situation it has never seen, weigh it against what it understands the capability to be good for, and decide. That’s the difference between a tool and a rule: a rule fires when its condition is met, but judgement is exercised against a situation. We wanted ants that exercise judgement. If we model the fix, not the problem. Everything else the colony does well grows out of it.

Why problem-first automation stays brittle It’s worth being precise about why the problem-first approach breaks down, because “it doesn’t scale” is too easy and misses the interesting part. It fails in two distinct ways, and capability-first avoids both. The first is that a runbook encodes a moment in time. The day you write it, it describes the system accurately. But the system doesn’t hold still — services move, dependencies get added, a threshold that was right last quarter is generous now, the topology quietly reorganises around you. The runbook doesn’t know any of this happened. So it goes on firing with full confidence, and the worst part is the timing: it is most assured precisely when it is most out of date, because nothing in it can tell that the ground has shifted. Stale automation doesn’t announce itself. It just starts being wrong.

The second is subtler, and it’s baked into the structure of branching logic itself. Every condition you add to handle a new case multiplies the paths through the tree. Two conditions, four combinations; ten conditions and you’ve lost the ability to hold the thing in your head. And the incident that actually pages you at 3am is almost never one of the branches you drew — it’s some combination of conditions nobody thought to encode, sitting in the gap between the paths. So the tree grows more complex without growing more complete. You pay for the complexity and you don’t get the coverage. Worse, the complexity itself becomes a hazard: a decision tree nobody fully understands is its own source of incidents.

Remember, the way a colony of small agents reasons about “the system as it is right now” — without anyone coordinating them — is the part the architecture does quietly. The ants don’t talk to each other or follow a shared plan. They read and write to a common space — a blackboard — where the current state of the incident accumulates: what’s been observed, what’s been tried, what each ant suspects. An ant reasons from what’s on the blackboard at that moment, not from a script. So “read the territory directly” isn’t a metaphor for one clever agent; it’s what falls out of many simple agents sharing a live picture instead of following a fixed map.

Capability-first sidesteps both failures, and for the same underlying reason: there’s no tree to rot and no enumeration to fall short of. An ant that understands what reindexing is for doesn’t go stale when the topology shifts — it looks at the system as it is right now and reasons from there. It doesn’t have a path that’s missing, because it isn’t following paths. You’re not maintaining a map that’s forever drifting out of sync with the territory. You’re giving something the judgement to read the territory directly.

Dependencies are information, not a workflow. There’s an objection that comes up fast when you describe capability-first agents: fine, but real systems are tangled. You can’t unblock a database in isolation when six services depend on it and one of them is mid-batch. Order matters. Surely that has to be encoded. It’s a fair point, and it’s where a lot of designs quietly slide back into the thing they were trying to escape. Because the obvious move is to write down the dependencies as a sequence — if the product database is wedged, first pause the ingest job, then drain the connection pool, then clear the lock, then resume. That feels like you’re respecting the system’s structure. What you’ve actually done is build a workflow, with all of the previous section’s problems waiting inside it. You’ve just hidden the decision tree behind the word “dependency.”

The fix is to change what the dependency information is. Don’t give the ants a sequence. Give them the map — the product database feeds the order service, the ingest job writes to the database, the cache sits in front of both — as plain facts about how things relate, with no instructions attached. Then let the ant reason about what that map implies for the situation in front of it. The same knowledge that you might have frozen into “step one, step two, step three” becomes something the ant consults and interprets: the ingest job writes here, so if I clear this lock while it’s running I’ll just recreate the problem — I should deal with the job first.

It reached the right order, but it reasoned its way there from the topology, rather than replaying a sequence someone recorded. That distinction — dependencies as information to reason over, not a sequence to execute — is the whole game. It’s the difference between a colony and a pipeline wearing a colony’s clothes. Give the ants a richer and richer picture of how the system fits together and they get more capable, because there’s more for their judgement to work with. Give them more fixed sequences and you’ve just rebuilt the runbook, one well-intentioned dependency at a time.

What evolution actually buys you The colony evolves, and when I describe that, people tend to picture the ants getting cleverer — some gradual sharpening of the agents themselves. That’s not really where the value has shown up. Evolution has turned out to be useful in two much more concrete ways, and both are about exposing things we couldn’t see from the inside. The first is that it surfaces the tools we’re missing. When you design the capability set by hand, you give the ants what you think they’ll need, and your blind spots become theirs — there’s no way for the colony to want a tool you never imagined. Under selection pressure that changes. When ants repeatedly reach a situation they can’t resolve with what they have, that gap stops being invisible. It shows up as a pattern of failure clustered around the same kind of incident, and the shape of the failure tells you what’s absent. The colony can’t reach for clearing a queue if no ant has ever been able to clear a queue — but the run of incidents where that’s exactly what was needed points straight at the missing capability. Evolution doesn’t invent the tool, but it tells you, unambiguously, that you need one and roughly what it should do.

The second is that it sharpens the judgement around when to act. Remember from earlier that a capability is an action plus a sense of the circumstances it’s right for — and that sense is the hard part to get right by hand. Our first attempts at describing when to reach for a capability are always a little off: too eager, too cautious, right about the obvious cases and wrong about the edges. Selection does the tuning we can’t do from a whiteboard. The descriptions that lead to good outcomes survive and propagate; the ones that fire at the wrong moment don’t. Over generations the colony converges on a far better articulation of when than we wrote down originally — not because any single ant reasoned it out, but because the population kept what worked.

Put those together and the picture is clear: evolution isn’t making the ants smarter, it’s making the colony more complete. It closes the gaps in the toolkit and refines the judgement around the tools already there. The cleverness was always in the reasoning. What evolution adds is the relentless, population-scale feedback that tells us where the reasoning had nothing to work with.

Spiders and the library Evolution works on what the colony already has — Looking at execution history can tell you a tool is missing, it can find patterns, but it can’t read the manual and hand you the answer. Bringing genuinely new knowledge in from the outside is a different job, and it belongs to a different team. We call them spiders, and some of what they do is read product documentation and turn it into something the ants can inherit.

The tempting version of this is also the wrong one though. Faced with a pile of product docs, the obvious move is to stuff the relevant pages into the ants’ context — cram every fact about the subsystem into the prompt and trust the ant to use it. That doesn’t work, and it fails in a way worth understanding. An ant drowning in product detail isn’t more capable; it’s less. The signal it actually needs to reason well is buried under everything that happened to be documented, and more text is not more judgement. You end up with an agent that has the information and still doesn’t weigh it when it counts.

What the spiders do instead is pull on the threads that matter. They read the documentation the way a senior engineer reads it — not memorising it, but noticing the few things that genuinely change how you should act. This subsystem silently drops writes under backpressure. That setting looks cosmetic but reorders everything downstream. These two services share a lock you’d never guess from the topology.

The spider’s product isn’t a summary of the docs; it’s a small set of considerations distilled to the point where an ant will reliably bring them to bear when judging a situation. It enriches both halves of what we’ve been talking about — the world model and the judgement — but the discipline is the same in both: surface the load-bearing thread, leave the rest in the manual or the llm where it belongs. And because this lands in what the next generation inherits rather than in a single ant’s prompt, it compounds. A thread a spider pulls today becomes part of how the colony reasons tomorrow, without anyone having to remember to wire it in.

Restore first, investigate second Remember, through all of this, what the colony is actually for. The ants restore service. When something is broken in production, the job is to make it work again — quickly, safely, and without a human in the loop for the routine cases. That’s the goal everything above is in service of. It’s worth saying because it’s easy to lose, especially once you start talking about judgement and evolution and distilled knowledge. None of that is the ants doing deep root-cause analysis. Restoring service and explaining why it broke are different jobs on different clocks: one is measured in minutes and is about stopping the bleeding, the other can take days and wants patience and forensic care.

The colony’s first responsibility, restoring service, is always the first job. When it spots that the second job is needed — that something deeper is going on and a human should dig in once the fire’s out — its role is to flag that clearly, not to chase it. Knowing the difference, and staying on the right side of it, is part of what makes the ants trustworthy enough to let loose on production at all. We have other (related) systems that dig into root cause but they are separate.

Where this points If there’s one thing the ants have taught me, it’s that the leverage was never in anticipating problems. It was in describing what we know how to do, handing that to something that can reason, and getting out of its way. Every time we’ve been tempted to add control — another branch, another sequence, another rule for a case we’re worried about — we’ve made the colony worse. Every time we’ve added capability and judgement instead, it’s gotten better. That’s the lesson, and it keeps being the lesson.

There’s a lot still ahead. The learning loops will get tighter — faster feedback from outcomes back into what the next generation inherits. The ants will get better at knowing their own limits, raising the high-risk, high-blast-radius actions to a human instead of acting alone — judgement includes knowing when not to trust your judgement.

None of this changes the shape of the thing, though. It’s still a colony of simple agents, each holding a few capabilities and the judgement to use them, coordinating through a shared space rather than a controller, getting better across generations.

Right now, the ants are doing well. Mostly, the job now is to keep resisting the urge to tell them what to do.

What the Ants Taught Us : Model the Fix, Not the Problem

Recent Posts

Comments