Why AIOps Needs to Think Like an Ant Colony

Cynthia Unwin
Mar 7
6 min read

We, the SRE live in a world of black swans and thundering herds. Cascading failures and unknown unknowns are our territory. In this world orchestration is fragile and doesn't scale. We don't just need resilience, we need adaptability. Adaptability does not come from control, it comes from the ability to assess and react in novel situations and that is the realm of choreography. For us to lead in the AIOps space we need to cede the control that comes with pervasive orchestration and embrace the emergent behaviour of event driven systems.

The concept of ant-based AIOps systems leans on blackboard architecture, stigmergy, and event-based patterns to build an LLM-forward system that is resilient, adaptive, and scalable. These aren't new ideas. They're old ideas that have been waiting for the right substrate — and LLMs are that substrate.

The blackboard: an architecture that was right too early

The blackboard architecture was born in the early 1970s at Carnegie Mellon, inside a system called HEARSAY-II. The problem was continuous speech understanding — getting a computer to parse human speech by fusing knowledge from acoustics, phonetics, syntax, semantics, and pragmatics all at once. No single knowledge source could solve the problem alone, and no one could predict in advance which combination of insights would crack a given utterance.

The solution seems disarmingly simple: put a shared data structure — the blackboard — in the middle. Let independent knowledge sources read from it and write to it. Each source watches the shared state and steps up when it recognizes something relevant to its expertise. Think of it as a group of specialists standing around a blackboard in a room, each contributing when they have something useful to add, building on what others have written.

Barbara Hayes-Roth formalized this into a consistent architecture in her 1985 paper "A Blackboard Architecture for Control." Three components: the blackboard itself — a structured shared memory holding the evolving solution state; knowledge sources — independent modules with specific expertise that don't need to know about each other; and a control component that decides what to attend to next. The architecture spread to sonar interpretation, protein analysis, military vehicle monitoring, and manufacturing planning. Then the AI winter hit, and the blackboard model got filed away as a relic of an older time.

But here's the thing: the blackboard architecture didn't fail. It was premature. The knowledge sources you could plug into it in the 1980s were brittle rule-based expert systems and hand-coded heuristics. TWhile the architecture was sound, the pieces were expensive to build and fragile in practice. The blackboard was just waiting for knowledge sources that could reason flexibly, handle ambiguity, and operate across domains.

Agents using LLMs and tools are exactly those knowledge sources.

The 2025 revival: the data backs this up

Two papers published in 2025 put real numbers behind the blackboard's return.

Han et al. proposed LbMAS in July 2025 — a system where LLM agents with various roles share all information through a blackboard during problem-solving. Agents are selected to act based on what's currently on the blackboard, and the cycle repeats until consensus emerges. Their system achieved competitive performance with state-of-the-art multi-agent approaches while spending fewer tokens. Fewer tokens, same or better results.

Salemi et al. went further in October 2025, applying the blackboard paradigm to data science tasks requiring information discovery across large data lakes. In their framework, a central agent posts requests to a shared blackboard, and autonomous subordinate agents volunteer to respond based on their capabilities. The results: 13% to 57% improvement in end-to-end task success over master-slave multi-agent paradigms. The key finding — and the one that should matter most to anyone building agent systems — is that this approach eliminates the need for a central coordinator to have prior knowledge of all sub-agents' expertise. You don't need to know what your agents can do ahead of time. They decide for themselves whether they can contribute. That's a fundamentally different design assumption, and it allows systems to scale.

Stigmergy: removing the last layer of control

But here's where it gets interesting for us. Even in these revived blackboard systems, there's still a control component. Something still decides who goes next. There's still a scheduler, a decider, a turn-taking protocol. The blackboard decentralizes knowledge but it doesn't fully decentralize control.

Stigmergy removes that final layer.

The term was coined by Pierre-Paul Grassé in 1959 while studying termite nest-building. He observed that partial constructions triggered further building activities without any direct communication between individual termites. The work itself — the trace left in the environment — was the signal. No coordinator. No scheduler. No master plan. The environment is the coordination mechanism.

In an ant colony, a foraging ant finds food and leaves a pheromone trail. Another ant detects the pheromone and follows it. That second ant reinforces the trail. Over time, the shortest path between the colony and the food source emerges — not because anyone designed it, but because the environment itself became the coordination layer. No messages passed between agents. No central scheduler. No single point of failure.

Now map that onto the blackboard model. Take the blackboard — the shared state where knowledge sources contribute and observe — and remove the control component entirely. Instead of a scheduler deciding which agent goes next, agents autonomously decide whether and when to participate based on what they perceive in the shared state. The blackboard stops being a managed workspace and becomes a stigmergic environment — a shared medium that agents modify through their contributions and observe to determine their own next actions.

This is the move from blackboard-with-controller to pure choreography. And this is what makes the system truly adaptive. When a novel situation hits — a failure mode no one anticipated, a cascading interaction no one mapped — an orchestrated system breaks because the orchestrator doesn't have a playbook for this scenario. A stigmergic system adapts because agents respond to the state of the environment, not to instructions. They don't need a playbook. They need perception and the ability to act on what they perceive. That's what LLMs give us.

We've already learned this lesson once

This should sound familiar. The enterprise software community arrived at this same conclusion through an entirely different path — and we lived through the transition.

Remember the ESB? I bet you still support them. The Enterprise Service Bus was the great centralized orchestrator of the SOA era. Every service talked to the bus. The bus knew the routes. The bus managed the transformations. The bus was the god-object that understood every service's capabilities and every possible interaction.

And it broke. At scale, under load, in the face of changing requirements, the ESB became a bottleneck, a single point of failure, and a change management nightmare. We spent a decade learning that lesson. We moved from ESBs to event-driven architectures. From centralized orchestration to choreography. From monolithic coordinators to domain-driven, loosely-coupled services reacting to events. Kafka, event sourcing, CQRS, saga patterns — the entire modern microservices playbook is built on the principle that loosely coupled components reacting to environmental signals scales better than centralized coordination.

The AIOps agent community is repeating the same architectural mistake. LangGraph uses graph-based state machines with a central coordinator. CrewAI assigns a manager agent to delegate tasks. AutoGen routes everything through conversation-driven orchestration. Every major framework assumes that agents need to be told what to do, and when. These are ESBs wearing LLM clothing.

The production teams I talk to are already hitting walls. Orchestrated agent systems work beautifully in demos. They fall apart in production for exactly the same reasons centralized microservice orchestration fell apart: the coordinator becomes a bottleneck, a single point of failure, and an increasingly complex god-object that has to understand every agent's capabilities and every possible interaction.

This is our playbook

For those of us in operations, the pattern for choreography-based agent systems is hiding in plain sight. Think about how incident response actually works in a well-run environment. An event fires — an alert, a log anomaly, a metric threshold breach. Specialized responders observe the events relevant to their domain. Each responder contributes their findings to a shared space — a war room, a shared timeline, an incident channel. Other responders react to the updated shared state. Resolution emerges from the collective contributions, not from a single coordinator directing every action.

This is stigmergy. This is choreography. The shared incident timeline is the pheromone trail.

Now replace those human responders with AI agents. A log-analysis agent that observes log anomaly events. A metrics agent that watches for correlated metric deviations. A topology agent that maps blast radius. A remediation agent that watches for confirmed root causes and proposes actions. No master agent directing traffic. Each agent autonomously decides whether to participate based on the events and shared state it observes. Resolution emerges.

Three independent intellectual traditions — classical AI's blackboard systems, biological stigmergy, and enterprise event-driven choreography — all converged on the same architectural principle: decentralized coordination through shared environmental state outperforms centralized orchestration as systems grow in complexity. We don't need to theorize about this. The blackboard researchers proved it in the 1980s. The ants proved it over 400 million years. And we proved it ourselves when we burned down the ESB and built event-driven microservices in its place.

The only question is whether we'll apply what we already know, or whether we'll spend another decade relearning it with agents.