It's really a search problem...

Cynthia Unwin
Jan 1
3 min read

Or more specifically it's a knowledge synthesis problem.

Building AI agents or agent teams for AIOps systems isn't hard. Even if you build them from scratch, a bit of python and an API key gets you a piece of non-deterministic software that can legitimately do some cool things. It can even do some smart things. The trick is to get it to do consistently useful things. This is much harder. Lots of things contribute to this from choosing the right model, adding the right guardrails, understanding your real goals, to getting your prompts just right. However, while all these things (and more) are important the core bottleneck is getting the agent the information it needs to make a good decision.

I'm not talking about having the right MCP server to expose information for an agent to use. Exposing information is reasonably easy. Exposing the appropriate and correct information for a specific context is hard. Synthesizing understanding from fragments of data is hard. Knowing what information is relevant and correct at a specific time, based on a specific context, and a specific set of conditions is a difficult problem especially when that time, context, and conditions cannot be fully predicted during design.

Let us start with the anchoring hypothesis that successful, scaleable AIOps systems that maintain stable, productive, and secure system state will, out of necessity, follow a choreography pattern, rather than an orchestrator pattern, with minimal centralized control (Ants vs Armies). If we assume that, like ants, simple agents with a good communication infrastructure will outperform sophisticated agents in isolation, we need to make sure that each agent (ant) knows what it needs to know, when it needs to know it. We also need to be conscious that we are not adding the complexity of flooding the agent with irrelevant information, starving it's decision making process by not providing the enough information, or skewing it's decisions by providing out of context information. While not providing necessary information is an obvious problem, providing the wrong information, irrelevant information or out of context information is every bit as problematic, because everything that the agent is provided shifts the outcome of its decision making process. While agents can identify irrelevant information in some scenarios they don't reliably ignore irrelevant information. If you don't believe me read this excellent paper - https://arxiv.org/abs/2408.10615

This is a hard problem for a variety of reasons. Finding fragments of relevant information is achievable given a good understanding of the problem being solved but reliably synthesizing information from those fragments in real-time is more difficult. It is difficult because:

Your state space grows exponentially as variables are added. That is, as a system becomes more interconnected and more components influence real-world state the amount of data to query becomes larger and more ambiguous.
It is easier to identify that you have found a relevant piece of information than it is to identify that you didn't find one.
Discrete fragments such as incident data are point in time records and they don't reflect context changes that occur over time.
It's difficult to tell whether the agent's synthesis of data is correct before it takes action (in some cases it can take days to identify that an action led to a less than optimal outcome). In addition, system restoration often requires multiple discrete actions to create the desired outcome.
As agents work and share information noise accumulation is inevitable and clear communication signals degrade.
Even in very high performing, meticulously documentation conscious Enterprises, 'tribal knowledge' is incredibly pervasive (and probably impossible to eliminate) meaning that full context is almost never available.

Observability, CMDB, incident history, topology, policies, etc. are all necessary data that must be exposed to agents by an operational data fabric in an AIOps environment but they are not sufficient. Forcing a non-deterministic system to synthesis meaning on the fly from fragments of data will not be consistent enough to create really useful AIOps, especially if we seek to solve problems that were not anticipated during design (which is really core to the effectiveness of an intelligent system). While experimentation and experience are going to need to drive the shape of the full solution, from where we stand now it seems that to be successful we will need systems that create and maintain an operational data synthesis layer that takes over some of the harder tasks of curating knowledge for AIOps agents. We need to stop treating incidents or anomalies as independent search problems and begin to assemble an agent usable persistent operational understanding of an Enterprise to ensure that context and situational awareness are available to agents. Expecting agents to consistently synthesize understanding from raw data on the fly feels a lot like letting agents write automation code on the fly and executing it in production without testing, not something I would suggest actually doing.

It's really a search problem...

Recent Posts

Comments