Framing the Problem

Cynthia Unwin
Jan 8
4 min read

Following through on the the fundamental assumption that the key to solving a problem is understanding what that problem is and being able to ask the right questions about that problem to expose how to create a solution, it's time to step back and take a quick review of what it is that we, the AI Enabled SRE community, are talking about when we discuss AIOps.

When we look at IT Operations through the lens of how we implement AIOps we deal with a few of types of problems:

Processes: Things for which there is a process and steps that need following. These are orchestratable problems like incident routing, patch management, code release, change approval, landing zone inflation, code review. Things with clear playbooks. These problems can be quite complex but they follow a known path and need to follow a process. These processes can benefit from AI in the form of advanced pattern matching; information processing; multi-process assembly; natural language interfaces; etc. These are the types of problems that are being successfully tackled with AI in Enterprises right now. These can be reasonably easy problems or they can be very hard problems but we are successful in these items when the problem space both can be, and is, well understood. There is a huge amount of very valuable work to do in this space.
State Management: The next type of problems are what I call state problems. This is a lot of what we do in IT operations. We maintain the state of a system so that it can perform the tasks it was designed to perform. This is the realm of resiliency and the non-functional requirement. As AI Enabled SRE this is our core space; this is the realm of managing complex systems. Because we work with complex, dynamic, interdependent systems the problem space is in some ways unknowable. Or more to the point, for practical purposes, it is unknowable as there will always be an unconsidered external factor; a hidden dependency; that next problem or black swan event that we did not foresee; another feedback loop that we failed to consider. I am going to strenuously argue that because of this intrinsic unknowablity, this problem space is not best served by orchestration because orchestration in a undefined problem space lacks resilience. For AI in this space to meet our expectations it needs to be able to handle novel situation that were not imagined at design time. People will argue with me about this (please do so in the comments), however, if we look at natural systems that are undeniably complex systems, we don't see orchestration, we see choreography and shared information spaces. Natural systems manage exceedingly complex, dependent processes without orchestration; and complex behavior is emergent. It follows from that observation that if we want to be able to match this management of complexity at scale in Enterprise platforms we need to stop trying to orchestrate everything. We are doomed to creating fragile and maintenance intense systems if we leave the responsibility for maintaining state to systems that seek to fit all conditions into a predefined orchestrated pattern. I fully understand that these choreographed systems will need to make use of orchestrated sub-processes to complete their tasks, this allows for governance and testing, but they will be selected sub-tasks invoked by the choreographed system. For example, an agent that responds to a local threshold and decides on the appropriate in context action, will be better served to select and execute a known and tested Ansible playbook than to create a script on the fly. However, allowing the agent to select the appropriate playbook for the situation allows it to make decisions based on context and situational awareness.
Error Management: The third group is probably a sub-group of the two above but I'm calling it out here as a separate group because I think it is worth thinking about on it's own. This is software error handling. What we do within software to manage unforeseen conditions. Think of this as next generation error handling. This is the category that deals with issues that aren't systemic. These are data issues, weird exceptions, things that happen within the software while it runs in the overall environment. These are not errors that are interesting from a platform perspective, but they are errors that chew through time from a support perspective. I propose that we shift how we look at these. Most CRUD operations executed by business systems are well served by the use of deterministic code that can be tested and behaves in a predictable fashion. However, traditional software falls short when it comes to error handling as in many cases the best course of action is a selection between a retry and failing gracefully. If one shifted this to AI enabled error handling, an agent (or colony of agents) could be invoked to assess a problem in real time and is some cases provide a better solution. For example an AI enabled API could query a back-end system to get additional information; invoke a different sub-process; strip problematic characters; or even just log better errors based on the real-time situation and context. We have an opportunity to rethink and reset our expectations about how software handles errors in all of the software that we design.

What the problem space demands is a layered operational model: orchestrated processes providing governed, repeatable workflows for known patterns; choreographed systems enabling resilient state management through contextual awareness and emergent behavior; and embedded AI capabilities transforming how we handle exceptions and edge cases within applications themselves. Each layer operates with different assumptions about knowability and control, yet they integrate to create operational environments that are simultaneously more reliable and more adaptive than what we can achieve with one paradigm alone.

Shifting to AI Enabled SRE isn't just about adding new tools to our existing practices—it's about fundamentally rethinking how we build resilience into complex systems. Our challenge is to translate those principles into enterprise platforms while maintaining the governance, auditability, and predictability that business operations require.

1 Comment

Kamesh singh

Jan 09

Very well articulated. My one cent - When things get too complicated, gauge how much technology you actually need. Let necessity be the guide. Thus AI driven Ops (AIOps) becomes part of business problem (operating model), after all its true value is supporting business outcomes.