Asking the right question

Cynthia Unwin
Jan 4
4 min read

"If I had an hour to solve a problem, I'd spend 55 minutes thinking about the problem and 5 minutes thinking about solutions."

Albert Einstein

I've been thinking about Bas Pluim's comment from my post the other day. It isn't a new thought but it is a really important one. As I look back on my career, it is clear how much time we (as an industry) spend solving the wrong problem and Bas's comment about not just achieving a goal but taking the best path to that goal is a warning we need to heed. As we think Autonomous platforms and AIOps systems that maintain state in complex production environments, the question of what a good solution to a problem looks like is not easy to answer. You can resolve an issue in many ways, but how do we identify the best way? and how do we know when to settle for the "good enough" way?

We have a couple (a bunch?) of problems:

The best course of action is not always knowable at the time of a problem, but action needs to be taken. When is it the right action to drop hundreds of users from am instance stateful service to save the thousands currently connected across the platform? When is just restarting it the right choice and when is that just kicking the problem down the road to a possibly less advantageous time?
Sometimes actions needs to be taken to expose the real problem.
Sometimes limping is better than not coming back up at all.
There are often multiple ways to skin the proverbial outage cat. Sometimes you can get there from here through many paths, but some are easier than others. How do we define the best solution?
Often multiple discrete actions need to be taken in different places to resolve an issue and step three can't be known before the outcome of step one is available.
Sometimes things that look at the time like a positive action, turn out after the fact to have been terrible mistakes.

If we are looking at just restoring service we are solving a different problem than resolving or even understanding root cause. Usually we need to do both, but not on the same time lines. What are the questions we need to ask to tease out the best way to proceed?

For a while now, I have been building AIOps agents that first predict what the outcome of their actions are going to be, then take the action, then compare the action to the outcome. If it matches, yeah, we have a win If it doesn't we need to ask ourselves what information or instruction the agent was missing.

This tells us if the agent is able to create a plan and have it achieve a predicted goal, but this is different from assessing expected behavior for the system from an external perspective. An agent that accurately predicts that wiping a drive will free space is great, unless you needed that data. An accurate prediction and the next best action are different. It's an overly simplistic example but you get my point. Especially in a complex system.

For a system to be autonomous it needs to understand both what a good action and bad action are (both from the perspective of it leads to a holistically undesirable outcome, and it is a far more complicated solution than a problem requires, takes to long or is hard to reproduce).

If we come back to the ant analogy (that simple agents acting based on observations of specific local thresholds and (pheromone) enriched data within their environments will produce a more scaleable and resilient system than highly choreographed super ants) we see that agents need to be capable of adaptive outcomes. Which means that while no individual agent has an overall strategic view of the situation and just because its action worked in a similar situation and led to the stabilization of the system (colony) the last time, does not mean that it will work this time, and it needs to be able to tell the difference and adapt its behavior to achieve the goal.

Ants (and other distributed intelligence) resolve this with some basic rules (ant usually perform action A when a threshold is crossed, but occasionally it does something different to observe the outcome); threshold based decisions (a quorum of ants see the situation as requiring the same action); Markov chain like state transitions (probability of moving to the next state depends only on the current state not any previous state); and no critical dependence on any one ant. This basic pattern drives scaled behavior on the emergent properties of of complex biological systems.

For us to replicate this complexity (or simplicity depending on your perspective) we need to do much better at asking questions and understanding our problem space. While each environment is different I think it is a mistake to think that there is nothing to learn from gaining a better understanding of the problem space in general (this is why ants are so successful, they understand the problem space well enough to apply local rules across dynamic environments to achieve complex outcomes).

To prototype and fail fast with this model we need a low stakes environment where our agents can test boundaries, build thresholds, and work autonomously to build the information analytics system necessary to evolve in this space but that environment needs to be diverse enough to prepare our systems for the real world. For now we retain human in the loop to reduce risk in live environments but because of the nature of the problem area, it may be difficult to know when the human in the loop knows the next best action and has the best prediction of the outcome. How do we know when the ants have outgrown us?

2026 is going to be a brilliantly interesting year.

Asking the right question

Recent Posts

Comments