The power of testing your understanding.

Cynthia Unwin
Nov 17, 2021
5 min read

I've been waging war against technical debt for the better part of two decades. I've watched good teams drown slowly beneath waves of trivial issues that they knew how to fix but didn't ever find time to address. I've watched smart guys burn themselves out getting up in the middle of the night for months on end to do something that a script could do, if only someone would take the half day to write it and get it tested. I've seen teams run from fire to fire without ever getting a breath in between. I've watched all of this for years, but the chronic problem that I've seen grind more teams into the ground than anything else comes in the aftermath of those events we don't see coming.

It happens on every solution and to every team. We start feeling like we have a platform under control and then we get blind-sided by something that seems completely new and everything falls apart. The team pulls together and restores service; a new equilibrium is achieved. We add a new item to the things we know how to deal with and a new entry is added to our playbooks. We do an RCA. We take action to avoid or manage the problem in the future but technical debt continues to build. All too often we end up treating the symptoms of the problem rather than the true root cause, or we find something wrong and fix it but never verify that was what actually caused our problem in the first place. We make a list of things we should look into and address, but these things end up in the ever growing backlog. Now our technical debt isn't just trivial items we haven't had time to get to, now our technical debt is hiding unknown, unknowns. These things don't just recede into the ether from whence they came. They fester. They crop up in weird places. We see them as an increasing number of tickets that get closed as "one off incidents", we see them as subtle (or not so subtle) decreases in system performance or stability. Our customers feel it as changes take longer, and are less stable; we get less nimble.

Restoring service; completing a standard RCA; and creating a process to handle the situation is tactical and necessary. It is the standard process and it doesn't meet our needs. We need to be strategic; we need to adapt. Platforms that run in complete isolation with no change in their functionality or environment theoretically will run forever but it is becoming increasingly rare to see a system like this. I'd argue that the whole concept of steady-state is no longer useful because even systems that are reasonably functionally static, in most cases are connected in some way to an external environment, and that environment is not static. We can't treat platforms like they exist independently, they are all part of a larger, ever changing, complex system.

This is why we need to be diligent in updating our understanding of our systems by being demanding and thorough in understanding and resolving problems. Completing a comprehensive RCA is hard. It takes thought, and effort, and testing, but if done well, it will repay the effort in droves by both increasing system resilience and reducing technical debt.

We have all seen RCAs that tie a problem up with a neat bow only to be immediately followed by a re-occurrence of the original problem. We have all seen RCAs that raise questions that aren't answered. We have all seen RCAs that meet customer requirements by the letter of a contract but don't serve the interests of a platform. How do we avoid this? How do we balance the need to evolve our understanding of failures with the realities of our ongoing, day to day tasks?

There are many answers to this question, but the best one I have ever found is that method based RCAs should be seen as tools, not as a final step to the problem resolution process. An RCA isn't a document that you complete to close an issue and an RCA is not a postmortem. A method based RCA is a powerful tool for investigation. This seems like an obvious statement but in practice many teams fail to see the truth of this. If used properly, an RCA organises our investigation and gives structure to our efforts. RCAs should start as soon as a problem is observed and be used to shape how we proceed. While initially a teams focus will always be on restoring service and not on finding cause, following a logical process increases the chances that we will make good decisions at all stages, even under bad conditions.

However, not just any method will get a team where it needs to go. We need to consistently use a method that starts by stating the problem clearly; creates a hypothesis based on what we know, and then incorporates testing to prove that our understanding is correct before we move on to proposing a solution. While there are many different methods, Kepner-Tregoe is my favourite because it asks a team to really understand a problem by recording what is happening and what isn't before defining a resolution. The questions we ask in this first stage really drive the quality of the outcome. Gathering information on what is happening; and what isn't happening; what order items happen in; when a problem happens and when a problem does not happen; and organising that information logically allows a team to discard those initial ideas that miss a factor and so often lead investigations astray. The right question can lead to a good solution and a lack of questions leads to poor decisions.

Understanding the problem, in most cases, is by far the hardest part of resolving issues. Discerning what are causes, what are symptoms, and what is extraneous noise is the real challenge of troubleshooting and fixing problems.

But before we move from understanding a problem to resolving it there is another crucial step. Once we think we understand a situation we must create a hypothesis about the the situation -- and then test it. Confirm that the problem really is what we think. We need to take the time to poke holes in our theory, take the time to establish what we are not seeing. We need to look for a single example that contradicts our theory and then expand our understanding to include this. We need to do this relentlessly. We need to ensure we have situational awareness. Complex systems are noisy. It is exceedingly difficult to successfully filter out the noise of other problems, confounding accounts, and exhaustion without actively seeking to disprove our theories. We need to remember that a lack of evidence does not mean that something isn't happening; but one single counter example can prove a theory wrong.

Testing your hypothesis can show whether your perceived cause is a true root cause or is another symptom, or better yet, a completely separate issue. I can clearly remember implementing one particularly complicated code fix that took almost two months to move through development, testing and deployment, only to find that while it solved "a" problem, it didn't solve "the" problem we were targeting. Now we were two months behind on finding a resolution. The team had spent two months mitigating a risk based on the understanding that it was a short term process. The business was livid. It was not a good look for the team.

Teams, almost without fail, test solutions, but few test their understanding of a problem. Adding this step ensures that our efforts are moving in the right direction and keep us on track through issue resolution. In the case above, seeking and acknowledging one counter example would have told us that we didn't have the whole story and have saved us months of spinning.

By approaching a failure like a scientific phenomenon that needs to be understood we open the doors for managing the complexity of modern environments and we move away from reacting, to shaping the evolution of the platforms we manage.

The power of testing your understanding.

Recent Posts

Commentaires