What does it take to make technical operations teams resilient?

Cynthia Unwin
Jun 26, 2021
7 min read

Over the past year of lock downs and home schooling, of changing roles at work and at home, I have spent a lot of time thinking about what people need to be successful; what it takes to teach; and what it takes to learn. I have researched how people develop the independent opinions that empower them to make good decision and how can we empower teams (and children) to cope effectively with unpredictable situations.

Not surprisingly, lots of smart people have asked these questions too. Their answer seems to be - provide people with a scaffolding of base knowledge and give them access to the tools and skills that allow them to build on this knowledge independently. With this as a foundation people have what they need to incorporate new insights into their mental models and use that to build new ways of succeeding.

It strikes me that this applies directly to building resilient operational teams. If we ensure that teams have a framework of proven methods, have the tools to understand how their systems work, and have the skills and knowledge to build on, they become far more able to adapt to unexpected problems and change. Adaptability is what brings resilience to both platforms and the teams that keep them running.

Strong technical operations teams rarely struggle with problems and events that are expected, they stumble when they are faced with things they didn't expect; they fail to respond well to the black swans. In all reality, we can't plan for everything, there will never be a runbook for every scenario, and there will always be that "new" circumstance that we haven't met before. We can't be ready for every event but we can be ready for "types" of events. We can be prepared for events that interrupt communication, or remove physical components, or make external systems respond strangely, etc. To be truly resilient we need to have the scaffolding already in place to minimise service disruption even when we don't know what has actually happened or have a formal plan for the scenario.

Following this logic, effective technical operations teams need to have deep knowledge of their systems (or systems of systems), they need to have the ability to observe what those systems do (or do not do), and they need to have the required skills to minimise impact. Resilience is about adapting to and coping with unforeseen events, not just responding to failure. It is about recovering and maintaining service when circumstances are not optimal, when the unexpected strikes and when components you count on have gone off the rails.

While every team and every platform need different specific knowledge there are general patterns that hold true across most systems.

Have a process and follow it:

Companies like IBM are big on formal processes and they are for good reason -I know, this is counter-intuitive, imposing structure shouldn't encourage free thinking but it does. Structure and process are tools to free the mind from figuring out what comes next, and they let it get down to the real work of solving problems.

Tested, repeatable process provides a framework to hang new ideas on. Use of an effective process and runbooks builds on the experience of others and develops a team's skills much like drills develop muscle memory in athletes. Having a set of processes for dealing with situation types, that a team knows and can rely on ensures that rigor remains in place even under difficult circumstances, at 3:00 AM, or after something has gone terribly wrong. Using a proven and understood method (theoretically) leads to consistent outcomes and allows everyone on a team to know where to go next. Often in critical situations, having a ready-made next step breaks the paralysis so common in people under stress. Once a team is moving forward, they gain momentum but if they are stuck at the starting line they can stall before they make any progress. Ideally, wherever possible, individual runbooks should be automated and run (automatically) in response to specific conditions. I am however, a strong proponent of ensuring that everyone on a team knows exactly what the automation actually does and why so that if it fails, they have some idea what that means and what to do. This also gives people the background they need to assess which runbook or process needs to be used when they are faced with a unique event that has no automated solution. Using a formal process also works especially well for performing root cause analysis where it is critical to have a deep understanding of an issue in order to define a solution, risk mitigation or action plan. Failure to establish real root cause of an incident, and to stop the investigation at an intermediate symptom is all too common and results in wasted effort, unnecessary failures and loss of reputation both for an operational team and for the platform as a whole.

Know your platform and understand your key risks:

Even the most resilient architectures and seasoned teams have failures. That's why we work with error buckets, this is why we practice restoring from backup, and this is why we do DR exercises. Because we will never eliminate failure, platform resilience hinges on managing risk by optimising response when disruptive events inevitably do occur. In order to do this, you need to know where you platform is likely to struggle, know how it behaves under pressure and, just as importantly, know what normal looks like. If you know how your applications, infrastructure and operational teams typically respond to unexpected or unplanned disruptions before they happen, you will be halfway to understanding an issue before you even start restoring service. (Restore service first, investigate root cause second). Failure testing or Chaos Engineering is invaluable in developing this knowledge. This is why observability and traceability are so fundamental to the maintenance of a resilient platform. Without observability you cannot really know your system and the option to be proactive is severely limited. Fundamentally, you cannot intimately know how a platform responds in the wild if you can't or don't watch it respond. Observability is the key to establishing event correlation and without event correlation you start at the beginning for every failure. Real time observability is best presented through operational dashboards (in tandem with monitoring and alerting). These dashboards should be designed to show health at a macro level and should always reference back to what normal looks like. High-level dashboards should show everyone who wants to know basic information like how many users are using the system, how many records it is processing, and if any major components or their dependencies are unavailable. From here they should then drill down into more specific detail on resource metrics (Utilization, Saturation, Errors) or application metrics (Rate of Use, Errors, Duration of tasks/calls). Below that specific detailed views of specific sub-components can be used to surface targeted information. All too often ambitious dashboard and monitoring projects completely miss the big picture and because of this, do little more than provide noise when real problems occur. Finally, it is not enough to have a team that knows a system well and understands its intricacies, the team needs to be able to share this knowledge and ensure that it doesn't become siloed. Resilient teams cross train and have backups. Resilient teams don't work in a vacuum and have at least a cursory knowledge of the systems that share their environment. All of the members of a resilient team can take vacation, sleep, or go to the dentist without putting the platform at risk.

Invest in your team's skills:

If teams are going to adapt to new situations and problems they need to be equipped and willing to do things they have never done before, with little warning or preparation. Teams need to have a well to go to that contains the skills they need to solve new problems, it they have been doing the same things every day for the last decade, they don't have anything in their well.

In addition, established teams can be very resistant to trying new things, even when that change would make their individual lives better. Using a solution that is known and understood, even if it is difficult and uncomfortable and is not suited to a new problem, is often preferable to a new solution because the team knows they can manage the current one. Doing something different always has the potential to make things worse. In the face of this risk, many teams resist adapting. Breaking down this resistance when a team has become set in their patterns can be the biggest challenge in increasing the resilience of a platform. However, adaptability is key to ensuring that teams continue to make incremental improvements as systems change and new problems appear. Adaptability and resilience comes from inside teams and rarely can be successfully forced from the outside. There are different ways, however, to enable teams to seek positive change. One highly productive method is to actively increase team skills on a regular, ongoing basis. This can be done through formal training, incentive programs, job shadowing, or by rotating people through different groups. Whatever method is chosen, the goal is to broaden people's understanding of how things can be done, to encourage them to experience different ideas and patterns, and enable them to stretch what they can do. When they have real, lived experience of other ways to approach a problem they being to imagine how they could approach their technical challenges differently. Once this process starts it is self-reinforcing and self-sustaining. Teams that are consistently accruing new skills shift their perspective and see new ways to do things naturally.

As the world changes all systems and processes degrade over time and need to be renewed. In addition, even the most careful planners and seasoned operational teams, whether they be traditional sysadmins or SREs, need to be able to adapt to new and unexpected situations on an ongoing basis. While, to be really successful, teams need to be supported by a resilient architecture; a resilient architecture cannot overcome a lack of adaptability in an operational team. Giving your operational teams the scaffolding of tools, knowledge, skills to be resilient and adaptable will provide better outcomes both for the individuals involved and for the business depending on the services they support.

What does it take to make technical operations teams resilient?

Recent Posts

Comments