There is no more "keeping the lights on"

Cynthia Unwin
Jun 5, 2023
5 min read

Updated: Sep 19, 2023

Several years ago I came across a presentation by an Architect at IBM named Simon Grieg. It was a presentation that I found when I was looking for something else, but it stuck with me and I find myself coming back to it again and again. The presentation explains the concept of Architectural Entropy. His premise is that System Architectures, like all other systems in the universe, inevitably and steadily move towards disorder over time. Regardless of how well designed a computer system is, it will be subject to this law of Architecture Entropy. This is true because no system is completely static nor does it exist in isolation and this ongoing internal and external change drives "entropy gain" as architectural and structural integrity of a system are eroded. The only way to fend off entropy is to add energy back into the system in the form of governance and maintenance to actively counter the descent into chaos.

This was true a decade ago when, despite a clear overall vision, ongoing tactical decisions made it harder and harder to reach a strategic goal. it was true a decade ago when external integrations; functional patches and enhancements; security patches; and Operating System upgrades continued to pushed forward, changing the Enterprise environment in both subtle and significant ways. Today, however, with our current industry focus on moving away from monolithic architectures to component based systems, coupled with accelerated change velocity, the rate of entropy gain experienced is accelerating. Now, more than ever, controlling the descent into architectural chaos caused by entropy creep becomes a critical factor in managing software life-cycles over even reasonably short time-frames.

As we manage the life-cycle of complex systems, we face complex problems that require preparation and strategies to manage if we are going to deliver consistent service, reduce cost, and maintain stability over time. High rates of change and integration with services outside of the control of a single product owner allows for the delivery of a more robust feature set in a shorter period of time but it comes at the cost of making platform maintenance more complicated. Getting to the root of what has caused an error becomes a less straightforward process and requires the understanding of how a service traverses system components to provide specific behaviours, especially if consistent patterns are not used across an Enterprise. The cross-cutting overall vision of how a complex system fits together, how each portion interacts, how they operate independently, and how change in one area can produce unexpected results in another becomes critical to the operation of a platform. Each area of a complex system needs a technical owner, but the end-to-end platform itself also needs to be clearly understood and technically governed and this is not trivial.

As we shift away from big bang monolithic software deployments to incrementally improved platforms, our challenge increases while at the same time our expectations of our software are changing. We no longer expect to build a system, have it remain reasonably static for the next 5 -7 years, and then go through another big bang implementation of the next generation of software. There is no "steady state" in this model unless that state is of constant change. Ideally, good architecture allows for the evolution of platforms and it allows us to consume new services and change to new models within the life-cycle of a living system. However, architects can't see the future, and no system, however well designed, can manage this shift and change without the attention of an overall owner.

While, as an industry, we understand the importance of build architecture and are starting to recognise the need to attend to non-functional and operational requirements during the design and build phases of a project, we rarely systematically address Architectural Entropy. The natural accrual of technical debt, complex problems, architectural drift, and what those mean for solutions over time is managed only when critical issues occur. Most programs initially include the necessary build resources but beyond that, when we move into so called "steady-state", teams fail to see the ongoing necessity for skilled overall technical leadership, individual teams work towards their individual goals and often the "build" architecture team is relied on in emergencies. This strategy provides disjointed support and governance as it relies on the best efforts of the otherwise occupied teams that find themselves focused on a tactically resolving a problem to meet their own short term goals.

I do a fair amount of work with struggling long term accounts and while the problems they have and the technologies they deal with are varied there is a common factor that contributes in a surprising number of cases. Despite having fundamentally changed how services are delivered to our customers we have not moved past the idea that an overall architect is only required to create an initial architecture but after it is up and running, work stream based architects and developers paired with a disjointed support team should be able to maintain it from that point forward. This wasn't true a decade ago, and it isn't true now.

This view point is validated when I look at programs that are really successful. When I look at programs that are meeting the expectation of this era's users and delivering new software quickly and with stability over the long-term, these platforms without exception, have a clear technical owner that is focused on directing the evolution of the platform. Not a functional platform owner, but someone (or a team of someones) responsible for end-to-end technical delivery over time. While this person is not the expert in each of the components involved in creating a solution, this person has the experience, capacity, and authority to govern technical change and form a bridge between different discrete areas of responsibility. Fundamentally, if nobody owns the technical coherence of a solution, short-term expedience and incomplete understanding will quickly erode the viability of a platform.

We know from experience that legacy solutions with ongoing Architectural ownership perform better. Despite their loosely coupled nature, or possibly because of it Hybrid cloud solutions have an even greater requirement for an ongoing caretaker of the environment at the conceptual, architectural, and operational levels to maintain solution coherence and to adhere to non-functional requirements over time. If this role isn't filled, unchecked entropy gain will render the solution unmanageable from a maintenance and/or cost perspective over the surprisingly short term.

As an industry we need to accept that long term change requires governance and embrace the role of the Operational or Site Reliability Architect as a critical piece in the ongoing success of our customers. Over my career I have been repeatedly been told that ongoing programs can't support the expense of senior technical owners, I think the situation is actually the opposite. Expecting that a solution can be left to run with no overarching technical ownership over the course of years while continuing to be efficient, flexible and cost effective is naive and short sighted.

There is no more "keeping the lights on"

Recent Posts

Коментарі