What Is True Resilience? (Hint: It’s Not About Managing Risk)

Here are three guiding principles for becoming a more resilient, high-performing organization in 2021.

Google what is resilience in cloud computing? Site reliability engineering — Perfection doesn’t happen; change is constant. Designing for persistent information and learning helps the ecosystem survive and grow by responding to external changes dynamically.

getty

2020 accelerated the transformation of almost every industry. The backbone of retail, commerce, education, healthcare, travel, and most of our everyday lives has moved to the cloud. It’s not a far-off, theoretical destination; it’s where critical services are routinely delivered. Because these services are critical, most of the customers I work with in my capacity as the leader of Google Cloud’s Office of the CTO are now asking how those services stay resilient in the face of many unexpected, unpredictable events. The answer transcends traditional organizational, contractual, and cultural boundaries.

Just this week, unprecedented weather patterns across the U.S. pushed many IT and business leaders to virtual “war rooms” in order to ensure capacity, networking, and applications were instantly and persistently available. But those rooms were in the moment, rapidly assembled and then rapidly disassembled—just like the technology that underpins the real-time applications and services we all depend on. This is the new normal, and it calls for a new model of operations. Rather than setting a fixed reliability as the calculation for contracts and practices, the focus must be on resiliency under any number of conditions.

True resilience isn’t about managing a particular instance of risk, but being ready for anything through the way you operate.

Cloud-era resilience assumes the world is always changing, and the winning strategy involves mission-focused teams leveraging the best possible information. That’s the basis of both risk management and growth, and that’s what Google builds for. This assumption was first embodied in our methods of Site Reliability Engineering (SRE), which are now considered essential for highly reliable production systems.

True resilience isn’t about managing a particular instance of risk, but being ready for anything through the way you operate. Today’s disasters may come from wild, unanticipated success (leading to traffic spikes) as much as devastating unforeseen failure (be that a natural disaster, a political event, or a system configuration error that cascades into a global outage).

To give you a better idea of how this works, here are three guiding principles for becoming a more resilient, high-performing organization.

1. Create maximum observability of the overall system

Table of Contents

1 1. Create maximum observability of the overall system
2 2. Design for effectiveness, not perfection
3 3. Learn and iterate as you go

I can’t overstress the importance of observability and continuity in how Google sees healthy system performance. With our roots as a data-oriented information company, we know both the importance of gathering signals and their interactions, and the need to organize for flexibility and performance—all in the service of keeping the system running for continuous improvement.

To put this in context, we can look to our most recent Black Friday and Cyber Monday. Had we attempted to build “perfect” systems for every customer, we likely would have seen high failure rates. The high traffic levels now common in these times can take customers into unpredictable “black swan” situations, or random, unexpected incidents with widespread ramifications.

Instead, for six months ahead of Black Friday and Cyber Monday, teams from across Google worked with key customers to create maximum observability of the overall system. This awareness of system interactions enabled coordinated responses among outlined roles in various teams by prioritizing data flow and operational consistency if something began to go amiss. This level of foresight and planning enabled us to work quickly with certain Google Cloud customers to resolve, for example, issues in data analytics pipelines in just minutes. It was done with no system stoppage and no change in the customer experience.

2. Design for effectiveness, not perfection

Back in Google’s earliest days, our founding teams wanted to map the entire internet, but didn’t have enough money to do that with conventional computers. Rather than focusing on the perfect solution, they bought economical computer parts and hooked them together with software that was built for persistence. Likewise, our teams were organized for maximum awareness, flexibility, and sharing. This meant that when parts failed, we had systems and methods in place to keep the jobs going.

In other words, we treated a technical problem much in the way nature treats a successful ecosystem—by not designing for perfection. Perfection doesn’t happen; change is constant. Designing for persistent information and learning helps the ecosystem survive and grow by responding to external changes dynamically.

The point is learning—which frankly happens more often when something goes wrong, and when the maximum amount of data is shared.

This practice has continued through to today, and one example can be seen in our error budget-based approach. Google’s SRE engineers (and their counterparts at our customer sites, Customer Reliability Engineers) allow for a certain number of system errors before halting deployments. A system with a service level objective of 99.9% performance has a 0.1% error budget, so for every million requests, there’s a budget of 1,000 errors over that period. The errors are meant to be addressed and fixed, of course, but new deployments are halted only as the error budget begins to exhaust.

Another example is our internal management practice called “no-blame reviews,” (also referred to as blameless postmortems) where people share issues they’ve encountered without assigning blame to any group or individual. The point is learning—which frankly happens more often when something goes wrong, and when the maximum amount of data is shared.

Related: Design your Google Cloud deployment so that it best matches your business needs. Check out Google Cloud’s Architecture Framework for best practices and implementation recommendations.

3. Learn and iterate as you go

Google has grown from one service—internet search—to many, creating and building out all kinds of computing infrastructure in the process. Yet our early lessons in system observability, change, and persistence have been core to the way we’ve successfully moved into many other world-scale platforms, like mobility, maps and navigation, commerce, entertainment, and cloud computing. It’s how we serve billions of people. It’s how we provide high-quality insight into a range of system performance attributes which matter for developers, apps, stores, marketplaces, billing, identity, and community involvement. High observability, besides helping performance, also teaches about adjacencies and natural areas of growth. So much that the process itself is part of our service.

Today, Google’s insights about observability and persistence, along with our tools and organizational methods, are helping many industries innovate and grow. Recent solutions for data and software access (like BigQuery), and management and deployment across multiple clouds and on-premises computers (Anthos) help customers realize this vision of observability and peak performance.

Increasingly, our “ecosystem approach”—one that designs for maximum observability, constant change rather than perfection, and doesn’t assign blame—is becoming more critical and applicable across numerous industries. Take healthcare, where patients, providers, insurers, suppliers, and others are increasingly intertwined. Data sharing and data privacy are both essentials for the future health of these systems, but it’s not something historically they’ve had to do. Together, we’re building secure systems that help the players transforming the healthcare industry work better together.

And as an added bonus for those taking on the hard task of leading successful transformations in rapidly changing markets and industries, the “resilient ecosystem, no-blame” approach also leads to better and more enjoyable management. There’s more shared success, less solitary failure, more trust, and better understanding. Ultimately this creates a more meaningful and fulfilling relationship among team members, far beyond anything I’ve seen come from the best chip, VM, database, visualization tool, build system, or algorithm. That’s because it serves what technology does at its best: helping humans create, relate, and inspire each other, to the greatest possible degree.

Keep reading: Here are six critical success factors to navigating challenging times and achieving a resilient IT strategy.