risky-business-how-an-engineering-mindset-drives-innovation

Change your business mindset with these four engineering principles.

Software engineers are often advised to think like business people, but do engineers also have something to teach managers and entrepreneurs? Absolutely! In my role as technical director in Google Cloud’s Office of the CTO, I’ve met many senior business leaders who could have benefitted from adopting a software development mindset to help them take their organizations to the next level. To get there, they would have had to first get comfortable with a level of risk, speed, transparency, and uncertainty that can initially be hard to stomach.

Take a closer look at the principles that modern software engineering teams live by and you’ll see that they’re far from reckless—in fact, on balance, they’re rather logical and reasonable. For example, many engineering teams employ rapid iteration, an approach to problem solving that requires making a lot of changes. Of course, every time we make a change, we must accept related risks—anathema in most business contexts. The upside is that the more risks we take, the faster we learn from our failures, and the more quickly we make progress. This philosophy is widely useful, far beyond the discipline of software engineering.

The good news is that the fundamental enabler for this is cultural, not technical. Read on for some software engineering practices that we employ at Google that you can borrow from to help your own business generate new ideas.

1. Consider Incremental Risks

Many organizations wish they could innovate faster, but they are effectively paralyzed by fear: innovating requires that they accept risk, and when all risks seem critical, that’s unacceptable. Thus, building a culture of innovation usually starts with some form of risk management.

Today’s software risk management looks dramatically different that it did in the 80s and 90s. Back then, projects were huge, timelines were long, and failures were spectacular and common. To illustrate modern software risk management, let’s take this diagram from my article Think Like a Python, not a Tiger:

As you can see, each of these blocks represent taking the same total risk, but the left side uses multiple stages, resulting in advancing more quickly over each stage and less potential risk exposure through each stage. In contrast, managing all of the risk at once carries much more coordination overhead, so it takes longer to see results. Taking large risks also results in more reluctance to cancel and repeat failed attempts, leading to higher project abandonment rates. Today, the software world uses repeatable patterns for risk management. Adjusting how we manage risks improves our learning speed through the iterative no-regrets, fail-adjust-and-retry approach. But that’s only the beginning of the journey: there’s still a larger opportunity yet to uncover.

Resilience is about being ready for any outcome.

Risks and the associated learnings that come from them follow a self-reinforcing pattern to build resiliency. As my colleague Will Grannis recently shared, true resilience is not about managing risk: Resilience is about being ready for any outcome. Grannis’ point about resilience is noteworthy, because if you aim to control risk and stop there, your natural tendency will be to resist change, and that’s wildly counterproductive. As leaders, the way we can encourage this discipline is to repeatedly ask for small, measurable improvements. To paraphrase Grannis’ article, Ensure Observability + Design for Effectiveness, Not Perfection + Learn and Iterate as You Go. 

Make no mistake, taking incremental risks isn’t about thinking small. It’s about encouraging a bias for action. Let’s start learning faster. Then, we can take it to another level.

Read this next: See the top five lessons we’ve learned about building a culture of innovation

2. Start Rapid Prototyping

Let’s combine our incremental risk discipline with a design thinking concept called the Prototyping Effect that we use here at Google. Simply put, your success rate when innovating will be directly proportional to the number of attempts taken—not to the amount of time you spend on each attempt. You may recognize this concept:

All other things being equal, we can try more ideas and get many more successes through rapid iteration of build/fail/fix than the old take-your-sweet-time way. And in our experience, the best way to achieve more successes and build more confidence in your culture is to increase the number of ideas you try. At Google, we do this by promoting small incremental changes and aiming to improve with each iteration. Our learnings primarily come from the “ideas tried” category, so the more active we are there, the more quickly we learn and refine our actions to drive successes. 

To encourage rapid iteration, ask your leaders to measure attempts, and highlight what they learned. It’s about lots of small course corrections and encouragement. 

Before this can fully work, your teams need to feel safe and secure while failing regularly at lots of quick attempts. That leads us to our next belief.

3. Assure Psychological Safety

At Google, we discovered that the number one success factor for our most effective teams is the level of psychological safety enjoyed by each team. In other words, if the team feels safe enough for interpersonal risk-taking, they function significantly better. This demonstrates a supercharged level of resiliency.

Our best leaders remove all blame from failures, and instead celebrate failure for the value of the lessons we learn from them.

Achieving this goal requires deliberate work from leaders to constantly recognize and promote learning for the value of the lessons in a way that each team member volunteers to take risks to make their views heard, and to bravely go down uncharted paths. Our best leaders remove all blame from failures, and instead celebrate failure for the value of the lessons we learn from them. Google does this by holding blameless post-mortem reviews when things go awry. Every quarter, our Technical Infrastructure team publishes its Greatest Hits and Misses and shares it with all our engineers to constructively examine our most spectacular errors and near misses. Along the way, we celebrate those involved, and the steps taken together to prevent recurrences. When individuals take risks that don’t work out, we lift them up, and make them part of improving the systems and processes that allowed those failures to begin with. Widely sharing our lessons learned helps us to avoid repeating our mistakes.

4. Establish Error Budgets

Still, you might ask how do we know how much risk is appropriate to take? Google’s Site Reliability Engineering principles employ a concept known as Error Budgets for this. It’s rooted in the idea that there is no such thing as a system that never fails. It’s true in both software and business. Ideally, we design our systems such that when errors happen, we gracefully retry, or continue in a reduced capacity until the error is resolved. This reduces risk through design so even during times of failure, some level of functionality is still possible. An error budget is a way of quantifying the amount of time your system can exhibit errors without adverse business impact. Picking a budget that’s too small results in spending too much time designing fault tolerance features, and distracts from the time you can spend adding features your users care more about. Picking a value that’s too big means that you may lose trust from your users, and they may go elsewhere if that happens.

When we find an ideal balance between these upper and lower limits for our error budget, we know that we can accept errors, perhaps 1% of the time, and still meet our business goals. We set our error budget at that level on a monthly basis. Next, we measure our error rate throughout the period to verify that we consistently meet our goal. In months where our error budget goes unused, we have wasted an opportunity to accept risk with no regrets. Take your calculated risks in those windows where you have an unused error budget, or by carving out carefully pre-selected risk zones we call Maintenance Windows. When we succeed, we learn from that. When we fail within our error budget or maintenance windows, we meet expectations and learn from that too. It’s a win-win.

Skeptics may struggle to follow my reasoning about error budgets. It may seem radical. I hear you. You’re thinking “why would I ever intentionally do something that could cause a problem that affects my users?” It’s because your users expect you to continually improve and expand your products and services. The error budget is an acceptable level of risk you should factor into workstreams to identify any potential impact for mission critical functions. Allowing things to break—on occasion—while minimizing serious harm to the business ensures you can adapt quickly in times of pressure or competition.

It All Comes Back to Learning

Let’s review. Incremental risks, rapid prototyping, psychological safety assurance, and error budgets are all tools to foster rapid learning that contribute to a culture of innovation that quickly evolves. Combining these tools with the right mentality changes the game: Learning is essential to resilience as circumstances continually change. Resilience helps you navigate risk in a more effective way than simple risk avoidance, promoting progress at a faster pace. Lots of small failures are way better than one big slow failure. So, let’s set the conditions to take risks, fail fast, learn fast, and adapt quickly. Let’s not punish failure, but instead celebrate learning from our results.

Read this next: Learn how IT leaders can fast-track their SRE practice with our four best practices.

Similar Posts