How does your company measure great technology operations?
Website uptime?  Isn’t that table steaks?
Who isn’t up 99.9% of the time, 99.9% of the time?

In today’s fast-changing enterprise, velocity of software delivery, optimal customer experience, and efficiency is a constant focus.  To balance these critical goals you should consider these three areas.

  1. Structure

Do you have software engineering and production engineering under the same leader?  You might reconsider. 

From Amazon to New Relic, the role of technical operations/production engineering continues to morph and often looks like software engineering in many ways from a skill set perspective.  Companies that get it right, realize the software engineering and technical operations/production engineering are changing but the distinction between the two is critical.  Companies like my own, google, etc.. are increasingly selecting former software leaders with strong operational backgrounds to lead the technical operations/production engineering teams. These leaders are breaking down the walls between software and operations, including often embedding, half of the operations teams inside software engineering teams.  This enables more autonomy within software teams and tight collaboration while protecting operational best practices.  Most technical operations/production engineering teams now focus on delivering a consistent transformation into the new world of Cloud, CI/CD, autonomous teams, automation and instrumentation.

“The incentive of the development team is to get features launched and to get users to adopt the product. That’s it! … the incentives of a team with operational duties is to ensure that the thing doesn’t blow up on their watch. So these two would certainly seem to be the tension.” – Ben Treynor, Google operations.

It’s important that you morph your operations team into the future and keep the healthy tension between meeting launch deadlines and ensuring quality and reliability.

Why even bother with operations, what about software as infrastructure or autonomous teams?

In my experience, if you don’t have a strong technical operations/production engineering team you will often see defects handled under the radar (see metrics next).  At the enterprise level this can also lead to many redundant efforts in the software teams on automating deployment, integrating different monitoring tools and software as infrastructure instead of focusing on features that customers want.  It’s why most large web-scale companies including AOL, Yahoo, Google and Amazon have an empowered, well-invested technical operations/production engineering team that is distinct in leadership from software engineering. In all these companies, technical operations/production engineering have a critical role supporting autonomous teams or embedding with them.

  1. Metrics for customer success and operational success

If you don’t measure and track it, you don’t know how you are doing.

When undertaking a web-scale project, it is critical that you capture and track the right metrics.  It is important to use the data to evaluate the customer experience, operational consistency and velocity.

Hidden customer impacts are the enemy of continuous improvement.

From my perspective, reporting on website uptime is very 1999.   

Key things to consider for reactive improvement:

  • Towers of Responsibility- Review your organization and group it into no more than 10 towers of responsibility. Application, technical operations/production engineering, core infrastructure, financial systems, etc..
  • Number of Incidents by Tower of Responsibility– Track anything that impacts more than 5 customers along with both enterprise severity and response priority.
  • Mean Time to Repair- Track, by tower of responsibility, the time to repair any customer impact.
  • Tasks- Capture and age tasks in a root cause meeting to prevent the impact in the future…My favorite line is “who is going to do what so that given the same set of circumstances we have a different outcome next time?”
  • The Towers of Reality- Three key areas cause impacts. You need to track and trend by responsibility tower.
    • 3rd party- Someone you do business with caused the impact
    • Failure- You didn’t touch anything; it just broke. Often means the systems need investment or prioritization.
    • Change- You changed something and broke it. 
  1. Monitoring

Anytime your customers tell you that your system is down, you need to add monitoring.

You must also:

 

  1. Create synthetic monitoring that automatically uses your application as customer do. These tests should be in production and in your software team’s automatic testing.
  2. Purchase and install a log aggregator like Splunk. Add alerts that search your logs for anomalies and alert on them.

 

Meet with your product and business partners to set error budgets and with these basic steps your organization will continuously improve the customer experience. Make sure you have separation between software engineering and technical operations, and ensure that the technical operations teams know their goal is to improve overall customer experience through automation and monitoring along with hitting the key goals.  Their job is not to assess or judge but make sure the Service Level Objective is at or above target.

After all, it’s all about the customer.

 

 

Leave a Reply

You must be logged in to post a comment.