If you're a frequent reader of the Cortex blog, you know that we care deeply about empowering Site Reliability Engineering (SRE) teams to adopt and manage microservices architecture. In organizations of all sizes and industries, SRE teams ultimately own the responsibility of keeping systems up and running and putting in place systems that mitigate risk, automate manual operations, and integrate alerting. Beyond that, successful SRE teams maintain clearly defined criteria for production, ensure developer accountability, and diligently measure success against availability targets (e.g. SLOs).
For organizations with maturing SRE teams, we've found that it's worth asking how those teams might garner influence beyond that function. What SRE principles can we evangelize and adopt across the wider engineering organization? What might other engineers gain by thinking like an SRE?
In this article, we'll provide a brief history of the SRE role and identify a number of key SRE principles that we've found to be impactful across engineering functions.
The SRE function: a history
The ethos of the Site Reliability Engineer (SRE) was first defined in 2003 by Ben Treynor Sloss, a VP of Engineering at Google. As he puts it in this interview,
"SRE is fundamentally doing work that has historically been done by an operations team, but using engineers with software expertise, and banking on the fact that these engineers are inherently both predisposed to, and have the ability to, substitute automation for human labor."
The root of the SRE function came from the need to build and automate software to solve operational problems. It was in large part a response to the dysfunctional model where a development team might be responsible for writing code, and a separate operations team might be responsible for maintaining that code in production. In this model, developers are incentivized towards development velocity whilst operations teams are incentivized against change. This system not only hinders feature development and innovation, but also fails to optimize for investment in automation that can bring significant gains in the long term.
Since the inception of the role in the early 2000s, Google published a widely respected SRE book that memorializes both the SRE mission and best practices. In 2019, LinkedIn celebrated the SRE role as the second-most promising job in the U.S.
The SRE mindset
Given the explosive growth of the SRE role in modern software development teams, it's worth noting the qualities of the role that make it so impactful. And perhaps beyond that — what might it mean to adopt an SRE mindset if you're not an SRE? Might there be an opportunity in evangelizing those qualities across an organization? We think so.
Below, we'll start with the a few key SRE principles that we believe apply strongly across functions.
Encourage ownership
In the context of microservices, service ownership is critical. As we wrote in "How to Drive Ownership in Microservices",
"Service ownership means that there is a clear person or group of people who are ultimately held accountable for the success of each service."
For SRE teams managing dozens of services across multiple applications, a failure to track and assign service ownership can make it extraordinarily difficult to diagnose and address outages. For that reason, they might:
Regularly audit their microservices architecture to make sure every single service has a clear owner.
Define SLOs for each service and make sure that owners of that service are held accountable for reaching those SLOs.
Assign SREs to on-call rotations and support escalations for the service they own.
The developer accountability that comes with service ownership, however, can and should apply well beyond the SRE function. Engineers across an organization should strive to assign clear ownership within their team always, whether that's enforcing that every GitHub issue and Pull Request have an owner or assigning a single person to be held accountable for each step of a product release. Ultimately, engineers in high-ownership environments feel significantly more empowered to solve challenging problems and are much more likely to deliver high-quality output at a predictable rate.
Build for scale and invest in automation
As noted above, SREs are responsible for keeping systems up and running — and building processes to ensure that those systems can handle scale. For example, an SRE might:
Identify key availability metrics and build monitoring tooling around those metrics to make sure irregularities are quickly identified. If an automated process can detect an issue and correct it, even better.
Enable auto-scaling to make sure all environments running a production application can scale memory and compute alongside a growth in volume or user count.
Integrate with a CI/CD tool to automate various steps within the product release process.
Beyond the SRE function, however, the principle across these three responsibilities is simple: create reliable systems that can scale. For engineers outside of the SRE function, adopting this principle can be incredibly impactful.
For example, consider a QA Engineering Lead who's responsible for writing test scenarios intended to produce performance benchmarks for each component of an application. Suppose the QA engineer can hire 10 other engineers on her team. If she adopted the SRE mindset of creating reliable systems that scale, she might write a test scenario template for each application component, write a script that outputs performance benchmarks when a test scenarios is run, and empower her team to grow those test scenarios to incorporate more complexity over time.
Be proactive and be prepared
As much as successful SRE teams create systems to maintain reliability, outages are inevitable. And when incidents do happen, SREs are wholly responsible for building systems that make it easier to detect, diagnose and mitigate impact when an incident does occur. To this end, SRE teams might:
Document a customer-facing Disaster Recovery plan that informs customers of the steps they can take to prepare against permanent data loss in case of an incident.
Create incident run books so that engineers across teams know what to do in case of an outage. Among other things, this run book might define incident criteria, list primary sources of information, and clearly state team owners for each step of the response.
Enforce the Posmortem practice, where information and evidence for each incident is collected and specialists from across teams gather to document root cause, discuss lessons learned, and propose future preventative measures.
While the initiatives above are specific to the SRE function, the principle is once again simple: build systems that make it easy to detect, mitigate, and prevent problems. A Support Engineer, for example, would benefit from enforcing a postmortem on high-priority tickets to better understand root cause and potentially identify a gap in the product. If the Support Engineer uses a tool like Zendesk, she might create a Slack integration that alerts the Customer Success team if a P1 ticket breaches an SLA. Support teams might already be working on a process of this nature, but adopting an SRE mindset can certainly strengthen the momentum behind those processes and cultivate a culture of preparedness.
The best place to get started
Here at Cortex, we've helped teams of all sizes take steps towards widely adopting an SRE mindset. Here are a few places to get started.
Audit alerts. Collect all of the alerts your team has received in the last 30 days. Do you take action on every single one? Should higher-priority alerts be more clearly surfaced? Should someone have been paged? Consider "noisy" alerts and decide which ones might be better off removed or modified.
Encourage a Postmortem culture. Whether it's outages or P1 support tickets, encourage your team to write postmortems. Failure is to be expected, but make sure you're learning from it. A Google blog post about postmortem culture here.
Assign ownership. A high-ownership environment signals that your team trusts each other, that engineers feel like they have what they need to solve a problem, and that they'll take accountability for failure if it happens.
Automate what you can. Invest in tools that automate repetitive tasks. If you don't know where to start, check out Zapier to see if you can integrate the tools you already use. Time is your most valuable asset — reduce as much mindless work as you can.
Create checklists. If there's a process you or your team does regularly, create a checklist for it. For example, what steps are required to promote code from staging to production? Write it down and use it as a template every time that happens. If the engineering lead who typically owns that process is out of office, it'll be easy for someone else to pick it up.
Define target metrics. It's easy to associate SLOs with the SRE function, but all teams should target meaningful metrics to track and work towards. Consider measuring Pull Request throughput per week, customer support tickets answered, or story points per sprint — whatever might best incentivize behaviors that correlate with your definition of success.
This list is certainly not exhaustive, but we're confident it's a good place to start.
The risk of losing the SRE mindset
While the steps above might sound intuitive, we know that it's hard, it takes time, and it takes commitment. Here at Cortex, however, we've learned that not investing in the SRE mindset risks:
Developer morale. If your team spends a lot of time "keeping the lights on" and not investing in the future, developer morale takes a hit.
Ability to meet SLOs. If your team isn't targeting the right metrics and has a hard time making concrete progress towards those, your customers and your product will suffer.
Alert fatigue. If you don't invest in meaningful alerts, it becomes difficult for engineers to distinguish between severe problems and expected behavior. Issues will be ignored.
SREs have significantly strengthened software development teams in the last decade, and the SRE mindset is just as powerful. We encourage you to embrace it. If you're looking for help or additional tips, don't hesitate to reach out to us at team@getcortexapp.com. We'd love to learn from you.