Back to Resources

4 Essential Steps to Build a Culture of Reliability Across Engineering

4 Essential Steps to Build a Culture of Reliability Across Engineering

At Cortex, our mission is to help engineering organizations foster a culture of reliability and ownership. We’ve personally experienced the challenges posed by a sprawling microservice architecture — from dealing with service maturity standards to working through production readiness checklists — and we incorporate our firsthand knowledge into everything we do.

The most important insight we’ve gained along the way is that ownership is crucial to success with microservices. Recently, I hosted a webinar to discuss just how important defining clear ownership is, and how leaders can foster a culture of accountability at their organizations with the right approach.

In this ebook, we’ll dive into the challenges commonly faced by engineering organizations working with microservices, the core principles of establishing ownership, and how Cortex can help you successfully create a culture of reliability.

Common challenges

Institutional knowledge

Many organizations transition to microservices with a focus on enablement and flexibility. The goal is to build fast and increase velocity, sometimes at a cost. Commonly, the proliferation of tools and technologies becomes unsustainable. Depending on the application, engineers may use different languages and frameworks, or different versions of the same framework. Teams may ultimately diverge in how they structure CI pipelines, format logs, and track metrics.

A lack of standardization will inevitably slow velocity. As institutional knowledge grows, it is more difficult to triage issues when they arise. Communication overhead grows significantly, so team members spend more time figuring out what happened than devising and implementing a solution. Plus, it becomes more challenging and costly for individuals to join new teams, since they need to spend more time upfront becoming familiar with new systems.

Organizational structure

In accordance with Conway’s law, many organizations move to microservices because their teams have become distributed over time — it makes more sense for small components to communicate with one another like small, distributed teams do. Reorganization isn’t inherently bad, but when it’s coupled with siloed knowledge, teams may ultimately find it difficult to collaborate with one another.

Unclear ownership protocols and a general lack of accountability are the trademarks of this situation. You may want to address a problem at your organization, like the proliferation of tools, but if ownership isn’t established, you can’t know where to begin or who is responsible for overseeing the process.

The problem is exacerbated when it comes to common services, like shared libraries. When the collective owns an asset, simple changes can wreak havoc on your whole organization. This is especially true if those changes aren’t communicated to the right people, but without ownership, you can’t know who the right people are.

SRE overload

Typically, the SRE team is tasked with establishing some kind of order within this chaos. For a single group, this is a mammoth responsibility, especially because an SRE team usually has only a fraction of the staff that engineering teams have.

On the other end, developers usually don’t have visibility into SRE efforts, so they can’t see where they are or aren’t making progress. If ownership isn’t in place, the problem becomes even worse — services can easily fall out of compliance when they’re neglected. Ultimately, without the accountability that visibility and ownership provide, no real progress can be made.

4 Steps to Build a Culture of Reliability

There are a few essential steps that your organization will need to take to create a culture of accountability: align on an outcome, define ownership, set clear standards to achieve your outcome, and provide visibility throughout the process.

1. Audit and align on an outcome

Trying to make improvements without a clear goal in mind is like wandering into the woods on a cloudy night — you’ll end up walking in circles unless you have your eyes on the North Star.

Team members need to agree on what your organization’s north star will be — if you can’t agree on a common goal, then you’ll continue running into issues caused by a lack of standardization. Consider what’s important to your organization. Are you focused on ensuring that your incident response process is streamlined? Has your velocity dropped because of a lack of innersourcing?

Whether your north star is reliability or standardization, this audit phase will help you and your teams decide on a clear vision of the future. Without a north star, it’s incredibly difficult for anyone to know what they’re accountable for. Outcome is a critical part of accountability, and this audit phase enables leaders to figure out how to achieve that goal.

2. Define ownership

Once you’ve found your north star and have a sense of where you’re heading, it’s time to define ownership. You can do this with a spreadsheet, database, or service catalog — what matters is that you have a centralized system to keep track of this crucial information. Individuals need to be clear on their responsibilities, and they need to know who to reach out to if an issue arises.

With a system in place, you’ll audit every component you have, from libraries to data pipelines. Every service and resource should have an owner. Ideally, owners will be teams, rather than individuals — this way, you don’t risk losing key information when folks leave your organization or switch teams.

Once a baseline is established, work with tech leads and managers to audit ownership definitions. Make sure everyone is clear that future initiatives will be assigned to specific owners. This will enable leaders to make a more informed decision about whether any components are incorrectly assigned.

Be prepared to spend a good amount of time on this step — everyone at your organization will need to come to a consensus around ownership for this to work. Without a consensus, you’ll inevitably wind up with teams developing divergent processes again. Trust that the time investment is worth it: when everyone is on the same page, it becomes immensely easier to produce high-quality code.

3. Set clear standards

Once ownership is defined, you can set clear standards and determine a path toward your north star. If that north star is reliability, for example, then you need to decide if you’re tracking MTTR, incident response time, number of incidents, or a combination of these metrics. Codifying standards contextualizes progress and allows everyone to clearly see how their services are improving.

Aligning on key standards from the start will enable folks to collaborate on achieving a goal, rather than talking past one another. Having clear standards also empowers employees to petition leadership to make changes that will help them make progress, like improving the organization’s tech stack.

Your standards will change over time, especially as you learn more about your services and teams. You may learn, for instance, that the standards that apply to your tier-1 services don’t quite fit tier-0 services. With the right tools in place, you can readily communicate changing standards and ensure that everyone impacted remains on the same page.

4. Provide visibility

Setting clear standards also enables you to provide greater visibility, which is the most important part of creating a culture of accountability. Visibility is the greatest driver of cultural change. Developers are craftspeople who care deeply about building high-quality software and following best practices, but that is difficult to do when a clear definition of “high-quality” doesn’t exist within your organization.

Defining standards is one way to provide that visibility. Frequent updates and clear communication help everyone remain in agreement about standards. Plus, it helps keep your organization’s north star at the front of mind for everyone. You can accomplish this through Slack notifications, tech ops reviews, management meetings, all-hands — the method doesn’t matter as much as the communication itself does.

Visibility is needed at every level. Developers need insight into how their services are performing against standards; when they’re underperforming, they need a clear sense of what’s wrong and how to improve it. Leadership needs a clear understanding of how metrics are being tracked and where the risks are — without that visibility, they can’t know whether progress is being made, or whether the standards used are still appropriate.

Visibility from leadership can be a powerful way to drive action. Leaders can highlight improvements in specific areas at all-hands meetings, and compare them against the clear standards that your organization has defined. This kind of visibility emphasizes the importance of your goals, and it drives buy-in across your organization.

Not every standard needs firm deadlines or hard-and-fast goals — in many cases, driving accountability is about providing visibility and reminding folks where you’re headed. The gamification of standards can do a lot in this regard, increasing engagement and creating visibility, without much action needed from leaders. Consider the benefits of an “achievement unlocked” system, where services can earn a bronze, silver, or gold status as they improve along maturity standards.

Positive reinforcement, like leadership boards and metrics that highlight teams with significant improvement, is a great way to drive visibility. It motivates the folks at your organization to take initiative and search for this information, creating a self-fulfilling cycle of visibility and improvement.

What is Cortex?

Cortex is a developer portal that helps you accomplish everything we’ve covered. It’s designed to make it easy to gain visibility into services and deliver high-quality software. Cortex is equipped with a suite of tools — from a Resource Catalog to a built-in leadership board — that will help you create a culture of accountability at your organization.

Service and Resource catalogs

The Service Catalog and Resource Catalog are two essential tools for creating a culture of reliability. Through integrations, Cortex will pull all of your services and every single component of your infrastructure into a single platform. All of these distinct entities are tagged with ownership, so it’s easy to identify the people accountable for a service or resource.

These catalogs act as a single source of truth, so your developers don’t have to waste valuable time trying to figure out which spreadsheet has the info they need. Each service and resource has its own details page where you can quickly find everything you need: owners, Slack channels, current on-call information, API documentation, event timelines, links to runbooks and dashboards, and more.

Ownership is the most critical aspect of the catalogs, allowing your organization to efficiently and confidently move through the first steps of creating a culture of accountability. Cortex can connect with IDPs, like Okta and OpsGenie, and automatically sync team information, so once you tag a service with owner info, it’ll stay up to date as people leave and join the team.

Scorecards and Initiatives

Scorecards and Initiatives allow you to easily establish best practices and keep your teams accountable. While we can’t find your north star for you, Cortex can automate the process of codifying standards and tracking progress, providing the visibility that developers need. Not only can leaders enforce standards, but team members can hold themselves accountable for their work.

Cortex comes with a number of integrations out of the box, so you can automatically pull information from Sonarqube, New Relic, or Datadog, for example.

You even have the flexibility to push in your own data and write rules with the power of Cortex Query Language (CQL).

Scorecards offer an easy way to codify standards as a set of rules and measure all of your services against them. At a glance, you can see exactly where a service is or is not in compliance. Gamification is built into Scorecards with Levels, so teams can quickly understand what the highest priority rules are, making it even easier to drive progress on the most important standards.

When it comes to looking at the big picture, Cortex can help with that, too. The Bird’s Eye report is an invaluable tool for leaders: this report visualizes a Scorecard as a heat map, so you can quickly understand performance across teams, groups, and rules.

This kind of visibility allows leaders to identify bottlenecks and risks. With the ability to group by team, leaders can gain insight into where teams are succeeding and struggling. When you can quickly see the challenges a team is facing, it becomes easier to provide the resources they need to meet service standards.

The insights provided especially by Scorecards empower leaders to make informed decisions about the future of their organizations.

With the Initiatives tool, leaders can easily drive tactical progress on key campaigns by setting clear goals and deadlines. There’s no need to build manual reports — all the information you need exists within Cortex.

Plus, Initiatives can make life a lot easier for the SRE team. They can decrease their reach by setting an automated Initiative and tracking progress within Cortex. Whether you’re trying to get everyone to adopt a specific package version or ensure all services have runbooks, Initiatives can help you drive progress toward that goal.

With these tools, you can keep your organization’s north star front and center. Scorecards and Initiatives are powerful tools that enable you to both set clear standards and provide visibility for everyone involved.

Scaffolder

Cortex’s built-in templating tool, the Scaffolder, makes it easy for developers to follow best practices in their code and enables you to achieve standardization across your organization. Powered by Cookiecutter, the Scaffolder allows you to create custom templates configured to follow all of your standards, empowering devs to spin up a new service in minutes.

The Scaffolder frees developers up to get more creative, knowing that the provided template will keep them in compliance. By reducing the overhead for developers, it becomes even easier for them to care about service quality. Plus, with tight integration with the Service Catalog, key information is tracked from day one.

How can Cortex help?

Cortex can transform your unwieldy tracking spreadsheet into meaningful, actionable information by helping you establish ownership and define standards across your organization. Cortex is a single source of truth for all of your services and assets, as well as a powerful tool that helps you drive progress toward your goals.

There are a number of other Cortex tools that can help you gain the visibility you need into your application, from the powerful Query Builder to the Dependency Graph, which allows you to quickly visualize how all of your services and infra components interact. Every tool you’ll find in this platform is designed to facilitate the process of creating a culture of accountability and establishing standardization across your organization.

If you’re interested in seeing how Cortex can create visibility and help you foster a culture of reliability, book a demo today. Make sure to subscribe to our blog and sign up for a future webinar to learn more about fostering accountability and empowering engineers to produce high-quality code.

Talk to an expert today