At DockerCon 2022, the CTO of Cortex, Ganesh Datta, had a conversation with Jonatan Ponzo from Rappi and Mike McClannahan from EquipmentShare about how they use and maintain microservices.
Jonatan has been at Rappi, an on-demand delivery company in Latin America, for almost four years. He has worked on multiple teams, including DevOps and SRE, and his current role is Core Platform Architect on the Developer team. Currently, Rappi has more than 2000 services that support many different platforms such as RappiPay, an online neobank.
Mike has been Director of Core Engineering at EquipmentShare, an online platform to improve construction workers’ productivity, for almost seven years. The platform initially started as a monolith and then as they grew past 100 software engineers, they scaled their architecture, resulting in more than two dozen microservices.
You can read about the highlights of the session below.
How important is it to have ownership of services?
Both Mike and Jonatan stressed the importance of tracking service ownership as their organizations grew in size and complexity. Mike said that EquipmentShare’s original target customer was small businesses. As they moved upmarket, there was an increased focus on scale: “Larger customers had additional requirements such as SLAs, which made reliability a business requirement rather than an engineering afterthought.”
A simple view of load balancing and 500 errors became hard to keep track of, so EquipmentShare needed a better way of identifying and tracking services. In addition, it was hard to determine direct customer impact from one system’s alert, so a more integrated solution was useful. With a centralized service catalog, they’re empowered to put junior developers as owners of services so that the engineers can learn and become more self-sustaining over time, increasing the longevity of the platform.
Jonatan echoed the same, saying that Rappi wanted to track ownership of each service independently and have a clear view of which services were reliable. When a team builds a service they are wholly responsible for development: from the design process all the way through to maintenance via an on-call rotation and reliability via application performance monitoring. Team ownership is more important than individual ownership as engineers will work on different teams over time.
How do you measure service reliability? Reliability metrics, postmortems, or something else?
Jonatan noted that Rappi has seen exponential growth over the past four years, with bottlenecks on popular days such as Mother’s Day. “We view reliability as akin to platform stability: fewer big incidents that can be handled smoothly is the goal,” he said. The focus is not so much on specific response codes, but more that incident response proceeds in a controlled fashion, reducing fatigue on engineers.
Over the past year, Rappi has been ramping up measurement to figure out what teams are best at and what needs to improve. Some indicators that they look at (which are tracked in both development and production environments) are postmortems, outages, successful deployment rollbacks, and how mature each service is. They also think about ways to increase impact from each team and how to provide help effectively.
At EquipmentShare, Mike said, “we have internal SLOs and rapid iteration phases. As our architecture changes, the values that we track also evolve.” Each team helps define what are the most important metrics, giving leads their own responsibility over what to pursue next.
How do you ensure developers care about service quality while maintaining autonomy of their work?
It can be difficult to push for a focus on service quality. It’s easy to align on the goals at a high level, but it can be harder for engineers to see the value in specific projects (compared to feature work).
At Rappi, there’s a culture where developers care a lot about service quality and bring a high degree of urgency to quality fixes. Once a service is mature enough to focus on reliability, this becomes a persistent touchpoint. For all of their services “it’s necessary to have 100% coverage before going to production. Generally, developers should be able to build quickly, but also be flexible.”
EquipmentShare maintains a balance of gatekeeping and visibility, with clear expectations of what is needed. Teams agree on principles, and then engineers trade-off feature work and addressing any edge cases that come up. Mike sees engineering leadership as key: “It is also important for us to communicate upwards to executives so that service quality is part of topline business goals, not just part of technical debt.”
How have you used Cortex?
Mike said that Cortex maps to their remote teams quite well: “It helps us see where developers are needed, which services need adoption, and how each service is configured. It is also helpful from a leadership point of view to see how the organization is focused and how services are changing over time.”
Jonatan emphasized that the most valuable part of Cortex is in making sure continuous rollouts are reliable. ”Previously, it was much more difficult to measure quality or reliability across teams and initiatives, but Cortex allows us to track this easily. Exposing all of the results to developers and having quality and reliability visible in one place adds a lot of clarity.”
A full recording of the fireside chat can be found here.
Cortex helps organizations with service ownership, reliability, and scaffolding with microservices. If this could improve your organization’s workflow, learn more and get a demo here.