SRE
Best Practice
code quality
Development
incident management
Production Readiness
reliability
SDLC
Testing
Security

Pocket Guide to Production Readiness (plus bonus framework!)

Discover what production readiness is, how to conduct efficient reviews, and why it’s time to elevate your production readiness checklist to a production readiness framework.

By
Lauren Craigie
-
April 19, 2024

Faster software development cycles means greater reward, and greater risk. Organizations that lose sight of continuous alignment to standards risk delayed launches, higher risk of churn, and higher costs—not to mention unproductive and unhappy developers.

Building a strong production readiness review process can help, but existing tools and frameworks haven’t made it easy to keep up to date with the increased velocity at which software is expected to ship. In this blog we’ll look at what production readiness is, why it matters, and what's new in best practice.

What is production readiness?

Production readiness is the process by which software is made adequately secure, performant, reliable, and observable enough for on-going operational use. Achieving this state minimizes downtime, improves user experiences, and reduces critical failures in a live environment.

The “production ready review” (PRR) process requires stakeholders to align on many factors that contribute to that end-state, and will necessarily look different for organizations in different markets, with different products, organizational structures, and technographic landscape—as evidenced by our recent survey of the PRR process for engineering leaders.

Achieving production readiness requires an aggregation of cross-cutting standards like testing, monitoring, documentation, code reviews, observability, security controls, deployment workflows, and alignment across infrastructure like AWS, GitHub and Kubernetes. The process touches everything from onboarding developers and users to ensuring appropriate backup and disaster recovery. While it overlaps with and is often partially under the remit of site reliability engineering (SRE), it is more exclusive to readiness prior to production.

In fewer words—it comes down to the definition of done. Of course, we know that software is ever-changing, which means “doneness” must now be continually re-evaluated as the code and the environment it lives is evolves. This requires stakeholders to be laser focused on created shared definitions that can be automatically enforced, and routinely revisited.

Production readiness vs product readiness

It’s important to note a distinction that can be a bit confusing when first learning about this space. Production readiness is the process by which any software component (services, resources, APIs, infrastructure, etc) are made ready for operational use, whereas product readiness is particular to a bundle of software components that comprise a product. The term product readiness often refers to the process preceding “launch” of a new multi-faceted capability, whereas production readiness should refer to an on-going effort to ensure each component remains adequately secure, performant, reliable, and observable. For a product to be “ready” means all of its components are production ready.

Why is production readiness important?

Production readiness reviews vary by industry, often influenced by each sector's operational requirements, regulatory frameworks, and user expectations. For example in the United States, production readiness for healthcare companies will often have to take into account HIPAA (the Health Insurance Portability and Accountability Act), while defense companies are held to more stringent cybersecurity requirements and quality controls for selling to the federal government.

Sticking with the healthcare example, consider building an Electronic Health Records (EHR) system for deployment in a hospital. Production readiness would require the EHR software to demonstrate high reliability and user-friendly interfaces, but also robust security measures to protect sensitive patient data, while being flexible to each patient’s needs.

In one example of poor standards alignment, doctors were unable to track complex medical needs on an EHR and supplemented it with a paper chart. When a cancer patient received a dose of chemotherapy it was documented in the EHR, but not the paper chart. This avoidable workaround led to an additional unnecessary dose being given, and shows the importance of production readiness in software engineering.

The benefits of getting production readiness right for development teams include:

Reliability

Reliability ensures that a software system consistently performs according to its specifications and user expectations under normal and anticipated stress conditions. By taking measures to ensure connection to monitoring and logging tools are part of their production readiness review, developers can identify and rectify potential points of failure, leading to a decrease in downtime and fewer disruptions for users. This reliability fosters trust in the software, encouraging its adoption and continued use.

Observability

Observability refers to the ability to monitor and understand the internal state of a system based on its external outputs. A system prepared for production is equipped with comprehensive logging, monitoring, and alerting tools that provide real-time insights into its performance and health. This level of observability enables quick identification and resolution of issues before they affect users, enhancing the overall stability and reliability of the system.

Security

Security is a crucial consideration for any software system, especially those handling sensitive data or operating in regulated industries. But ensuring connection to things like vulnerability scanners or code coverage tools can be difficult at scale. Wrapping these into a unified production readiness process can help reduce risk or time spent resolving issues.

Incident Response and Management

If software is missing documentation, runbooks, roll-back, or on-call protocol, it becomes extremely difficult to ensure expedient incident response and remediation. Expectations for these things can be incorporated into a production readiness review, and should be continuously checked in order to ensure up-to-date information.

Scalability

Scalability ensures that a system can handle growth without compromising performance, whether it's in the form of increased data volume, user traffic, or transaction rates. Production readiness practices include designing systems with scalability in mind, such as using microservices architectures or scalable cloud resources. This ensures that the system can accommodate growth seamlessly, maintaining performance levels and providing a consistent user experience even as demand increases.

Information Discovery

Comprehensive documentation is an often overlooked but critical aspect of production readiness. Well-documented systems facilitate easier maintenance, troubleshooting, and future development efforts. Documentation should cover the architecture, codebase, deployment processes, and operational procedures, providing a valuable resource for both new and existing team members. This enhances team efficiency and ensures that the system can be effectively managed and evolved over time. By adding documentation to your production readiness framework, you're ensuring everyone who needs to respond to or improve a given software component has all the information necessary to do so.

Moving from a status production readiness checklist to a dynamic framework

The ideal production readiness checklist is one that doesn’t end when all the boxes are ticked. That is to say production readiness now needs to be as dynamic as the software being evaluated. It’s no longer good enough to ensure bugs and vulnerabilities have been addressed before initial deployment; teams need to ensure continuous connection to code coverage vulnerability scanners, while enforcing some standard of resolution time.

Some production readiness frameworks are hundreds of lines long, but with the right integrations, automations, and continuous monitoring, even something of this size can be incredibly effective without being equally burdensome to developers. Here are a few dimensions to consider for your own software, though the below will necessarily change if you’re thinking about production readiness for an API vs a service, for example.

1. Ownership

  • Every component has an owner.
  • Ownership is synced with your Identity Provider.
  • Related team members and teams are identifiable.
  • Up and downstream dependencies are mapped.

2. Code Quality and Completeness

  • Connection to code coverage tools with acceptable rate of coverage.
  • Code review completed and approved by peers.
  • Unit tests cover critical functions and paths.
  • Integration tests validate interactions between components.
  • End-to-end tests simulate user scenarios.

3. Performance and Scalability

  • Connection to performance monitoring tools where SLOs are met.
  • Load testing to understand the application's behavior under expected traffic.
  • Stress testing to find the limits of the application components.
  • Scalability testing to ensure the application can handle growth in users/data.
  • Performance benchmarks established and met.

4. Security

  • Connection to relevant security and vulnerability scanners.
  • Security audit conducted to identify vulnerabilities.
  • Maximums set for known vulnerabilities.
  • SLOs set for vulnerability mitigation.
  • Penetration testing performed to assess defenses.
  • Data encryption implemented for data at rest and in transit.
  • Compliance with relevant industry security standards verified.

5. Observability

  • Connection to logging and monitoring tools.
  • Logging implemented for critical application events and errors.
  • Monitoring set up for key performance indicators (KPIs) and health metrics.
  • Alerting configured for system anomalies and thresholds.
  • Dashboard created for real-time system status visibility.

6. Reliability

  • SLOs defined and met.
  • Redundancy and failover mechanisms in place.
  • Backup and restore procedures tested.
  • Disaster recovery plan documented and rehearsed.
  • Dependency checks completed to ensure external services are reliable.

7. Infrastructure and Environment

  • Infrastructure as Code (IaC) used for consistent environment setup.
  • Separate environments established for development, testing, and production.
  • Resource limits and quotas reviewed to prevent resource exhaustion.
  • Network configurations validated (firewalls, load balancers).

8. Operational Readiness

  • Runbooks or operational documentation prepared for common tasks and incidents.
  • On-call rotation and escalation policies established.
  • Incident response drill conducted to test the team's readiness.
  • Post-mortem process defined for learning from incidents.

9. Regulatory and Compliance Checks

  • Data privacy policies reviewed for compliance with regulations (GDPR, HIPAA).
  • License audits completed for open-source components.
  • Accessibility standards compliance verified.
  • Data retention and deletion policies established.

10. User Documentation and Training

  • User documentation completed and accessible.
  • API documentation reviewed and updated.
  • Training materials prepared for end-users and support staff.
  • Feedback mechanisms in place for continuous improvement.

Measuring production readiness

Production readiness requires the convergence of many data sources and stakeholders. Test coverage, defect density, security status, SLOs, documentation, and observability are all managed by different tools, and teams. Which is why it might be surprising that production readiness is still most often managed via some combination of spreadsheet, wiki, Git, and generic project management software.

The below image shows what the spreadsheet approach most often looks like—questionable information, out of date ownership, and uncertainty around package versions or connection to critical tools.

Typical production readiness system when handled via spreadsheet

This approach is often why organizations fall into the trap of assuming production readiness is a “one-time” process. But we know that’s no longer realistic—not when code, tools, and devs themselves change frequently at organizations, and software components balloon from dozens into the thousands.

Production readiness frameworks must contain some understanding of on-going health, via tools that check for functionality, latency, and change error rate. In order to make production readiness a living standard—as dynamic as the software being assessed—teams need a way to ensure continuous alignment to standards, in a way that removes friction for developers. They need an Internal Developer Portal.

The below image shows a typical production readiness checklist in Cortex, where multiple different integrated tools are referenced for on-going alignment to standards.

Screenshot of a Production Readiness Scorecard in Cortex

Best practice tips for assessing, tracking, and improving production readiness

Establishing your production readiness framework can be intimidating, as you will need to operate at an enterprise-wide scale. You may struggle with transparency and visibility into every dependency as your organization grows. Simple things like ownership are not guaranteed, and complexity creeps in even for well-run engineering teams.

Here are some tips to embedding a culture that prizes production readiness in your team.

Ensure ownership

Understanding who owns what is often the most challenging part of ensuring production readiness for all the software your team has developed. Without this, it’s impossible to drive action and attention to readiness gaps on any meaningful timeline. Ensure connection to your Identity Provider and set up a mechanism for understanding exactly when software loses or changes owners.

Shift to continuous evaluations

To ensure software stays aligned with standards met upon initial deployment, it's necessary to shift into a continuous evaluation framework. Tools like IDPs can help, but only if you have all of the right data and integrations to tools where disparate standards already live.

Centralize standards

You will have different standards for the observability, security, reliability, and usability of your software, and each set of standards will change by software type. It’s not necessary to reduce your list to the lowest common denominator, but it is critical to ensure your standards are centrally located, and managed, and data shares a common information architecture.

Bias towards automation where possible

Production is a lengthy process for initial deployment, but a never-ending one once software is operational. Incorporating automation to check for alignment to standards and drive action required is crucial for scaling your program, and your team. Automation also helps make data more accessible, and next steps more obvious for developers who can’t risk excessive context-switching throughout their day.

Establish clear readiness criteria

Establishing clear readiness criteria ensures that all stakeholders have a shared understanding of what it means for the software to be ready for production. This clarity helps in aligning efforts towards meeting predefined standards of performance, security, and usability, thus reducing the likelihood of production issues.

Adopt agile and DevOps methodologies

Agile and DevOps methodologies facilitate continuous integration and delivery, as well as an emphasis on transparency and production standards. This approach encourages iterative improvements, higher standards and closer collaboration between development, operations, and other teams, leading to more reliable and quicker deployments.

Use automated testing frameworks

Automation increases the efficiency and coverage of testing processes, allowing for more consistent and thorough validation of code changes. This ensures that regressions and bugs are caught early, significantly improving the reliability and stability of the software before it reaches production.

Prioritize risk-based testing

Risk-based testing focuses on identifying and testing the most critical components and functionalities, especially those related to security and compliance. By working closely with security teams, developers can institute a testing process that addresses vulnerabilities promptly, mitigating potential risks before they impact the production environment.

Encourage cross-functional collaboration

Collaboration breaks down silos between departments, and requires a culture of open communication and shared objectives. By working together more closely teams can tap into diverse expertise to address challenges more effectively, enhancing the overall quality and readiness of the software.

Promote accountability

Production environments are often decentralized, and working within them effectively requires individuals and teams across the development and deployment process to have a high degree of ownership and accountability of their roles. Insisting on responsibility drives higher standards of work and a ensures a proactive approach to addressing issues, contributing to the successful launch and operation of software in a production environment.

How can Cortex help?

Production readiness is based on defining best practice, then tracking, quantifying and assessing indicators to hit this target. Getting it right means drawing data from disparate sources, crunching and analyzing it before sharing insights with relevant stakeholders. Moving your Production Readiness Review to an Internal Developer Portal to ensure centralized, continuous alignment will be a crucial step as your team grows.

As we learned in the 2024 State of Production Readiness:

  1. Alignment is hard. As organizations adopt frameworks designed to help developers ship faster, an inability to manage the knock-on effect of information entropy has introduced new risk, hampered velocity, and degraded productivity.
  2. Automation is key. Most leaders identified activities in their production readiness checklist that they have not been able to automate. This has led to either lack of attention to these tasks, lack of continuity, or lack of efficient management.
  3. Assessment must be continuous. Leaders that reported the highest levels of confidence spoke of the importance of continuous standards enforcement, and have taken steps to make this a regular part of their production readiness lifecycle.

Cortex helps by:

Centralizing data and standards across all tools

Cortex’s fully custom catalogs, 50+ out of the box integrations, the ability to bring in custom data, and a rich plugin architecture enables teams to build new data experiences that best fit developer workflows. Any standard that governs how code is written, tested, secured, reviewed, or deployed can be unified, and even re-segmented by team, type, domain, or exemption status, to ensure all components are managed in context.

Applying always-on standards for continuous alignment

Because Cortex is a one-stop-shop for all your data and documentation, it can also serve as a means of continuously monitoring alignment to all software standards you define in a Scorecard. If code is updated, ownership changes, new tools are adopted, or old packages hit end-of-life, Cortex makes it easy to see what needs attention.

Providing approved templates to ship quality code quickly

Cortex enables teams to not only build production readiness standards into templates with boilerplate code, but also initiate any workflow from within the platform using a customizable HTTP request. So developers can do things like:

  • Make an API call to AWS
  • Deploy a service
  • Provision a resource
  • Assign a temporary access key
  • Create a JIRA ticket, or any other action that would help the developer complete a task, or in this case, their production readiness framework.

If you’re interested in turning your production readiness checklist into a continuous framework, check out our production readiness product tour. Or try out Cortex for yourself by booking a custom demo!

SRE
Best Practice
code quality
Development
incident management
Production Readiness
reliability
SDLC
Testing
Security
By
Lauren Craigie
What's driving urgency for IDPs?