From your software projects you already know that: Even though you have checklists, peer reviews, control, audit and compliance mechanisms, you still have problems. This is inevitable. It is time for your DevOps team and organization to build a self-diagnostics, self-learning and self-improvement culture. Your culture accepts problems and your teams are ready when problems occur. Solving problems is not an exceptional state of work. But they must be part of your daily work to contribute on continuous learning and improvement journey of your organization. And you multiply the effects of these solutions for the problems you solve, by making them transparent, available and easily accessible within your entire DevOps organization.
One of the prominent DevOps organizations, Netflix, has built an in-house software (Chaos Monkey) to simulate catastrophic events in their cloud-based data centers. Chaos Monkey randomly destroys servers in production systems, so Netflix team can build additional assurance on their operational ability for resilience, stability and uninterrupted service quality for their clients. From each failure they learn new lessons and they exploit these lessons to make their systems even more stable and resilient.
In your organization if there is finger-pointing after incidents, this will create a fear culture for engineers. Thus, your organization simply becomes slow, bureaucratic and a political slippery landscape. Instead of consciously learning from errors, being organically more resistant and resilient against errors and being more mindful and careful to prevent errors. Everyone in such organizations care about self-protection. Work, problems and even solutions themselves are never fully transparent.
Because problems are inevitable in complex systems, instead of finger-pointing, blaming and shaming the ones who cause problems, your organization should value actions to make problems visible in your daily work. It should encourage organizational learnings from errors and inefficiencies, so everyone in your DevOps organization can also learn and profit from these problems, solutions and knowledge.When engineers in your DevOps organization feel safe about giving details about mistakes, they voluntarily go extra mile and spend a lot of energy to make sure that a similar problem will not happen again in their own work center and in other work centers in your organizational value stream. If engineers are punished or even if they feel that they are punished when they do mistakes, then they will be afraid of making mistakes, so
The goals of a post-mortem review are very simple:
Exploring what you did wrong is frightening and in some organizations it is dangerous. If admitting having made mistakes opens you to criticism or discipline, you are unlikely to make such admissions. This strategy is ultimately self-defeating, since failing to understand a past mistake usually condemns you to repeating it again in the future. Organizations that are serious about improvement understand this, and take trouble to create a process and culture wherein it is safe to explore mistakes.
When you enter into a post-mortem review process, you must accept a few basic premises:
It is absolutely essential that everyone involved completely accept this "No blame, We are here to learn model". Many organizations go to great trouble to create such safe environments. The FAA, for instance, has an Aviation Safety Reporting System, whereby pilots who make "mistakes" can gain immunity from regulatory discipline if they report those incidents.
Post-mortem reviews must always define actionable measures to prevent the incident from happening again in the future. New Telemetry metrics, new automated test cases, identification of type of changes that require additional code reviews, refactoring code or decoupling complex system components which cause frequent problems can be examples of such preventative measures.
Publish post-mortem review protocols and lessons learnt widely in your organization. This will help you convert your local learnings from one work center in your value stream into organization-wide global learnings. And this will be a clear message in your DevOps organization to nurture transparency, openness and learning culture.
A game day is not one of your typical boring team events where extraverts enjoy the show and introverts play with their mobile phones to speed up the flow of time.
In a game day catastrophic failures are simulated in your test systems. And DevOps teams work towards fixing and learning from these failures.
For instance, a critical server is terminated to validate the successful operation of failover mechanism without service interruptions. Then your DevOps team validates if/how your recovery mechanism from backups or from your Infrastructure as Code (IaC) works. Identifying problems in these fail scenarios helps your DevOps team build resilient, fault-tolerant systems and create learnings.
During the process of solving problem, your DevOps team builds relationship with other departments while they rehearse fail events in non-stress conditions. You will test and have a visible chance to improve communication and troubleshooting processes within your larger global organization.
Furthermore, you will have the ability to observe weaker signals for potential larger issues that may reveal themselves in the future. Frequently happening low priority incidents during these fail scenarios, or a small side effect that may have come close to crash another critical component in your architecture are important week signals that you should take into account and work out to improve your systems.
In your DevOps team encourage calculated risk taking. High performer DevOps organizations like yours do more often errors. This is not only OK, but this is also what your organization needs. To learn and perform better.
Over typical organizations, high performers have 80% less critical failures in their production systems. In other words they have 5 times less incidents which impact their clients. This is why your engineers in your DevOps organization needs to feel free to do errors and learn from them.