By analyzing logging data streams, DevOps teams can detect sometimes-subtle patterns that indicate changes in a system.
A typical DevOps environment employs continuous integration and continuous delivery (CI/CD), where developers commit code changes and put them into production as soon as they’re ready, instead of holding them back until the next update is released. In addition to speeding up the development process, CI/CD ensures that problems are detected and fixed quickly.
However, a purely reactive approach to the software development lifecycle (SDLC) is woefully insufficient in an increasingly digitized world where software failures can have catastrophic repercussions on businesses and, in the case of critical infrastructure, endanger human lives and health. Even short of a full-blown crisis, software bugs can degrade performance, waste resources, and cause security vulnerabilities, including vulnerabilities that can lead to costly and damaging data breaches.
That’s why DevOps teams are moving towards resilience engineering, where developers seek to build systems that can continue performing their core functions and avoid data loss even when unforeseen problems occur. Resilience engineering is a natural fit for DevOps, which promotes taking a systemic view following IT incidents, examining the factors that contributed to them, and using this information to prevent the incident from reoccurring. Resilience engineering takes this principle a step further by seeking to redirect DevOps teams’ focus away from pure incident response and towards proactive incident prevention.
Resilience engineering can’t work without access to data, not just real-time dashboards of a system’s current state but historical telemetry that paints a picture of a system’s behavior over time. Logs are the best source of information, but many DevOps teams still overlook them. Here are a few ways in which logging and log analysis promote resilience engineering and the development of resilient systems.
In a microservices architecture, “normal” is relative. A single service running slowly may not be cause for alarm, but in combination with other factors, it could be an early warning of a catastrophe. By recording what the behavior of an application or service looks like over time, logging enables developers to establish a baseline normal. Then, after each code commitment, they can compare how the application is behaving with the post-change normal and determine if they must take further action. This process can be extrapolated across the organization’s entire data environment, enabling developers to identify trends and stay one step ahead of problems.
While unit and user interface testing are important, test results are snapshots of how an application or service behaved at a moment in time under specific circumstances. As a result, not all bugs show up during testing. Additionally, many problems are difficult to trace, especially in today’s highly complex, distributed data environments. By analyzing log outputs, developers can dig deeper, see what was happening immediately prior to the issue occurring, and put it in context.
End users may be oblivious to non-fatal errors such as missing parameters, attempts to reconnect, and exceptions that are caught and handled, but if they’re not addressed, they can cause serious problems down the line, including security breaches and outages. Automated log analysis alerts developers to these otherwise invisible errors.
Part of resilience engineering is admitting that despite a team’s best efforts to bake resiliency into a system, occasionally things will break anyway. DevOps teams shouldn’t see IT incidents as “failures” but normal and expected (if unpleasant) events. Incident response (IR) teams must be able to troubleshoot issues in a systematic way, not only to fix the problem at hand but also to glean actionable insight into preventing it from happening again. Log analysis is key to effective incident response. Without it, IR teams could miss a correlated event, perhaps something that happened early on in the SDLC, that is responsible for the current incident and could cause future issues.
Automation is a key DevOps principle. In addition to accelerating systems deployment, automation, when applied correctly, can be used for remediation of known issues that can’t be fixed with code patches and that will inevitably happen again. The data contained within incident response logs is invaluable to this process.
It’s no longer enough for applications and systems to be available; they must be highly performant. A 100-millisecond delay in website load time can reduce ecommerce conversion rates by 7%, and 53% of mobile site visitors will leave a page that takes longer than three seconds to load. High latency also harms organizational productivity and efficiency. Log analysis enables DevOps teams to identify opportunities to shift load automatically during incidents, uncover patterns that warn of errors when load rises in an application, and map their data flow with their logs to reduce bottlenecks.
Building resilient systems isn’t just about looking for patterns that led up to system crashes and other IT incidents, but also the patterns that show how a system self-corrects in the face of an error. Analysis of incident response logs uncover valuable information about fault tolerance, redundancy of components, automatic safety mechanisms, capacity planning, and other features of resilient systems. This information can be used to improve current systems and build new ones.