This was originally posted on The New Stack.
Once upon a time, log management was relatively straightforward. The volume, types, and structures of logs were simple and manageable.
However, over the past few years, all of this simplicity has gone out the window. Thanks to the shift toward cloud native technologies—such as loosely coupled services, microservices architectures, and technologies like containers and Kubernetes—the log management strategies of the past no longer suffice. Managing logs successfully in a cloud native world requires fundamental changes to the way logs are aggregated, analyzed, and more.
Here’s how the cloud native revolution has changed the nature of log management and what IT and DevOps teams can do to continue managing logs effectively.
At first glance, log management in a cloud native world may not seem that different from conventional logging. Cloud native infrastructure and applications still generate logs, and the fundamental steps of the log management process—collection, aggregation, analysis, rotation—still apply.
But if you start trying to monitor a cloud native environment, it quickly becomes clear that managing logs efficiently and effectively is much more difficult. There are four main reasons why.
First and foremost, there are simply more logs to contend with.
Before the cloud native era, most applications were monoliths that ran on individual servers. Each application typically generated only one log (if it even created its own log at all; sometimes, applications logged data to syslog instead). Each server also typically generated only a handful of logs, with syslog and auth being the main ones. Thus, to manage logs for the entire environment, you only had a few logs to contend with.
In cloud native environments, by contrast, you typically work with microservices architectures - where there could be a dozen or more different services running, each providing a different piece of the functionality required to compose the entire application. Every microservice may generate its own log.
Not only that, but there are more layers of infrastructure; so by extension, more logs. You have not only the underlying host servers and the logs they generate, but also logs created by the abstraction layer—such as Docker or Kubernetes or both, depending on how you use them—that sits between the application and the underlying infrastructure.
In short, the shift to cloud native means that IT teams have gone from contending with a handful of separate logs for each application they support, to a dozen or more.
Not only are there more logs overall, but there are more types of logs. Instead of just having server logs and application logs, you have logs for your cloud infrastructure, logs for Kubernetes or Docker, authentication logs, logs for both Windows and Linux (because it’s more common now to be using both types of operating systems in the same shop), and more.
This variety adds complexity not only because there are more distinct types of log data to manage, but also because these various types of logs are often formatted in different ways. As a result, it is harder to parse all logs at once using regex matching or other types of generic queries.
Along with the increase in the number and types of logs, there is now more complexity and variation in the way log data is exposed within application environments.
Kubernetes is a prime example. Kubernetes provides some built-in functionality for collecting logs at the node level; the exact way that it does that collection depends on environment variables. For example, it logs to journald on systems with systemd installed but otherwise writes directly to .log files inside /var/log.
To make matters more complicated, Kubernetes has no native support for cluster-level logging - although, again, multiple approaches are possible. You could use a logging agent running on each Kubernetes node to generate log data for the cluster, or you could run a logging agent in a sidecar container. Alternatively, you could try to generate cluster-wide log data directly from the application, provided your cluster architecture and application make this practical.
The bottom line here is that there is considerable variability in the way logging architectures are set up, even within the same platforms. As a result, it has become more difficult in cloud native environments to devise a uniform log management process that works consistently across all of the applications or platforms it needs to support.
A final challenge in cloud native logging arises from the fact that some cloud native applications lack persistent data storage. Containers are the prime example.
When a container instance stops running, all data stored inside the container is permanently destroyed. Thus, if log data was stored inside the container (which it often is, by default), it will disappear along with the container. Because containers are ephemeral, with instances halting and being removed with new ones spinning up automatically, it’s not as if admins are asked whether they want to save log data before a container shuts down. It will just shut down and be removed, taking your log data with it unless you have moved that data somewhere else beforehand.
This transience may be okay if you only care about working with log data in real time. However, if you need to keep historical logs available for a certain period of time, losing log data when containers stop running is unacceptable.
To respond to these challenges of logging in a cloud native world, teams can use the following guidelines.
With so many different types of log formats and architectures to support and remember, trying to manage the logs for each system separately is not feasible.
Instead, implement a unified, centralized log management solution that automatically collects data from all parts of your environment and aggregates it into a single location.
Your log management tools and processes should be able to support any type of environment without you having to reconfigure the environment.
If you have, for example, one Kubernetes cluster that exposes log data in one way and a second cluster that logs in a different way, you should be able to collect and analyze logs from both clusters without having to change the way either cluster deals with logs. Likewise, if you have one application running on one public cloud and another one on a different cloud, you shouldn’t have to modify the default logging behavior of either cloud environment in order to manage its logs from a central location.
One way to ensure that logs from environments without persistent storage don’t disappear is to collect log data in real time and aggregate it in an independent location. That way, log data is preserved in a persistent log manager as soon as it is born and will remain available even if the container shuts down.
This approach is preferable to trying to collect log data only at fixed periods from inside containers, which leaves you at risk of missing some logs if the containers shut down earlier than you expected.
Instead of ignoring logs that are structured in ways that conventional analytics tools can’t support, take advantage of custom log parsers to work with data in any format. That way, you don’t risk missing out on important insights from non-standard logs.
Cloud native log management is fundamentally different from managing log data for conventional, monolithic applications. It’s not just that the scale of log data has increased (though it has), but also that there is much greater diversity when it comes to the way log data is recorded, structured, and exposed. Managing logs effectively in the face of these challenges requires a log management solution that fully centralizes and unifies log data from any and all systems that you support, while also providing the power to derive insights from non-standard log types.
This post is part of a larger series called Logging in the Age of DevOps: From Monolith to Microservices and Beyond. Download the full eBook here.