Kubernetes clusters produce a lot of data. On one hand, that’s a good thing because more data means more opportunity to gain visibility into the environment.
On the other hand, trying to manage all of the data generated by a Kubernetes cluster can feel like an overwhelming task. Without a solid plan in place for efficiently collecting, aggregating, and analyzing Kubernetes’s multiple data sources, engineers may quickly succumb to information overload.
Fortunately, avoiding information overload in a Kubernetes environment is possible. This article explains what DevOps and IT teams need to know about the types of data sources in Kubernetes and how to work with them in a way that ensures actionable visibility into the cluster as opposed to leaving engineers drowning in data that they can’t use effectively. It discusses Kubernetes logging fundamentals, explains why Kubernetes log management is uniquely challenging, and identifies best practices for devising an effective Kubernetes log management strategy.
Kubernetes data can feel overwhelming for several reasons.
The first reason is there are multiple sources of data. At a high level, these sources can be broken down into three categories:
Within each of these categories, however, there are multiple sources of individual log data. A single application could produce several log streams or log files, for example. Cluster-level logs also come in multiple forms: operating system logs from nodes, the Kubernetes API server log, the Kubernetes scheduler log, and more.
The point here is that managing Kubernetes data requires working with a diverse set of data sources.
A second complicating factor is that Kubernetes clusters store data in a variety of locations. Engineers can’t run a simple journalctl or kubectl command or tail a few files in a central /var/log directory to get all of the data. Instead, they need to collect logs from the disparate locations where they are stored: containers in the case of application logs, worker node file systems for node operating system logs, master node file systems for logs related to Kubernetes itself, and so on.
A third challenge is that data in Kubernetes constantly changes and may even disappear.
Because Kubernetes continuously spins container instances up and down as well as terminates pods and restarts them on different nodes among other activities, the data that engineers collect from one point in time may not accurately reflect the cluster state even just a few minutes later.
Finally, because logs stored in the containers themselves disappear when the container shuts down, teams must ensure that they pull log data out of containers before that happens, or it will be lost forever.
By default, Kubernetes restricts the amount of log data that can be stored for some types of logs. In particular, kubelet logs, which are stored on each worker node, are limited to 10 megabytes in size. When logs reach that size, old log data will be discarded to make room for new data.
This rotation is a challenge because it adds further pressure to collect log data in real time or risk losing it permanently. Teams can change the default log size, but that’s not an ideal solution; it wastes disk space, and sooner or later, logs will need to be rotated no matter how large Kubernetes admins allow them to be.
Faced with this dizzying assortment of challenges from Kubernetes log data sources, storage locations, lifecycle timelines, and sizes, admins need a systematic plan for managing it all coherently. Otherwise, they risk overlooking or even permanently losing critical data, which in turn means the loss of important visibility into their clusters.
The following factors are the foundations of an effective Kubernetes log data management plan.
Kubernetes log data should be streamed in real time to a central location where it can be stored persistently. Without real-time log collection, teams run the risk that some log data may disappear when the container in which it lives shuts down or when the log file to which it is written becomes too large. Periodic collection of logs is not enough to guarantee full visibility.
Along similar lines, aggregating log data from the various data sources that exist in Kubernetes is critical. It would be wildly inefficient to attempt to analyze and manage each log file individually, given that a production cluster could easily contain several dozen or more individual logs (one for each node and container, not to mention for Kubernetes’s various services).
The best way to aggregate logs in Kubernetes is to deploy a single log aggregator like LogDNA within the cluster that can collect logs from all relevant sources and aggregate them to a central location specified by admins. Attempting to deploy a separate log aggregation agent for each node or container would be very tedious and inefficient. It would also increase the resource utilization of the cluster.
Because Kubernetes takes a rather crude approach to rotating some logs by deleting old log data when logs reach a preset size, relying on Kubernetes to rotate logs is not typically a good idea. Kubernetes may remove historical log data that teams still need. Or, it may not rotate out old logs quickly enough. Plus, some logs aren’t rotated automatically by Kubernetes at all.
Instead of entrusting log rotation to Kubernetes, then, teams should aggregate logs in a central location and then rotate them as needed there. They should retain logs for whichever period is appropriate according to their storage capacities and business requirements.
Because Kubernetes’s logging and data architecture varies so much from that of conventional application environments, managing Kubernetes data can be a tremendous challenge. However, by collecting logs in real time through a lightweight log aggregator and implementing effective log rotation, Kubernetes admins can avoid Kubernetes information overload and thrive even in the face of Kubernetes’s unique logging challenges.