Kubernetes makes life as a DevOps professional easier by creating levels of abstractions like pods and services that are self sufficient. Now, though this means we no longer have to worry about infrastructure and dependencies, what doesn’t change is the fact that we still need to monitor our apps, the containers they’re running on, and the orchestrators themselves. What makes things more interesting, however, is that the more Kubernetes piles on levels of abstraction to “simplify” our lives, the more levels we have to see through to effectively monitor the stack. Across the various levels you need to monitor resource sharing, communication, application deployment and management, and discovery. Pods are the smallest deployable units created by Kubernetes that run on nodes which are grouped into clusters. This means that when we say “monitoring” in Kubernetes, it could be at a number of levels—the containers themselves, the pods they’re running on, the services they make up, or the entire cluster. Let’s look at the key metrics and log data that we need to analyze to achieve end-to-end visibility in a Kubernetes stack.
Performance issues generally arise from CPU and memory usage and are likely the first resource metrics users would want to review. This brings us to cAdvisor, an open source tool that automatically discovers every container and collects CPU, memory, filesystem, and network usage statistics. Additionally, cAdvisor also provides the overall machine usage by analyzing the ‘root’ container on the machine. Sounds too good to be true, doesn’t it? Well it is, and the catch is that cAdvisor is limited in a sense that It only collects basic resource utilization and doesn't offer any long term storage or analysis capabilities.
Why is this important? With traditional monitoring, we’re all used to monitoring actual resource consumption at the node level. With Kubernetes, we’re looking for the sum of the resources consumed by all the containers across nodes and clusters (which keeps changing dynamically). Now, if this sum is less than your node’s capacity, your containers have all the resources they need, and there’s always room for Kubernetes to schedule another container if load increases. However, If it goes the other way around and you have too few nodes, your containers might not have enough resources to meet requests. This is why making sure that requests never exceed your collective node capacity is more important than monitoring simple CPU or memory usage. With regards to disk usage and I/O, with Kubernetes we’re more interested in the percentage of disk in use as opposed to the size of our clusters, so graphs are wired to trigger alerts based on the percentage of disk size being used. I/O is also monitored in terms of Disk I/O per node, so you can easily tell if increased I/O activity is the cause for issues like latency spikes in particular locations.
There are a number of ways to collect metrics from Kubernetes, although Kubernetes doesn’t report metrics and instead relies on tools like Heapster instead of the cgroup file. This is why a lot of experts say that container metrics should usually be preferred to Kubernetes metrics. A good practice however, is to collect Kubernetes data along with Docker container resource metrics and correlate them with the health and performance of the apps they run. That being said, while Heapster focuses on forwarding metrics already generated by Kubernetes, kube-state-metrics is a simple service focused on generating completely new metrics from Kubernetes.These metrics have really long names which are pretty self explanatory; kube_node_status_capacity_cpu_cores and kube_node_status_capacity_memory_bytes are the metrics used to access your node’s CPU and memory capacity respectively. Similarly, kube_node_status_allocatable_cpu_cores tracks CPU resources currently available and kube_node_status_allocatable_memory_bytes does the same for memory. Once you get the hang of how they’ve been named, it’s pretty easy to make out what the metric keeps track of.
These metrics are designed to be consumed either by Prometheus or a compatible scraper, and you can also open /metrics in a browser to view them raw. Monitoring a Kubernetes cluster with Prometheus is becoming a very popular choice as both Kubernetes & Prometheus have similar origins and are instrumented with the same metrics in the first place. This means less time and effort lost in “translation” and more productivity. Additionally, Prometheus also keeps track of the number of replicas in each deployment, which is an important metric. Pods typically sit behind services that are scaled by “replica sets” which create or destroy pods as needed and then disappear. ReplicaSets are further controlled by “declaring state” for a number of running ReplicaSets (done during deployment). This is another example of a feature built to improve performance that makes monitoring more difficult. Replica sets need to be monitored and kept track of just like everything else if you want to continue to make your applications perform better and faster.
Now, like with everything else in Kubernetes, networking is about a lot more than network in, network out and network errors. Instead you have a boatload of metrics to look out for which include request rate, read IOPS, write IOPS, error rate, network traffic per second and network packets per second. This is because we have new issues to deal with as well, like load balancing and service discovery and where you used to have network in and network out, there are thousands of containers. These thousands of containers make up hundreds of microservices which are all communicating with each other, all the time. A lot of organizations are turning to a virtual network to support their microservices as software-defined networking gives you the level of control you need in this situation. That’s why a lot of solutions like Calico, Weave, Istio and Linkerd are gaining popularity with their tools and offerings. SD-WAN especially is becoming a popular choice to deal with microservice architecture.
Everything a containerized application writes to stdout and stderr is handled and redirected somewhere by a container engine and, more importantly, is logged somewhere. The functionality of a container engine or runtime, however, is usually not enough for a complete logging solution because when a container crashes, for example, it takes everything with it, including the logs. Therefore, logs need a separate storage, independent of nodes, pods, or containers. To implement this cluster-level, logging is used, which provides a separate backend to store and analyze your logs. Kubernetes provides no native storage solution but you can integrate quite a few existing ones.
Kubectl is the logging command to see logs from the Kubernetes CLI and can be used as follows:
This is the most basic way to view logs on Kubernetes and there are a lot of operators to make your commands even more specific. For example, “$ kubectl logs pod1” will only return logs from pod1. “$ kubectl logs -f my-pod” streams your pod logs, and “kubectl logs job/hello” will give you the logs from the first container of a job named hello.
Logs are particularly useful for debugging problems and troubleshooting cluster activity. Some variations of kubectl logs for troubleshooting are:
To get the most out of your log data, you can export your logs to a log analysis service like LogDNA and leverage its advanced logging features. LogDNA’s Live Streaming Tail makes troubleshooting with logs even easier since you can monitor for stack traces and exceptions in real time, in your browser. It also lets you combine data from multiple sources with all related events so you can do a thorough root cause analysis while looking for bugs.
Additionally, there are different logging levels depending on how deep you want to go; if you don't see anything useful in the logs and want to dig deeper, you can select a level of verbosity. To enable verbose logging on the Kubernetes component you are trying to debug, you just need to use --v or --vmodule, to at least level 4, though it goes up all the way to level 8. While level 3 gives you a reasonable amount of information with regards to recent changes made, level 4 is considered debug-level verbosity. Level 6 is used to display requested resources while level 7 displays HTTP request headers and 8 HTTP request contents. The level of verbosity you choose will depend on the task at hand, but it’s good to know that Kubernetes gives you deep visibility when you need it.Kubernetes monitoring is changing and improving every day because at the end of the day, that’s the name of the new game. The reason monitoring is so much more “proactive” now is because everything rests on how well you understand the ins and outs of your containers. The better the understanding, the better the chances of improvement, the better the end user experience. So in conclusion, literally everything depends on how well you monitor your applications.