Your phone rings at 2:00 AM on a holiday weekend. Someone has paged you to a sev 1 bridge where twenty other people are waiting for you to help find the root cause of a production-impacting incident. Your mind may be hazy, but you instinctively check the performance monitoring dashboard. There, you see a clear CPU spike on a cluster of servers. The spike started at 3:50 PM yesterday and caused fatal errors that eventually caused your application to stop processing transactions. You take a quick look at your Git repository and identify a pull request merged at the exact time that the CPU spike began, and you roll back the change.
The above scenario is common in today’s IT environment. It would be nearly impossible to correlate your event logs to a recent change in your environment without a proper logging system. Event logs serve an essential purpose because they record actions taken by an operating system or an application on a server or a device. As systems become increasingly complex, so does the troubleshooting required to fix problems across distributed systems. For this reason, you should never treat logging as an afterthought. Instead, you should regard it as a necessity.
Event logs can mean different things depending on who you ask. From an operations perspective, event logs are critical for monitoring how the application behaves and quantifies the environment’s overall health. Security event logs are crucial from a security perspective because they enable security analysts to review audit trails and correlate actions that could indicate threats or breaches.
Among the most familiar event logs are Windows Performance Monitoring logs, also known as perfmon counters. These perfmon counters can include information about the CPU, memory, and I/O activity. It’s common to build alerts around these metrics to monitor system health visually and respond in real-time to known conditions that could indicate a problem. Other methods of monitoring for conditions include tracking CPU percentage split by a particular host over a specific time range or alerting anytime one of these metrics goes over a hard threshold.
As a security analyst, it’s vital to understand common indicators of compromise such as unusual network traffic, abnormal amounts of user lockouts, and failed password attempts. Most modern logging systems allow you to track these numbers over time to distinguish between regular “business as usual” activity and malicious attempts to access the system. In addition to analyzing security-based event log messages across thousands of servers at a particular point in time, a sound logging system will also alert you in real-time about predefined situations representing a threat to the environment.
Monitoring event logs can help you better understand how the system is operating, and they can also give you insight into the health of the system. They are like documents that can tell you what’s happening under the hood of your car when the check engine light comes on. Many monitoring dashboards utilize event logs to provide a centralized view into your system. You can optimize the troubleshooting by providing access to many concurrent users, which is necessary when dozens of people are looking for answers to resolve an ongoing problem quickly.
Event logs are most helpful in gaining real-time or nearly real-time insight into the health of your system. Many people see older event logs as low-quality data and set short retention policies to save room for what they view as higher-quality data. However, these older event logs offer a lot of value since you can use them to construct normal baselines and identify what behavior to expect at specific times of day and on particular days of the week. The ability to identify standard behavior patterns can reduce the number of false alerts that require you to climb out of bed or drop what you’re doing to fix a non-issue.
Event logs serve an essential purpose, as they provide critical metrics for tracking overall system health. These metrics can be aggregated and graphed over time to give you visual insight into the system’s health based on a plethora of data points. In addition, you can configure event logs to alert you about possible problems within the system in real-time. In short, monitoring event logs can vastly improve your overall monitoring strategy.