Monitoring solutions are a vital component in managing an application’s environment. From the systems layer all the way up to the end user’s connection to the app, you want to find out how the platform is performing. Indicators like CPU, memory, the number of connections, and overall health help teams make informed decisions for guaranteeing uptime.
Teams monitor metrics (short-term information) and logs (long-term information) mainly from a reactive perspective. All the data is in the logging system and can be queried when needed. The most relevant example is when an application is down and the application owner looks for indicators to find the cause of the incident.
Alerting enables teams to quickly identify things rather than looking at the monitoring solutions’ dashboards all day and all week, waiting for something to happen. Instead, alerts respond to metrics hitting certain thresholds by notifying teams, enabling those teams to know when something is wrong with an app or service so they can get the system fixed quickly. However, in today’s systems, alerts can come from many different communication tools — from email to chat to SMS — and these streams of communication can be constant, leading to alert fatigue.
Instead of being more efficient, having many streams of notifications can be a productivity killer because teams may become overwhelmed by an overload of notifications in their email inbox or repeated message streams in a Slack or Teams channel. With many notifications that don’t always include all the right information, teams may start to skim or overlook alerts, which can lead to missing important information.
Apart from the overload of notifications and alerts, another point to watch out for when discussing alert fatigue is making sure notifications and corresponding alerts are routed to the correct triage and support teams.
No one wants to receive unnecessary alert notifications, but there is no way around them. Systems and applications go down at some point, and teams must take the necessary actions to check and investigate incidents. An essential solution here is to optimize the handling of alerts instead of neglecting them.
While it might seem daunting at first, there are a few basic steps you can take to make your alert notifications much more efficient. First, make sure you build out different notification paths based on the severity of the incident. Alerts related to a business-critical workload outage could be routed using a text message (SMS) to mobile phones or by using a mobile app notification. Whereas traditional email might be the least efficient for this kind of business-critical scenario.
Next, make sure you send the notifications to the correct people or teams to get involved. Too many cooks in the kitchen is a valid statement for IT scenarios as well. If all IT team members receive the alert, it’s harder to direct the proper responders to the incident.
Finally, avoid information overload by including only necessary information in alerts, as well as providing the correct information. As with the previous item, information should not only be routed to the proper team to respond but also concise with all necessary data available in the alert to start taking actions, without containing unnecessary details or repetitive information. Also, avoid generating a multitude of alerts related to the same root cause. For example, when a server is down, you shouldn’t receive additional alerts about its network and storage components being down too. You already know they’re down.
Consider reading the following article, which touches on the aspects of optimizing team notifications, How to Notify your Team of Errors: Email vs. Slack vs. PagerDuty.
For your business-critical workloads and their corresponding critical alerts, one should look at a solution that’s capable of sharing real-time, high-risk, business-critical incident information as the first layer of communication.
Besides sending alerts, this tool should also help with incident management, shortening the mean time to detection (MTTD) and mean time to resolution (MTTR) by alerting only the necessary teams and providing automation workflows that help mitigate the incident.
As a second layer, broader team communication tools (think of Slack, Teams, Discord, and many other channel-based platforms) can be helpful for medium-risk alerts. The benefit of using this kind of communication is that information coming from the first line of communication (the critical alert notification system) can filter communications to make sure they use the correct channel, depending on the root cause or root application platform. Another benefit is that the history of what happened with the platform involved is also typically available in those channels, making it easier to retrieve information and act, especially when sent to the correct channel members.
Last, email is a good method for low-risk alerts, visibility, and archiving or compliance purposes. It’s also useful for sharing informational updates to the application workload with stakeholders within the business. But keep in mind, email is less efficient when it comes to high or medium-risk alerting.
Now that you know how to prevent noisy alerts by directing the alert notifications to the appropriate communication tools, it is equally important to optimize error messages.
Instead of sharing all error-related messages as part of the alert, you should at least tune the logs within the application and system to avoid information overload. An easy example is the Windows server logs, which split up any message into informational, warning, or error categories. From there, you can drill down to more specific message details, by using powerful filters such as the error time stamp, service involved, event ID number, or a plain text string search in the body of the error message.
Similar to alert fatigue, system administrators get overloaded with error messages, where honestly, not all message information is useful for troubleshooting. Yet, at the same time, where an incident is very specific, you could need much more verbose and detailed error messages and logs to help in troubleshooting. Finding the right balance between capturing too much and too limited of information is a key factor in overall IT operations and development cycles.
Systems and applications generate a lot of metrics, logs, and other information, which often becomes too much to handle during a critical outage. As IT teams get bombarded with alerts, both critical and false positives, it is tempting to neglect them. By routing the right alerts to the correct team of responders and using the appropriate tool for the level of severity and information you want to share, this alert fatigue can be optimized dramatically.
LogDNA understands the need for teams to have different notifications for different types of issues and has plenty of configuration options to handle alerts. Learn more about LogDNA’s integrations with tools that can help in each aspect of to-the-point notification and communication streams, starting from tunneling critical alerts to channel-based communication and email, as well as custom webhooks in our docs.