How to Notify Your Team of Errors: Email vs. Slack vs. PagerDuty

4 MIN READ

MIN READ

TABLE OF CONTENTS

4 MIN READ

MIN READ

Site Reliability Engineering (SRE) and Operations (Ops) teams heavily rely on notifications. We use them to know what’s going on with application workloads and how applications are performing. Notifications are critical to ensuring SREs and Ops teams can resolve errors and reduce downtime. They’re also crucial when monitoring environments — not only when running in production but also during the dev-test or staging phase.

‍
Having monitoring tools available is essential. They’re responsible for collecting and providing the necessary data that enables investigation and mitigation of issues. Teams don’t have to watch complex monitor dashboards anymore. We all remember the 40-inch flat screens hanging up in the IT helpdesk office. These dashboards mainly showed reactive information. When an incident occurred, the screen would switch an object from green (healthy) to amber (warning) or red (alert). Since then, numerous ways have evolved to notify people. This article discusses three such improvements called email, Slack, and PagerDuty.

The Importance of Error Notifications

The pandemic has forced many people to work remotely. Since we’re all working at home, having a big screen in the IT Helpdesk office won’t help us anymore. But even when SRE and Ops engineers were still working in the office, the big flat screen monitor had already lost its charm.

Having the color of an object on the dashboard doesn’t consistently identify the root cause or the full impact of an outage. Some notices, like a partial or complete outage, must be resolved as soon as possible. They need our immediate attention. Other alerts, such as when an application occasionally throws a non-critical exception, can wait until we have sufficient time available.

When it comes to IT incident management, there are two important definitions to know:

Mean Time to Detection (MTTD) identifies the amount of time it takes to detect and identify an incident. The shorter the MTTD, the faster DevOps teams can start taking action to investigate and fix the incident.
Mean Time to Resolution (MTTR) identifies how long it took to mitigate an alert or incident. Again, the shorter amount of time this takes, the better.

Unfortunately, those two metrics aren’t always obvious, as each incident might require a different response time. It depends on the root cause or the criticality of an incident. The time it takes to resolve the issue could be lengthy for situations with low criticality, or it could be as short as possible for high criticality.

A recommended way to triage alerts and incidents is by having a proper notification channel. Properly coordinating your notification channels can allow you to identify the severity and frequency of a situation and route it to the correct person in a way that they’re comfortable with and, more importantly, will acknowledge.

Error Notification Channels

Let’s compare some common uses of notification channels:

High-risk (critical and real-time) notifications using PagerDuty
Medium-risk alerts on Slack
Low-risk notifications using email

PagerDuty is an incident event management tool. It analyzes signals from our IT environment, whether running on-premises, in a public cloud, or a hybrid cloud. PagerDuty recognizes alerts coming from monitoring tools. It identifies similar incidents. Then, it helps on-call teams run automated playbooks and keep them up-to-date with relevant information. PagerDuty also recognizes possibly related incidents by relying on machine learning intelligence. The use of machine learning enables detailed analysis using public and in-house created information sources. All of these capabilities make PagerDuty a perfect solution for handling high-risk, business-critical outages. It also optimizes both MTTD and MTTR.

Slack is a collaboration and communication platform that uses channels. In IT incident management, we can use Slack to contact the right people through using a specific channel. We can integrate it with a DevOps pipeline that deploys application workloads. Also, we can use it for channel notification updates, such as updates about failed pipeline runs. The team working on pipeline deployment will immediately see the notifications and act. In this case, it could mean reverting to the last pipeline run. In the event of outages and larger incidents, Slack is helpful as a forum for teams to discuss ideas. Organizing application landscape components in dedicated channels can shorten both MTTD and MTTR, even if the issues are not business-critical. Plus, they’re accessible by developers and infrastructure teams, and any required stakeholder.

While email used to be the typical notification medium, it’s lost momentum as a business-critical notification method. One reason for this is that it leads to information overload, known as incident information fatigue. Incident information fatigue is often the product of receiving hundreds or thousands of emails sent to multiple distribution groups. As a result, nobody acts or knows what to do.

Though it doesn’t always reduce MTTD and MTTR, email is still a viable solution for errors that teams will need to solve eventually. Email helps with archiving or compliance purposes. For example, we might use it to summarize the incident email traffic from the last few days or weeks and use it to help outline sprint planning. Since most systems can communicate with email, it can also be a useful last resort to inform teams using legacy systems.

Conclusion

Mezmo, formerly known as LogDNA, understands the need for different notifications for different types of issues and offers you plenty of flexibility. If you’re not a Mezmo customer yet, you can sign up for a free trial. If you’re already using Mezmo as your trusted centralized log management solution, you can review the documentation on configuring the proper notifications for your organization at any time.

false

7.26.21

SHARE ARTICLE

RSS FEED

RELATED ARTICLES

Unearthing Gold: Deriving Metrics from Logs with Mezmo Telemetry Pipeline

Managing Variable Log Retention

Get More Value From Your Logs Without Compromising Costs

How to Leverage Mezmo Archiving

What is an Observability Data or Telemetry Data Pipeline?

July 26, 2021

RELATED ARTICLES

Unearthing Gold: Deriving Metrics from Logs with Mezmo Telemetry Pipeline

Managing Variable Log Retention

Get More Value From Your Logs Without Compromising Costs

How to Leverage Mezmo Archiving

What is an Observability Data or Telemetry Data Pipeline?

SHARE ARTICLE

How to Notify Your Team of Errors: Email vs. Slack vs. PagerDuty

The Importance of Error Notifications

When it comes to IT incident management, there are two important definitions to know:

Mean Time to Detection (MTTD) identifies the amount of time it takes to detect and identify an incident. The shorter the MTTD, the faster DevOps teams can start taking action to investigate and fix the incident.
Mean Time to Resolution (MTTR) identifies how long it took to mitigate an alert or incident. Again, the shorter amount of time this takes, the better.

Error Notification Channels

Let’s compare some common uses of notification channels:

High-risk (critical and real-time) notifications using PagerDuty
Medium-risk alerts on Slack
Low-risk notifications using email

How to Notify Your Team of Errors: Email vs. Slack vs. PagerDuty

The Importance of Error Notifications

Error Notification Channels

Conclusion

How to Notify Your Team of Errors: Email vs. Slack vs. PagerDuty

The Importance of Error Notifications

Error Notification Channels

Conclusion

How to Notify Your Team of Errors: Email vs. Slack vs. PagerDuty

Engineering

The Importance of Error Notifications

Error Notification Channels

Conclusion

Engineering

Talk to an expert to learn more

Logging in the Age of DevOps eBook

Log Data Restoration beta program

LogDNA Streaming Early-Access

LogDNA Variable Retention Early-Access

@2022 Copyright Mezmo Inc.

NEWSLETTER SECTION

How to Notify Your Team of Errors: Email vs. Slack vs. PagerDuty

The Importance of Error Notifications

Error Notification Channels

Conclusion

How to Notify Your Team of Errors: Email vs. Slack vs. PagerDuty

The Importance of Error Notifications

Error Notification Channels

Conclusion

Use cases >

Reduce your SIEM cost

Integrations

Use cases >

ELK Replacement

Digital Transformation

DevSecOps

Control

Learn >

Blog

eBooks

Reports and Guides

Videos

Webinars

Case Studies

Infographics

Log Management

Observability

DevOps

Kubernetes

Security

About

Customers

Partners

Newsroom

Events

Career

Culture

Compliance & Security

Contact us

How to Notify Your Team of Errors: Email vs. Slack vs. PagerDuty

Engineering

Related articles

SHARE ARTICLE

The Importance of Error Notifications

Error Notification Channels

Conclusion

Engineering

Related articles

share article

Talk to an expert to learn more

Logging in the Age of DevOps eBook

Log Data Restoration beta program

LogDNA Streaming Early-Access

LogDNA Variable Retention Early-Access

@2022 Copyright Mezmo Inc.