How Prometheus made monitoring alerting better?

Prometheus made a huge change in monitoring architecture years ago when it introduced a decoupled alert notification engine as a separate program named Alertamanger.

Sebastian Kiljan

Jun 17, 2024

•

2

min.

Prometheus made a huge change in monitoring architecture years ago when it introduced a decoupled alert notification engine as a separate program named Alertamanger. Before that most of the available solutions used a monolithic approach where all components were deployed together as a single process. When metrics and notification engines are deployed as a single process they could quickly become bottleneck because they require much more complicated configuration and there isn't an easy way to lower the volume of notification when they occur.

Separation of metrics and alerts notification allow to adjust Prometheus instances to organisations and teams needs and lower number of needed Alertamangers. Total low number of Alertmanagers allow to lower the volume of notification by grouping alert notifications on Alertmanager level. This option is especially useful for critical systems when oncall duty must perform by humans because any extra notification in the end will lead to fatigue and exhaustion of humans.

Alertmanger is a component only responsible for sending alert notifications via different channels like emails or communicators messages.

Alertmanager supports a simplified version of high availability that requires only two members to be deployed. In contrast most high availability solutions require at least three nodes in a cluster to avoid split brain syndrome. Alertmanager version of high availability is limited by number of members and shouldn’t be higher than three due performance reasons. Gossip protocol that is used by Alertmanager for communication by members becomes more and more expensive with adding another member to the Alertmanager cluster. In general Alertmanager is quite lightweight as the process and mostly there is no need to add new replicas because of performance reasons even in large Prometheus deployments.

The most challenging issue for safe Alertmanager deployment is to provide hard isolation for multi-tenant environments because Alertmanager only supports soft isolation. Stock version of Alertmanager does not provide any hard isolation for multiple tenants and requires a different approach by deploying a separate Alertmanager cluster for each tenant that are isolated on network level. Other Prometheus compatible solutions like Cortex or Mir were able to extend Alertmanager to support hard isolation for multiple tenants.

Main reason for stock Alertmanager that not support hard isolation for multi tenancy is that Prometheus itself also not support hard isolation for multi tenancy. To make it work there would be need to support hard isolation for multi-tenancy in both Prometheus and Alertmanager but it never became part of Prometheus project scope. There were several requests for years to introduce hard isolation for multi-tenancy in the Prometheus ecosystem but it never happened. Fortunately the community came up with Prometheus compatible solutions like Cortex and later on Mimir that support hard isolation for multi-tenancy on all needed levels. History shown that Prometheus developers made a correct choice to limit the scope of the Prometheus project because Cortex and Mimir require years of development to graduate their products.

Another important Alertmanager feature is alerting routing tree that is a flexible way to cover complex configuration notification scenarios when a single Alertmanager cluster is shared by multiple teams inside an organisation and notifications are routed via different channels.

Alertmanager out of the box supports most of the available vendors to cover notification channels like emails, SMS, communicators message.

Alertmanager has several advanced features that could be used to lower the volume of notifications by grouping alerts notification or by inhibition rules. Grouping alerts notification is done by user defined rules that group similar notifications based on a common set of labels and send them as a single notification. In contrast, inhibitions rules allow to suppress floods of alert notification when some disaster happens like broken switch or datacenter rack becomes unavailable. During a disaster, the most important aspect is to locate failure and any extra notifications that are not the root of the problem and have lower priority should be suppressed to avoid a flood of notifications.

Sometimes there is a need to mute alert notification for a given time interval due maintenance work or other circumstances. Alertmanager supports silencing option via GUI or cli command that could be enriched with regular expression to have much more complex multiple matching with single expression.

Alertmanager could be accessed via web browser based UI or cli command or by API. Web browser UI is used by users to view active alerts and add new silences. Cli command mostly is used in scripts and other non-interactive options.

Alertmanager with its well designed architecture and flexible options became so successful that it became Incorporated in several projects like Cortex or Mimir or even Grafana.

Alertmanager supports a variety of different notifications vendors that allow to use Alertmanager in any environment that user could need.

What made Alertmanager superior to others alert notification software is flexible and efficient architecture to be able lower the volume of notifications and even suppress them when needed.