Should operators be a desirable architecture for Kubernetes applications?

Kubernetes operators from the beginning of their existence were made to be as flexible as possible to cover all user cases that were not possible to achieve with builtin Kubernetes objects.

Krzysztof Grabowski

Jun 3, 2024

•

2

min.

Kubernetes operators from the beginning of their existence were made to be as flexible as possible to cover all user cases that were not possible to achieve with builtin Kubernetes objects. One of the most important aspects was to offer a way to create native Kubernetes applications by users without any constraints that exist when built in objects are used to deploy applications to Kubernetes. Before that it was only possible to combine existing components like deployments or statefulset and other Kubernetes objects to create functional applications on Kubernetes clusters without any advanced logic or dynamically adjust to events created by applications itself or its state in the cluster.

What operators offer was not only limited to flexible ways to define new objects but also customizable logic that could bring new applications with complex logic like databases or distributed storage to Kubernetes clusters. Operator architecture offers options to lower needed experience by human operators to reliably manage stateful applications inside Kubernetes clusters by automated most common maintenance tasks and expose only the most important aspect of application to keep it working properly.

Because this new architecture brings so much value for users it quickly became a new trend and became so popular that multiple companies and developers decided to write their own operators to cover existing and new stateful applications. What is important to note is that not all scenarios were the best choice to be used with operators. Only time could tell which became successful and which not.

The most known early adopter of operator architecture was Prometheus monitoring software. Running Prometheus inside Kubernetes clusters always was a challenging task that required complex configuration prepared by users and required really advanced knowledge of Prometheus itself. The Prometheus operator simplified the installation and configuration process by exposing configuration only needed by the Kubernetes environment and setting reasonable defaults for most configuration options needed by the Kubernetes environment. Prometheus operator also cover extra Kubernetes objects to define Prometheus recording and alerting rules and support Alertmanager as part of operator bundle out of box. Prometheus operator became quickly adopted by the community as the best way to run Prometheus software inside Kubernetes clusters. Operator architecture improved Kubernetes Prometheus user experience so much that Prometheus and later on its derivatives became the most popular choice for monitoring in Kubernetes environments. Prometheus by itself is a great software and support Kubernetes with its dynamic inventory named service discovery but required from users complex configuration to adjust to Kubernetes clusters and to be able to have a common interface consumed by Grafana dashboards and alerting and recording rules. What makes this process challenging is the need for a time consuming process to properly design all components and interaction between them that require really detailed knowledge from several different areas for software to provide reasonable common standards. Prometheus processing pipeline is the most complex part of configuration that operates on service discovery but needs to be defined by the user before it is used. Prometheus operator developers prepare generic configuration that could be reused by all Kubernetes clusters but also offer the way to extend default configuration by adding parts of static configuration file when generic configuration is not enough. Prometheus operator allow to run independent multiple Prometheus instances in the same Kubernetes cluster without worries that maintenance of all them will be too expensive as it was earlier a case without an operator. Prometheus operator propose common naming and configuration standards that could be shared between components and allow community members to independently deliver all needed components like Grafana dashboards and Prometheus recording and alerting rules. All work done by the community exceeds Prometheus operator initial work and delivers an enterprise grade monitoring suite that users could easily run inside Kubernetes clusters and cover most common scenarios.

Operators also provide a framework to coordinate components communication in distributed systems and make possible between them based on events. The best example of this approach is Rook whose role is to run existing distributed storage inside Kubernetes clusters. Ceph is the one of supported storage options. Ceph was created before Kubernetes existed and all distributed mechanisms need to be implemented by Ceph itself. Ceph supports block storage and object storage and filesystem storage in one installation which make it a flexible and scalable solution. Ceph requires complicated configuration and multiple components to be installed in proper order and could require time consuming maintenance when installed without any automation tool. Ceph provides proper tooling to improve installation and maintenance process but it still requires a lot of internal operation knowledge to be able to run and maintain a healthy Ceph cluster. Due to its own complexity running Ceph in Kubernetes wasn't a feasible option before the Rook operator was introduced. Rook operator in general simplified the installation process and maintenance process by reusing Kubernetes resources and mapping them to be used in Ceph cluster. During the installation process Rook could detect if all needed components were bootstrapped correctly and try to recover if possible and if not help the human operator to solve them. Rook also makes Ceph easier to operate by abstracting Ceph objects and presenting them as Kubernetes ones with reasonable default values that limit the number of options needed to run Ceph clusters. Operator also simplified the upgrade process by doing extensive tests before and during and after upgrade to mitigate many risks of failed upgrade process and make Ceph cluster inoperable. There are also cases when a Rook operator detects a problem but due complexity or risk data loss it reports the problem to the human operator and waits for intervention.

In general, stateful workloads run in Kubernetes without operators are not reliable enough to be run in production grade environments. Most prominent of stateful workloads are different types of databases. Most production grade databases require high availability and multiple replicas and also regular backups and recovery when disaster occurs. Running a database without an operator requires several extra time consuming steps during installation and later on during the maintenance phase. Operator simplify the process by automatically adding and removing new replicas in a safe way and switching traffic between database nodes and offering simple ways to define backups and restore them in case of disaster and much more. There are database operators that could compete with cloud vendors offering the same database product. Main advantage of database operators is that they are more responsive and they in general work faster in applying changes due to the work with lightweight containers and not with virtual machines. Cloud providers require extra free for virtual machines that are running databases and they need to be specific type but with operators is not the case and any type of virtual machine could be used as long as it is supported by database and cost is the same as standard virtual machines.

To have a production grade Kubernetes operator there is need even for years of graduation and proper testing procedures. What is important to note is that any Kubernetes operator should be properly checked and tested if is mature and reliable enough before it being used in a production graduate environment.

Kubernetes operators became successful because they lowered the entrypoint for human operator and offered automation for the installation and maintenance process. Even complex stateful and distributed applications are exposed as Kubernetes objects with reasonable options need to be set by the user during installation.