Could stateful applications be safely deployed to Kubernetes clusters?

In the past running a stateful production application on Kubernetes cluster wasn't recommended at all for production grade environments.

Sebastian Kiljan

Sep 16, 2024

•

4

min.

In the past running a stateful production application on Kubernetes cluster wasn't recommended at all for production grade environments. Running stateful applications in Kubernetes could lead to several issues like application unviability or in some cases to partial or even full data loss. In contrast when applications unavailability or data loss and manual recovery by cluster operator is often accepted like in development and testing and staging environments running stateful applications in Kubernetes clusters become quite common. It became acceptable to be running stateful applications in hybrid models where developing and testing and staging stateful applications are running on Kubernetes cluster and production grade applications are running as cloud service or on dedicated virtual or physical servers.

There are several reasons why running stateful applications on Kubernetes is much more beneficial than running as a cloud provider service or on a dedicated server in non-production environments. The most important factor is the cost because running non-production environments in most cases require much less resources than production ones and it is possible to share the same resources between stateless and stateful applications for cost savings. Also for non-production environments often is needed to be able to run multiple replicas of the same environment in the same or different versions that could be operated by different teams or even team members and without resource sharing it quickly become expensive. Kubernetes excel in environment when sharing resources is needed and offer out of the box safe way to run multitenant workloads in the same cluster. Kubernetes provides a fully automated way to quickly scale up and down cluster resources when they are needed or when they are released. Cloud services that are great for production workload because they are reliable and performant they are also quite slow in provision and removal when they are compared to Kubernetes equivalent of the same application. Cloud vendors also for stateful applications add extra fees that are much higher when compared to the same resources without it.

Currently it is possible to run stateful applications in the Kubernetes cluster in a safe manner for production grade environments. To make it happen there was need for changes in the way Kubernetes manage stateful applications and there is need for extra support on the application side to be ready to run in the Kubernetes cluster. When all pieces are in place it is possible to run safely and reliably applications like databases or other type stateful applications without issues and handling failures when applications operate on high availability mode.

In the past the only way to run stateful applications on Kubernetes was to use statefulset built-in object. Unfortunately production grade stateful applications deployment requires much more complexity and logic to be safely run on distributed Kubernetes clusters. Kubernetes cluster is not aware of any internal state of application and can’t detect application cluster partition or adding new replicas. When an application becomes broken by any reason Kubernetes cluster won’t do anything outside restarting the pod application or switch off network traffic and wait for human operator action.

Production grade workloads require to be able to run in high availability mode and when the failure occurs in one of the nodes it should be detected and traffic need to be redirected to another healthy member of the application cluster.

Cloud providers for years successfully offer services of stateful applications that run reliably and scale well and also support common administrative or maintenance tasks like backup and restore and manage multiple replicas and high availability. Kubernetes community spend a lot of time and effort to provide similar functionality natively by cluster without dependency on cloud providers.

In the end Kubernetes community extended the cluster API by adding support for customizable objects that could represent any kind of application (including stateful one) and become exposed as native Kubernetes objects. Also there is a need to run programmable services that could implement logic needed by applications that could be triggered by application state or by generated events.

Custom Resources Definition (CRD) and Custom Resources (CR) are structures that represent application definition and application instance inside Kubernetes cluster. These structures are customizable and provide basic validation defined by developers responsible for porting applications to be able to run in the Kubernetes cluster. Instance objects are treated equally by Kubernetes as built-in objects like Deployment or Statefulset with differences that are provided by the user. Operator architecture requires a specialized service named operator that are provided by applications vendors that support Kubernetes requirements for structure of code and contain custom logic that replicate human operator actions in automated way. What the operator could offer is mostly limited how much time of development was provided by the application vendor. In most cases writing operators is quite expensive because to have reliable and safe operator years of continuous development and tests are needed.

What is important to note is that operators for complex stateful applications like databases are really demanding because they are expected to always avoid data corruption or data loss when they take any action.

Because operators are so challenging and time consuming to develop and maintain for vendors, the quality and functionality of different operators for the same application vary from vendor to vendor. The most important aspect of Kubernetes operators for end users is to pick the best one that fulfils needed requirements. The Kubernetes community doesn't put any rules or guidelines on what an operator should deliver and how because operators are entities that are external to the cluster. Many users find it challenging to properly select production grade operators because for many operators choice is limited in terms of number to single vendor and functionality and maturity even when multiple vendors are available.

Is important to note that any operator that needs to be run in production grade workloads needs to be checked and tested from basic functionalities like installation and upgrade to more advanced ones like adding new replicas or switch failover and also check the backup and restore process. Backup and restore process for the operators is specific for each vendor and needs to be carefully checked before any important workload is provisioned.

Kubernetes operators bring stateful applications to be able to safely run in production workload on Kubernetes cluster. They simplify user experience to provision and maintain stateful applications and they automated error handling for many possible failures.