The case for running data apps on Kubernetes

Kubernetes is the de facto standard today for cloud-native development. For a long time, Kubernetes was mostly associated with stateless applications such as web and batch applications. However, like most things, Kubernetes is constantly evolving. These days, we are seeing an exponential increase in the number of stateful apps on Kubernetes. In fact, the number of clusters running stateful apps on Google Kubernetes Engine (GKE) has doubled every year since 2019.

Today, Kubernetes is increasingly used to run stateful and data applications such as databases (Kafka, MySQL, PostgreSQL, and MongoDB), big data (Hadoop and Spark), data analytics (Hive and Pig), and machine learning (TensorFlow and PyTorch). Modern data engineering tools like Airbyte and vector DBs, and feature stores such as Qdrant, Weaviate and Feast, use containers and Kubernetes as their default self-managed compute deployment option.

Meanwhile, Kubernetes platform engineers are becoming more conversant with these data tools, while data engineers are familiarizing themselves with Kubernetes. We reported on this in the 2022 Data On Kubernetes (DoK) report, where customers reported observing a 3x increase in productivity by running data applications on Kubernetes. Additionally, over 41% of respondents said they plan to reskill or hire for data on Kubernetes talent. The push for running data workloads on Kubernetes is only going to grow further.