Natalie Pistunovich

Lead Developer Advocate @Aerospike

Bio

Natalie is a learner, a Google Developer Expert for Go, an OpenAI Developer Ambassador, a public speaker and a sailor. When she's not working on robust systems at Aerospike, she is organizing the conferences: GopherCon Europe, Cloud Nein, HeyAI and BSides Berlin, and the Berlin chapters of the Go and Women Techmakers user groups. Previously, she was an Engineering Manager, Software and Hardware Engineer, and a Co-Founder of a mobile start-up. In her free time, she is wondering if there is life on Mars.

Contact

Abstract

Building a K8s Operator for a Distributed Database

How did we build a k8s operator that allows 100% up time for a high availability high workload database? Operating a distributed high load, high throughput database in the cloud comes with several interesting challenges. In order to manage real-time serving of mission critical workloads at 100% availability we developed a Kubernetes operator that handles the operational complexities.

We needed to handle the following requirements:

Apply live patches
Replace live cluster with tens of nodes
Handle degraded/crashed nodes

Under these conditions:

High Availability
remain 100% online with no down time
Operate under very high workloads and traffic
Manage replicated records across different hardware failure groups (rack awareness)

Due to its stateful nature and the type of workloads that are usually handled, cluster management and recovery are non-trivial. We are using the Operators API to handle that complexity and control the clusters from within Kubernetes. In this talk we’ll cover the steps we took to plan and execute and the challenges we faced and share the best practices.