The disastrous fire OVHcloud data centers experienced this March affected our monitoring system badly. Here is how it challenged us and what we did to keep everything working smoothly.
Here is another failure experience from our SREs that is worth sharing. It involves the migration of an Elasticsearch cluster from one storage to another inside a Kubernetes cluster.
We're starting a special series of articles dedicated to our… failures and lessons we've learned from them. This story has happened with a ClickHouse + ZooKeeper setup due to miscommunication.
This article reviews existing tools for implementing chaos engineering in K8s including kube-monkey, chaoskube, Chaos Mesh, Litmus Chaos, Chaos Toolkit, some games, and even more.
When something prevents web applications from proper functioning, you have to investigate into all levels: in your infrastructure (K8s based in our case), third-party services or in the code itself.