Our internal production infrastructure has a not-too-critical part, which we use for testing various technical solutions, including different Rook versions for stateful applications. At the time of events described in this article, this part of the infrastructure was running Kubernetes 1.15, and we decided to upgrade it.
The Rook operator v0.9 provisioned persistent volumes in the cluster. What made matters worse was that the Helm release of this old operator contained resources with deprecated API versions, holding us back from upgrading the cluster. We didn’t want to upgrade Rook in the running cluster, so we decided to “dismantle” it manually.
Caution! It is a failure story: do not repeat the steps described below in production without reading carefully to the very end!
Well, for a few hours, we were successfully moving data to storage having StorageClasses not managed by Rook…
Migrating Elasticsearch data “without” downtime
… and then it was the turn of the 3-nodes Elasticsearch cluster deployed in Kubernetes:
~ $ kubectl -n kibana-production get po | grep elasticsearch
elasticsearch-0 1/1 Running 0 77d2h
elasticsearch-1 1/1 Running 0 77d2h
elasticsearch-2 1/1 Running 0 77d2h
We decided to move it to new PVs without downtime. We thoroughly verified the ConfigMap configuration and did not expect any surprises. Well, our migration plan involves several potentially dangerous twists that may lead to an incident if some of the K8s cluster nodes become unreachable… Anyway, these nodes are running fine and I’ve done this a zillion times before, haven’t I? So let’s dive in!
Here are our steps to reach the goal.
1. Make changes to the StatefulSet in the Elasticsearch’s Helm chart (es-data-statefulset.yaml
):
apiVersion: apps/v1
kind: StatefulSet
metadata:
labels:
component: {{ template "fullname" . }}
role: data
name: {{ template "fullname" . }}
spec:
serviceName: {{ template "fullname" . }}-data
…
volumeClaimTemplates:
- metadata:
name: data
annotations:
volume.beta.kubernetes.io/storage-class: "high-speed"
Note that the last line has the high-speed
value which was rbd
before.
2. Delete the existing StatefulSet (do not forget to supply the --cascade=false
parameter). This is one of the potentially dangerous twists since StatefulSet no longer controls the number of ES pods. In the case of a sudden failure of any K8s node with an ES pod running on it, this pod will not be restarted automatically. Still, the non-cascading deletion of the StatefulSet and its subsequent redeployment with new parameters lasts only a few seconds, so the risks are relatively low (obviously, they depend on the specific environment).
Let’s do it:
$ kubectl -n kibana-production delete sts elasticsearch --cascade=false
statefulset.apps "elasticsearch" deleted
3. Re-deploy Elasticsearch, scale StatefulSet to 6 replicas:
~ $ kubectl -n kibana-production scale sts elasticsearch --replicas=6
statefulset.apps/elasticsearch scaled
… and check the result:
~ $ kubectl -n kibana-production get po | grep elasticsearch elasticsearch-0 1/1 Running 0 77d2h elasticsearch-1 1/1 Running 0 77d2h elasticsearch-2 1/1 Running 0 77d2h elasticsearch-3 1/1 Running 0 11m elasticsearch-4 1/1 Running 0 10m elasticsearch-5 1/1 Running 0 10m ~ $ kubectl -n kibana-production exec -ti elasticsearch-0 bash [user@elasticsearch-0 elasticsearch]$ curl --user admin:********** -sk https://localhost:9200/_cat/nodes 10.244.33.142 8 98 49 7.89 4.86 3.45 dim - elasticsearch-4 10.244.33.118 26 98 35 7.89 4.86 3.45 dim - elasticsearch-2 10.244.33.140 8 98 60 7.89 4.86 3.45 dim - elasticsearch-3 10.244.21.71 8 93 58 8.53 6.25 4.39 dim - elasticsearch-5 10.244.33.120 23 98 33 7.89 4.86 3.45 dim - elasticsearch-0 10.244.33.119 8 98 34 7.89 4.86 3.45 dim * elasticsearch-1
Here is how our data storage looks like:
~ $ kubectl -n kibana-production get pvc | grep elasticsearch
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
data-elasticsearch-0 Bound pvc-a830fb81-... 12Gi RWO rbd 77d
data-elasticsearch-1 Bound pvc-02de4333-... 12Gi RWO rbd 77d
data-elasticsearch-2 Bound pvc-6ed66ff0-... 12Gi RWO rbd 77d
data-elasticsearch-3 Bound pvc-74f3b9b8-... 12Gi RWO high-speed 12m
data-elasticsearch-4 Bound pvc-16cfd735-... 12Gi RWO high-speed 12m
data-elasticsearch-5 Bound pvc-0fb9dbd4-... 12Gi RWO high-speed 12m
Great!
4. Speed up the data transfer.
If you feel boring and you are irresistibly drawn to adventures (and your data is not so important), you can speed up the process by keeping just one index replica:
~ $ kubectl -n kibana-production exec -ti elasticsearch-0 bash
[user@elasticsearch-0 elasticsearch]$ curl --user admin:********** -H "Content-Type: application/json" -X PUT -sk https://localhost:9200/my-index-pattern-*/_settings -d '{"number_of_replicas": 0}'
{"acknowledged":true}
… but that isn’t our way, of course:
~ $ ^C
~ $ kubectl -n kibana-production exec -ti elasticsearch-0 bash
[user@elasticsearch-0 elasticsearch]$ curl --user admin:********** -H "Content-Type: application/json" -X PUT -sk https://localhost:9200/my-index-pattern-*/_settings -d '{"number_of_replicas": 2}'
{"acknowledged":true}
Because the loss of a pod will lead to data inconsistency until it is restored, and the error-induced loss of a PV will lead to data loss.
Let’s increase the rebalancing limits:
[user@elasticsearch-0 elasticsearch]$ curl --user admin:********** -XPUT -H 'Content-Type: application/json' -sk https://localhost:9200/_cluster/settings?pretty -d '{
> "transient" :{
> "cluster.routing.allocation.cluster_concurrent_rebalance" : 20,
> "cluster.routing.allocation.node_concurrent_recoveries" : 20,
> "cluster.routing.allocation.node_concurrent_incoming_recoveries" : 10,
> "cluster.routing.allocation.node_concurrent_outgoing_recoveries" : 10,
> "indices.recovery.max_bytes_per_sec" : "200mb"
> }
> }'
{
"acknowledged" : true,
"persistent" : { },
"transient" : {
"cluster" : {
"routing" : {
"allocation" : {
"node_concurrent_incoming_recoveries" : "10",
"cluster_concurrent_rebalance" : "20",
"node_concurrent_recoveries" : "20",
"node_concurrent_outgoing_recoveries" : "10"
}
}
},
"indices" : {
"recovery" : {
"max_bytes_per_sec" : "200mb"
}
}
}
}
5. Evict shards from three old ES nodes:
[user@elasticsearch-0 elasticsearch]$ curl --user admin:********** -XPUT -H 'Content-Type: application/json' -sk https://localhost:9200/_cluster/settings?pretty -d '{
> "transient" :{
> "cluster.routing.allocation.exclude._ip" : "10.244.33.120,10.244.33.119,10.244.33.118"
> }
> }'
{
"acknowledged" : true,
"persistent" : { },
"transient" : {
"cluster" : {
"routing" : {
"allocation" : {
"exclude" : {
"_ip" : "10.244.33.120,10.244.33.119,10.244.33.118"
}
}
}
}
}
}
Soon there will be no data on them:
[user@elasticsearch-0 elasticsearch]$ curl --user admin:********** -sk https://localhost:9200/_cat/shards | grep 'elasticsearch-[0..2]' | wc -l
0
6. We are ready to kill old ES nodes one by one.
Prepare three PersistentVolumeClaims of the following type:
~ $ cat pvc2.yaml
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: data-elasticsearch-2
spec:
accessModes: [ "ReadWriteOnce" ]
resources:
requests:
storage: 12Gi
storageClassName: "high-speed"
Delete PVCs and pods related to replicas 0, 1, 2, one at a time. At the same time, manually create the PVC and make sure that the ES instance in the new pod generated by StatefulSet is successfully connected to the ES cluster:
~ $ kubectl -n kibana-production delete pvc data-elasticsearch-2 persistentvolumeclaim "data-elasticsearch-2" deleted
^C
~ $ kubectl -n kibana-production delete po elasticsearch-2
pod "elasticsearch-2" deleted
~ $ kubectl -n kibana-production apply -f pvc2.yaml
persistentvolumeclaim/data-elasticsearch-2 created
~ $ kubectl -n kibana-production get po | grep elasticsearch
elasticsearch-0 1/1 Running 0 77d3h
elasticsearch-1 1/1 Running 0 77d3h
elasticsearch-2 1/1 Running 0 67s
elasticsearch-3 1/1 Running 0 42m
elasticsearch-4 1/1 Running 0 41m
elasticsearch-5 1/1 Running 0 41m
~ $ kubectl -n kibana-production exec -ti elasticsearch-0 bash
[user@elasticsearch-0 elasticsearch]$ curl --user admin:********** -sk https://localhost:9200/_cat/nodes
10.244.21.71 21 97 38 3.61 4.11 3.47 dim - elasticsearch-5
10.244.33.120 17 98 99 8.11 9.26 9.52 dim - elasticsearch-0
10.244.33.140 20 97 38 3.61 4.11 3.47 dim - elasticsearch-3
10.244.33.119 12 97 38 3.61 4.11 3.47 dim * elasticsearch-1
10.244.34.142 20 97 38 3.61 4.11 3.47 dim - elasticsearch-4
10.244.33.89 17 97 38 3.61 4.11 3.47 dim - elasticsearch-2
Finally, it is ES 0 node’s turn: delete the elasticsearch-0 pod and wait until it restarts with the new StorageClass defined and claims the PV. Here is the result:
~ $ kubectl -n kibana-production exec -ti elasticsearch-0 bash
[user@elasticsearch-0 elasticsearch]$ curl --user admin:********** -sk https://localhost:9200/_cat/nodes
10.244.33.151 17 98 99 8.11 9.26 9.52 dim * elasticsearch-0
At the same time, the other ES pod has the following nodes:
~ $ kubectl -n kibana-production exec -ti elasticsearch-1 bash
[user@elasticsearch-0 elasticsearch]$ curl --user admin:********** -sk https://localhost:9200/_cat/nodes
10.244.21.71 16 97 27 2.59 2.76 2.57 dim - elasticsearch-5
10.244.33.140 20 97 38 2.59 2.76 2.57 dim - elasticsearch-3
10.244.33.35 12 97 38 2.59 2.76 2.57 dim - elasticsearch-1
10.244.34.142 20 97 38 2.59 2.76 2.57 dim - elasticsearch-4
10.244.33.89 17 97 98 7.20 7.53 7.51 dim * elasticsearch-2
Congratulations: we’ve got the split-brain in production! And the new data is being randomly written to two separate ES clusters! (Well, luckily, it was not a real production in our case.)
Downtime and data loss
In the previous section, we have abruptly switched from planned to restoration work. Before anything else, you must stop the data flow to the empty “incomplete” ES cluster that consists of a single node.
What if we just remove a label from the elasticsearch-0 pod? This way, it would be excluded from load balancing at the Service level. Unfortunately, once the pod is excluded, you cannot return it to the ES cluster since cluster members are discovered via the same Service during the cluster formation.
The following environment variable is responsible for this:
env:
- name: DISCOVERY_SERVICE
value: elasticsearch
And here is how it is used in the elasticsearch.yaml
ConfigMap (you can learn more in the documentation).
discovery:
zen:
ping.unicast.hosts: ${DISCOVERY_SERVICE}
Well, it isn’t our way… A better approach is to immediately stop workers that write data to the ES cluster in real-time. To do this, scale down to zero all three deployments. (Fortunately, the application is based on the microservice architecture, and you do not have to stop the entire service).
Well, downtime during the day is probably better than ever-increasing data loss. Now, let’s find out the reasons for our incident and get the result we want.
Causes of the incident and recovery
So what is happening here? Why didn’t node 0 join the cluster? Let’s check the configs once again? Nope, they seem fine.
Now let’s examine the Helm charts… And here it is! The problem is hidden in es-data-statefulset.yaml
:
apiVersion: apps/v1
kind: StatefulSet
metadata:
labels:
component: {{ template "fullname" . }}
role: data
name: {{ template "fullname" . }}
…
containers:
- name: elasticsearch
env:
{{- range $key, $value := .Values.data.env }}
- name: {{ $key }}
value: {{ $value | quote }}
{{- end }}
- name: cluster.initial_master_nodes # !!!!!!
value: "{{ template "fullname" . }}-0" # !!!!!!
- name: CLUSTER_NAME
value: myesdb
- name: NODE_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: DISCOVERY_SERVICE
value: elasticsearch
- name: KUBERNETES_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
- name: ES_JAVA_OPTS
value: "-Xms{{ .Values.data.heapMemory }} -Xmx{{ .Values.data.heapMemory }} -Xlog:disable -Xlog:all=warning:stderr:utctime,level,tags -Xlog:gc=debug:stderr:utctime -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.host=127.0.0.1 -Djava.rmi.server.hostname=127.0.0.1 -Dcom.sun.management.jmxremote.port=9099 -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false"
...
Why is the initial_master_nodes
variable defined that way? The thing is that when you start the ES cluster for the first time, it determines the set of master-eligible nodes (node 0 in our case). Thus, the elasticsearch-0 pod starts with an empty PV, the cluster bootstrapping process begins, and the master in the elasticsearch-2 pod is ignored.
OK, let’s edit ConfigMap:
~ $ kubectl -n kibana-production edit cm elasticsearch
apiVersion: v1
data:
elasticsearch.yml: |-
cluster:
name: ${CLUSTER_NAME}
initial_master_nodes:
- elasticsearch-0
- elasticsearch-1
- elasticsearch-2
...
… and remove the environment variable in question from StatefulSet:
~ $ kubectl -n kibana-production edit sts elasticsearch
...
- env:
- name: cluster.initial_master_nodes
value: "elasticsearch-0"
...
StatefulSet starts updating all pods sequentially according to the RollingUpdate
strategy. Of course, it does it in reverse order, that is, from the 5th pod to the 0th:
~ $ kubectl -n kibana-production get po
NAME READY STATUS RESTARTS AGE
elasticsearch-0 1/1 Running 0 11m
elasticsearch-1 1/1 Running 0 13m
elasticsearch-2 1/1 Running 0 15m
elasticsearch-3 1/1 Running 0 67m
elasticsearch-4 1/1 Running 0 67m
elasticsearch-5 0/1 Terminating 0 67m
What will happen when the rolling update is over? Will the cluster bootstrapping process run fine? After all, the rolling update of StatefulSet is swift… Will the elections be successful in such conditions, given that the documentation states that «auto-bootstrapping is inherently unsafe»? What if we will get a cluster bootstrapped based on node 0 that contains only a tiny part of the index? Those are the thoughts that plagued my mind during the process.
Flash forward: No, everything will be fine under given conditions. However, I was not 100% sure at the time. Just imagine that it happens in production with a lot of business-critical data… So creepy! And you end up messing around with backups.
Therefore, while the rolling update is running, let’s save and kill the service responsible for discovery:
~ $ kubectl -n kibana-production get svc elasticsearch -o yaml > elasticsearch.yaml
~ $ kubectl -n kibana-production delete svc elasticsearch
service "elasticsearch" deleted
… and delete PVC for the pod 0:
~ $ kubectl -n kibana-production delete pvc data-elasticsearch-0 persistentvolumeclaim "data-elasticsearch-0" deleted
^C
Now that the rolling update is over, elasticsearch-0 is Pending
due to unavailable PVC, and the cluster is fragmented (ES nodes have lost each other):
~ $ kubectl -n kibana-production exec -ti elasticsearch-1 bash
[user@elasticsearch-1 elasticsearch]$ curl --user admin:********** -sk https://localhost:9200/_cat/nodes
Open Distro Security not initialized.
Let’s edit ConfigMap as follows (just in case):
~ $ kubectl -n kibana-production edit cm elasticsearch
apiVersion: v1
data:
elasticsearch.yml: |-
cluster:
name: ${CLUSTER_NAME}
initial_master_nodes:
- elasticsearch-3
- elasticsearch-4
- elasticsearch-5
...
Then let’s create an empty PV for elasticsearch-0 by creating the appropriate PVC:
$ kubectl -n kibana-production apply -f pvc0.yaml
persistentvolumeclaim/data-elasticsearch-0 created
And restart the nodes to apply ConfigMap changes:
~ $ kubectl -n kibana-production delete po elasticsearch-0 elasticsearch-1 elasticsearch-2 elasticsearch-3 elasticsearch-4 elasticsearch-5
pod "elasticsearch-0" deleted
pod "elasticsearch-1" deleted
pod "elasticsearch-2" deleted
pod "elasticsearch-3" deleted
pod "elasticsearch-4" deleted
pod "elasticsearch-5" deleted
Finally, you can start the service using the YAML manifest we saved above:
~ $ kubectl -n kibana-production apply -f elasticsearch.yaml
service/elasticsearch created
Let’s see what we’ve got:
~ $ kubectl -n kibana-production exec -ti elasticsearch-0 bash
[user@elasticsearch-0 elasticsearch]$ curl --user admin:********** -sk https://localhost:9200/_cat/nodes
10.244.98.100 11 98 32 4.95 3.32 2.87 dim - elasticsearch-0
10.244.101.157 12 97 26 3.15 3.00 2.10 dim - elasticsearch-3
10.244.107.179 10 97 38 1.66 2.46 2.52 dim * elasticsearch-1
10.244.107.180 6 97 38 1.66 2.46 2.52 dim - elasticsearch-2
10.244.100.94 9 92 36 2.23 2.03 1.94 dim - elasticsearch-5
10.244.97.25 8 98 42 4.46 4.92 3.79 dim - elasticsearch-4
[user@elasticsearch-0 elasticsearch]$ curl --user admin:********** -sk https://localhost:9200/_cat/indices | grep -v green | wc -l
0
Hooray! The elections went smoothly, the cluster runs as expected, indexes are in place.
Now you just have to:
- Return the original
initial_master_nodes
values to ConfigMap; - Restart all pods again;
- Move all shards to nodes 0, 1, 2 and scale the cluster down from 6 to 3 nodes (similarly to the step described at the beginning of the article);
- Commit all manual changes to the repository.
Conclusion
What lessons can be learned from our case?
When migrating data in production, you have to always keep in mind that something might go wrong. For example, there might be an error in the configuration of a service or an application, a sudden data center incident, loss of network connectivity, and so on. Therefore, before starting the migration process, you have to take various measures to prevent an incident or minimize its consequences. You must prepare plan B beforehand and have it ready.
The algorithm of actions we used here is vulnerable to sudden and unexpected problems. Before starting the migration in a more important environment, you need to:
- Perform the migration in a testing environment with the same configuration as that of the production ES cluster.
- Schedule a service downtime. Or switch the load to another cluster temporarily. (The exact method depends on the availability requirements.) As for the approach that involves downtime, you should first stop the workers writing data to Elasticsearch, take a fresh backup, and start transferring the data to the new storage.
Check out the official ECK operator from Elastic. Works fantastic. We manage several huge ElasticSearch Clusters with it. Upgrades, scale up/down etc work rather smoothly.