Migrating etcd between cloud Kubernetes clusters with no downtime

Ever needed to move etcd storage from one K8s cluster to another? In the process, you will most likely have no option to turn it off, as it could lead to a minor (or major) collapse of the services that rely on it. This article discusses a not-so-obvious and not quite the most popular way to migrate etcd* from a Kubernetes cluster hosted in one cloud to another. The method outlined below can help you avoid downtime and the consequences that come with it. Both clusters are in the cloud, which means we are sure to encounter some limitations and difficulties (we will be discussing them in detail).

* We are not migrating etcd in which Kubernetes stores the whole cluster state. What we cover is a standalone etcd installation that is used by third-party applications and that runs in a K8s cluster.

So, there are two ways to migrate etcd:

  1. The most obvious one is to take an etcd snapshot and restore it at a new location. But that way involves downtime and therefore is not in the cards for us.
  2. The second method is to propagate etcd over two Kubernetes clusters. To do so, you have to create independent StatefulSets in each of the K8s clusters and then combine them into a single etcd cluster. This method comes with risks: an error could affect the existing etcd cluster. But that allows you to migrate etcd between clusters and avoid downtime. This is the method we will focus on below.

Note: In this article, we use AWS, but the process is almost identical for any other cloud provider. The K8s clusters in the examples below are managed using the Deckhouse Kubernetes platform. This means that some functionality may be platform-specific. For cases such as these, we will provide alternative ways of doing things.

We assume you have a basic understanding of etcd and have worked with this database before. We also recommend checking the official etcd documentation.

Step 1: Reducing the size of the etcd database

Note: If your etcd cluster has a limit on the number of revisions of each key, feel free to skip this section.

The first thing to consider before starting the migration is the etcd database size. A large database will increase the bootstrap time for new nodes and may potentially lead to problems. So let’s look at how you can reduce the database size.

Find out the current revision — get the list of keys:

# etcdctl get / --prefix --keys-only
/main_production/main/config

/main_production/main/failover

/main_production/main/history
…

Now let’s check a random key in the JSON format:

# etcdctl get /main_production/main/history -w=json
{"header":{"cluster_id":13812367153619139789,"member_id":7168735187350299418,"revision":5828757,..

You will see the current cluster revision (5828757 in our case).

Subtract the number of recent revisions you want to keep: in our experience, a thousand is sufficient. Run etcdctl compaction using the value you ended up with:

# etcdctl compaction 5827757

This command is global to the entire etcd cluster — just run it once on any of the nodes. Please, refer to the official documentation for details on how compaction (and other etcdctl commands) works.

Next, defrag to free up space:

# etcdctl defrag --command-timeout=90s

This command must be run on each node sequentially. We recommend you repeat this step on all nodes except the leader and then switch it to the defragmented node using etcdctl move-leader. Then you can go back to the last node and reduce the database size on it.

In our case, the procedure reduced the database size from 800 MB to ~700 KB, significantly cutting down the time required for the subsequent steps.

The etcd chart to use

etcd runs as a StatefulSet. Below is an example of a StatefulSet used in a cluster:

---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: etcd
  labels:
    app: etcd
spec:
  serviceName: etcd
  selector:
    matchLabels:
      app: etcd
  replicas: 3
  template:
    metadata:
      labels:
        app: etcd
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In                 values:
                - etcd
            topologyKey: kubernetes.io/hostname
      imagePullSecrets:
      - name: registrysecret
      containers:
      - name: etcd
        image: quay.io/coreos/etcd:v3.4.18
        command:
        - sh
        args:
        - -c
        - |
          stop_handler() {
              >&2 echo "Caught SIGTERM signal!"
              kill -TERM "$child"
          }

          trap stop_handler SIGTERM SIGINT

          etcd \
          --name=$HOSTNAME \
          --initial-advertise-peer-urls=http://$HOSTNAME.etcd:2380 \
          --initial-cluster-token=etcd-cortex-prod \
          --initial-cluster etcd-0=http://etcd-0.etcd:2380,etcd-1=http://etcd-1.etcd:2380,etcd-2=http://etcd-2.etcd:2380 \
          --advertise-client-urls=http://$HOSTNAME.etcd:2379 \
          --listen-client-urls=http://0.0.0.0:2379 \
          --listen-peer-urls=http://0.0.0.0:2380 \
          --auto-compaction-mode=revision \
          --auto-compaction-retention=1000 &
          child=$!
          wait "$child"
        env:
        - name: ETCD_DATA_DIR
          value: /var/lib/etcd
        - name: ETCD_HEARTBEAT_INTERVAL
          value: 200
        - name: ETCD_ELECTION_TIMEOUT
          value: 2000
        resources:
          requests:
            cpu: 50m
            memory: 1Gi
          limits:
            memory: 1Gi
        volumeMounts:
        - name: data
          mountPath: /var/lib/etcd
        ports:
        - name: etcd-server
          containerPort: 2380
        - name: etcd-client
          containerPort: 2379
        readinessProbe:
          exec:
            command:
            - /bin/bash
            - -c
            - /usr/local/bin/etcdctl endpoint health
          initialDelaySeconds: 10
          periodSeconds: 10
          timeoutSeconds: 10
  volumeClaimTemplates:
  - metadata:
      name: data
    spec:
      accessModes: [ "ReadWriteOnce" ]
      resources:
        requests:
          storage: 2Gi
---
apiVersion: v1
kind: Service
metadata:
  name: etcd
spec:
  clusterIP: None
  ports:
  - name: etcd-server
    port: 2380
  - name: etcd-client
    port: 2379
  selector:
    app: etcd

We will cover the essential elements of the above chart as we proceed through the article.

Step 2: Making etcd nodes accessible from the outside

If clients outside the Kubernetes cluster are using etcd, chances are there are some instances to route traffic to the pods. However, in order to bootstrap new nodes, you will need each cluster node to be accessible from the outside on a preset IP address. This is the main challenge when working with a cloud cluster.

In a static Kubernetes cluster, each etcd node is easily accessible: all you need is a service like NodePort, coupled with a hard NodeSelector for the pods. In the cloud, where the pod can move to a new node at any time without its IP address being known in advance, that approach doesn’t apply.

The solution is to create three separate LoadBalancer services — we need three of them because of the “three-headed” etcd cluster. This will result in LBs being automatically provisioned by the cloud provider. Here is a chart example:

---
apiVersion: v1
kind: Service
metadata:
  name: etcd-0
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-type: nlb
    service.beta.kubernetes.io/aws-load-balancer-internal: "true"
    service.beta.kubernetes.io/aws-load-balancer-subnets: id
spec:
  externalTrafficPolicy: Local
  loadBalancerSourceRanges:
  - 0.0.0.0/0
  ports:
  - name: etcd-server
    port: 2380
  - name: etcd-client
    port: 2379
  selector:
    statefulset.kubernetes.io/pod-name: etcd-0
  type: LoadBalancer
---
apiVersion: v1
kind: Service
metadata:
  name: etcd-1
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-type: nlb
    service.beta.kubernetes.io/aws-load-balancer-internal: "true"
    service.beta.kubernetes.io/aws-load-balancer-subnets: id
spec:
  externalTrafficPolicy: Local
  loadBalancerSourceRanges:
  - 0.0.0.0/0
  ports:
  - name: etcd-server
    port: 2380
  - name: etcd-client
    port: 2379
  selector:
    statefulset.kubernetes.io/pod-name: etcd-1
  type: LoadBalancer
---
apiVersion: v1
kind: Service
metadata:
  name: etcd-2
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-type: nlb
    service.beta.kubernetes.io/aws-load-balancer-internal: "true"
    service.beta.kubernetes.io/aws-load-balancer-subnets: id
spec:
  externalTrafficPolicy: Local
  loadBalancerSourceRanges:
  - 0.0.0.0/0
  ports:
  - name: etcd-server
    port: 2380
  - name: etcd-client
    port: 2379
  selector:
    statefulset.kubernetes.io/pod-name: etcd-2
  type: LoadBalancer

The service.beta.kubernetes.io/aws-load-balancer-internal: "true" annotation defines the kind of load balancer (private IP) to be provisioned. The service.beta.kubernetes.io/aws-load-balancer-subnets: id annotation specifies the network to be used by the LB. Most cloud providers have this functionality — only the annotations will differ.

Let’s look at the resources we have in the cluster:

Is the etcd client available?

# telnet 10.100.0.47 2379
Trying 10.100.0.47...
Connected to 10.100.0.47.
Escape character is '^]'.

Great: our etcd nodes are now accessible from the outside!

Now let’s create the same services in the new cluster. Note that at this point, we are only creating services in the new K8s cluster, not StatefulSets. (You will have to create a StatefulSet in the new cluster with a name that differs from the one in the existing cluster. The hostname in the pods must be different because we are using it as the etcd node name.)

Later, we will call our StatefulSet in the new cluster etcd-main (you may use any other name you wish), so let’s change the selectors and service names to match this new name:

…
name: etcd-main-0
…
  selector:
    statefulset.kubernetes.io/pod-name: etcd-main-0
…

On top of that, you have to change the values in the service.beta.kubernetes.io/aws-load-balancer-subnets: id annotation to the corresponding network ID in the new Kubernetes cluster. No changes are required in any other service resources.

Let’s see what we’ve got here:

Checking availability is not yet worthwhile, since there are no pods to which these services point.

Step 3: DNS magic

So we have made new nodes available at the IP address level. Now it’s time to look at the node IDs in etcd. Below are the startup parameters:

--name=$HOSTNAME \
--initial-advertise-peer-urls=http://$HOSTNAME.etcd:2380 \
--advertise-client-urls=http://$HOSTNAME.etcd:2379 \

We won’t elaborate on what each parameter does: you can learn more about them in the official documentation. What is important here is that the name of the node is the pod’s hostname. Nodes connect to each other using a <hostname>.<namespace> FQDN. In order for a new node to work, you have to ensure that the FQDN is accessible from the pods. There are several ways to do that:

  1. The easiest way is to add static records to pods’ /etc/hosts by editing the StatefulSet. The downside is that it requires the pods to be restarted.
  2. Another way is to resolve names at the kube-dns level. That’s the one we’re going to be using in our case. In the example below, static records are added using the Deckhouse kube-dns module:
spec:
  settings:
    hosts:
    - domain: etcd-main-0
      ip: 10.106.0.34
    - domain: etcd-main-1
      ip: 10.106.0.42
    - domain: etcd-main-2
      ip: 10.106.0.47
    - domain: etcd-main-0.etcd-main
      ip: 10.106.0.34
    - domain: etcd-main-1.etcd-main
      ip: 10.106.0.42
    - domain: etcd-main-2.etcd-main
      ip: 10.106.0.47

Let’s see the pod’s resolve:

# host etcd-main-0
etcd-main-0 has address 10.106.0.34
# host etcd-main-0.etcd-main
etcd-main-0.etcd-main has address 10.106.0.34

It’s all okay now! Let’s do the same trick in the new cluster and add the static records for the etcd nodes from the old cluster:

spec:
  settings:
    hosts:
    - domain: etcd-0
      ip: 10.100.0.47
    - domain: etcd-1
      ip: 10.100.0.46
    - domain: etcd-2
      ip: 10.100.0.37
    - domain: etcd-0.etcd
      ip: 10.100.0.47
    - domain: etcd-1.etcd
      ip: 10.100.0.46
    - domain: etcd-2.etcd
      ip: 10.100.0.37

Well, this magic was quite easy, wasn’t it?

Step 4: Adding new nodes to the etcd cluster

At last, we reached the point when we could add new nodes to the etcd cluster propagating it over our two Kubernetes clusters. To do so, run the command below in any of the active etcd pods:

etcdctl member add etcdt-main-0 --peer-urls=http://etcd-main-0.etcd-main
:2380

Since we already know the name of the StatefulSet (etcd-main), we also know the names of the new pods.

Important note: You may ask, “Why not add all the new nodes at once?”. The thing is, we have a 6-member cluster with a quorum of 4. Adding four nodes at once will result in a quorum loss. This will cause the existing nodes to fail.

Now let’s edit the chart for deployment to the new cluster:

---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: etcd-main
  labels:
    app: etcd-main
spec:
  serviceName: etcd
  selector:
    matchLabels:
      app: etcd
  replicas: 1
  template:
    metadata:
      labels:
        app: etcd
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In                 values:
                - etcd
            topologyKey: kubernetes.io/hostname
      imagePullSecrets:
      - name: registrysecret
      containers:
      - name: etcd
        image: quay.io/coreos/etcd:v3.4.18
        command:
        - sh
        args:
        - -c
        - |
          stop_handler() {
              >&2 echo "Caught SIGTERM signal!"
              kill -TERM "$child"
          }

          trap stop_handler SIGTERM SIGINT

          etcd \
          --name=$HOSTNAME \
          --initial-advertise-peer-urls=http://$HOSTNAME.etcd-main:2380 \
          --initial-cluster-state existing \
          --initial-cluster-token=etcd-cortex-prod \
          --initial-cluster etcd-main-0=http://etcd-main-0.etcd-main:2380,etcd-0=http://etcd-0.etcd:2380,etcd-1=http://etcd-1.etcd:2380,etcd-2=http://etcd-2.etcd:2380 \
          --advertise-client-urls=http://$HOSTNAME.etcd:2379 \
          --listen-client-urls=http://0.0.0.0:2379 \
          --listen-peer-urls=http://0.0.0.0:2380 \
          --auto-compaction-mode=revision \
          --auto-compaction-retention=1000 &
          child=$!
          wait "$child"
        env:
        - name: ETCD_DATA_DIR
          value: /var/lib/etcd
        - name: ETCD_HEARTBEAT_INTERVAL
          value: 200
        - name: ETCD_ELECTION_TIMEOUT
          value: 2000
        resources:
          requests:
            cpu: 50m
            memory: 1Gi
          limits:
            memory: 1Gi
        volumeMounts:
        - name: data
          mountPath: /var/lib/etcd
        ports:
        - name: etcd-server
          containerPort: 2380
        - name: etcd-client
          containerPort: 2379
        readinessProbe:
          exec:
            command:
            - /bin/bash
            - -c
            - /usr/local/bin/etcdctl endpoint health
          initialDelaySeconds: 10
          periodSeconds: 10
          timeoutSeconds: 10
  volumeClaimTemplates:
  - metadata:
      name: data
    spec:
      accessModes: [ "ReadWriteOnce" ]
      resources:
        requests:
          storage: 2Gi

In it, both the name and the start command have changed. Let’s look at the latter in greater detail:

  • The command now has the --initial-cluster-state existing flag. It indicates that members of the existing cluster are being bootstrapped instead of creating a new one (see the documentation for details).
  • The --initial-advertise-peer-urls parameter is different because the name of the StatefulSet has changed.
  • Most importantly, the --initial-cluster flag has changed. It lists all the existing cluster members, including the new etcd-main-0 node.

The nodes are added one by one, so the replicas key must be set to 1 for the first deployment.

Verify that the new node has successfully joined the cluster (etcdctl endpoint status):

Add two more nodes (the steps are exactly the same as those described above):

  • Add a new node to the cluster using the etcdctl member add command.
  • Edit the new StatefulSet: add one replica and change the --initial-cluster key to add a new node.
  • Wait for the node to successfully join the etcd cluster.

Note that the kubectl scale statefulset command cannot be used because you have to change the parameter in the start command of the new StatefulSet as well as change the replica count.

Check the cluster status:

Everything looks good. Now you can switch the etcd leader to one of the new nodes using etcdctl:

etcdctl move-leader 60ce6ed30863955f --endpoints=etcd-0:2379,etcd-1:2379,etcd-2:2379,etcd-main-0:2379,etcd-main-1:2379,etcd-main-2:2379

Step 5: Rerouting etcd clients

Now you need to reroute the etcd clients to the new endpoints.

In our case, the client was a PostgreSQL cluster running Patroni. A detailed description of relevant changes in its configuration would be beyond the scope of this article.

Step 6: Deleting the old nodes from the etcd cluster

It’s time to delete the old nodes. Keep in mind that these should be deleted one at a time to avoid losing the cluster quorum. Let’s look at the process step by step:

  • Delete one of the old pods by scaling the StatefulSet in the old K8s cluster:
    kubectl scale sts etcd –-replicas=2
  • Delete a member from the etcd cluster:
    etcdctl member remove e93f626220dffb --endpoints=etcd-0:2379,etcd-1:2379,etcd-main-0:2379,etcd-main-1:2379,etcd-main-2:2379
  • Check the status of the etcd cluster:
    etcdctl endpoint health
  • Repeat the process for the remaining nodes.

We recommend keeping the Persistent Volumes of old pods, if possible. They may come in handy should you need to roll back to the original state.

After removing all the old nodes, edit the etcd command in the StatefulSet in the new Kubernetes cluster (remove the old nodes from it):

…
--initial-cluster etcd-main-0=http://etcd-main-0.etcd-main:2380,etcd-main-1=http://etcd-main-1.etcd-main:2380,etcd-main-2=http://etcd-main-2.etcd-main:2380
…

Step 7: Deleting the remaining etcd resources in the old Kubernetes cluster

Once the new etcd cluster “settles down” and you are confident that it works as expected, delete the resources left over from the old etcd cluster (Persistent Volumes, Services, etc.). With that, your migration is complete — congratulations!

Conclusion

The method of migrating etcd between Kubernetes cloud clusters described above is not the most obvious one. However, it can help move etcd from one cluster to another rather quickly and without any downtime.

Comments

Your email address will not be published. Required fields are marked *