We have already explained how/why we like Rook: working with some kinds of storage in the Kubernetes cluster becomes a lot easier. However, this simplicity brings some complexities. We hope this article will help you to avoid many of those complexities before they manifest themselves.
To add some spice to this story, let us suppose that we have just experienced a (hypothetical) problem with the cluster…
Skating on thin ice
Imagine that you have configured and started Rook in your K8s cluster. You’ve been pleased with its operation, and then at some point, this is what happens:
- New pods cannot mount RBD images from Ceph;
- Commands like
lsblk
anddf
do not work on Kubernetes nodes. This suggests that something is wrong with RBD images mounted on nodes. You cannot read them which means that monitors are unavailable; - Neither monitors nor OSD/MGR pods are operational in the cluster.
Now it’s time to answer the question, when was the rook-ceph-operator
pod started for the last time? It turns out this happened quite recently. Why? The rook-operator has suddenly decided to make a new cluster! So, how do we restore an old cluster and its data?
Let’s start with a longer and more entertaining way by investigating Rook internals and restoring its components manually step by step. Obviously, there is a shorter and proper way: you can use backups. As you know, there are two types of administrators: those who do not yet use backups, and those who have painfully learned to use them always (we’ll talk about this in a bit).
A bit of Rook internals, or The long way
Looking around and restoring Ceph monitors
Firstly, we have to examine the list of ConfigMaps: there are required rook-ceph-config
and rook-config-override
. They are being created upon the successful deployment of the cluster.
NB: In new versions of Rook (after this PR was accepted), ConfigMaps ceased to be an indicator of the successful cluster deployment.
To proceed, we have to do a hard reboot of all servers with mounted RBD images (ls /dev/rbd*
). You can do it with sysrq (or “manually” in your data center). This step is necessary to unmount all mounted RBD images since a regular reboot will not work in this case (the system will be unsuccessfully trying to unmount images normally).
As you know, the running monitor daemon is the prerequisite for any Ceph cluster. Let’s take a look at it.
Rook mounts the following components into the monitor’s pod:
Volumes:
rook-ceph-config:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: rook-ceph-config
rook-ceph-mons-keyring:
Type: Secret (a volume populated by a Secret)
SecretName: rook-ceph-mons-keyring
rook-ceph-log:
Type: HostPath (bare host directory volume)
Path: /var/lib/rook/kube-rook/log
ceph-daemon-data:
Type: HostPath (bare host directory volume)
Path: /var/lib/rook/mon-a/data
Mounts:
/etc/ceph from rook-ceph-config (ro)
/etc/ceph/keyring-store/ from rook-ceph-mons-keyring (ro)
/var/lib/ceph/mon/ceph-a from ceph-daemon-data (rw)
/var/log/ceph from rook-ceph-log (rw)
Let’s take a closer look at the contents of the rook-ceph-mons-keyring
secret:
kind: Secret
data:
keyring: LongBase64EncodedString=
Upon decoding it, we’ll get the regular keyring with permissions for the administrator and monitors:
[mon.]
key = AQAhT19dlUz0LhBBINv5M5G4YyBswyU43RsLxA==
caps mon = "allow *"
[client.admin]
key = AQAhT19d9MMEMRGG+wxIwDqWO1aZiZGcGlSMKp==
caps mds = "allow *"
caps mon = "allow *"
caps osd = "allow *"
caps mgr = "allow *"
Okay. Now let’s analyze the contents of the rook-ceph-admin-keyring
secret:
kind: Secret
data:
keyring: anotherBase64EncodedString=
What do we have here?
[client.admin]
key = AQAhT19d9MMEMRGG+wxIwDqWO1aZiZGcGlSMKp==
caps mds = "allow *"
caps mon = "allow *"
caps osd = "allow *"
caps mgr = "allow *"
Same. Keeping on looking… For example, here is the rook-ceph-mgr-a-keyring
secret:
[mgr.a]
key = AQBZR19dbVeaIhBBXFYyxGyusGf8x1bNQunuew==
caps mon = "allow *"
caps mds = "allow *"
caps osd = "allow *"
Eventually, we discover more secrets in the rook-ceph-mon
ConfigMap:
kind: Secret
data:
admin-secret: AQAhT19d9MMEMRGG+wxIwDqWO1aZiZGcGlSMKp==
cluster-name: a3ViZS1yb29r
fsid: ZmZiYjliZDMtODRkOS00ZDk1LTczNTItYWY4MzZhOGJkNDJhCg==
mon-secret: AQAhT19dlUz0LhBBINv5M5G4YyBswyU43RsLxA==
It contains the original list of keyrings and is the source of all the keyrings described above.
As you know (according to dataDirHostPath
in the docs), Rook stores this data in two different locations. So, let’s take a look at keyrings in host directories, mounted to pods that contain monitors and OSDs. To do so, we have to find the /var/lib/rook/mon-a/data/keyring
directory on the node and check its contents:
# cat /var/lib/rook/mon-a/data/keyring
[mon.]
key = AXAbS19d8NNUXOBB+XyYwXqXI1asIzGcGlzMGg==
caps mon = "allow *"
Surprise! The secret here differs from the secret in the ConfigMap.
And what about the admin keyring? It is also present:
# cat /var/lib/rook/kube-rook/client.admin.keyring
[client.admin]
key = AXAbR19d8GGSMUBN+FyYwEqGI1aZizGcJlHMLgx=
caps mds = "allow *"
caps mon = "allow *"
caps osd = "allow *"
caps mgr = "allow *"
Here is the problem. There was a failure: everything looks like the cluster was recreated, when, in fact, it did not.
Obviously, secrets contain new keyrings, and they don’t match our old cluster. That’s why we have to:
- use the monitor keyring from the
/var/lib/rook/mon-a/data/keyring
file (or from the backup); - replace the keyring in the
rook-ceph-mons-keyring
secret; - specify admin and monitor keyrings in the
rook-ceph-mon
ConfigMap; - delete controllers of pods with monitors.
After a brief waiting period, monitors once again are up and running. Well, that’s a good start!
Restoring OSDs
Now we need to enter the rook-operator
pod. While executing ceph mon dump
shows that all monitors are in place, ceph -s
says that they are in the quorum. However, if we look at the OSD tree (ceph osd tree
), we will notice something strange: OSDs are starting to appear but they are empty. It looks like we have to restore them somehow. But how?
Meanwhile, we‘ve finally got so needful rook-ceph-config
, rook-config-override
(and many other ConfigMaps with the names in the form rook-ceph-osd-$nodename-config
) among our ConfigMaps. Let’s take a look at them:
kind: ConfigMap
data:
osd-dirs: '{"/mnt/osd1":16,"/mnt/osd2":18}'
They are all jumbled up!
Let’s scale the number of operator pods down to zero, delete the generated Deployment files for OSD pods, and fix these ConfigMaps. But where do we get the correct map of OSD distribution between nodes?
- What if we dig into the
/mnt/osd[1–2]
directories on nodes? Maybe, we can find something there. - There are two subdirectories in the
/mnt/osd1
, they areosd0
andosd16
. The second sub-folder is the same as the one defined in the ConfigMap (16). - Looking at their size, we see that
osd0
is much larger thanosd16
.
We conclude that osd0
is an “old” OSD we need. It was defined as /mnt/osd1
in the ConfigMap (since we use directory-based OSDs).
Step by step, we dig into the nodes and fix ConfigMaps. After we’ve done, we can run the rook-operator pod and analyze its logs. And they are painting a rosy picture:
- “I am the operator of the cluster”;
- “I have found disk drives on nodes”;
- “I have found monitors”;
- “Monitors are in the quorum, good!”;
- “I am starting OSD deployments…”.
Let’s check the cluster liveness via entering the pod of Rook operator. Well, it looks like we have made some mistakes with OSD names in several nodes! No big deal: we fix ConfigMaps, delete redundant directories for the new OSDs, et voila: finally, our cluster becomes HEALTH_OK
!
Let’s examine images in the pool:
# rbd ls -p kube
pvc-9cfa2a98-b878-437e-8d57-acb26c7118fb
pvc-9fcc4308-0343-434c-a65f-9fd181ab103e
pvc-a6466fea-bded-4ac7-8935-7c347cff0d43
pvc-b284d098-f0fc-420c-8ef1-7d60e330af67
pvc-b6d02124-143d-4ce3-810f-3326cfa180ae
pvc-c0800871-0749-40ab-8545-b900b83eeee9
pvc-c274dbe9-1566-4a33-bada-aabeb4c76c32
…
Everything is in place now — the cluster is rescued!
A lazy man’s approach, or The quick way
For backup devotees, the rescue procedure is simpler and boils down to the following:
- Scale the Rook-operator’s deployment down to zero;
- Delete all deployments except for the Rook-operator’s;
- Restore all secrets and ConfigMaps from a backup;
- Restore the contents of
/var/lib/rook/mon-*
directories on the nodes; - Restore
CephCluster
,CephFilesystem
,CephBlockPool
,CephNFS
,CephObjectStore
CRDs (if they were lost somehow); - Scale the Rook-operator’s deployment back to 1.
Hints and tips
Always make backups!
And here are a few tips on how to avoid situations when you’ll be desperately needing these backups:
- If you’re planning some large-scale manipulations with your cluster involving server restarts, we recommend to scale the rook-operator deployment down to zero to prevent it from “doing stuff”;
- Specify nodeAffinity for monitors in advance;
- Pay close attention to preconfiguring
ROOK_MON_HEALTHCHECK_INTERVAL
andROOK_MON_OUT_TIMEOUT
values.
Conclusion
There is no point in arguing that Rook, as an additional layer [in the overall structure of the Kubernetes storage], simplifies many things in the infrastructure as well as complicates some. All you need is to make a well-considered and informed choice about whether you favor benefits or have concerns about risks in each particular case.
By the way, the new section “Adopt an existing Rook Ceph cluster into a new Kubernetes cluster” was recently added to the Rook documentation. There you can find a detailed description of steps required to adopt the existing Rook Ceph cluster into a new Kubernetes cluster as well as how to recover a cluster that has failed for some reason.
Comments