When running a Ceph Cluster with Rook in Kubernetes, a common operation in a Cloud Native environment is the replacement of nodes and volumes. A ceph cluster provides redundancy and the ability to self-heal.
Unfortunately, when a drive fails, there is still some manual work to be done to remove the old drive metadata. This is because ceph doesn't know whether a drive is just out for a little while or whether it is gone forever.
After an OSD drive failure, we are starting with a blank disk which has none of the previous metadata or files of the previous disk. The only way to
OSD Failure Message
Since the previous OSD deployment is tied to a specific volume, it fails to initialize with this error:
RuntimeError: could not find osd.1 with osd_fsid 8906b2bc-f339-4d7d-b2e6-74eb6e6348b2
The OSD pod will remain in the CrashLoopBackoff state until we fix it.
This is the status of ceph in this state:
ceph status cluster: id: 7a16c29c-52a0-49bc-ad89-7b7d69343b4e health: HEALTH_WARN 1 osds down 1 host (1 osds) down Degraded data redundancy: 150/450 objects degraded (33.333%), 84 pgs degraded, 208 pgs undersized services: mon: 3 daemons, quorum a,b,c (age 3m) mgr: a(active, since 3h) mds: 1/1 daemons up, 1 hot standby osd: 3 osds: 2 up (since 8m), 3 in (since 3h) data: volumes: 1/1 healthy pools: 4 pools, 208 pgs objects: 150 objects, 133 MiB usage: 587 MiB used, 383 GiB / 384 GiB avail pgs: 150/450 objects degraded (33.333%) 124 active+undersized 84 active+undersized+degraded io: client: 852 B/s rd, 3.0 KiB/s wr, 2 op/s rd, 0 op/s wr
SOP: Recreate the OSD
Assuming the new drive and node has been booted, we are ready to fix the cluster by removing the old OSD references and recreating a new OSD on the new node.
Remove OSD From Ceph Cluster
The first thing to do is remove the old OSD reference from the ceph cluster. Use the following commands (replacing
osd.1 with the appropriate OSD).
$ ceph osd out osd.1 osd.1 is already out. $ ceph osd crush remove osd.1 removed item id 1 name 'osd.1' from crush map $ ceph auth del osd.1 updated $ ceph osd rm osd.1 removed osd.1
Delete Old OSD Deployment
In the rook namespace, you should see the failing deployment for the bad OSD. Delete it.
The Rook Operator Takes Over
The Rook operator will do the remainder of the work to recreate a new OSD in the old one's place, since it will use the previous configuration, recognize that there is no OSD.x and will initialize a new one.
If you are impatient, you can restart the rook operator pod, but I've found it fairly responsive if I just wait a minute or so.
The operator will provision an OSD and replace the deployment we removed earlier with a new one.
Ceph Repairs Itself
Once the new OSD is initialized and added to the cluster, ceph goes to work rebuilding that OSD from the other replicas.
$ ceph status cluster: id: 7a16c29c-52a0-49bc-ad89-7b7d69343b4e health: HEALTH_WARN Degraded data redundancy: 374/450 objects degraded (83.111%), 65 pgs degraded services: mon: 3 daemons, quorum a,b,c (age 91s) mgr: a(active, since 3h) mds: 1/1 daemons up, 1 hot standby osd: 3 osds: 3 up (since 28s), 3 in (since 41s); 66 remapped pgs data: volumes: 1/1 healthy pools: 4 pools, 208 pgs objects: 150 objects, 133 MiB usage: 431 MiB used, 383 GiB / 384 GiB avail pgs: 374/450 objects degraded (83.111%) 141 active+clean 56 active+recovery_wait+undersized+degraded+remapped 8 active+undersized+degraded+remapped+backfill_wait 1 active+recovering+undersized+remapped 1 active+recovering+undersized+degraded+remapped 1 active+undersized+remapped+backfill_wait io: client: 2.0 KiB/s rd, 2.2 KiB/s wr, 1 op/s rd, 0 op/s wr recovery: 682 KiB/s, 0 objects/s
Continue to monitor ceph until it is healthy again, and you're all done.
Not Terribly Painful
While I would love for this to be streamlined even further, the process isn't horrible, and in fact less painful than I was expecting thanks to Rook.