By kekoav@gmail.com in Kubernetes — Dec 18, 2021

Rook Ceph Cluster Replacing a Failed OSD

When running a Ceph Cluster with Rook in Kubernetes, a common operation in a Cloud Native environment is the replacement of nodes and volumes. A ceph cluster provides redundancy and the ability to self-heal.

Unfortunately, when a drive fails, there is still some manual work to be done to remove the old drive metadata. This is because ceph doesn't know whether a drive is just out for a little while or whether it is gone forever.

Clean Disk

After an OSD drive failure, we are starting with a blank disk which has none of the previous metadata or files of the previous disk. The only way to

OSD Failure Message

Since the previous OSD deployment is tied to a specific volume, it fails to initialize with this error:

RuntimeError: could not find osd.1 with osd_fsid 8906b2bc-f339-4d7d-b2e6-74eb6e6348b2

The OSD pod will remain in the CrashLoopBackoff state until we fix it.

Ceph Status

This is the status of ceph in this state:

ceph status
  cluster:
    id:     7a16c29c-52a0-49bc-ad89-7b7d69343b4e
    health: HEALTH_WARN
            1 osds down
            1 host (1 osds) down
            Degraded data redundancy: 150/450 objects degraded (33.333%), 84 pgs degraded, 208 pgs undersized

  services:
    mon: 3 daemons, quorum a,b,c (age 3m)
    mgr: a(active, since 3h)
    mds: 1/1 daemons up, 1 hot standby
    osd: 3 osds: 2 up (since 8m), 3 in (since 3h)

  data:
    volumes: 1/1 healthy
    pools:   4 pools, 208 pgs
    objects: 150 objects, 133 MiB
    usage:   587 MiB used, 383 GiB / 384 GiB avail
    pgs:     150/450 objects degraded (33.333%)
             124 active+undersized
             84  active+undersized+degraded

  io:
    client:   852 B/s rd, 3.0 KiB/s wr, 2 op/s rd, 0 op/s wr

SOP: Recreate the OSD

Assuming the new drive and node has been booted, we are ready to fix the cluster by removing the old OSD references and recreating a new OSD on the new node.

Remove OSD From Ceph Cluster

The first thing to do is remove the old OSD reference from the ceph cluster. Use the following commands (replacing osd.1 with the appropriate OSD).

$ ceph osd out osd.1
osd.1 is already out.

$ ceph osd crush remove osd.1
removed item id 1 name 'osd.1' from crush map

$ ceph auth del osd.1
updated

$ ceph osd rm osd.1
removed osd.1

Delete Old OSD Deployment

In the rook namespace, you should see the failing deployment for the bad OSD. Delete it.

The Rook Operator Takes Over

The Rook operator will do the remainder of the work to recreate a new OSD in the old one's place, since it will use the previous configuration, recognize that there is no OSD.x and will initialize a new one.

If you are impatient, you can restart the rook operator pod, but I've found it fairly responsive if I just wait a minute or so.

The operator will provision an OSD and replace the deployment we removed earlier with a new one.

Ceph Repairs Itself

Once the new OSD is initialized and added to the cluster, ceph goes to work rebuilding that OSD from the other replicas.

$ ceph status
  cluster:
    id:     7a16c29c-52a0-49bc-ad89-7b7d69343b4e
    health: HEALTH_WARN
            Degraded data redundancy: 374/450 objects degraded (83.111%), 65 pgs degraded

  services:
    mon: 3 daemons, quorum a,b,c (age 91s)
    mgr: a(active, since 3h)
    mds: 1/1 daemons up, 1 hot standby
    osd: 3 osds: 3 up (since 28s), 3 in (since 41s); 66 remapped pgs

  data:
    volumes: 1/1 healthy
    pools:   4 pools, 208 pgs
    objects: 150 objects, 133 MiB
    usage:   431 MiB used, 383 GiB / 384 GiB avail
    pgs:     374/450 objects degraded (83.111%)
             141 active+clean
             56  active+recovery_wait+undersized+degraded+remapped
             8   active+undersized+degraded+remapped+backfill_wait
             1   active+recovering+undersized+remapped
             1   active+recovering+undersized+degraded+remapped
             1   active+undersized+remapped+backfill_wait

  io:
    client:   2.0 KiB/s rd, 2.2 KiB/s wr, 1 op/s rd, 0 op/s wr
    recovery: 682 KiB/s, 0 objects/s

Continue to monitor ceph until it is healthy again, and you're all done.

Not Terribly Painful

While I would love for this to be streamlined even further, the process isn't horrible, and in fact less painful than I was expecting thanks to Rook.