Once the failed drive was replaced, the cluster was able to rebalance and repair the inconsistent PGs.
cluster: id: 105370dd-a69b-4836-b18c-53bcb8865174 health: HEALTH_OK services: mon: 3 daemons, quorum ceph-mon00,ceph-mon01,ceph-mon02 (age 33m) mgr: ceph-mon02(active, since 13d), standbys: ceph-mon00, ceph-mon01 mds: cephfs:2 {0=ceph-mon00=up:active,1=ceph-mon02=up:active} 1 up:standby osd: 30 osds: 30 up (since 9d), 30 in (since 9d) data: pools: 5 pools, 385 pgs objects: 8.42M objects, 29 TiB usage: 41 TiB used, 14 TiB / 55 TiB avail pgs: 385 active+clean io: client: 7.7 KiB/s wr, 0 op/s rd, 0 op/s wr
After poking around on the failed drive, it looks like the actual 2.5" drive itself is fine. The USB-to-SATA controller seems to be culprit, and randomly garbles data over the USB interface. I was also able to observe it fail to enumerate on the USB bus. A failure rate of 1 in 30 isn't bad considering the cost of the drives.
Discussions
Become a Hackaday.io Member
Create an account to leave a comment. Already have an account? Log In.