For the last day I thought the disks in my home server sounded a bit busy. Usually they’re so idle they spin down for much of the day, but they were clicking away when they shouldn’t be. Later the same day I got delightful logcheck mails saying things like “I/O error, dev sda, sector 976755863”.
Since there was already a spare disk in the machine, I ran “zfs replace tank dying-disk spare-disk“. It seems the sick disk was actually nearly dead, and output of zpool status showed the resilver was running at 1.4KB/s, and slowing down. It would have taken about 7 years to complete at that rate. Clearly it was going to take a while, even if it did improve. Bear in mind, it took about seven or eight minutes for the zpool status to complete, instead of being instant.
In the end, I ran shutdown -h, and when that got stuck just halt (which took a minute to respond). I pulled the disk out, powered the machine back up and the resilver had continued from where it was. ZFS has noticed the disk is missing, but it’s just using the remaining good disk in the mirror to complete.
Interestingly, despite passing a scrub two days ago there are now data checksum errors on the remaining disk. zpool status -v shows me the filename, and fortunately it’s a file I can trivially recreate.
I wonder if I’d manually used zpool attach basing it on the surviving disk, would it have resilvered normally, maybe falling back to the dying drive when it found these errors, possibly without losing any data? Then I’d be able to detach the dying disk. I’ll have to wait until the next disk dies to try that…