“zfs replace” on a dying disk

For the last day I thought the disks in my home server sounded a bit busy. Usually they’re so idle they spin down for much of the day, but they were clicking away when they shouldn’t be. Later the same day I got delightful logcheck mails saying things like “I/O error, dev sda, sector 976755863”.

Since there was already a spare disk in the machine, I ran “zfs replace tank dying-disk spare-disk“. It seems the sick disk was actually nearly dead, and output of zpool status showed the resilver was running at 1.4KB/s, and slowing down. It would have taken about 7 years to complete at that rate. Clearly it was going to take a while, even if it did improve. Bear in mind, it took about seven or eight minutes for the zpool status to complete, instead of being instant.

In the end, I ran shutdown -h, and when that got stuck just halt (which took a minute to respond). I pulled the disk out, powered the machine back up and the resilver had continued from where it was. ZFS has noticed the disk is missing, but it’s just using the remaining good disk in the mirror to complete.

Interestingly, despite passing a scrub two days ago there are now data checksum errors on the remaining disk. zpool status -v shows me the filename, and fortunately it’s a file I can trivially recreate.

I wonder if I’d manually used zpool attach basing it on the surviving disk, would it have resilvered normally, maybe falling back to the dying drive when it found these errors, possibly without losing any data? Then I’d be able to detach the dying disk. I’ll have to wait until the next disk dies to try that…

RTNETLINK answers: File exists

While setting up a second bridge for virtual machines on an Ubuntu server, I ended up with this error:

# ifup br1

Waiting for br1 to get ready (MAXWAIT is 3 seconds).
RTNETLINK answers: File exists
Failed to bring up br1.

I found lots of different explanations and possibilities described, but in my case it was simply that when I copied the previous bridge definition, I had kept the gateway line. It seems you can only have one gateway defined in your interfaces file, so by removing that it all works for me now.

Hopefully somebody finds this and it turns out to be the same fix they need.