The good news: Charlie’s Server’s FTP service is back online, with all of the old data to boot!
The bad news: One of the drives in the FTP array died.
The ugly news: LVM didn’t want to activate the array in partial mode.
Read on for technical details and how I revived a dead LVM array with a missing drive.
About a week ago, I realized that one of the disks in my LVM array was dying, with I/O errors on certain sectors of the drive. Not wanting to take too much of a risk (of course, no backups and no mirroring, ’cause the cool kids live on the edge like that), I decided to initiate a pvmove immediately. Trying to be smart about it, I decided to do the pvmove in chunks rather than have it process the entire drive at once. Of the 3500 or so physical extents of 32MB each, I did my moves in the following fashion and order.
| Physical Extents | Result |
|---|---|
| 1-1000 | Success |
| 1001-2000 | Success |
| The Rest | FAILURE |
Amazingly enough, the drive died completely during the metadata update phase of the move. This means that all of the data was actually mirrored by that point (or at least as much of it as could be read from the device without error). What threw me off was that LVM simply refused to activate the volume group and logical volume(s), even in --partial mode.
After failing to bring back the volumes using partial mode, I started to try a bunch of random, even crazy, solutions. Word on the grapevine is that you can sometimes revive a dead drive by freezing it — literally, in the freezer. So I left the culprit in the freezer for an ninety minutes, and then gave it a shot. Still no dice, though this technique has apparently worked for others.
Just before deciding to completely scrap the entire volume group, I thought of something only half-crazy. Why not complete pvmove‘s job by hand? It should be fairly straightforward to manually modify the volume group configuration and overwrite the currently loaded configuration with my modified version. I was able to do just that, with some calculations as to which physical extents belong in which sections of which logical volumes.
Somehow, I ended up getting things right on the first try (if I had a dime every time that happened, I’d still be dirt poor), and am pleased to announce that I don’t see any data integrity loss! After checking a routine check of all of the remaining devices for bad blocks (lest this happen again right away) and then a routine e2fsck -y to fix any errors on the filesystem, all was well! Thankfully the aforementioned dead sectors (on which those I/O errors occurred) stored blank nodes on the filesystem anyways, so things just worked themselves out.
That having been said, I’m definitely keeping an anxious eye on ZFS and specifically its port to Linux. Looks like it might solve some of my problems!
I don’t understand, but I like FTPs. Wheeee!