Tuesday, September 9, 2014

on the mortality of SSDs

One of my servers, lil turbo, was booting from one of those bottom-of-the-barrel ADATA 32GB SSDs.  There are tons of reviews out there saying that these things are little turds, but I was feeling ballsy.  Then, one day, the server wasn't on the network any more.  I went into the closet, where lil turbo lives, to see what was the matter.

One of the non-boot drives was locked in a death grip on the sector it had been reading when it was interrupted, and fractured, seemingly non-Latin characters were bleeding all over the display.  Fuck.

Rebooted, and no dice.  Neither SSD was even seen in POST, not the boot drive and not the one I bought a year ago to mirror the boot drive with.

That was three months ago.

Last week, I decided to take a crack at reviving the comatose lil turbo.  Thinking either the SSD hot swap module or the SATA controller had died, I tried replacing both parts.  Still no dice.

So I started working on something else, and needed a spare 3.5" HDD to test a bus on a different server (vault 101).  So I pulled one of the RAID drives from lil turbo to use.  Then, forgetting that lil turbo was missing a drive, I booted it again, and the SSDs showed up!  However, they didn't boot - the screen came up with "Missing boot drive" or some shit.

I was thinking that the hot swap enclosure must be loose, and the drive was making connection and then loosing it.  But several subsequent boots failed the same way.

Then it hit me.  I grabbed the RAID disk back from vault 101 and inserted it in lil turbo's yawning, empty bay, but not all the way.  Then I went down the front and opened all the hot swap bays for the RAID disks, nine in all, so none of the would be seen or spun up when I next booted lil turbo.

When lil turbo booted, both SSDs were seen, and once it got to grub, I slowly began closing all the RAID drive bays.  Once the system had booted, I issued an mdadm --assemble --verbose /dev/md0 /dev/sd[abcdehijk] and a mount /dev/md0 /mnt/store, and watched the drive lights flicker as my data, marooned for three months, finally came back to me.

* * *

Later I learned that the ADATA was a turd after all - the smart log showed two critical-looking errors from around the time that the server would have crashed.

Next step: turn the root into a btrfs RAID1 and mirror it across both drives, finally!

(Edit: So I ended up trying various things live and borked the install.  Rather than fixing it or restoring from backup, I decided it was time for a fresh start.  Read about how I reinstalled lil turbo to boot from a raid1 btrfs root here.)

No comments:

Post a Comment