SHA-1 Broken
I’m not an expert in cryptography, but I try to pay attention to what’s happening in the crypto world. Today, Bruce Schneier announced:
SHA-1 has been broken. Not a reduced-round version. Not a simplified version. The real thing.
The research team of Xiaoyun Wang, Yiqun Lisa Yin, and Hongbo Yu (mostly from Shandong University in China) have been quietly circulating a paper announcing their results:
- collisions in the the full SHA-1 in 269 hash operations, much less than the brute-force attack of 280 operations based on the hash length.
- collisions in SHA-0 in 239 operations.
- collisions in 58-round SHA-1 in 233 operations
So, unless I’m mis-reading this, SHA-1 lost a factor of 2,048; that’s enough to start moving away from SHA-1, but not enough to run screaming in the streets. The last time that SHA-1 attacks showed up, similar attacks were possible against MD5 and possibly also the newer SHA family members; I’m not really sure if there are cryptographic hashes in common use that aren’t at least slightly tainted right now.
Upgrading Asterisk can be a whole lot of fun
It all started on Wednesday night, when I discovered that my Asterisk VoIP server couldn’t make long-distance phone calls any more. I could dial, and the phone would ring, but as soon as the other end answered, the line went dead and the logs started filling up with error messages:
Feb 2 20:24:03 WARNING[-1250968656]: Huh? An ilbc frame that isn't a multiple of 50 bytes long from IAX2 (16)?
That’s kind of weird looking. Google wasn’t particularly helpful, so I decided that this was probably a VoIP provider problem and sent mail to NuFone’s support people. Historically, NuFone’s support department has a reputation of being a bit spotty–either they handle your problem right off the bat, or you never hear from them again, but that was probably a combination of broken email servers and staffing shortages due to rapid growth. In this case, I got a response back from NuFone’s president first thing in the morning. We’ve exchanged a bit of mail in the past, and in this case he agreed that it was a weird error, and the first thing to try was upgrading. I was running Asterisk 1.0.2, while the current version is 1.0.5. He suggested that I might be happier with checking a development version of out CVS–they’ve been more stable for him, anyway, and there are a ton of new features and bug fixes that aren’t in 1.0.5.
What the heck, I said. Sounds like fun.
So, I checked a copy of Asterisk and several of its supporting libraries and drivers out of CVS, built them, installed them, and tried restarting Asterisk. First problem: I’d built my kernel without support for module unloading, somehow. So, I had to reboot to upgrade the drivers for my two PCI VoIP cards. Grumble, grumble. Oh well, I’d been meaning to shut it down and add more RAM anyway; the RAM was sitting right there, and it shouldn’t take long to slide the server out from under its desk and install the RAM. Five minutes later, the box was booting back up with the new drivers and twice as much memory.
Problem 2: As soon as I started Asterisk, I discovered that the X101P PCI card that handles my regular POTS line was reporting a ‘red alarm’ on the line, suggesting that it wasn’t connected right. I crawled under the desk, unplugged the phone line, wiggled connectors, and tested it with a $10 phone that I keep on hand for situations like this. I got a dial tone just fine with the phone, so I plugged the line back into the PC and crawled back out from under the desk to discover…
Problem 3: the server locked up while I was fiddling with the phone line. So I rebooted it, only to discover…
Problem 4: it couldn’t load the drivers for the PCI card anymore. The driver spit out a number of weird errors:
kernel: wcfxo: Out of space to write register 06 with e0 kernel: wcfxo: Out of space to write register 0f with 00
That’s ugly. I developed a new theory: the card wasn’t getting reset right on reboot. So, I powered the box completely down, pulled the power cord for a minute, and then tried again. It worked this time; everything came up right, and I was able to verify that everything was working correctly. Incoming calls worked right, and outgoing calls worked, via both POTS and VoIP.
So fast-forward a few hours. I try to call out, and nothing happens–I just get a dead line. Checking Asterisk’s console, it looks like it dialed out on the POTS line, but I wasn’t hearing ringing or anything. The driver for the X101P PCI card had apparently choked again. I tried stopping and restarting Asterisk, but it didn’t help. I didn’t worry too much, though–incoming calls on my POTS line get forwarded to VoIP after about 15 seconds worth of ringing. Somewhere in the middle of this, a couple calls came in from family members, but I wasn’t close enough to a phone to answer them, so I let them go to voice mail. Except neither call left a message. That seemed strange (my family is big on leaving messages), so I called back, and they said that the line had went dead right after it answered–they never heard the “leave a message” message.
Problem 5: incoming VoIP calls weren’t working, either. The connection died in the middle of a Playback instruction. I tried changing codecs, restarting Asterisk, and even rebooting, all without success. My frustration level was rising. Quickly. At this point, almost nothing was working right–POTS was dead, and incoming VoIP calls only worked right if I answered the phone. I took a break, put the kids to bed, and resisted the urge to scream. I mean, I debug problems like this for a living, which means that I really hate to do it at home.
So I spent a minute thinking–it was really weird that rebooting didn’t fix the VoIP Voicemail problem. I’d tested it earlier in the day, and it had worked after the upgrade to CVS Asterisk. And I hadn’t changed any other software since then. I rebooted again, and this time the X101P PCI VoIP card disappeared completely. Asterisk wouldn’t even start now–my config files are sprinkled with references to this card, and Asterisk was aborting with “can’t find card” errors. So I thought a bit more and came up with a new theory–when I’d pulled the box out to install more RAM, maybe I’d pulled on the phone cable that goes into the X101P a bit too hard, and the card had popped partway out of the slot. It’s easy to test, so I shut the server down again, slid it out from under the desk, popped the PCI card out, checked it for obvious problems, and put it back in firmly. I then plugged everything back in, rebooted, and discovered that everything was working perfectly. POTS worked, incoming voicemail worked via POTS or VoIP, outgoing VoIP calls worked.
I checked again this morning, and everything’s still working fine.
So, did I really pull the card halfway out, or was yesterday’s Asterisk CVS tree just really unstable? And what sort of person goes through this just to save $15 per month on phone calls? And speaking of that–if I’d had a second VoIP provider set up, then I could have shunted calls to them when outgoing calls to NuFone failed, and I would have been able to avoid part of the mess. Hey, yeah, and I still need to finish the Asterisk-caller-ID-on-MythTV project, so I can see who’s calling on the TV while I’m watching movies. Oohh, and my shiny new internal PCI ADSL card is supposed to ship soon. And I need to mod an Xbox so I can run MythTV on it downstairs…
I guess that should answer the “what sort of person” question. A person with incurable gadget lust. I’m getting better. Really.
Use Lynx, go to jail
This is almost too much to believe:
A Londonder made a tsnuami-relief donation using lynx – a text-based browser used by the blind, Unix-users and others – on Sun’s Solaris operating system. The site-operator decided that this “unusual” event in the system log indicated a hack-attempt, and the police broke down the donor’s door and arrested him.
(From Boing Boing)
A fun little DSL outage
Yesterday morning, when I arrived at work, I noticed that my laptop couldn’t connect to my home email server for some reason. Attempts to ping my home web server showed 90% packet loss. That’s kind of an unusual situation for a home network–I’ve had DSL go out quite a few times over the past 5 years, and I’ve had routers crash, but this is the first time that I’ve seen crippling packet loss. My best guess was that something strange had happened with a VPN that I’d set up between home and work, and that something was flinging non-rate-limited UDP or ESP packets at an insane rate.
Since 90% packet loss effects our home VoIP service as well, I had my wife hit reset on our home router/firewall PC. That fixed the unusual 90% packet loss, replacing it with 100% packet loss. When I got home last night, I found that the system had dropped into the BIOS setup screen with a “the previous boot didn’t complete right, so you probably want to change some BIOS settings” error. Grr. I changed the boot time error settings to tell it to ignore all errors, but this will probably happen the next time I need to do an emergency reboot.
Of course, it goes without saying that yesterday was the slowest day of the month, in terms of traffic to my blog. Traffic is way up this month–my average number of visits so far this month is only slightly behind the single best day from last year. My previous high was 595 hits, followed by 524 hits in second place. This month, I’m averaging around 540 hits per day, with a high of almost 800. Yeah, except for yesterday, which was barely 350.
Disks are fun
Okay, so my RAID array died because I wasn’t paying enough attention and my 3ware card had already kicked out one perfectly good drive for no obvious reason. No sweat, I can handle that. I as I mentioned before, I took me most of a day, but I recovered almost all of the data off of the failed 4-drive array onto a new 2-drive RAID-0 array. Once the copy was complete, the goal was to destroy the old, broken RAID-5 array, create a new, working RAID-5 array, and then copy all of the data off of the RAID-0 array onto the new RAID-5 array. Then, when everything was complete, I was planning on using the RAID-0 disks as parity and spare drives for the RAID-5 set. Nice and simple, right?
So, by Friday night, I had 6 drives in front of me. One was bad, three were good, but part of the broken RAID array, and two held the data that had been on the RAID array. My goal was to take the 3 good drives and use them to build a new 4-drive RAID-5 array, so I built a software RAID-5 array in degraded mode–that way, I could get away with leaving out the 4th drive at the beginning. Once I copied the data off of the 5th and 6th drives, I was planning on adding them to the RAID-5 array so I’d have a 4th disk plus a spare.
I was very careful not to re-use the broken drive–it was on 3ware channel #2, so I cleverly built my new array using Linux’s sda, sdc, and sdd devices, skipping sdb. Once RAID-5 was running, I formatted the new array, copied everything from the RAID-0 set, broke down the RAID-0 set, and added the drives to the RAID-5 array. And promptly watched everything crumble to dust. My RAID-5 array started out in degraded mode, with 3 of 4 drives active. I then added 2 additional drives, and instead of watching it rebuild to 4 of 4 plus 1 spare, it went to 2 of 4 active. It even sent me this helpful email:
From: scott@mail.sigkill.org Subject: Fail event on /dev/md1:nfs Date: January 8, 2005 8:16:43 AM PST To: scott@sigkill.org This is an automatically generated mail message from mdadm running on nfs A Fail event had been detected on md device /dev/md1. Faithfully yours, etc.
Although the array was still mounted, any attempt to access it generated a steady stream of I/O errors. What happened, you ask?
Basically, I was an idiot. Like I said, the drive on 3ware channel #2 failed, so I didn’t use drive sdb. Except that 3ware numbers their channels starting with 0. So channel #2 was drive number 3—sdc, not sdb. So I’d rebuilt by array using the bad drive, then copied my data onto the broken disk, and destroyed all of my good copies. I spent all morning Saturday trying to fix things, but I couldn’t even get the kernel to acknowledge that the RAID array existed. I finally gave up and tried cloning sdb onto sdc, to see if that’d work, but it didn’t make a bit of difference–I could at least get mdadm to tell me that sdb had once been a part of a RAID array, but it didn’t recognize any of the data on sdc as any part of anything.
In desperation, I tried re-creating the RAID array exactly as I’d first built it, using sda, sdc, and sdd. Amazingly enough, that worked, and I was able to mount the drive. I then carefully added sdc into the array, watched it rebuild the first 20% of the array, and then fail sdc back out of the array, leaving me back where I started. I finally turned off the computer in disgust and went and played with what was left of our snow.
Sunday was more snow, so I played with the kids, and then finally took one last swing at the computer. I re-built the RAID array again, and then built a RAID-0 array from sde and sdf. I then tried to copy anything that was salvageable off of the broken RAID-5 array. I figured that I’d be able to copy something before it croaked again. I checked back a couple hours later to discover that it’d copied all 216 GB without error. I was stunned–apparently the drive’s problem was really just corruption of a few sectors–writing new data back onto the drive overwrote the weak parts with a new, strong signal, and it was able to read them back safely. Ugh. It wouldn’t resync right because there were still a number of old sectors with old data on them–if I’d zeroed out the whole drive, it’d probably have worked right from the start, for at least a couple months, until it failed again.
So, I went back through the process again, destroying the array built from sda, sdc, and sdd, and then building a new one with sdb this time. There’s no way I’m going to trust the failing drive, even if it did work this time. I copied everything off of the little RAID-0 array, then carefully tore it apart and used its drives to rebuild the big array into its full RAID-5 glory. And it actually worked this time, without errors. Everything was finally finished around midnight last night, and I was able to reboot without problems.
All done, right?
Ha.
This morning I got up to find the screen full of syslogged Ethernet problems–apparently the network card had locked up. I could log in on the console, but I couldn’t ping anything. I rebooted, everything came up okay, and I tried copying a bunch of stuff onto the new RAID array. It copied just fine for about 5 minutes, and then the box locked up hard. No kernel panic or anything, just a dead box. The reset button didn’t help, and it ignored the soft power button, so I had to do the hold-the-power-button-for-5-seconds trick. After that, it didn’t boot right–there were 3ware card errors everywhere–timeouts, not drive problems. It locked up again halfway through booting.
So, practically speaking, I’m right back where I started on Friday morning–my box is dead, but the data is probably fine. I’m going to pop the box open and wiggle some cables, but I probably have bad hardware somewhere in the box–motherboard, 3ware card, or power supply. If this had happened at work, I’d just RMA the whole mess and let the vendor sort it out, but that’s not very useful at home, especially when dealing with a 4-year-old system with a second-hand RAID card. Ugh.
Update: I powered it off for a while, wiggled cables, removed spare hardware, rebooted, and found a nice kernel bug. If you have a RAID array with 4 drives plus a spare, and for some reason the spare’s RAID superblock has a higher timestamp then the 4 data drives, then the kernel’s RAID code will gladly kick the 4 good drives out of the array and keep just the spare. I sense a bug report in my near future.
Anatomy of a Drive Failure
I’ve lost a lot of hard drives over the years, but I’ve never really had the ability to put one under the microscope, so to speak, to see what happened and what I could have done to detect the failure before it became a problem. In generally, even an extra 24 hours’ notice would greatly reduce the amount of data lost and reduce the pain involved in replacing failed drives. Drive makers understand this, and added the S.M.A.R.T. drive monitoring standard to drives years ago. Under Linux, the smartmontools package provides a number of tools for monitoring drives’ SMART status; I’ve been increasingly vigilant about running it on all of my systems, hoping that it’ll let me spot drive failures before data loss occurs.
I lost another drive this week. This is the first drive that I’ve lost that has been actively monitored by smartmontools the entire time, and the logs produced are instructive. Unfortunately, I didn’t pay close enough attention to SMART to prevent data loss, but there are a number of lessons contained in the logs produced. By understanding what the precursors of this drive failure, we should be able to be more reactive when faced with future failures.
First, here are the basic specs on the system and drives involved:
- Athlon 700 (slot A)
- 384 MB RAM (PC133)
- Via KT133 chipset (Asus K7A MB, I think)
- 3ware 7500-8 8-channel IDE RAID controller
- 3 Maxtor 160 GB drives, 1 Hitachi 160 GB drive
The drive that failed was a Maxtor, on channel #2. Here’s what smartmontools 5.30 has to say about the drive in its current condition:
Device Model: Maxtor 4A160J0
Serial Number:A608B7WE
Firmware Version: RAMB1TU0
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: 7
ATA Standard is: ATA/ATAPI-7 T13 1532D revision 0
Local Time is:Fri Jan 7 11:47:02 2005 PST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x82) Offline data collection activity was
completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 24) The self-test routine was aborted by
the host.
Total time to complete Offline
data collection: ( 243) seconds.
Offline data collection
capabilities: (0x5b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
No General Purpose Logging support.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 99) minutes.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
3 Spin_Up_Time 0x0027 214 214 063Pre-fail Always - 11805
4 Start_Stop_Count 0x0032 253 253 000Old_age Always - 73
5 Reallocated_Sector_Ct 0x0033 249 249 063Pre-fail Always - 41
6 Read_Channel_Margin 0x0001 253 253 100Pre-fail Offline - 0
7 Seek_Error_Rate 0x000a 253 252 000Old_age Always - 0
8 Seek_Time_Performance 0x0027 252 244 187Pre-fail Always - 34394
9 Power_On_Hours 0x0032 224 224 000Old_age Always - 24560
10 Spin_Retry_Count 0x002b 253 252 157Pre-fail Always - 0
11 Calibration_Retry_Count 0x002b 253 252 223Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 253 253 000Old_age Always - 76
192 Power-Off_Retract_Count 0x0032 253 253 000Old_age Always - 0
193 Load_Cycle_Count 0x0032 253 253 000Old_age Always - 0
194 Temperature_Celsius 0x0032 253 253 000Old_age Always - 38
195 Hardware_ECC_Recovered 0x000a 253 252 000Old_age Always - 43456
196 Reallocated_Event_Count 0x0008 251 251 000Old_age Offline - 2
197 Current_Pending_Sector 0x0008 249 249 000Old_age Offline - 41
198 Offline_Uncorrectable 0x0008 253 252 000Old_age Offline - 0
199 UDMA_CRC_Error_Count0x0008 199 199 000Old_age Offline - 0
200 Multi_Zone_Error_Rate 0x000a 253 252 000Old_age Always - 0
201 Soft_Read_Error_Rate0x000a 253 216 000Old_age Always - 37
202 TA_Increase_Count 0x000a 253 248 000Old_age Always - 0
203 Run_Out_Cancel 0x000b 253 245 180Pre-fail Always - 19
204 Shock_Count_Write_Opern 0x000a 253 252 000Old_age Always - 0
205 Shock_Rate_Write_Opern 0x000a 253 252 000Old_age Always - 0
207 Spin_High_Current 0x002a 253 252 000Old_age Always - 0
208 Spin_Buzz 0x002a 253 252 000Old_age Always - 0
209 Offline_Seek_Performnce 0x0024 154 148 000Old_age Offline - 0
99 Unknown_Attribute 0x0004 253 253 000Old_age Offline - 0
100 Unknown_Attribute 0x0004 253 253 000Old_age Offline - 0
101 Unknown_Attribute 0x0004 253 253 000Old_age Offline - 0smartctl also reports a bunch of event log results after this, but they’re not completely relevant right now–the events in question didn’t occur until things started failing.
Looking at the results that smartctl reports, it doesn’t look like anything is particularly wrong. None of the pre-fail statistics are outside of their ideal range, and then old-age statistics make the drive look nearly new. Just looking at these numbers wouldn’t give you any indication that the drive was throwing uncorrectable read errors every few minutes.
So, let’s move on to the syslog results. The smartmontools package actively monitors each of these parameters and logs changes to syslog from time to time. You can look at the raw logs if you want to see the whole picture, but it’s way too long to include in its entirety here. The short version goes like this:
Dec 5 07:31:06 SMART Prefailure Attribute: 8 Seek_Time_Performance changed from 251 to 252
Dec 5 15:01:06 SMART Prefailure Attribute: 8 Seek_Time_Performance changed from 252 to 253
Dec 5 15:31:04 SMART Prefailure Attribute: 8 Seek_Time_Performance changed from 253 to 252
Dec 5 20:01:04 SMART Prefailure Attribute: 8 Seek_Time_Performance changed from 252 to 253 This pattern continues on like this the whole time, with Seek_Time_Performance wandering from 251 to 253 and back. All 3 of my Maxtor drives do this all the time, and have since they were brand-new. It’s just noise in the logs, not a real problem. Next:
Dec 8 01:31:06 SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 253 to 252
Dec 8 02:01:05 SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 252 to 253 This is the first indication of trouble. Notice that it’s not very threatening–Hardware_ECC_Recovered just barely changed and it immediately flipped back to its old value. Plus, it’s marked as a “usage attribute,” which indicates that it’s non-threatening. Continuing:
Dec 13 04:50:56 SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 253 to 252
Dec 13 05:20:56 SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 252 to 253
Dec 13 06:50:56 SMART Prefailure Attribute: 8 Seek_Time_Performance changed from 253 to 252
Dec 13 07:20:56 SMART Prefailure Attribute: 8 Seek_Time_Performance changed from 252 to 253
Dec 13 09:50:57 SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 253 to 252
Dec 13 11:20:56 SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 252 to 253
Dec 13 13:50:56 SMART Prefailure Attribute: 8 Seek_Time_Performance changed from 253 to 252
Dec 13 21:20:56 SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 253 to 252
Dec 13 21:50:57 SMART Prefailure Attribute: 8 Seek_Time_Performance changed from 252 to 253
Dec 13 21:50:57 SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 252 to 253 This is the first time that Hardware_ECC_Recovered reoccurred after the first occurrence on the 8th. I left the Seek_Time_Performance lines in, just to show that the ECC lines aren’t particularly common–the Seek Time lines show up every couple hours, day in, day out.
The ECC notices continue, showing up again on the 16th, 18th, 25th, and again at 5:20 AM on the 1st. That’s where things start getting interesting:
Jan 1 03:20:57 starting scheduled Long Self-Test. Jan 1 03:50:56 SMART Prefailure Attribute: 8 Seek_Time_Performance changed from 253 to 251 Jan 1 05:20:56 SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 253 to 252 Jan 1 05:50:56 SMART Prefailure Attribute: 8 Seek_Time_Performance changed from 251 to 252 Jan 1 05:50:56 SMART Usage Attribute: 196 Reallocated_Event_Count changed from 253 to 252 Jan 1 05:50:56 SMART Usage Attribute: 198 Offline_Uncorrectable changed from 253 to 252 Jan 1 05:50:56 Self-Test Log error count increased from 0 to 1 Jan 1 06:20:55 SMART Prefailure Attribute: 8 Seek_Time_Performance changed from 252 to 253 Jan 1 06:20:55 SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 252 to 253
At this point, I hadn’t seen any actual errors yet, but the drive’s SMART self-test had spotted a bad sector. The 2nd and 3rd were basically the same–their self test reported that the same sector was still bad. All hell started to break lose on the 4th:
Jan 4 02:50:56 SMART Usage Attribute: 196 Reallocated_Event_Count changed from 252 to 251
Jan 4 02:50:56 SMART Usage Attribute: 198 Offline_Uncorrectable changed from 252 to 253
Jan 4 07:35:40 ATA error count increased from 980 to 981
Jan 4 08:35:40 SMART Prefailure Attribute: 8 Seek_Time_Performance changed from 252 to 253
Jan 5 02:05:42 starting scheduled Short Self-Test.
Jan 5 02:35:40 SMART Usage Attribute: 198 Offline_Uncorrectable changed from 253 to 252
Jan 5 02:35:40 Self-Test Log error count increased from 3 to 4
Jan 5 06:36:08 SMART Prefailure Attribute: 5 Reallocated_Sector_Ct changed from 253 to 252
Jan 5 06:36:08 SMART Usage Attribute: 197 Current_Pending_Sector changed from 253 to 252
Jan 5 06:36:10 ATA error count increased from 981 to 1293
Jan 5 07:14:45 ATA error count increased from 1293 to 2377 By this point, I was seeing errors in the filesystem. Syslog was filling up with 3ware and XFS errors about disk problems. Things were starting to suck. On the 6th, I ordered new drives, and this morning I started installing them. I’m currently attempting to recover whatever data I can off of the bad disk.
So, there are a couple things that we can learn from this. First, if I’d been paying attention and immediately migrated data off of the failing disk as soon as SMART told me that it had developed a bad sector, then I’d probably have been okay. It took 2 or 3 days before the problem got bad enough to be visible at the filesystem level. Second, if I’d had enough familiarity with this particular Maxtor drive, then I should have noticed that something weird was happening when the ECC errors started climbing. None of my other Maxtor drives have ever logged an ECC message; that makes the Hardware_ECC_Recovered message look kind of suspicious, but that probably only holds for this exact family of Maxtor drives. In a commercial environment, where I had dozens or hundreds of similar drives, I’d want to tell my log monitoring software to pay special attention to that message, because it looks like a good indicator of drive failure.
More importantly, though–if I’d been paying closer attention to my 3ware card, I would have noticed that this 4-drive RAID 5 array was running in degraded mode before the drive failed. If I’d fixed that then, then the drive failure wouldn’t have cost me any data–the array would have dropped the failing drive and warned me, and that would have been that. Instead, I’m looking at a weekend’s worth of hassle as well as some data loss. When I get everything back up and running, I’m probably going to switch from using the 3ware card’s hardware RAID 5 to software RAID 5–I trust Linux’s RAID monitoring tools more then I trust 3ware’s. Also, I was only getting ~25 MB/sec writing with the 3ware’s hardware RAID 5, while I should get closer to 100 MB/sec with software RAID 5.
Drive failure
Well, this isn’t looking promising–my new drives arrived, but the big RAID array is throwing errors left and right. I’m not sure how much data I’ll be able to recover off the thing. Most of the data on the drive is reconstructible, but not everything. Most of this contents are old digital pictures, but I’ve tried to write them all to DVD before throwing them onto the RAID array. Odds are I missed some stuff, though.
Amazingly enough, the system logs seem to have survived unscathed, so I’ll write up a “anatomy of a drive failure” article later, showing what this looked like from a SMART perspective. Since I’ve never actually seen a SMART-monitored drive failure before, it should be somewhat educational.
One step forward, one step back
As mentioned earlier, I spent part of the long weekend cleaning up home theater stuff. Part of this involved migrating files onto my home file server, which is an old Athlon 700 with an 8-channel 3ware RAID card and 4 160 GB drives in a 450 GB RAID 5 array.
So what happens as soon as I finish copying stuff onto the array? A drive starts failing on the RAID array, and I discover that it was already running in degraded mode. Now I’m in danger of losing all 200 GB on the array. Most likely, it won’t come to that, but it’s still fantastically irritating. Of the 4 160 GB drives that I bought last year, 2 of them have now failed.
To make sure that this doesn’t happen again, I just ordered 2 more 160 GB drives from NewEgg (only $76 each), along with a 3-in-2 style drive cooler. Assuming that it all arrives tomorrow, I should be able to rebuild the array, including a spare drive this time, and hopefully I won’t have to worry about it failing again.
RMS on the answering machine?
I need a new message on my home answering machine, but I really don’t think this is what I’m looking for:
If you sign up three people to donate to and join the Free Software Foundation, geek hero/Free Software inventor Richard Stallman or FSF copyfight lawyer superstar Eben Moglen will record an answering machine message for you to use (Boing Boing)
This is deeply wrong. Deeply wrong.
Where in the US?
One of this morning’s blog reading discoveries was a post by Kellan on Find Your Spot, a “where in the US you should live” survey. It asks you a ton of questions about weather, cultural activities, transportation, and recreation and then recommends the 24 best places in the US for you to live. Sounds like fun, right?
Unfortunately, it seems to be dreadfully wrong. My top 3 results:
- Kent, Washington
- Provo/Orem, Utah
- Tacoma, Washington
I mean, I guess it’s nice that 2 out of the top 3 are within an hour of where I live, but Kent? I mean, I guess it’s close to Ikea, and it’s easy to navigate, with all of the big wide streets full of warehouses, but other then that, it’s the butt of most “Seattle suburbs” jokes. Go watch reruns of Almost Live for edification. Tacoma’s supposed to be better then it was, but it’s not exactly what I’m looking for, either.
I’m going to resist commenting on Provo. I’ve been there a couple times, once driving through as part of a road trip, the other on work, and I have *no* desire to go back, much less live there.
The first batch of questions were on weather–I thought that I said that I don’t like cold streaks, warm is good, moderate weather is best, sun is good, but a bit of snow is nice. I suppose that explains number 5 on the list: Anchorage, Alaska.
Here’s the full list:
- Kent, WA
- Provo-Orem, UT
- Tacoma, WA
- Ann Arbor, MI
- Anchorage, AK
- Fort Collins, CO
- Olympia, WA
- Hampton, VA
- Bellingham, WA (Hey! I’m from there)
- Hickory, NC
- Tulsa, OK
- Manchester, NH
- Bismark, ND
- Spokane, WA
- Sioux Falls, SD
- Chapel Hill, NC
- Chattanooga, TN
- Colorado Springs, CO
- Portland, ME
- Greenville, SC
- Loveland, CO
- Amarillo, TX
- Ogden, UT
- Clarksville, TN
I’ve been in 6 out of the top 10, so I guess the list isn’t completely worthless, but in many ways it reminds me of one of those career placement tests that I took in high school. It told me that I was lousy at organizational skills, would average a ‘D’ in college accounting classes, and should consider a career in accountancy. It also claimed that there was no hope that I’d pass a college zoology class, but the rest of the sciences would be straight ‘A’s for me.