home sick

Jesse Morgan

March 24, 2009

So, I’m home sick again. Fourth time this year I’ve been sick. Cold, flu, cold, Bronchitis. Awesome. Will I get to rest today? No, of course not.

My server (Unicron) has been up and running for 2 years now- I got the parts right after Ian was born. I set up a nice software raid array at the time that’s served me well. I’d never set up a raid array like this before, so I wasn’t really sure how to monitor it. The raid array has been running fine for 2 years, so I just sorta let it slide.

Over this past weekend, I did some work resizing the lvm partitions and had to poke around with the raid stuff. I found not one, but two ways to monitor it one was to set up a monitor with the mdadm tools and have it email me if there was a problem, and the other lead me to create a simple nagios monitor. I set both up sunday night.

Flash forward to this morning- jackie wakes me up, asks me if I’m going to work (I’d come home sick the day before and was still out of it.) My intention was to wake up long enough to IM my manager and supervisor and let them know I was gonna be sick. I do so, then minimize the im stuff. Staring me in the face was the following:
raidfailure

Wait, wait, wait- my script must be crappy, there’s no way the raid array choked right after setting up the monitoring. I sorta go into denial and check my email:

This is an automatically generated mail message from mdadm
running on unicron
A Fail event had been detected on md device /dev/md2.
It could be related to component device /dev/sdc2.
Faithfully yours, etc.
P.S. The /proc/mdstat file currently contains the following:
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md2 : active raid5 sda3[0] sdd2[3] sdc2[4](F) sdb3[1]
1461633600 blocks level 5, 64k chunk, algorithm 2 [4/3] [UU_U]
md3 : active raid1 sdc1[0] sdd1[1]
979840 blocks [2/2] [UU]
md1 : active raid1 sda2[0] sdb2[1]
979840 blocks [2/2] [UU]
md0 : active raid1 sda1[0] sdb1[1]
192640 blocks [2/2] [UU]
unused devices:

Frak.

So here I am, praying it was a hiccup while I reboot and rebuild my raid array. it looks like sdc2 went nutty about 45 minutes after I went to bed. I restarted the server and sdc2 reappeared, and I’m rebuilding now to see what happens,

md2 : active raid5 sdc2[4] sda3[0] sdd2[3] sdb3[1]
1461633600 blocks level 5, 64k chunk, algorithm 2 [4/3] [UU_U]
[=================>…] recovery = 85.0% (414436248/487211200) finish=26.8min speed=45182K/sec

One thing is for sure, I need to get a auxiliary drive in case this one goes kaput for real. I said I’d buy one in june… 2007. suppose I better get one that, huh?

My apologies of this was nonsensical, I’m really tired.