So one of the fun things I get to do on a daily basis is that I get to work with some pretty cool stuff.
One of those things, are servers. Today while my boss and I were driving to a convention in Georgia (a whole different story, 12 hour drive SUCKED). We had one of our client’s servers crash, for a second day in a row.
The kicker is this, after fighting with Dell to RMA a 600GB SAS drive, and get it next-day’ed to the site, a tech had 3 hours previously replaced a known bad drive with a new working one. So our first response was, WTF?! we just replaced the bad drive!
One of our techs ran onsite and working with him, we found that the new drive was marked Offline in the Dell H700 Bios (uhoh..), and the other drive in the RAID 1, was showing as failed (SHIT!). Working with my tech, I had him turn the Drive 0:1 online, naturally this should be a fruitless effort, that does no more than just turning a drive on.
The back story, for my decision, is that one of my own servers, a Dell R410, that runs on a H700 as well, had a similar issue, where the RAID 5 had two drive failures due to excessive heat, to fix the system with NO issues, was to put the offline drive to on, and to reestablish the RAID. With this thinking I had the tech try just turning on the new, offline drive.
When he rebooted the server, he got a windows prompt that the server had shutdown incorrectly (YESSS!!), this was so far so good, it meant that the rebuild of the RAID 1 Array was somewhat a success (as if it wasn’t it shouldn’t have booted up right?). When the system booted up completely, there were no errors, and no signs of applications not starting up… So basically, a drive, a 600GB (SAS 6Gbps) Drive, somehow managed to rebuild its ENTIRE array in just under 3 hours.. bull shit I say, but how can we explain how this system is online without the complete RAID array rebuild..
Well, here is where it get’s interesting.. After my tech confirmed with the client that they could get in, and work properly, one of my in-office techs called dell, and after a lengthy chat with the tech found that the RAID 1, that was the array 0 of the RAID 10 configuration had actually NOT finished rebuilding, the logs from the controller actually told us, that it got to 66.7% rebuilding, when the first drive in the array failed.. wait.. 67?! yeah, my thoughts exactly. A drive, that was in the Array 0 of a RAID 10, somehow got to 67% completion, lost its master drive and somehow was booting up..? yeah. It makes no sense to us either, from my boss to the other techs, everyone says the system should be DITW (Dead In The Water), yet, here it was alive and kicking.
With that bit of fun, we have to be careful, I suspect there HAS to be some sort of data missing, from a windows update file to some system file that hasn’t been accessed yet, that we will see down the line. Zombie Server. Ugh, thankfully, we have a contingency plan in place, in case the system dies decide to die, for good..
And that’s all there is for today, nothing better than driving for 6 hours, and having a server go belly up, awesome stuff..
We are here in Atlanta now and i’m exhausted. Hopefully tomorrow isn’t so exciting, but will post back 😀
-B