I had one of those “aw nuts” moments last night.
At some point yesterday afternoon, I noticed that none of the computers in the house could get out to the Internet. After verifying that my wireless network was fine and that internal DNS was up-and-running, I traced the problem back to my Forefront Threat Management Gateway (TMG) firewall. Attempting to RDP into it proved fruitless, and when I went downstairs and looked at the front of the server, I noticed the hard drive activity light was constantly lit.
So, I powered the server off and brought it back on. Problem solved … well, not really. It happened again a couple of hours later, so I repeated the process and made a mental note that I was going to have to look at the server when I had a chance.
Demanding My Attention
Well, things didn’t “stay fixed.” Later in the evening, the same lack of connectivity surfaced again. I went to the basement, powered the server off, and brought it back up. That time, though, the server wouldn’t start and complained about having nothing to boot from.
As I did a reset and watched it boot again, I could see the problem: although the server knew that something was plugged in for boot purposes, it couldn’t tell that what was plugged in was a 250GB SATA drive. Ugh.
When I run into those types of situation, the remedy is pretty clear: a new hard drive. I always have a dozen or more hard drives sitting around (comes from running a server farm in the basement), and I grabbed a 500GB Hitachi drive that I had leftover from another machine. Within five minutes, the drive was in the server and everything was hooked back up.
Down to the Metal
Of course, a new hard drive was only half of the solution. The other half of the equation involved restoring from backup. In this case, a bare metal restore from backup was the most appropriate course of action since I was starting with a blank disc.
For those who may not be familiar with the concept of bare metal restoration, you can get a quick primer from Wikipedia. I use Microsoft’s System Center Data Protection Manager 2010 (DPM) to protect the servers in my environment, so I knew that I had an image from which I could restore my TMG box. I just dreaded the thought of doing so.
Why the worry? Well, I think Arthur C. Clarke summed it up best with the following quote:
Any sufficiently advanced technology is indistinguishable from magic.
The Cold Sweats
Now bare metal restore isn’t “magic,” but it is relatively sophisticated technology … and it’s still an area that seems plagued with uncertainties.
I have to believe that I’m not the only one who feels this way. I’ve co-authored two books on SharePoint disaster recovery, and the second book includes a chapter I wrote that covers bare metal restore on a Windows 2008 server. My experience with bare metal restores can be summarized as follows: when it works, it’s awesome … but it doesn’t always work as we’d want it to. When it doesn’t work, it’s plain ol’ annoying in that it doesn’t explain why.
So, it’s with that mindset that I started the process of trying to clear away my server’s lobotomized state. These are the steps I carried out to get ready for the restore:
- I went into the DPM console, selected the most recent bare metal restore recovery point available to me (as shown on the right), and restored the contents of the folder to a network file share– in my case, \\VMSS-FILE1\RESTORE. Note: you’ll notice a couple of restore points available after the one I selected; those were created in the time since I did the restore but before I wrote this post.
- The approximately 21GB bare metal restore image was created on the share. I do have gigabit Ethernet on my network, and since I recently built-out a new DPM server with faster hardware, it really didn’t take too long to get the image restored to the designated file share – maybe five minutes or so. The result was a single folder in the designated file share.
- I carried out a little manipulation on the folder that DPM created; specifically, I cut out two levels of sub-folders and made sure that the WindowsImageBackup folder was available directly from the top of the share as shown at the left. The Windows Recovery Environment (or WinRE) is picky about this detail; if it doesn’t see the folder structure it expects when restoring from a network share, it will declare that nothing is available for you to restore from – even though you know better.
With my actual restore image ready to go on the file share, I booted into the WinRE using a bootable USB memory stick with Windows 2008 R2 Server on it. I walked through the process of selecting Repair your computer, navigating out to the file share, choosing my restore image, etc. The process is relatively easy to stumble through, but if you want it in a lot of detail, I’d encourage you to read Chapter 5 (Windows Server 2008 Backup and Restore) in our SharePoint 2010 Disaster Recovery Guide. In that chapter, I walk through the restore process in step-by-step fashion with screenshots.
I got to the point in the wizard where I was prompted to select additional options for restore as shown on the left. By default, the WinRE will format and repartition discs as needed. In my case, that’s what I wanted; after all, I was putting a brand new drive in (one that was larger than the original), so formatting and partitioning was just what the doctor ordered. I also had the ability to exclude some drives (through Exclude disks) from the recovery process – not something I had to worry about given that my system image only covered one hard drive. If my hard drive required additional drivers (as might be needed with a drive array, RAID card, or something equivalent), I also had the opportunity to supply them with the Install drivers option. Again, this was a basic in-place restore; the only thing I needed was a clean-up of the hard drive I supplied, so I clicked Next.
In my experience, the dialog on the left is the last point of easily grasped normal wizard activity before the WinRE restore wizard takes off and we enter “magic land.” As I mentioned, when restores work … they just chug right along and it looks easy. When bare metal and system state restores don’t work, though, the error messages are often unintelligible and downright useless from a troubleshooting and remediation perspective. I hoped that my restore would be one of the happy restores that chugged right along and made me proud of my backup and restore prowess.
I crossed my fingers and clicked the Next button.
<Insert Engine Dying Noises Here>
Well, you knew this blog post would be a whole lot less interesting if everything went according to plan.
Once I worked through my panic and settled down, I looked a little closer. I understood The system image restore failed without much interpretation, but I had no idea what to make of
Error details: The parameter is incorrect. (0x80070057)
That was the extent of what I had to work with. All I could do was close out and try again. Sheesh.
Let’s face it: there aren’t a whole lot of options to play with in the WinRE when it comes to bare metal restore. The screenshot on the left shows the Advanced options you have available to you, but there really isn’t much to them. I experimented with the Automatically check and update disk error information checkbox, but it really didn’t have an effect on the process. Nevertheless, I tried restores with all combinations of the checkboxes set and cleared. No dice.
With the Advanced options out of the way, there was really only one other place to look: the Exclude disks dialog. I knew Install drivers wasn’t needed, because I had no trouble accessing my disks and wasn’t using anything like a RAID card or some other advanced disk configuration.
I popped open the disk exclusion dialog (shown on the right) and tried running a restore after excluding all of the disks except the Hitachi disk to which I would be writing data (Disk 2). Again, no dice – I still continued to get the aforementioned error and couldn’t move forward.
I knew that DPM created usable bare metal images, and I knew that the WinRE worked when it came to restoring those images, so I knew that I had to be doing something wrong. After another half an hour of goofing around, I stopped my thrashing and took stock of what I had been doing.
My Inner Archimedes
My eureka moment came when I put a few key pieces of information together:
- While writing the chapter on Windows Server 2008 Backup and Restore for the SharePoint 2010 DR book, I’d learned that image restores from WinRE are very persnickety about the number of disks you have and the configuration of those disks.
- When DPM was creating backups, only three hard drives were attached to the server: the original 250GB system drive and two 30GB SSD caching drives.
- Booting into WinRE from a memory stick was causing a distinctly visible fourth “drive” to show up in the list of available disks.
- The bootable USB stick had to be a factor, so I put it away and pulled out a Windows Server 2008 R2 installation disk. I then booted into the WinRE from the DVD and walked through the entire restore process again. When I got to the confirmation dialog and pressed the Next button this time around, I received no The parameter is incorrect errors – just a progress bar that tracked the restore operation.
The one point that’s going to stick with me from here on out is this: if I’m doing a bare metal restore, I need to be booting into the WinRE from a DVD or from some other source that doesn’t affect my drives list. I knew that the disks list was sensitive on restore, but I didn’t expect USB drives to have any sort of effect on whether or not I could actually carry out the desired operation. I’m glad I know better now.