Bare Metal Bugaboos

Having recently recovered from a firewall outage using Windows Server 2008’s bare metal restore capabilities, I figured I’d write a quick post to cover how I did it. I also cover one really big learning I picked up as a result of the process.

I had one of those “aw nuts” moments last night.

At some point yesterday afternoon, I noticed that none of the computers in the house could get out to the Internet.  After verifying that my wireless network was fine and that internal DNS was up-and-running, I traced the problem back to my Forefront Threat Management Gateway (TMG) firewall.  Attempting to RDP into it proved fruitless, and when I went downstairs and looked at the front of the server, I noticed the hard drive activity light was constantly lit.

So, I powered the server off and brought it back on.  Problem solved … well, not really. It happened again a couple of hours later, so I repeated the process and made a mental note that I was going to have to look at the server when I had a chance.

Demanding My Attention

Well, things didn’t “stay fixed.”  Later in the evening, the same lack of connectivity surfaced again.  I went to the basement, powered the server off, and brought it back up.  That time, though, the server wouldn’t start and complained about having nothing to boot from.

As I did a reset and watched it boot again, I could see the problem: although the server knew that something was plugged in for boot purposes, it couldn’t tell that what was plugged in was a 250GB SATA drive.  Ugh.

When I run into those types of situation, the remedy is pretty clear: a new hard drive.  I always have a dozen or more hard drives sitting around (comes from running a server farm in the basement), and I grabbed a 500GB Hitachi drive that I had leftover from another machine.  Within five minutes, the drive was in the server and everything was hooked back up.

Down to the Metal

Of course, a new hard drive was only half of the solution.  The other half of the equation involved restoring from backup.  In this case, a bare metal restore from backup was the most appropriate course of action since I was starting with a blank disc.

For those who may not be familiar with the concept of bare metal restoration, you can get a quick primer from Wikipedia.  I use Microsoft’s System Center Data Protection Manager 2010 (DPM) to protect the servers in my environment, so I knew that I had an image from which I could restore my TMG box.  I just dreaded the thought of doing so.

Why the worry?  Well, I think Arthur C. Clarke summed it up best with the following quote:

Any sufficiently advanced technology is indistinguishable from magic.

The Cold Sweats

Now bare metal restore isn’t “magic,” but it is relatively sophisticated technology … and it’s still an area that seems plagued with uncertainties.

I have to believe that I’m not the only one who feels this way.  I’ve co-authored two books on SharePoint disaster recovery, and the second book includes a chapter I wrote that covers bare metal restore on a Windows 2008 server.  My experience with bare metal restores can be summarized as follows: when it works, it’s awesome … but it doesn’t always work as we’d want it to.  When it doesn’t work, it’s plain ol’ annoying in that it doesn’t explain why.

So, it’s with that mindset that I started the process of trying to clear away my server’s lobotomized state.  These are the steps I carried out to get ready for the restore:

  1. DPM consoleI went into the DPM console, selected the most recent bare metal restore recovery point available to me (as shown on the right), and restored the contents of the folder to a network file share– in my case, \\VMSS-FILE1\RESTORENote: you’ll notice a couple of restore points available after the one I selected; those were created in the time since I did the restore but before I wrote this post.
  2. The approximately 21GB bare metal restore image was created on the share.  I do have gigabit Ethernet on my network, and since I recently built-out a new DPM server with faster hardware, it really didn’t take too long to get the image restored to the designated file share – maybe five minutes or so.  The result was a single folder in the designated file share.
  3. Folder structure for restore shareI carried out a little manipulation on the folder that DPM created; specifically, I cut out two levels of sub-folders and made sure that the WindowsImageBackup folder was available directly from the top of the share as shown at the left.  The Windows Recovery Environment (or WinRE) is picky about this detail; if it doesn’t see the folder structure it expects when restoring from a network share, it will declare that nothing is available for you to restore from – even though you know better.

In Recovery

With my actual restore image ready to go on the file share, I booted into the WinRE using a bootable USB memory stick with Windows 2008 R2 Server on it.  I walked through the process of selecting Repair your computer, navigating out to the file share, choosing my restore image, etc.  The process is relatively easy to stumble through, but if you want it in a lot of detail, I’d encourage you to read Chapter 5 (Windows Server 2008 Backup and Restore) in our SharePoint 2010 Disaster Recovery Guide.  In that chapter, I walk through the restore process in step-by-step fashion with screenshots.

Additional restore options dialogI got to the point in the wizard where I was prompted to select additional options for restore as shown on the left.  By default, the WinRE will format and repartition discs as needed.  In my case, that’s what I wanted; after all, I was putting a brand new drive in (one that was larger than the original), so formatting and partitioning was just what the doctor ordered.  I also had the ability to exclude some drives (through Exclude disks) from the recovery process – not something I had to worry about given that my system image only covered one hard drive.  If my hard drive required additional drivers (as might be needed with a drive array, RAID card, or something equivalent), I also had the opportunity to supply them with the Install drivers option.  Again, this was a basic in-place restore; the only thing I needed was a clean-up of the hard drive I supplied, so I clicked Next.

Confirmation dialogI confirmed the details of the operation on the next page, and everything looked right to me.  I then paused to mentally double-check everything I was about to do.

In my experience, the dialog on the left is the last point of easily grasped normal wizard activity before the WinRE restore wizard takes off and we enter “magic land.”  As I mentioned, when restores work … they just chug right along and it looks easy.  When bare metal and system state restores don’t work, though, the error messages are often unintelligible and downright useless from a troubleshooting and remediation perspective.  I hoped that my restore would be one of the happy restores that chugged right along and made me proud of my backup and restore prowess.

I crossed my fingers and clicked the Next button.

<Insert Engine Dying Noises Here>

A picture of the restore going belly-upThe screenshot on the right shows what happened almost immediately after I clicked next.

Well, you knew this blog post would be a whole lot less interesting if everything went according to plan.

Once I worked through my panic and settled down, I looked a little closer.  I understood The system image restore failed without much interpretation, but I had no idea what to make of

Error details: The parameter is incorrect. (0x80070057)

That was the extent of what I had to work with.  All I could do was close out and try again.  Sheesh.

Head Scratching

Advanced options dialogLet’s face it: there aren’t a whole lot of options to play with in the WinRE when it comes to bare metal restore.  The screenshot on the left shows the Advanced options you have available to you, but there really isn’t much to them.  I experimented with the Automatically check and update disk error information checkbox, but it really didn’t have an effect on the process.  Nevertheless, I tried restores with all combinations of the checkboxes set and cleared.  No dice.

With the Advanced options out of the way, there was really only one other place to look: the Exclude disks dialog.  I knew Install drivers wasn’t needed, because I had no trouble accessing my disks and wasn’t using anything like a RAID card or some other advanced disk configuration.

Disk exclusion dialogI popped open the disk exclusion dialog (shown on the right) and tried running a restore after excluding all of the disks except the Hitachi disk to which I would be writing data (Disk 2).  Again, no dice – I still continued to get the aforementioned error and couldn’t move forward.

I knew that DPM created usable bare metal images, and I knew that the WinRE worked when it came to restoring those images, so I knew that I had to be doing something wrong.  After another half an hour of goofing around, I stopped my thrashing and took stock of what I had been doing.

My Inner Archimedes

My eureka moment came when I put a few key pieces of information together:

  • While writing the chapter on Windows Server 2008 Backup and Restore for the SharePoint 2010 DR book, I’d learned that image restores from WinRE are very persnickety about the number of disks you have and the configuration of those disks.
  • When DPM was creating backups, only three hard drives were attached to the server: the original 250GB system drive and two 30GB SSD caching drives.
  • Booting into WinRE from a memory stick was causing a distinctly visible fourth “drive” to show up in the list of available disks.
    The bootable USB stick had to be a factor, so I put it away and pulled out a Windows Server 2008 R2 installation disk.  I then booted into the WinRE from the DVD and walked through the entire restore process again.  When I got to the confirmation dialog and pressed the Next button this time around, I received no The parameter is incorrect errors – just a progress bar that tracked the restore operation.

Takeaway

The one point that’s going to stick with me from here on out is this: if I’m doing a bare metal restore, I need to be booting into the WinRE from a DVD or from some other source that doesn’t affect my drives list.  I knew that the disks list was sensitive on restore, but I didn’t expect USB drives to have any sort of effect on whether or not I could actually carry out the desired operation.  I’m glad I know better now.

Additional Reading and References

  1. Product Overview: Forefront Threat Management Gateway 2010
  2. Wikipedia: Bare-metal restore
  3. Product Overview: System Center Data Protection Manager 2010
  4. Book: SharePoint 2010 Disaster Recovery Guide

DPM, RPC, and DCOM with Forefront TMG 2010

In this post, I discuss a couple of DCOM/RPC snags I ran into while configuring Microsoft’s Data Protection Manager (DPM) 2007 client protection agent to run on my new Forefront Threat Management Gateway (TMG) 2010 server. I also walk through the troubleshooting approach I took to resolve the issues that appeared.

In a recent post, I was discussing my impending move to Microsoft’s Forefront Threat Management Gateway (TMG) 2010 on my home network.  As part of the move, I was going to decommission two Microsoft Internet Security and Acceleration (ISA) 2006 servers and an old Windows Server 2008 remote access services (RAS) box and replace them with a single TMG 2010 server – a big savings in terms of server maintenance and power consumption.

I completed the move about a week ago, and I’ve been very happy with TMG thus far.  TMG’s ISP redundancy and load balancing features have been fantastic, and they’ve allowed me to use my Internet connections much more effectively.

As a user of ISA since its original 2000 release, I also had no problem jumping in and working with the TMG management GUI.  It was all very familiar from the get-go.  Call me “very satisfied” thus far.

This afternoon, I took a few moments to un-join the old ISA servers from the domain, power them down, and clean things up.  I had also planned to take a little time integrating the new TMG box into my Data Protection Manager (DPM) 2007 backup rotation.  Unfortunately, though, the DPM integration took a bit longer than expected …

Backup Brigade

For those unfamiliar with the operation of DPM, I’ll take a couple of moments to explain a bit about how it works.  In order for DPM to do its thing, any computer that is going to be protected must have the DPM 2007 Protection Agent installed on it.  Once the DPM Protection Agent is installed and configured, the DPM server communicates through the agent (which operates as a Windows service) to leverage the protected system’s Volume Shadow Copy Service (VSS) for backups.

Installing the DPM agent typically isn’t a big challenge for common client computers, and it can be accomplished directly from within the DPM management GUI itself.  When the agent is installed through the GUI, DPM connects to the computer to be protected, installs the agent, and configures it to point back to the DPM server.  No manual intervention is required.

On some systems, though, it’s simply easier to install and configure the agent directly from the to-be-protected system itself.  A locked-down server (like a TMG box) falls into this category, so I manually put the agent installation package on the TMG server, ran it, and carried out the follow-up PowerShell Attach-ProductionServer (from the DPM Management Shell) on the DPM server.  The install proceeded without issue, and the associated attach (on the DPM server) went off without a hitch.  I thought I was good to go.

I fired up the management GUI on the DPM Server, went into the Agents tab under Management, and discovered that I couldn’t connect to the TMG server.

DPM Agent Issue

The Checklist

The fact that I couldn’t connect to the TMG server (SS-TMG1) from my DPM box was a bit of an eyebrow lifter, but it wasn’t entirely unexpected.  Communication between a DPM server and the DPM agent leverages DCOM, and I’d had to jump through a few hoops to ensure that the DPM server could communicate with the ISA boxes previously.

I suspected that an RPC/DCOM issue was in play, but I was having trouble seeing where the problem might be. So, I reviewed where I was at.

  • Without an exception, Windows Firewall will block communication between a DPM server and its agents.  I confirmed that Windows Firewall wasn’t in play and that TMG itself was handling all of the firewall action.
  • Examining TMG, I confirmed that I had a rule in place that permitted all traffic between my DPM server (SS-TOOLS1) and the TMG box itself.
    DPM-TMG Access Rule 
  • DPM-TMG Access Rule (with RPC)Strict RPC compliance is another potential problem with DPM on both ISA and TMG, as requiring strict compliance blocks any DCOM traffic.  DCOM (and any other traffic that doesn’t explicitly begin RPC exchanges by communicating with the RPC endpoint mapper on the target server) gets dropped by the RPC Filter unless the checkbox for Enforce strict RPC compliance is unchecked.  I confirmed that my rule wasn’t requiring strict compliance (as shown on the right).
  • I made sure that my DPM server wasn’t listed as a member of either the Enterprise Remote Management Computers or Remote Management Computers Computer Sets in TMG.  These two Computer Sets are specially impacted by a couple of TMG System Policy rules that can impact their ability to call into TMG via RPC and DCOM.
  • I reviewed all System Policy rules that might impact inbound RPC calls to the TMG server, and I couldn’t find any that would (or should) be influencing DPM’s ability to connect to its agent.

I also went out to the Forefront TMG Product Team’s Blog to see what advice they might have to offer, and I found this excellent article on RPC and TMG – well worth a read if you’re trying to troubleshoot RPC problems.  Unfortunately, it didn’t offer me any tips that would help in my situation.

Watching The Traffic

I may have simply had tunnel vision, but I was still obsessed with the notion that strict RPC checking was causing my problems.  To see if it was, I decided to fire-up TMG’s live logging and see what was happening when DPM tried to connect to its agent.  I set the logging to display only traffic that was originated from the DPM box, and this is what I saw.

Watching Traffic from DPM

There was nothing wrong that I could see.  My access rule was clearly being utilized (the one that doesn’t enforce strict RPC checking), and I wasn’t seeing any errors – just connection initiations and closes.  Traffic from DPM to TMG looked clean.

Taking A Step Back

I was frustrated, and I clearly needed to consider the possibility that I didn’t have a good read on the problem.  So, I went to the Windows Application event log to see if it might provide some insight.  I probably should have started with the event logs instead of TMG itself and firewall rules … but better late than never, I figured.

DPM Agent Can't CommunicatePopping open Event Viewer, I was greeted with the image you see on the left.  What I saw was enlightening, for I did have a problem with communication between the DPM agent and the DPM server.  The part that intrigued me, though, was the fact that the problem was with outbound communication (that is, from TMG server to DPM server) – not the other way around as I had originally suspected.  All of my focus had been on troubleshooting traffic coming into TMG because I’d been interpreting the errors I’d seen to mean that the DPM server couldn’t reach the agent – not that the agent couldn’t “phone home,” so to speak.

I knew for a fact that the DPM Server, SS-TOOLS1, didn’t have the Windows Firewall service running.  Since the service wasn’t running, there was no way that the agent’s attempts to communicate with DPM could (or rather, should) be getting blocked at the destination.  That left the finger of blame pointing at TMG.

On The Way Out

I decided to repeat the traffic watching exercise I’d conducted earlier, but instead of watching traffic coming into the TMG box from my DPM server, I elected to watch traffic going the other direction – from TMG to DPM.  Here’s what I saw:

Watching Traffic to DPM

The “a-ha” moment for me came when I saw the firewall rule that was actually governing RPC traffic to the DPM box from TMG.  It wasn’t the DPM All <=> SS-TMG1 rule I’d established — it was a system policy rule called [System] Allow RPC from Forefront TMG to trusted servers.

System policy rules are normally hidden in the firewall policy tab, so I had to explicitly show them to review them.  Once I did, there it was – rule 22.

System Policy Rule 22

Note that this rule applies to all traffic from the TMG server to the Internal network; I’ll be talking about that more in a bit.

System Policy EditorI couldn’t edit the rule in-place; I needed to use the System Policy editor.  So, I fired up the System Policy Editor and traced the rule back to its associated configuration group.  As it turned out, the rule was tied to the Active Directory configuration group under Authentication Services.

As the picture on the left clearly shows, the Enforce strict RPC compliance checkbox was checked.  Once I unchecked it and applied the configuration change, the DPM agent began communicating with the DPM server without issue.  Problem solved.

What Happened?

I was fairly sure that I hadn’t experienced this sort of trouble installing the DPM Protection Agent under ISA Server 2006, so I tried to figure out what might have happened.

I hadn’t recalled having to adjust the target system policy under ISA when installing the DPM agent originally, but a quick boot and check of my old ISA server revealed that the checkbox was indeed unchecked (meaning that strict RPC compliance wasn’t being enforced).  I’d apparently made the change at some point and forgotten about it.  I suspect I’d messed with it at some point in the distant past while working on passing AD information through ISA, getting VPN functionality up-and-running, or perhaps something else.

Implications

Bottom line: TMG enforces strict compliance for RPC traffic that originates on the TMG server (Local Host) and is destined for the Internal network.  Since System Policy Rules are applied before administrator-defined Firewall Policy Rules, RPC traffic from the TMG server to the Internal network will always be governed by the system policy unless that policy is disabled.

In this particular scenario, the DPM 2007 Protection Agent’s operation was impacted.  Even though I’d created a rule that I thought would govern interactions between DPM and TMG, the reality is that it only governed RPC traffic coming into TMG – not traffic going out.

In reality, any service or application that sends DCOM traffic originating on the TMG server to the Internal network is going to be affected by the Allow RPC from Forefront TMG to trusted servers rule unless the associated system policy is adjusted.

Conclusion

The core findings of this post have been documented by others (in a variety of forms/scenarios) for ISA, but this go-round with TMG and the DPM association caught me off-guard such that I thought it would be worth sharing my experience with other firewall administrators.  If anyone else moving to TMG takes the “build it from the ground up” approach that I did, then the system policy I’ve been discussing may get missed.  Hopefully this post will serve as a good lesson (or reminder for veteran firewall administrators).

UPDATE (10/26/2010)

Thomas K. H. Bittner (former MVP from Germany who runs the Windows Server System Reference Architecture blog) contacted me and shared a blog post he wrote on configuring DPM 2010 and TMG communication; the post can be found here: http://msmvps.com/blogs/wssra/archive/2010/10/20/configure-the-forefront-tmg-2010-to-allow-dpm-2010-communication.aspx.  Thomas’ post is fantastic in that it is extremely granular in its configuration of communication channels between TMG and DPM.  If you would prefer to lock things down more securely than I demonstrate, then I highly recommend checking out the post that Thomas wrote.

Additional Reading and References

  1. Recent Post: Portrait of a Basement Datacenter
  2. Microsoft: Forefront Threat Management Gateway 2010
  3. Microsoft: Internet Security and Acceleration Server 2006
  4. Microsoft: System Center Data Protection Manager
  5. Forefront TMG Product Team Blog: RPC Filter and “Enable strict RPC compliance”