Years ago, before I began working with SharePoint, I spent some time working as an application architect for a Fortune 500 financial services company based here in Cincinnati, Ohio. While at the company, I was awarded the opportunity to serve as a disaster recovery (DR) architect on the team that would build (from ground up) the company’s first DR site implementation. It was a high-profile role with little definition – the kind that can either boost a career or burn it down. Luckily for me, the outcome leaned towards the former.
Admittedly, though, I knew very little about DR before starting in that position. I only knew what management had tasked me with doing: ensuring that mission-critical applications would be available and functional at the future DR site in the event of a disaster. If you aren’t overly familiar with DR, then that target probably sounds relatively straightforward. As I began working and researching avenues of attack for my problem set, though, I quickly realized how challenging and unusual disaster recovery planning was as a discipline – particularly for a “technically minded” person like myself.
Understanding the “Technical Tendency”
When it comes to DR, folks with whom I’ve worked have heard me say the following more than a few times:
It is the nature of technical people to find and implement technical solutions to technical problems. At its core, disaster recovery is not a technical problem; it is a business problem. Approaching disaster recovery with a purely technical mindset will result in a failure to deliver an appropriate solution more often than not.
What do I mean by that? Well, technical personnel tend to lump DR plans and activities into categories like “buying servers,” “taking backups,” and “acquiring off-site space.” These activities can certainly be (and generally are) part of a DR plan, but if they are the starting point for a DR strategy, then problems are likely to arise.
Let me explain by way of a simplistic and fictitious example.
Planning for SharePoint DR in a Vacuum
Consider the plight of Larry. Larry is an IT professional who possesses administrative responsibility for his company’s SharePoint-based intranet. One day, Larry is approached by his manager and instructed to come up with a DR strategy for the SharePoint farm that houses the intranet. Like most SharePoint administrators, Larry’s never really “done” DR before. He’s certain that he will need to review his backup strategy and make sure that he’s getting good backups. He’ll probably need to talk with the database administrators, too, because it’s generally a good idea to make sure that SQL backups are being taken in addition to SharePoint farm (catastrophic) backups.
Larry’s been told that off-site space is already being arranged by the facilities group, so that’s something he’ll be able to take off of his plate. He figures he’ll need to order new servers, though. Since the company’s intranet farm consists of four servers (including database servers), he plans to play it safe and order four new servers for the DR site. In his estimation, he’ll probably need to talk with the server team about the hardware they’ll be placing out at the DR site, he’ll need to speak with the networking team about DNS and switching capabilities they plan to include, etc.
Larry prepares his to-do list, dives in, and emerges three months later with an intranet farm DR approach possessing the following characteristics:
- The off-site DR location will include four servers that are setup and configured as a new, “warm standby” SharePoint farm.
- Every Sunday night, a full catastrophic backup of the SharePoint farm will be taken; every other night of the week, a differential backup will be taken. After each nightly backup is complete, it will be remotely copied to the off-site DR location.
- In the event of a disaster, Larry will restore the latest full backup and appropriate differential backups to the standby farm that is running at the DR site.
- Once the backups have been restored, all content will be available for users to access – hypothetically speaking, of course.
There are a multitude of technical questions that aren’t answered in the plan described above. For example, how is patching of the standby farm handled? Is the DR site network a clone of the existing network? Will server name and DNS hostname differences be an issue? What about custom solution packages (WSPs)? Ignoring all the technical questions for a moment, take a step back and ask yourself the question of greatest importance: will Larry’s overall strategy and plan meet his DR requirements?
If you’re new to DR, you might say “yes” or “no” based on how you view your own SharePoint farm and your experiences with it. If you’ve previously been involved in DR planning and are being honest, though, you know that you can’t actually answer the question. Neither can Larry or his manager. In fact, no one (on the technical side, anyway) has any idea if the DR strategy is a good one or not – and that’s exactly the point I’m trying to drive home.
The Cart Before the Horse
Assuming Larry’s company is like many others, the SharePoint intranet has a set of business owners and stakeholders (collectively referred to as “The Business” hereafter) who represent those who use the intranet for some or all of their business activities. Ultimately, The Business would issue one of three verdicts upon learning of Larry’s DR strategy:
Verdict 1: Exactly What’s Needed
Let’s be honest: Larry’s DR plan for intranet recovery could be on-the-money. Given all of the variables in DR planning and the assumptions that Larry made, though, the chance of such an outcome is slim.
Verdict 2: DR Strategy Doesn’t Offer Sufficient Protection
There’s a solid chance that The Business could judge Larry’s DR plan as falling short. Perhaps the intranet houses areas that are highly volatile with critical data that changes frequently throughout the day. If an outage were to occur at 4pm in the afternoon, an entire day’s worth of data would basically be lost because the most recent backup would likely be 12 or so hours old (remember: the DR plan calls for nightly backups). Loss of that data could be exceptionally costly to the organization.
At the same time, Larry’s recovery strategy assumes that he has enough time to restore farm-level backups at the off-site location in the event of a disaster. Restoring a full SharePoint farm-level backup (with the potential addition of differential backups) could take hours. If having the intranet down costs the company $100,000 per hour in lost productivity or revenue, you can bet that The Business will not be happy with Larry’s DR plan in its current form.
Verdict 3: DR Strategy is Overkill
On the flipside, there’s always the chance that Larry’s plan is overkill. If the company’s intranet possesses primarily static content that changes very infrequently and is of relatively low importance, nightly backups and a warm off-site standby SharePoint farm may be overkill. Sure, it’ll certainly allow The Business to get their intranet back in a timely fashion … but at what cost?
If a monthly tape backup rotation and a plan to buy hardware in the event of a disaster is all that is required, then Larry’s plan is unnecessarily costly. Money is almost always constrained in DR planning and execution, and most organizations prioritize their DR target systems carefully. Extra money that is spent on server hardware, nightly backups, and maintenance for a warm off-site SharePoint farm could instead be allocated to the DR strategies of other, more important systems.
Taking Care of Business First
No one wants to be left guessing whether or not their SharePoint DR strategy will adequately address DR needs without going overboard. In approaching the challenge his manager handed him without obtaining any additional input, Larry fell into the same trap that many IT professionals do when confronted with DR: he failed to obtain the quantitative targets that would allow him to determine if his DR plan would meet the needs and expectations established by The Business. In their most basic form, these requirements come in the form of recovery point objectives (RPOs) and recovery time objectives (RTOs).
The Disaster Recovery Timeline
I have found that the concepts of RPO and RTO are easiest to explain with the help of illustrations, so let’s begin with a picture of a disaster recovery timeline itself:
The diagram above simply shows an arbitrary timeline with an event (a “declared disaster”) occurring in the middle of the timeline. Any DR planning and preparation occurs to the left of the event on the timeline (in the past when SharePoint was still operational), and the actual recovery of SharePoint will happen following the event (that is, to the right of the event on the timeline in the “non-operational” period).
This DR timeline will become the canvas for further discussion of the first quantitative DR target you need to obtain before you can begin planning a SharePoint DR strategy: RPO.
RPO: Looking Back
As stated a little earlier, RPO is an acronym for Recovery Point Objective. Though some find the description distasteful, the easiest way to describe RPO is this: it’s the maximum amount of data loss that’s tolerated in the event of a disaster. RPO targets vary wildly depending on volatility and criticality of the data stored within the SharePoint farm. Let’s add a couple of RPO targets to the DR timeline and discuss them a bit further.
Two RPO targets have been added to the timeline: RPO1 and RPO2. As discussed, each of these targets marks a point in the past from which data must be recoverable in the event of a disaster. In the case of our first example, RPO1, the point in question is 48 hours before a declared disaster (that is, “we have a 48 hour RPO”). RPO2, on the other hand, is a point in time that is a mere 30 minutes prior to the disaster event (or a “30 minute target RPO”).
At a minimum, any DR plan that is implemented must ensure that all of the data prior to the point in time denoted by the selected RPO can be recovered in the event of a disaster. For RPO1, there may be some loss of data that was manipulated in the 48 hours prior to the disaster, but all data older than 48 hours will be recovered in a consistent state. RPO2 is more stringent and leaves less wiggle room; all data older than 30 minutes is guaranteed to be available and consistent following recovery.
If you think about it for a couple of minutes, you can easily begin to see how RPO targets will quickly validate or rule-out varying backup and/or data protection strategies. In the case of RPO1, we’re “allowed” to lose up to two days (48 hours) worth of data. In this situation, a nightly backup strategy would be more than adequate to meet the RPO target, since a nightly backup rotation guarantees that available backup data is never more than 24 hours old. Whether disk or tape based, this type of backup approach is very common in the world of server management. It’s also relatively inexpensive.
The same nightly backup strategy would fail to meet the RPO requirement expressed by RPO2, though. RPO2 states that we cannot lose more than 30 minutes of data. With this type of RPO, most standard disk and tape-based backup strategies will fall short of meeting the target. To meet RPO2’s 30 minute target, we’d probably need to look at something like SQL Server log shipping or mirroring. Such a strategy is going to generally require a greater investment in database hardware, storage, and licensing. Technical complexity also goes up relative to the aforementioned nightly backup routine.
It’s not too hard to see that as the RPO window becomes increasingly more narrow and approaches zero (that is, an RPO target of real-time failover with no data loss permitted), the cost and complexity of an acceptable DR data protection strategy climbs dramatically.
RTO: Thinking Ahead
If RPO drives how SharePoint data protection should be approached prior to a disaster, RTO (or Recovery Time Objective) denotes the timeline within which post-disaster farm and data recovery must be completed. To illustrate, let’s turn once again to the DR timeline.
As with the previous RPO example, we now have two RTO targets on the timeline: RTO1 and RTO2. Analogous to the RPO targets, the RTO targets are given in units of time relative to the disaster event. In the case of RTO1, the point in time in question is two hours after a disaster has been declared. RTO2 is designated as t+36 hours, or a day and a half after the disaster has been declared.
In plain English, an RTO target is the maximum amount of time that the recovery of data and functionality can take following a disaster event. If the overall DR plan for your SharePoint farm were to have an RTO that matches RTO2, for instance, you would need to have functionality restored (at an agreed-upon level) within a day and half. If you were operating with a target that matches RTO1, you would have significantly less time to get everything “up and running” – only two hours.
RTO targets vary for the same reasons that RPO targets vary. If the data that is stored within SharePoint is highly critical to business operations, then RTOs are generally going to trend towards hours, minutes, or maybe even real-time (that is, an RTO that mandates transferring to a hot standby farm or “mirrored” data center for zero recovery time and no interruption in service). For SharePoint data and farms that are less business critical (maybe a publishing site that contains “nice to have” information), RTOs could be days or even weeks.
Just like an aggressive RPO target, an aggressive RTO target is going to limit the number of viable recovery options that can possibly address it – and those options are generally going to lean towards being more expensive and technically more complex. For example, attempting to meet a two hour RTO (RTO1) by restoring a farm from backup tapes is going to be a gamble. With very little data, it may be possible … but you wouldn’t know until you actually tried with a representative backup. At the other extreme, an RTO that is measured in weeks could actually make a ground-up farm rebuild (complete with new hardware acquisition following the disaster) a viable – and rather inexpensive (in up-front capital) – recovery strategy.
Whether or not a specific recovery strategy will meet RTO targets in advance of a disaster is oftentimes difficult to determine without actually testing it. That’s where the value of simulated disasters and recovery exercises come into play – but that’s another topic for another time.
This post was intended to highlight a common pitfall affecting not only SharePoint DR planning, but DR planning in general. It should be clear by now that I deliberately avoided technical questions and issues to focus on making my point about planning. Don’t interpret my “non-discussion” of technical topics to mean that I think that their place with regard to SharePoint DR is secondary. That’s not the case at all; the fact that John Ferringer and I wrote a book on the topic (the “SharePoint 2007 Disaster Recovery Guide”) should be proof of this. It should probably come as no surprise that I recommend our book for a much more holistic treatment of SharePoint DR – complete with technical detail.
There are also a large number of technical resources for SharePoint disaster recovery online, and the bulk of them have their strong and weak points. My only criticism of them in general is that they equate “disaster recovery” to “backup/restore.” While the two are interrelated, the latter is but only one aspect of the former. As I hope this post points out, true disaster recovery planning begins with dialog and objective targets – not server orders and backup schedules.
If you conclude your reading holding onto only one point from this post, let it be this: don’t attempt DR until you have RPOs and RTOs in hand!