Archive

Archive for the ‘SharePoint 2007’ Category

Finding a GUID in a SharePoint Haystack

August 1, 2012 8 comments

HaystackHere’s another blog post to file in the “I’ll probably never need it, but you never know” bucket of things you’ve seen from me.

Admittedly, I don’t get to spend as much time as I’d like playing with PowerShell and assembling scripts. So when the opportunity to whip-up a “quick hit” script comes along, I usually jump at it.

The Situation

I feel very fortunate to have made so many friends in the SharePoint community over the last several years. One friend who has been with me since the beginning (i.e., since my first presentation at the original SharePoint Saturday Ozarks) is Kirk Talbot. Kirk has become something of a “regular” on the SharePoint Saturday circuit, and many of you may have seen him at a SharePoint Saturday event someplace in the continental United States. To tell you the truth, I’ve seen Kirk as far north as Michigan and as far south as New Orleans. Yes, he really gets around.

Kirk and I keep up fairly regular correspondence, and he recently found himself in a situation where he needed to determine which objects (in a SharePoint site collection) were associated with a handful of GUIDs. Put a different way: Kirk had a GUID (for example, 89b66b71-afc8-463f-b5ed-9770168996a6) and wanted to know – was it a web? A list? A list item? And what was the identity of the item?

PowerShell to the Rescue

I pointed Kirk to a script I had previously written (in my Finding Duplicate GUIDs in Your SharePoint Site Collection post) and indicated that it could probably be adapted for his purpose. Kirk was up to the challenge, but like so many other SharePoint administrators was short on time.

I happened to find myself with a bit of free time in the last week and was due to run into Kirk at SharePoint Saturday Louisville last weekend, so I figured “what the heck?” I took a crack at modifying the script I had written earlier so that it might address Kirk’s need. By the time I was done, I had basically thrown out my original script and started over. So much for following my own advice.

The Script

The PowerShell script that follows is relatively straightforward in its operation. You supply it with a site collection URL and a target object GUID. The script then searches through the webs, lists/libraries, and list items of the site collection for an object with an ID that matches the GUID specified. If it finds a match, it reports some information about the matching object.

A sample run of the script appears below. In the case of this example, a list item match was found in the target site collection for the supplied GUID.

Sample Script Execution

 

This script leverages the SharePoint object model directly, so it can be used with either SharePoint 2007 or SharePoint 2010. Its search algorithm is relatively efficient, as well, so match results should be obtained in seconds to maybe minutes – not hours.

<#
.SYNOPSIS
   FindObjectByGuid.ps1
.DESCRIPTION
   This script attempts to locate a SharePoint object by its unique ID (GUID) within
   a site collection. The script first attempts to locate a match by examining webs;
   following webs, lists/libraries are examined. Finally, individual items within
   lists and libraries are examined. If an object with the ID is found, information 
   about the object is reported back.
.NOTES
   Author: Sean McDonough
   Last Revision: 27-July-2012
.PARAMETER SiteUrl
   The URL of the site collection that will be searched
.PARAMETER ObjectGuid
   The GUID that identifies the object to be located
.EXAMPLE
   FindObjectByGuid -SiteUrl http://mysitecollection.com -ObjectGuid 91ce5bbf-eebb-4988-9964-79905576969c
#>
param 
(
	[string]$SiteUrl = "$(Read-Host 'The URL of the site collection to search [e.g. http://mysitecollection.com]')",
	[Guid]$ObjectGuid = "$(Read-Host 'The GUID of the object you are trying to find [e.g. 91ce5bbf-eebb-4988-9964-79905576969c]')"
)

function FindObject($startingUrl, $targetGuid)
{
	# To work with SP2007, we need to go directly against the object model
	Add-Type -AssemblyName "Microsoft.SharePoint, Version=12.0.0.0, Culture=neutral, PublicKeyToken=71e9bce111e9429c"

	# Grab the site collection and all webs associated with it to start
	$targetSite = New-Object Microsoft.SharePoint.SPSite($startingUrl)
	$matchObject = $false
	$itemsTotal = 0
	$listsTotal = 0
	$searchStart = Get-Date

	Clear-Host
	Write-Host ("INITIATING SEARCH FOR GUID: {0}" -f $targetGuid)

	# Step 1: see if we can find a matching web.
	$allWebs = $targetSite.AllWebs
	Write-Host ("`nPhase 1: Examining all webs ({0} total)" -f $allWebs.Count)
	foreach ($spWeb in $allWebs)
	{
		$listsTotal += $spWeb.Lists.Count
		if ($spWeb.ID -eq $targetGuid)
		{
			Write-Host "`nMATCH FOUND: Web"
			Write-Host ("- Web Title: {0}" -f $spWeb.Title)
			Write-Host ("-   Web URL: {0}" -f $spWeb.Url)
			$matchObject = $true
			break
		}
		$spWeb.Dispose()
	}
	
	# If we don't yet have match, we'll continue with list iteration
	if ($matchObject -eq $false)
	{
		Write-Host ("Phase 2: Examining all lists and libraries ({0} total)" -f $listsTotal)
		$allWebs = $targetSite.AllWebs
		foreach ($spWeb in $allWebs)
		{
			$allLists = $spWeb.Lists
			foreach ($spList in $allLists)
			{
				$itemsTotal += $spList.Items.Count
				if ($spList.ID -eq $targetGuid)
				{
					Write-Host "`nMATCH FOUND: List/Library"
					Write-Host ("-            List Title: {0}" -f $spList.Title)
					Write-Host ("- List Default View URL: {0}" -f $spList.DefaultViewUrl)
					Write-Host ("-      Parent Web Title: {0}" -f $spWeb.Title)
					Write-Host ("-        Parent Web URL: {0}" -f $spWeb.Url)
					$matchObject = $true
					break
				}
			}
			if ($matchObject -eq $true)
			{
				break
			}

		}
		$spWeb.Dispose()
	}
	
	# No match yet? Look at list items (which includes folders)
	if ($matchObject -eq $false)
	{
		Write-Host ("Phase 3: Examining all list and library items ({0} total)" -f $itemsTotal)
		$allWebs = $targetSite.AllWebs
		foreach ($spWeb in $allWebs)
		{
			$allLists = $spWeb.Lists
			foreach ($spList in $allLists)
			{
				try
				{ 
					$listItem = $spList.GetItemByUniqueId($targetGuid)
				}
				catch
				{
					$listItem = $null
				}
				if ($listItem -ne $null)
				{
					Write-Host "`nMATCH FOUND: List/Library Item"
					Write-Host ("-                    Item Name: {0}" -f $listItem.Name)
					Write-Host ("-                    Item Type: {0}" -f $listItem.FileSystemObjectType)
					Write-Host ("-       Site-Relative Item URL: {0}" -f $listItem.Url)
					Write-Host ("-            Parent List Title: {0}" -f $spList.Title)
					Write-Host ("- Parent List Default View URL: {0}" -f $spList.DefaultViewUrl)
					Write-Host ("-             Parent Web Title: {0}" -f $spWeb.Title)
					Write-Host ("-               Parent Web URL: {0}" -f $spWeb.Url)
					$matchObject = $true
					break
				}
			}
			if ($matchObject -eq $true)
			{
				break
			}

		}
		$spWeb.Dispose()
	}
	
	# No match yet? Too bad; we're done.
	if ($matchObject -eq $false)
	{
		Write-Host ("`nNO MATCH FOUND FOR GUID: {0}" -f $targetGuid)
	}
	
	# Dispose of the site collection
	$targetSite.Dispose()
	Write-Host ("`nTotal seconds to execute search: {0}`n" -f ((Get-Date) - $searchStart).TotalSeconds)
	
	# Abort script processing in the event an exception occurs.
	trap
	{
		Write-Warning "`n*** Script execution aborting. See below for problem encountered during execution. ***"
		$_.Message
		break
	}
}

# Launch script
FindObject $SiteUrl $ObjectGuid

Conclusion

Again, I don’t envision this being something that everyone needs. I want to share it anyway. One thing I learned with the “duplicate GUID” script I referenced earlier is that I generally underestimate the number of people who might find something like this useful.

Have fun with it, and please feel free to share your feedback!

Additional Reading and Resources

  1. Event: SharePoint Saturday Ozarks
  2. Twitter: Kirk Talbot (@kctElgin)
  3. Post: Finding Duplicate GUIDs in Your SharePoint Site Collection
  4. Event: SharePoint Saturday Louisville

Finding Duplicate GUIDs in Your SharePoint Site Collection

April 3, 2011 14 comments

This is a bit of an oldie, but I figured it might help one or two random readers.

Let me start by saying something right off the bat: you should never need what I’m about to share.  Of course, how many times have you heard “you shouldn’t ever really need this” when it comes to SharePoint?  I’ve been at it a while, and I can tell you that things that never should happen seem to find a way into reality – and into my crosshairs for troubleshooting.

Disclaimer

The story and situation I’m about to share is true.  I’m going to speak in generalities when it comes to the identities of the parties and software involved, though, to “protect the innocent” and avoid upsetting anyone.

The Predicament

I was part of a team that was working with a client to troubleshoot problems that the client was encountering when they attempted to run some software that targeted SharePoint site collections.  The errors that were returned by the software were somewhat cryptic, but they pointed to a problem handling certain objects in a SharePoint site collection.  The software ran fine when targeting all other site collections, so we naturally suspected that something was wrong with only one specific site collection.

After further examination of logs that were tied to the software, it became clear that we had a real predicament.  Apparently, the site collection in question contained two or more objects with the same identity; that is, the objects had ID properties possessing the same GUID.  This isn’t anything that should ever happen, but it had.  SharePoint continued to run without issue (interestingly enough), but the duplication of object GUIDs made it downright difficult for any software that depended on unique object identities being … well, unique.

Although the software logs told us which GUID was being duplicated, we didn’t know which SharePoint object or objects the GUID was tied to.  We needed a relatively quick and easy way to figure out the name(s) of the object or objects which were being impacted by the duplicate GUIDs.

Tackling the Problem

It is precisely in times like those described that PowerShell comes to mind.

My solution was to whip-up a PowerShell script (FindDuplicateGuids.ps1) that processed each of the lists (SPList) and webs (SPWeb) in a target site collection.  The script simply collected the identities of each list and web and reported back any GUIDs that appeared more than once.

The script created works with both SharePoint 2007 and SharePoint 2010, and it has no specific dependencies beyond SharePoint being installed and available on the server where the script is run.

########################
# FindDuplicateGuids.ps1
# Author: Sean P. McDonough (sean@sharepointinterface.com)
# Blog: http://SharePointInterface.com
# Last Update: August 29, 2013
#
# Usage from prompt: ".\FindDuplicateGuids.ps1 <siteUrl>"
#   where <siteUrl> is site collection root.
########################


#########
# IMPORTS
# Import/load common SharePoint assemblies that house the types we'll need for operations.
#########
Add-Type -AssemblyName "Microsoft.SharePoint, Version=12.0.0.0, Culture=neutral, PublicKeyToken=71e9bce111e9429c"


###########
# FUNCTIONS
# Leveraged throughout the script for one or more calls.
###########
function SpmBuild-WebAndListIdMappings {param ($siteUrl)
	$targetSite = New-Object Microsoft.SharePoint.SPSite($siteUrl)
	$allWebs = $targetSite.AllWebs
	$mappings = New-Object System.Collections.Specialized.NameValueCollection
	foreach ($spWeb in $allWebs)
	{
		$webTitle = "WEB '{0}'" -f $spWeb.Title
		$mappings.Add($spWeb.ID, $webTitle)
		$allListsForWeb = $spWeb.Lists
		foreach ($currentList in $allListsForWeb)
		{
			$listEntry = "LIST '{0}' in Web '{1}'" -f $currentList.Title, $spWeb.Title
			$mappings.Add($currentList.ID, $listEntry)
		}
		$spWeb.Dispose()
	}
	$targetSite.Dispose()
	return ,$mappings
}

function SpmFind-DuplicateMembers {param ([System.Collections.Specialized.NameValueCollection]$nvMappings)
	$duplicateMembers = New-Object System.Collections.ArrayList
	$allkeys = $nvMappings.AllKeys
	foreach ($keyName in $allKeys)
	{
		$valuesForKey = $nvMappings.GetValues($keyName)
		if ($valuesForKey.Length -gt 1)
		{
			[void]$duplicateMembers.Add($keyName)
		}
	}
	return ,$duplicateMembers
}


########
# SCRIPT
# Execution of actual script logic begins here
########
$siteUrl = $Args[0]
if ($siteUrl -eq $null)
{
	$siteUrl = Read-Host "`nYou must supply a site collection URL to execute the script"
}
if ($siteUrl.EndsWith("/") -eq $false)
{
	$siteUrl += "/"
}
Clear-Host
Write-Output ("Examining " + $siteUrl + " ...`n")
$combinedMappings = SpmBuild-WebAndListIdMappings $siteUrl
Write-Output ($combinedMappings.Count.ToString() + " GUIDs processed.")
Write-Output ("Looking for duplicate GUIDs ...`n")
$duplicateGuids = SpmFind-DuplicateMembers $combinedMappings
if ($duplicateGuids.Count -eq 0)
{
	Write-Output ("No duplicate GUIDs found.")
}
else
{
	Write-Output ($duplicateGuids.Count.ToString() + " duplicate GUID(s) found.")
	Write-Output ("Non-unique GUIDs and associated objects appear below.`n")
	foreach ($keyName in $duplicateGuids)
	{
		$siteNames = $combinedMappings[$keyName]
		Write-Output($keyName + ": " + $siteNames)
	}
}
$dumpData = Read-Host "`nDo you want to send the collected data to a file? (Y/N)"
if ($dumpData -match "y")
{
	$fileName = Read-Host "  Output file path and name"
	Write-Output ("Results for " + $siteUrl) | Out-File -FilePath $fileName
	$allKeys = $combinedMappings.AllKeys
	foreach ($currentKey in $allKeys)
	{
		Write-Output ($currentKey + ": " + $combinedMappings[$currentKey]) | Out-File -FilePath $fileName -Append
	}
}
Write-Output ("`n")

Running this script in the client’s environment quickly identified the two lists that contained the same ID GUIDs.  How did they get that way?  I honestly don’t know, nor am I going to hazard a guess …

What Next?

If you’re in the unfortunate position of owning a site collection that contains objects possessing duplicate ID GUIDs, let me start by saying “I feel for you.”

Having said that: the quickest fix seemed to be deleting the objects that possessed the same GUIDs.  Those objects were then rebuilt.  I believe we handled the delete and rebuild manually, but there’s nothing to say that an export and subsequent import (via the Content Deployment API) couldn’t be used to get content out and then back in with new object IDs. 

A word of caution: if you do leverage the Content Deployment API and do so programmatically, simply make sure that object identities aren’t retained on import; that is, make sure that SPImportSettings.RetainObjectIdentity = false – not true.

Additional Reading and References

  1. TechNet: Import and export: STSADM operations
  2. MSDN: SPImportSettings.RetainObjectIdentity

Client-Server Interactions and the max-age Attribute with SharePoint BLOB Caching

February 21, 2011 23 comments

I first presented (in some organized capacity) on SharePoint’s platform caching capabilities at SharePoint Saturday Ozarks in June of 2010, and since that time I’ve consistently received a growing number of questions on the topic of SharePoint BLOB caching.  When I start talking about BLOB caching itself, the area that seems to draw the greatest number of questions and “really?!?!” responses is the use of the max-age attribute and how it can profoundly impact client-server interactions.

I’d been promising a number of people (including Todd Klindt and Becky Bertram) that I would write a post about the topic sometime soon, and recently decided that I had lollygagged around long enough.

Before I go too far, though, I should probably explain why the max-age attribute is so special … and even before I do that, we need to agree on what “caching” is and does.

Caching 101

Why does SharePoint supply caching mechanisms?  Better yet, why does any application or hardware device employ caching?  Generally speaking, caching is utilized to improve performance by taking frequently accessed data and placing it in a state or location that facilitates faster access.  Faster access is commonly achieved through one or both of the following mechanisms:

  • By placing the data that is to be accessed on a faster storage medium; for example, taking frequently accessed data from a hard drive and placing it into memory.
  • By placing the data that is to be accessed closer to the point of usage; for example, offloading files from a server that is halfway around the world to one that is local to the point of consumption to reduce round-trip latency and bandwidth concerns.  For Internet traffic, this scenario can be addressed with edge caching through a content delivery network such as that which is offered by Akamai’s EdgePlatform.

Oftentimes, data that is cached is expensive to fetch or computationally calculate.  Take the digits in pi (3.1415926535 …) for example.  Computing pi to 100 decimals requires a series of mathematical operations, and those operations take time.  If the digits of pi are regularly requested or used by an application, it is probably better to compute those digits once and cache the sequence in memory than to calculate it on-demand each time the value is needed.

Caching usually improves performance and scalability, and these ultimately tend to translate into a better user experience.

SharePoint and caching

Through its publishing infrastructure, SharePoint provides a number of different platform caching capabilities that can work wonders to improve performance and scalability.  Note that yes, I did say “publishing infrastructure” – sorry, I’m not talking about Windows SharePoint Services 3 or SharePoint Foundation 2010 here.

With any paid version of SharePoint, you get object caching, page output caching, and BLOB caching.  With SharePoint 2010 and the Office Web Applications, you also get the Office Web Applications Cache (for which I highly recommend this blog post written by Bill Baer).

Each of these caching mechanisms and options work to improve performance within a SharePoint farm by using a combination of the two mechanisms I described earlier.  Object caching stores frequently accessed property, query, and navigational data in memory on WFEs.  Basic BLOB caching copies images, CSS, and similar resource data from content databases to the file system of WFEs.  Page output caching piggybacks on ASP.NET page caching and holds SharePoint pages (which are expensive to render) in memory and serves them back to users.  The Office Web Applications Cache stores the output of Word documents and PowerPoint presentations (which is expensive to render in web-accessible form) in a special site collection for subsequent re-use.

Public-facing SharePoint

Each of the aforementioned caching mechanisms yields some form of performance improvement within the SharePoint farm by reducing load or processing burden, and that’s all well and good … but do any of them improve performance outside of the SharePoint farm?

What do I even mean by “outside of the SharePoint farm?”  Well, consider a SharePoint farm that serves up content to external consumers – a standard/typical Internet presence web site.  Most of us in the SharePoint universe have seen (or held up) the Hawaiian Airlines and Ferrari websites as examples of what SharePoint can do in a public-facing capacity.  These are exactly the type of sites I am focused on when I ask about what caching can do outside of the SharePoint farm.

For companies that host public-facing SharePoint sites, there is almost always a desire to reduce load and traffic into the web front-ends (WFEs) that serve up those sites.  These companies are concerned with many of the same performance issues that concern SharePoint intranet sites, but public-facing sites have one additional concern that intranet sites typically don’t: Internet bandwidth.

Even though Internet bandwidth is much easier to come by these days than it used to be, it’s not unlimited.  In the age of gigabit Ethernet to the desktop, most intranet users don’t think about (nor do they have to concern themselves with) the actual bandwidth to their SharePoint sites.  I can tell you from experience that such is not the case when serving up SharePoint sites to the general public

So … for all the platform caching options that SharePoint has, is there anything it can actually do to assist with the Internet bandwidth issue?

Enter BLOB caching and the max-age attribute

As it turns out, the answer to that question is “yes” … and of course, it centers around BLOB caching and the max-age attribute specifically.  Let’s start by looking at the <BlobCache /> element that is present in every SharePoint Server 2010 web.config file.

BLOB caching disabled

<BlobCache location="C:\BlobCache\14" path="\.(gif|jpg|jpeg|jpe|jfif|bmp|dib|tif|tiff|ico|png|wdp|hdp|css|js|asf|avi|flv|m4v|mov|mp3|mp4|mpeg|mpg|rm|rmvb|wma|wmv)$" maxSize="10" enabled="false" />

This is the default <BlobCache /> element that is present in all starting SharePoint Server 2010 web.config files, and astute readers will notice that the enabled attribute has a value of false.  In this configuration, BLOB caching is turned off and every request for BLOB resources follows a particular sequence of steps.  The first request in a browser session looks like this:

image

In this series of steps

  1. A request for a BLOB resource is made to a WFE
  2. The WFE fetches the BLOB resource from the appropriate content database
  3. The BLOB is returned to the WFE
  4. The WFE returns an HTTP 200 status code and the BLOB to the requester
    Here’s a section of the actual HTTP response from server (step #4 above):

HTTP/1.1 200 OK 
Cache-Control: private,max-age=0 
Content-Length: 1241304 
Content-Type: image/jpeg 
Expires: Tue, 09 Nov 2010 14:59:39 GMT 
Last-Modified: Wed, 24 Nov 2010 14:59:40 GMT 
ETag: "{9EE83B76-50AC-4280-9270-9FC7B540A2E3},7" 
Server: Microsoft-IIS/7.5 
SPRequestGuid: 45874590-475f-41fc-adf6-d67713cbdc85

You’ll notice that I highlighted the Cache-Control header line.  This line gives the requesting browser guidance on what it should and shouldn’t do with regard to caching the BLOB resource (typically an image, CSS file, etc.) it has requested.  This particular combination basically tells the browser that it’s okay to cache the resource for the current user, but the resource shouldn’t be shared with other users or outside the current session.

    Since the browser knows that it’s okay to privately cache the requested resource, subsequent requests for the resource by the same user (and within the same browser session) follow a different pattern:

image

When the browser makes subsequent requests like this for the resource, the HTTP response (in step #2) looks different than it did on the first request:

HTTP/1.1 304 NOT MODIFIED 
Cache-Control: private,max-age=0 
Content-Length: 0 
Expires: Tue, 09 Nov 2010 14:59:59 GMT

    A request is made and a response is returned, but the HTTP 304 status code indicates that the requested resource wasn’t updated on the server; as a result, the browser can re-use its cached copy.  Being able to re-use the cached copy is certainly an improvement over re-fetching it, but again: the cached copy is only used for the duration of the browser session – and only for the user who originally fetched it.  The requester also has to contact the WFE to determine that the cached copy is still valid, so there’s the overhead of an additional round-trip to the WFE for each requested resource anytime a page is refreshed or re-rendered.

BLOB caching enabled

Even if you’re not a SharePoint administrator and generally don’t poke around web.config files, you can probably guess at how BLOB caching is enabled after reading the previous section.  That’s right: it’s enabled by setting the enabled attribute to true as follows:

<BlobCache location="C:\BlobCache\14" path="\.(gif|jpg|jpeg|jpe|jfif|bmp|dib|tif|tiff|ico|png|wdp|hdp|css|js|asf|avi|flv|m4v|mov|mp3|mp4|mpeg|mpg|rm|rmvb|wma|wmv)$" maxSize="10" enabled="true" />

When BLOB caching is enabled in this fashion, the request pattern for BLOB resources changes quite a bit.  The first request during a browser session looks like this:

image

In this series of steps

  1. A request for a BLOB resource is made to a WFE
  2. The WFE returns the BLOB resource from a file system cache

The gray arrow that is shown indicates that at some point, an initial fetch of the BLOB resource is needed to populate the BLOB cache in the file system of the WFE.  After that point, the resource is served directly from the WFE so that subsequent requests are handled locally for the duration of the browser session.

As you might imagine based on the interaction patterns described thus far, simply enabling the BLOB cache can work wonders to reduce the load on your SQL Servers (where content databases are housed) and reduce back-end network traffic.  Where things get really interesting, though, is on the client side of the equation (that is, the Requester’s machine) once a resource has been fetched.

What about the max-age attribute?

You probably noticed that a max-age attribute wasn’t specified in the default (previous) <BlobCache /> element.  That’s because the max-age is actually an optional attribute.  It can be added to the <BlobCache /> element in the following fashion:

<BlobCache location="C:\BlobCache\14" path="\.(gif|jpg|jpeg|jpe|jfif|bmp|dib|tif|tiff|ico|png|wdp|hdp|css|js|asf|avi|flv|m4v|mov|mp3|mp4|mpeg|mpg|rm|rmvb|wma|wmv)$" maxSize="10" enabled="true" max-age=”43200” />

Before explaining exactly what the max-age attribute does, I think it’s important to first address what it doesn’t do and dispel a misconception that I’ve seen a number of times.  The max-age attribute has nothing to do with how long items stay within the BLOB cache on the WFE’s file system.  max-age is not an expiration period or window of viability for content on the WFE.  The server-side BLOB cache isn’t like other caches in that items expire out of it.  New assets will replace old ones via a maintenance thread that regularly checks associated site collections for changes, but there’s no regular removal of BLOB items from the WFE’s file system BLOB cache simply because of age.  max-age has nothing to do with server side operations.

So, what does the max-age attribute actually do then?  Answer: it controls information that is sent to requesters for purposes of specifying how BLOB items should be cached by the requester.  In short: max-age controls client-side cacheability.

The effect of the max-age attribute

max-age values are specified in seconds; in the case above, 43200 seconds translates into 12 hours.  When a max-age value is specified for BLOB caching, something magical happens with BLOB requests that are made from client browsers.  After a BLOB cache resource is initially fetched by a requester according to the previous “BLOB caching enabled” series of steps, subsequent requests for the fetched resource look like this for a period of time equal to the max-age:

image

You might be saying, “hey, wait a minute … there’s only one step there.  The request doesn’t even go to the WFE?”  That’s right: the request doesn’t go to the WFE.  It gets served directly from local browser cache – assuming such a cache is in use, of course, which it typically is.

Why does this happen?  Let’s take a look at the HTTP response that is sent back with the payload on the initial resource request when BLOB caching is enabled:

HTTP/1.1 200 OK 
Cache-Control: public, max-age=43200 
Content-Length: 1241304 
Content-Type: image/jpeg 
Last-Modified: Thu, 22 May 2008 21:26:03 GMT 
Accept-Ranges: bytes 
ETag: "{F60C28AA-1868-4FF5-A950-8AA2B4F3E161},8pub" 
Server: Microsoft-IIS/7.5 
SPRequestGuid: 45874590-475f-41fc-adf6-d67713cbdc85

The Cache-Control header line in this case differs quite a bit from the one that was specified when BLOB caching was disabled.  First, the use of public instead of private tells the receiving browser or application that the response payload can be cached and made available across users and sessions.  The response header max-age attribute maps directly to the value specified in the web.config, and in this case it basically indicates that the payload is valid for 12 hours (43,200 seconds) in the cache.  During that 12 hour window, any request for the payload/resource will be served directly from the cache without a trip to the SharePoint WFE.

Implications that come with max-age

On the plus side, serving resources directly out of the client-side cache for a period of time can dramatically reduce requests and overall traffic to WFEs.  This can be a tremendous bandwidth saver, especially when you consider that assets which are BLOB cached tend to be larger in nature – images, media files, etc.  At the same time, serving resources directly out of the cache is much quicker than round-tripping to a WFE – even if the round trip involves nothing more than an HTTP 304 response to say that a cached resource may be used instead of being retrieved.

While serving items directly out of the cache can yield significant benefits, I’ve seen a few organizations get bitten by BLOB caching and excessive max-age periods.  This is particularly true when BLOB caching and long max-age periods are employed in environments where images and other BLOB cached resources are regularly replaced and changed-out.  Let me illustrate with an example.

Suppose a site collection that hosts collaboration activities for a graphic design group is being served through a Web application zone where BLOB caching is enabled and a max-age period of 43,200 seconds (12 hours) is specified.  One of the designers who uses the site collection arrives in the morning, launches her browser, and starts doing some work in the site collection.  Most of the scripts, CSS, images, and other BLOB assets that are retrieved will be cached by the user’s browser for the rest of the work day.  No additional fetches for such assets will take place.

In this particular scenario, caching is probably a bad thing.  Users trying to collaborate on images and other similar (BLOB) content are probably going to be disrupted by the effect of BLOB caching.  The max-age value (duration) in-use would either need to be dialed-back significantly or BLOB caching would have to be turned-off entirely.

What you don’t see can hurt you

There’s one more very important point I want to make when it comes to BLOB caching and the use of the max-age attribute: the default <BlobCache /> element doesn’t come with a max-age attribute value, but that doesn’t mean that there isn’t one in-use.  If you fail to specify a max-age attribute value, you end up with the default of 86,400 seconds – 24 hours.

This wasn’t always the case!  In some recent exploratory work I was doing with Fiddler, I was quite surprised to discover client-side caching taking place where previously it hadn’t.  When I first started playing around with BLOB caching shortly after MOSS 2007 was released, omitting the max-age attribute in the <BlobCache /> element meant that a max-age value of zero (0) was used.  This had the effect of caching BLOB resources in the file system cache on WFEs without those resources getting cached in public, cross-session form on the client-side.  To achieve extended client-side caching, a max-age value had to be explicitly assigned.

Somewhere along the line, this behavior was changed.  I’m not sure where it happened, and attempts to dig back through older VM images (for HTTP response comparisons) didn’t give me a read on when Microsoft made the change.  If I had to guess, though, it probably happened somewhere around service pack 1 (SP1).  That’s strictly a guess, though.  I had always gotten into the habit of explicitly including a max-age value – even if it was zero – so it wasn’t until I was playing with the BLOB caching defaults in a SharePoint 2010 environment that I noticed the 24 hour client-side caching behavior by default.  I then backtracked to verify that the behavior was present in both SharePoint 2007 and SharePoint 2010, and it affected both authenticated and anonymous users.  It wasn’t a fluke.

So watch-out: if you don’t specify a max-age value, you’ll get 24 hour client-side caching by default!  If users complain of images that “won’t update” and stale BLOB-based content, look closely at max-age effects.

An alternate viewpoint on the topic

As I was finishing up this post, I decided that it would probably be a good idea to see if anyone else had written on this topic.  My search quickly turned up Chris O’Brien’s “Optimization, BLOB caching and HTTP 304s” post which was written last year.  It’s an excellent read, highly informative, and covers a number of items I didn’t go into.

Throughout this post, I took the viewpoint of a SharePoint administrator who is seeking to control WFE load and Internet bandwidth consumption.  Chris’ post, on the other hand, was written primarily with developer and end-user concerns in mind.  I wasn’t aware of some of the concerns that Chris points out, and I learned quite a few things while reading his write-up.  I highly recommend checking out his post if you have a moment.

Additional Reading and References

  1. Event: SharePoint Saturday Ozarks (June 2010)
  2. Blob Post: We Drift Deeper Into the Sound … as the (BLOB Cache) Flush Comes
  3. Blog: Todd Klindt’s SharePoint Admin Blog
  4. Blog: Becky Bertram’s Blog
  5. Definition: lollygag
  6. Technology: Akamai’s EdgePlatform
  7. Wikipedia: Pi
  8. TechNet: Cache settings operations (SharePoint Server 2010)
  9. Bill Baer: The Office Web Applications Cache
  10. SharePoint Site: Hawaiian Airlines
  11. SharePoint Site: Ferrari
  12. W3C Site: Cache-Control explanations
  13. Tool: Fiddler
  14. Blog Post: Chris O’Brien: Optimization, BLOB caching and HTTP 304s

SharePoint, WebDAV, and a Case of the 405 Status Codes

December 28, 2009 43 comments

Several months ago, I decided that a rebuild of my primary MOSS environment here at home was in order.  My farm consisted of a couple of Windows Server 2003 R2 VMs (one WFE, one app server) that were backed by a non-virtualized SQL Server.  I wanted to free up some cycles on my Hyper-V boxes, and I had an “open physical box” … so, I elected to rebuild my farm on a single, non-virtualized box running (the then newly released) Windows Server 2008 R2.

The rebuild went relatively smoothly, and bringing my content databases over from the old farm posed no particular problems.  Everything was good.

The Situation

Fast forward to just a few weeks ago.

One of the site collections in my farm is used to store and share pictures that we take of our kids.  The site collection is, in effect, a huge multimedia repository …

… and allow me a moment to address the concerns of the savvy architects and administrators out there.  I do understand SharePoint BLOB (binary large object) storage and the implications (and potential effects) that large multimedia libraries can have on scalability.  I wouldn’t recommend what I’m doing to most clients – at least not until remote BLOB storage (RBS) gets here with SharePoint 2010.  Remember, though, that my wife and I are just two people – not a company of hundreds or thousands.  The benefits of centralized, tagged, searchable, nicely presented content outweigh scalability and performance concerns for us.

"Upload Multiple= Back to the pictures site.  I was getting set to upload a batch of pictures, so I did what I always do: I went into the Upload menu of the target pictures library in the site collection and selected Upload Multiple Pictures as shown on the right.  For those who happen to have Microsoft Office 2007 installed (as I do), this action normally results in the Microsoft Office Picture Manager getting launched as shown below.

The Microsoft Office Picture Manager being launchedFrom within the Microsoft Office Picture Manager, uploading pictures is simply a matter of navigating to the folder containing the pictures, selecting the ones that are to be pushed into SharePoint, and pressing the Upload and Close button.  From there, the application itself takes care of rounding up the pictures that have been selected and getting them into the picture library within SharePoint.  SharePoint pops up a page that provides a handy “Go back to …” link that can then be used to navigate back to the library for viewing and working with the newly uploaded pictures.

Upon selecting the Upload Multiple Pictures menu item, SharePoint navigated to the infopage.aspx page shown above.  I waited, and waited … but the Microsoft Office Picture Manager never launched.  I hit my browser’s back button, and tried the operation again.  Same result: no Picture Manager.

Trouble In River City

Opening a Picture Library in Explorer ViewPicture Manager’s failure to launch was obviously a concern, and I wanted to know why I was encountering problems … but more than anything, I simply wanted to get my pictures uploaded and tagged.  My wife had been snapping pictures of our kids left and right, and I had 131 JPEG files waiting for me to do something.

I figured that there was more than one way to skin a cat, so I initiated my backup plan: Explorer View.  If you aren’t familiar with SharePoint’s Explorer View, then you need not look any further than the name to understand what it is and how it operates.  By opening the Actions menu of a library (such as a Document Library or Picture Library) and selecting the Open with Windows Explorer menu item as shown on the right, a Windows Explorer window is opened to the library.  The contents of the library can then be examined and manipulated using a file system paradigm – even though SharePoint libraries are not based in (or housed in) any physical file system.

A Picture Library Open in Explorer View

The mechanisms through which the Explorer View are prepared, delivered, and rendered are really quite impressive from a technical perspective.  I’m not going to go into the details, but if you want to learn more about them, I highly recommend a whitepaper that was authored by Steve SheppardSteve is an escalation engineer with Microsoft who I’ve worked with in the past, and his knowledge and attention to detail are second to none – and those qualities really come through in the whitepaper.

Unfortunately for me, though, my attempts to open the picture library in Explorer View also led nowhere.  Simply put, nothing happened.  I tried the Open with Windows Explorer option several times, and I was greeted with no action, error, or visible sign that anything was going on.

SharePoint and WebDAV

I was 0 for 2 on my attempts to get at the picture library for uploading.  I wasn’t sure what was going on, but I was pretty sure that WebDAV (Web Distributed Authoring and Versioning) was mixed-up in the behavior I was seeing.  WebDAV is implemented by SharePoint and typically employed to provide the Explorer View operations it supports.  I was under the impression that the Microsoft Office Picture Manager leveraged WebDAV to provide some or all of its upload capabilities, too.

After a few moments of consideration, the notion that WebDAV might be involved wasn’t a tremendous mental leap.  In rebuilding my farm on Windows Server 2008 R2, I had moved from Internet Information Services (IIS) version 6 (in Windows Server 2003 R2) to IIS7.  WebDAV is different in IIS7 versus previous versions … I just hadn’t heard about SharePoint WebDAV-based functions operating any differently.

Playing a Client-Side Tune

My gut instincts regarding WebDAV hardly qualified as “objective troubleshooting information,” so I fired-up Fiddler2 to get a look at what was happening between my web browser and the rebuilt SharePoint farm.  When I attempted to execute an Open with Windows Explorer against the picture library, I was greeted with a bunch of HTTP 405 errors.

405 ?!?!

To be completely honest, I’d never actually seen an HTTP 405 status code before.  It was obviously an error (since it was in the 400-series), but beyond that, I wasn’t sure.  A couple of minutes of digging through the W3C’s status code definitions, though, revealed that a 405 status code is returned whenever a requested method or verb isn’t supported.

I dug a little deeper and compared the request headers my browser had sent with the response headers I’d received from SharePoint.  Doing that spelled-out the problem pretty clearly.

Here’s an example of one of the HTTP headers that was sent:

PROPFIND /pictures/twins3/2009-12-07_no.3 HTTP/1.1

… and here’s the relevant portion of the response that the SharePoint server sent back:

HTTP/1.1 405 Method Not Allowed
Allow: GET, HEAD, OPTIONS, TRACE

PROPFIND was the method that my browser was passing to SharePoint, and the request was failing because the server didn’t include the PROPFIND verb in its list of supported methods as stated in the Allow: portion of the response.  PROPFIND was further evidence that WebDAV was in the mix, too, given its limited usage scenarios (and since the bulk of browser web requests employ either the GET or POST verb).

So what was going on?  The operations I was attempting had worked without issue under II6 and Windows Server 2003 R2, and I was pretty confident that I hadn’t seen any issues with other Windows Server 2008 (R2 and non-R2) farms running IIS7.  I’d either botched something on my farm rebuild or run into an esoteric problem of some sort; experience (and common sense) pointed to the former.

Doing Some Legwork

I turned to the Internet to see if anyone else had encountered HTTP 405 errors with SharePoint and WebDAV.  Though I quickly found a number of posts, questions, and other information that seemed related to my situation, none of it really described my particular scenario or what I was seeing.

After some additional searching, I eventually came across a discussion on the MSDN forums that revolved around whether or not WebDAV should be enabled within IIS for servers that serve-up SharePoint content.  The back and forth was a bit disjointed, but my relevant take-away was that enabling WebDAV within IIS7 seemed to cause problems for SharePoint.

WebDAV Enabled in IIS7 I decided to have a look at the server housing my rebuilt farm to see if I had enabled the WebDAV Publishing role service.  I didn’t think I had, but I needed to check.  I opened up the Server Manager applet and had a look at Role Services that were enabled for the Web Server (IIS).  The results are shown in the image on right; apparently, I had enabled WebDAV Publishing.  My guess is that I did it because I thought it would be a good idea, but it was starting to look like a pretty bad idea all around.

The Test

I was tempted to simply remove the WebDAV Publishing role service and cross my fingers, but instead of messing with my live “production” farm, I decided to play it safe and study the effects of enabling and disabling WebDAV Publishing in a controlled environment.  I fired up a VM that more-or-less matched my production box (Windows Server 2008 R2, 64-bit, same Windows and SharePoint patch levels) to play around.

When I fired-up the VM, a quick check of the enabled role services for IIS showed that WebDAV Publishing was not enabled – further proof that I got a bit overzealous in enabling role services on my rebuilt farm.  I quickly went into the VM’s SharePoint Central Administration site and created a new web application (http://spsdev:18480).  Within the web application, I created a team site called Sample Team Site.  Within that team site, I then created a picture library called Sample Picture Library for testing.

When It Works (without the WebDAV Publishing Role Service)

I fired up Fiddler2 in the VM, opened Internet Explorer 8, navigated to the Sample Picture Library, and attempted to execute an Open with Windows Explorer operation.  Windows Explorer opened right up, so I knew that things were working as they should within the VM.  The pertinent capture for the exchange between Internet Explorer and SharePoint (from Fiddler2) appears below.

Explorer View Exchange without IIS7 WebDAV

Reviewing the dialog between client and server, there appeared to be two distinct “stages” in this sequence.  The first stage was an HTTP request that was made to the root of the site collection using the OPTIONS method, and the entire HTTP request looked like this:

OPTIONS / HTTP/1.1
Cookie: MSOWebPartPage_AnonymousAccessCookie=18480
User-Agent: Microsoft-WebDAV-MiniRedir/6.1.7600
translate: f
Connection: Keep-Alive
Host: spdev:18480

In response to the request, the SharePoint server passed back an HTTP 200 status that looked similar to the block that appears below.  Note the permitted methods/verbs (as Allow:) that the server said it would accept, and that the PROPFIND verb appeared within the list:

HTTP/1.1 200 OK
Cache-Control: private,max-age=0
Allow: GET, POST, OPTIONS, HEAD, MKCOL, PUT, PROPFIND, PROPPATCH, DELETE, MOVE, COPY, GETLIB, LOCK, UNLOCK
Content-Length: 0
Accept-Ranges: none
Server: Microsoft-IIS/7.5
MS-Author-Via: MS-FP/4.0,DAV
X-MSDAVEXT: 1
DocumentManagementServer: Properties Schema;Source Control;Version History;
DAV: 1,2
Exires: Sun, 06 Dec 2009 21:13:27 GMT
Public-Extension: http://schemas.microsoft.com/repl-2
Set-Cookie: WSS_KeepSessionAuthenticated=18480; path=/
Persistent-Auth: true
X-Powered-By: ASP.NET
MicrosoftSharePointTeamServices: 12.0.0.6510
Date: Mon, 21 Dec 2009 21:13:27 GMT

After this initial request and associated response, all subsequent requests (“stage 2”) were made using the PROPFIND verb and have a structure that was similar to the following:

PROPFIND /Sample%20Picture%20Library HTTP/1.1
User-Agent: Microsoft-WebDAV-MiniRedir/6.1.7600
Depth: 0
translate: f
Connection: Keep-Alive
Content-Length: 0
Host: spdev:18480
Cookie: MSOWebPartPage_AnonymousAccessCookie=18480; WSS_KeepSessionAuthenticated=18480

Each of the requests returned a 207 HTTP status (WebDAV multi-status response) and some WebDAV data within an XML document (slightly modified for readability).

HTTP/1.1 207 MULTI-STATUS 
Cache-Control: no-cache 
Content-Length: 1132 
Content-Type: text/xml 
Server: Microsoft-IIS/7.5 
Public-Extension: http://schemas.microsoft.com/repl-2 
Set-Cookie: WSS_KeepSessionAuthenticated=18480; path=/ 
Persistent-Auth: true 
X-Powered-By: ASP.NET 
MicrosoftSharePointTeamServices: 12.0.0.6510 
Date: Mon, 21 Dec 2009 21:13:27 GMT 

<?xml version="1.0" encoding="utf-8" ?><D:multistatus xmlns:D="DAV:" xmlns:Office="urn:schemas-microsoft-com:office:office" xmlns:Repl="http://schemas.microsoft.com/repl/" xmlns:Z="urn:schemas-microsoft-com:"> 
<D:response><D:href>http://spdev:18480/Sample Picture Library</D:href><D:propstat>
…
</D:propstat> 
</D:response> 
</D:multistatus>

It was these PROPFIND requests (or rather, the 207 responses to the PROPFIND requests) that gave the client-side WebClient (directed by Internet Explorer) the information it needed to determine what was in the picture library, operations that were supported by the library, etc.

When It Doesn’t Work (i.e., WebDAV Publishing Enabled)

When the WebDAV Publishing role service was enabled within IIS7, the very same request (to open the picture library in Explorer View) yielded a very different series of exchanges (again, captured within Fiddler2):

Explorer View with IIS7 WebDAV Enabled

The initial OPTIONS request returned an HTTP 200 status that was identical to the one previously shown, and it even included the PROPFIND verb amongst its list of accepted methods:

HTTP/1.1 200 OK
Cache-Control: private,max-age=0
Allow: GET, POST, OPTIONS, HEAD, MKCOL, PUT, PROPFIND, PROPPATCH, DELETE, MOVE, COPY, GETLIB, LOCK, UNLOCK
Content-Length: 0
Accept-Ranges: none
Server: Microsoft-IIS/7.5
MS-Author-Via: MS-FP/4.0,DAV
X-MSDAVEXT: 1
DocumentManagementServer: Properties Schema;Source Control;Version History;
DAV: 1,2
Exires: Sun, 06 Dec 2009 22:04:31 GMT
Public-Extension: http://schemas.microsoft.com/repl-2
Set-Cookie: WSS_KeepSessionAuthenticated=18480; path=/
Persistent-Auth: true
X-Powered-By: ASP.NET
MicrosoftSharePointTeamServices: 12.0.0.6510
Date: Mon, 21 Dec 2009 22:04:31 GMT

Even though the PROPFIND verb was supposedly permitted, though, subsequent requests resulted in an HTTP 405 status and failure:

HTTP/1.1 405 Method Not Allowed
Allow: GET, HEAD, OPTIONS, TRACE
Server: Microsoft-IIS/7.5
Persistent-Auth: true
X-Powered-By: ASP.NET
MicrosoftSharePointTeamServices: 12.0.0.6510
Date: Mon, 21 Dec 2009 22:04:31 GMT
Content-Length: 0

Unfortunately, these behind-the-scenes failures didn’t seem to generate any noticeable error or message in client browsers.  While testing (locally) in the VM environment, I was at least prompted to authenticate and eventually shown a form of “unsupported” error message.  While connecting (remotely) to my production environment, though, the failure was silent.  Only Fiddler2 told me what was really occurring.

The Solution

IIS7 WebDAV Publishing Role Not Installed The solution to this issue, it seems, is to ensure that the WebDAV Publishing role service is not installed on WFEs serving up SharePoint content in Windows Server 2008 / IIS7 environments.  The mechanism by which SharePoint 2007 handles WebDAV requests is still something of a mystery to me, but it doesn’t appear to involve the IIS7-based WebDAV Publishing role service at all.

Steve Sheppard’s troubleshooting whitepaper (introduced earlier) mentions that enabling or disabling the WebDAV functionality supplied by IIS6 (under Windows Server 2003) has no appreciable effect on SharePoint operation.  Steve even mentions that SharePoint’s internal WebDAV implementation is provided by an ISAPI filter that is housed in Stsfilt.dll.  Though this was true in WSSv2 and SharePoint Portal Server 2003 (the platforms addressed by Steve’s whitepaper), it’s no longer the case with SharePoint 2007 (WSSv3 and MOSS 2007).  The OPTIONS and PROPFIND verbs are mapped to the Microsoft.SharePoint.ApplicationRuntime.SPHttpHandler type in SharePoint web.config files (see below) – Stsfilt.dll library doesn’t even appear anywhere within the file system of MOSS servers (or at least in mine).

Web.config HttpHandler Section

Regardless of how it is implemented, the fact that the two verbs of interest (OPTIONS and PROPFIND) are mapped to a SharePoint type indicates that WebDAV functionality is still handled privately within SharePoint for its own purposes.  When the WebDAV Publishing role is enabled in IIS7, IIS7 takes over (or at least takes precedence for) PROPFIND requests … and that’s where things appear to break.

To Sum Up

After toggling the WebDAV Publishing role service on and off a couple of times in my VM, I became convinced that my production environment would start behaving the way I wanted it to if I simply disabled IIS7’s WebDAV Publishing functionality.  I uninstalled the WebDAV Publishing role service, and both Microsoft Office Picture Manager and Explorer View started behaving again.

I also made a note to myself to avoid installing role services I thought I might need before I actually needed them  :-)

Additional Reading and References

  1. Blog Post: Todd Klindt, “Installing Remote Blob Store (RBS) on SharePoint 2010
  2. Whitepaper: Understanding and Troubleshooting the SharePoint Explorer View
  3. Blog: Steve Sheppard
  4. Microsoft TechNet: About WebDAV
  5. IIS.NET: WebDAV Extension
  6. Tool: Fiddler2
  7. W3C: HTTP/1.1 Status Code Definitions
  8. MSDN: Client Request Using PROPFIND
  9. TechNet Forums: IIS WebDAV service required for SharePoint webdav?
  10.   Site: HTTP Extensions for Distributed Authoring

Manually Clearing the MOSS 2007 BLOB Cache

October 30, 2009 13 comments

It’s a fact of life when dealing with many caching systems: for all the benefits they provide, they occasionally become corrupt or require some form of intervention to ensure healthy ongoing operation.  The MOSS Binary Large Object (BLOB) cache, or disk-based cache, is no different.

Is BLOB Cache Corruption a Common Problem?

In my experience, the answer is “no.”  The MOSS BLOB cache generally requires little maintenance and attention beyond ensuring that it has enough disk space to properly store the objects it fetches from the lists within the content databases housing your publishing site collections.

How Should a Flush Be Carried Out?

ObjectCacheSettings.aspx Application PageWhen corruption does occur or a cache flush is desired for any reason, the built-in “Disk Based Cache Reset” option is typically adequate for flushing the BLOB cache on a single server and single web application zone.  This option (circled in red on the page shown to the right) is exposed through the Site collection object cache menu item on a publishing site’s Site Collection Administration menu.  Executing a flush is as simple as checking the supplied checkbox and clicking the OK button at the bottom of the page.  When a flush is executed in this fashion, it affects only the server to which the postback occurs and only the web application through which the request is directed.  If a site collection is extended to multiple web applications, only one web application’s BLOB cache is affected by this operation.

BLOB Cache Farm Flush SolutionAlternatively, my MOSS 2007 Farm-Wide BLOB Cache Flushing Solution (screenshot shown on the right) can be used to clear the BLOB cache folders associated with a target site collection across all servers in a farm and across all web applications (zones) serving up the site collection.  This solution utilizes a different mechanism for flushing, but the net effect produced is the same as for the out-of-the-box (OOTB) mechanism: all BLOB-cached files for the associated site collection are deleted from the file system, and the three BLOB cache tracking files for each affected web application (IIS site) are reset.

For more information on the internals of the BLOB Cache, the flush process, and the files I just mentioned, see my previous post entitled We Drift Deeper Into the Sound … as the (BLOB Cache) Flush Comes.

Okay, I Tried a Flush and it Failed.  Now What?

If the aforementioned flush mechanisms simply aren’t working for you, you’re probably staring down the barrel of a manual BLOB cache flush.  Just delete all of the files in the target BLOB cache folder (as specified in the web.config) and you should be good to go, right?

Wrong.

Jumping in and simply deleting files without stopping requests to the affected site collection (or rather, the web application/applications servicing the site collection) risks sending you down the road to (further) cache corruption.  This risk may be small for sites that see little traffic or are relatively small, but the risk grows with increasing request volume and site collection size.  Allow me to illustrate with an example.

The Context

Let’s say that you decided to manually clear the BLOB cache for a sizable publishing site collection that is heavily trafficked.  You go into the file system, find your BLOB cache folder (by default, C:\blobCache), open it up, select all files and sub-folders contained within, and press the <Delete> key on your keyboard.  Deletion of the BLOB cache files and sub-folders commences.

Deleting the sub-folders and files isn’t an instantaneous operation, though.  It takes some time.  While the deletion is taking place, let’s say that your MOSS publishing site collections are still up and servicing requests.  The web applications for which BLOB caching is enabled are still attempting to use the very folders and files currently being deleted.

The Race Condition

For the duration of the deletion, a race condition is in effect that can yield some fairly unpredictable results.  Consider the following possible execution sequence.  Note: this example is hypothetical, but I’ve seen results on multiple occasions that infer this execution sequence (or something similar to it).

  1. The deletion operation deletes one or more of the .bin files at the root of a web application’s BLOB cache folder.  These files are used by MOSS to track the contents of the BLOB cache, the number of times it was flushed, etc.
  2. A request for a resource that would normally be present in the BLOB cache arrives at the web server.  An attempted lookup for the resource in the BLOB cache folder fails because the .bin files are gone as a result of the actions taken in the last step.
  3. The absence of the .bin files kicks off some housekeeping.  Ultimately, a “fresh” set of .bin files written out.
  4. The requested resource is fetched into the BLOB cache (sub-)folder structure and the .bin files are updated so that subsequent requests for the resource are served from the file system instead of the content database.
  5. The deletion operation, which has been running the whole time, deletes the file and/or folder containing the resource that was just fetched.

Once the deletion operation has concluded, a resource that was fetched in step #4 is tracked in the BLOB cache’s dump.bin file, but as a result of step #5, the resource no longer actually exist in the BLOB cache file system.  Net effect: requests for these resources return HTTP 404 errors.

Since image files are the most common BLOB-cached resources, broken link images (for example, that nasty red “X” in place of an image in Internet Explorer) are shown for these tracked-but-missing resources.  No amount of browser refreshing brings the image back from the server; only an update to the image in the content database (which triggers a re-fetch of the affected resource into the BLOB cache) or another flush operation fixes the issue as long as BLOB caching remains enabled.

Proper Manual Clearing

The key to avoiding the type of corruption scenario I just described is to ensure that requests aren’t serviced by the web application or applications that are tied to the BLOB cache.  Luckily, this is accomplished in a relatively straightforward fashion.

Before attempting either of the approaches I’m about to share, though, you need to know where (in the server file system) your BLOB cache root folder is located.  By default, the BLOB cache root folder is located at C:\blobCache; however, most conscientious administrators change this path to point to a data drive or non-system partition.

Location of BLOB cache root folder in web.config If you are unsure of the location of the BLOB cache root folder containing resources for your site collection, it’s easy enough to determine it by inspecting the web.config file for the web application housing the site collection.  As shown in the sample web.config file on the right, the location attribute of the <BlobCache> element identifies the BLOB cache root folder in which each web application’s specific subfolder will be created.

Be aware that saving any changes to the web.config file will result in an application pool recycle, so it’s generally a good idea to review a copy of the web.config file when inspecting it rather than directly opening the web.config file itself.

The Quick and Dirty Approach

When you just want to “get it done” as quickly as possible using the least number of steps, this is the process:

  1. World Wide Web Publishing Service Stop the World Wide Web Publishing Service on the target server.  This can be accomplished from the command line (via net stop w3svc) or the Services MMC snap-in (via Start –> Administrative Tools –> Services) as shown on the right.
  2. Once the World Wide Web Publishing Service stops, simply delete the BLOB cache root folder.  Ensure that the deletion operation completes before moving on to the next step.
  3. Restart the World Wide Web Publishing service (via Services or net start w3svc).

Though this approach is quick with regard to time and effort invested, it’s certainly “dirty,” coarse, and not without disadvantages.  Using this approach prevents the web server from servicing *any* web requests for the duration of the operation.  This includes not only SharePoint requests, but requests for any other web site that may be served from the server.

Second, the “quick and dirty” approach wipes out the entire BLOB cache – not just the cached content associated with the web application housing your site collection (unless, of course, you have a single web application that hasn’t been extended).  This is the functional equivalent of trying to drive a nail with a sledgehammer, and it’s typically overkill in most production scenarios.

The Controlled (Granular) Approach

There is a less invasive alternative to the “Quick and Dirty” technique I just described, and it is the procedure I recommend for production environments and other scenarios where actions must be targeted and impact minimized.  The screenshots that follow are specific to IIS7 (Windows Server 2008), but the fundamental activities covered in each step are the same for IIS6 even if execution is somewhat different.

  1. Locating the IIS site ID of the target web application Determine the IIS ID of the web application servicing the site collection for which the flush is being performed.  This is easily accomplished using the Internet Information Services (IIS) Manager (accessible through the Administrative Tools menu) as shown to the right.  If I’m interested in clearing the BLOB cache of a site collection that is hosted within the InternalHomeWeb (Default) web application, for example, the IIS site ID of interest is 1043653284.
  2. Locating the name of the application pool associated with the web application Determine the name of application pool that is servicing the web application.  In IIS7, this is accomplished by selecting the web application (InternalHomeWeb (Default)) in the list of sites and clicking the Basic Settings… link under Edit Site in the Site Actions menu on the right-hand side of the window.  The dialog box that pops up clearly indicates the name of the associated application pool (as shown on the right, circled in red).  Note the name of the application pool for the next step.
  3. Stopping the target application pool Stop the application pool that was located in the previous step.  This will shutdown the web application and prevent MOSS from serving up requests for the site collections housed within the web application, thus avoiding the sort of race condition described earlier.  If multiple application pools are used to partition web applications within different worker processes, then shutting down the application pool is “less invasive” than stopping the entire World Wide Web Publishing Service as described in “The Quick and Dirty Approach.”  If all (or most) web applications are serviced by a single application pool, though, then there may be little functional benefit to stopping the application pool.  In such a case, it may simply be easier to stop the World Wide Web Publishing Service as described in “The Quick and Dirty Approach.”
  4. BLOB cache folder for selected web applicationOpen Windows Explorer and navigate to the BLOB cache root folder.  For the purposes of this example, we’ll assume that the BLOB cache root folder is located at E:\MOSS\BLOB Cache.  Within the root folder should be a sub-folder with a name that matches the IIS site ID determined in step #1 (1043653284).  Either delete the entire sub-folder (E:\MOSS\BLOB Cache\1043653284), or select the files within the sub-folder and delete them (as shown above).
  5. Once the deletion has completed, restart the application pool that was shutdown in step #3.  If the World Wide Web Publishing Service was shutdown instead, restart it.

Taking the approach just described affects the fewest number of cached resources necessary to ensure that the site collection in question (or rather, its associated web application/applications) starts with a “clean slate.”  If web applications are partitioned across multiple application pools, then this approach also restricts the resultant service outage to only those site collections ultimately being served by the application being shutdown and restarted.

Some Common Questions and Concerns

Q: I have multiple servers or web front-ends.  Do I need to take them all down and manually flush them as a group?

The BLOB cache on each MOSS server operates independently of other servers in the farm, so the answer is “no.”  Servers can be addressed one at a time and in any order desired.

Q: I’ve successfully performed a manual flush and brought everything back up, but I’m *still* seeing an old image/script/etc.  What am I doing wrong?

Interestingly enough, this type of scenario oftentimes has little to do with the actual server-side BLOB cache itself.

One of the attributes that can (and should) be configured when enabling the BLOB cache is the max-age attribute.  The max-age attribute specifies the duration of time, in seconds, that client-side browsers should cache resources that are retrieved from the MOSS BLOB cache.  Subsequent requests for these resources are then served directly out of the client-side cache and not made to the MOSS server until a duration of time (specified by the max-age attribute) is exceeded.

If a BLOB cache is flushed and it appears that old or incorrect resources (commonly images) are being returned when requested, it might be that the resources are simply cached on the local system and being returned from the cache instead of being fetched from the server.  Flushing locally-cached items (or deleting “Temporary Internet files” in Internet Explorer’s terminology) is a quick way to ensure that requests are being passed to the SharePoint server.

Q: I’m running into problems with a manual deletion.  Sometimes all files within the cache folder can’t be deleted, or sometimes I run into strange files that have a size of zero bytes.  What’s going on?

I haven’t seen this happen too often, but when I have seen it, it’s been due to problems with (or corruption in) the underlying file system.  If regular CHKDSK operations aren’t scheduled for the drive housing the BLOB cache, it’s probably time to set them up.

Additional Reading and References

  1. MSDN: Caching In Office SharePoint 2007
  2. CodePlex: MOSS 2007 Farm-Wide BLOB Cache Flushing Solution
  3. Blog Post: We Drift Deeper Into the Sound … as the (BLOB Cache) Flush Comes

MOSS Object Cache Memory Tuning is not an Intuitive Process

August 30, 2009 11 comments

I’ve been meaning to do a small write-up on a couple of key Object Cache points, but other things kept trumping my desire to put this post together.  I finally found the nudge I needed (or rather, gave myself a kick in the butt) after discussing the topic a bit with Andrew Connell following a presentation he gave at a SharePoint Users of Indiana user group meeting.  Thanks, Andrew!

A Brief Bit of Background

As I may have mentioned in a previous post, I’ve spent the bulk of the last two years buried in a set of Internet-facing MOSS publishing sites that are the public presence for my current client.  Given that my current client is a Fortune 50 company, it probably comes as no surprise when I say that the sites see quite a bit of daily traffic.  Issues due to poor performance tuning and inefficient code have a way of making themselves known in dramatic fashion.

Some time ago, we were experiencing a whole host of critical performance issues that ultimately stemmed from a variety of sources: custom code, infrastructure configuration, cache tuning parameters, and more.  It took a team of Microsoft experts, along with professionals working for the client, to systematically address each item and bring operations back to a “normal” state.  Though we ultimately worked through a number of different problem areas, one area in particular stood out: the MOSS Object Cache and how it was “tuned.”

What is the MOSS Object Cache?

The MOSS Object Cache is memory that’s allocated on a per-site collection basis to store commonly-accessed objects, such as navigational data, query results (cross-list and cross-site), site properties, page layouts, and more.  Object Caching should not be confused with Page Output Caching (which is an extension of ASP.NET’s built-in page caching capability) or BLOB Caching/Disk-Based Caching (which uses the local server’s file system to store images, CSS, JavaScript, and other resource-type objects).

Publishing sites make use of the Object Cache without any intervention on the part of administrators.  By default, a publishing site’s Object Cache receives up to 100MB of memory for use when the site collection is created.  This allocation can be seen on the Object Cache Settings site collection administration page within a publishing site:

Object Cache Settings Page

Note that I said that up to 100MB can be used by the Object Cache by default.  The size of the allocation simply determines how large the cache can grow in memory before item ejection, flushing, and possible compactions result.  The maximum cache size isn’t a static allocation, so allocating 500MB of memory, for example, won’t deprive the server of 500MB of memory unless the amount of data going into the cache grows to that level.  I’m taking a moment to point this out because I wasn’t (personally) aware of this when I first started working with the Object Cache.  This point also becomes a relevant point in a story I’ll be telling in a bit.

Microsoft’s TechNet site has an article that provides pretty good coverage of caching within MOSS (including the Object Cache), so I’m not going to go into all of the details it covers in this post.  I will make the assumption that the information presented in the TechNet article has been read and understood, though, because it serves as the starting point for my discussion.

Object Cache Memory Tuning Basics

The TechNet article indicates that two specific indicators should be watched for tuning purposes.  Those two indicators, along with their associated performance counters, are

  • Cache hit ratio (SharePoint Publishing Cache/Publishing cache hit ratio)
  • Object discard rate (SharePoint Publishing Cache/Total object discards)

The image below shows these counters highlighted on a MOSS WFE where all SharePoint Publishing Cache counters have been added to a Performance Monitor session:

Basic Publishing Tuning Performance Counters

According to the article, the Publishing cache hit ratio should remain above 90% and a low object discard rate should be observed.  This is good advice, and I’m not saying that it shouldn’t be followed.  In fact, my experience has shown Publishing cache hit ratio values of 98%+ are relatively common for well-tuned publishing sites possessing largely static content.

The “Dirty Little Secret” about the Publishing Cache Hit Ratio Counter

As it turns out, though, the Publishing cache hit ratio counter should come with a very large warning that reads as follows:

WARNING: This counter only resets with a server reboot. Data it displays has been aggregating for as long as the server has been up.

This may not seem like such a big deal, particularly if you’re looking at a new site collection.  Let me share a painful personal experience, though, that should drive home how important a point this really is.

I was attempting to do a little Object Cache tuning for a client to help free up some memory to make application pool recycles cleaner, and I was attempting to see if I could adjust the Object Cache allocations for multiple (about 18) site collections downward.  We were getting into a memory-constrained position, and a review of the Publishing cache hit ratio values for the existing site collections showed that all sites were turning in 99%+ cache hit ratios.  Operating under the (previously described) mistaken assumption that Object Cache memory was statically allocated, I figured that I might be able to save a lot of memory simply by adjusting the memory allocations downward.

Mistaken understanding in mind, I went about modifying the Object Cache allocation for one of the site collections.  I knew that we had some data going into the cache (navigational data and a few cross-list query result sets), so I figured that we couldn’t have been using a whole lot of memory.  I adjusted the allocation down dramatically (to 10MB) on the site collection and I periodically checked back over the course of several hours to see how the Publishing cache hit ratio fared.

After a chunk of the day had passed, I saw that the Publishing cache hit ratio remained at 99%+.  I considered my assumption and understanding about data going into the Object Cache to be validated, and I went on my way.  What I didn’t realize at the time was that the actual Publishing cache hit ratio counter value was driven by the following formula:

Publishing cache hit ratio = total cache hits / (total cache hits + total cache misses) * 100%

Note the pervasive use of the word “total” in the formula.  In my defense, it wasn’t until we engaged Microsoft and made requests (which resulted in many more internal requests) that we learned the formulas that generate the numbers seen in many of the performance counters.  To put it mildly, the experience was “eye opening.”

In reality, the site collection was far from okay following the tuning I performed.  It truly needed significantly more than the 10MB allocation I had given it.  If it were possible to reset the Publishing cache hit ratio counter or at least provide a short-term snapshot/view of what was going on, I would have observed a significant drop following the change I made.  Since our server had been up for a month or more, and had been doing a good job of servicing requests from the cache during that time, the sudden drop in objects being served out of the Object Cache was all but undetectable in the short-term using the Publishing cache hit ratio.

To spell this out even further for those who don’t want to do the math: a highly-trafficked publishing site like one of my client’s sites may service 50 million requests from the Object Cache over the course of a month.  Assuming that the site collection had been up for a month with a 99% Object Cache hit ratio, plugging the numbers into the aforementioned formula might look something like this:

Publishing cache hit ratio = 49500000 / (49500000 + 500000) * 100% = 99.0%

50 million Object Cache requests per month breaks down to about 1.7 million requests per day.  Let’s say that my Object Cache adjustment resulted in an extremely pathetic 10% cache hit ratio.  That means that of 1.7 million object requests, only 170000 of them would have been served from the Object Cache itself.  Even if I had watched the Publishing cache hit ratio counter for the entire day and seen the results of all 1.7 million requests, here’s what the ratio would have looked like at the end of the day (assuming one month of uptime):

Publishing cache hit ratio = 51200000 / (51200000 + 2030000) * 100% = 96.2%

Net drop: only about 2.8% over the course of the entire day!

Seeing this should serve as a healthy warning for anyone considering the use the Publishing cache hit ratio counter alone for tuning purposes.  In publishing environments where server uptime is maximized, the Publishing cache hit ratio may not provide any meaningful feedback unless the sampling time for changes is extended to days or even weeks.  Such long tuning timelines aren’t overly practical in many heavily-trafficked sites.

So, What Happens When the Memory Allocation isn’t Enough?

In plainly non-technical terms: it gets ugly.  Actual results will vary based on how memory starved the Object Cache is, as well as how hard the web front-ends (WFEs) in the farm are working on average.  As you might expect, systems under greater stress or load tend to manifest symptoms more visibly than systems encountering lighter loads.

In my case, one of the client’s main sites was experiencing frequent Object Cache thrashing, and that led to spells of extremely erratic performance during times when flushes and cache compactions were taking place.  The operations I describe are extremely resource intensive and can introduce blocking behavior in the request pipeline.  Given the volume of requests that come through the client’s sites, the entire farm would sometimes drop to its knees as the Object Cache struggled to fill, flush, and serve as needed.  Until the problem was located and the allocation was adjusted, a lot of folks remained on-call.

Tuning Recommendations

First and foremost: don’t adjust the size of the Object Cache memory allocation downwards unless you’ve got a really good reason for doing so, such as extreme memory constraints or some good internal knowledge indicating that the Object Cache simply isn’t being used in any substantial way for the site collection in question.  As I’ve witnessed firsthand, the performance cost of under-allocating memory to the Object Cache can be far worse than the potential memory savings gained by tweaking.

Second, don’t make the same mistake I made and think that the Object Cache memory allocation is a static chunk of memory that’s claimed by MOSS for the site collection.  The Object Cache uses only the memory it needs, and it will only start ejecting/flushing/compacting the Object Cache after the cache has become filled to the specified allocation limit.

And now, for the $64,000-contrary-to-common-sense tip …

For tuning established site collections and the detection of thrashing behavior, Microsoft actually recommends using the Object Cache compactions performance counter (SharePoint Publishing Cache/Total number of cache compactions) to guide Object Cache memory allocation.  Since cache compactions represent the greatest threat to ongoing optimal performance, Microsoft concluded (while working to help us) that monitoring the Total number of cache compactions counter was the best indicator of whether or not the Object Cache was memory starved and in trouble:

Total number of cache compactions highlighted in Performance Monitor

Steve Sheppard (a very knowledgeable Microsoft Escalation Engineer with whom I worked and highly recommend) wrote an excellent blog post that details the specific process he and the folks at Microsoft assembled to use the Total number of cache compactions counter in tuning the Object Cache’s memory allocation.  I recommend reading his post, as it covers a number of details I don’t include here.  The distilled guidelines he presents for using the Total number of cache compactions counter basically break counter values into three ranges:

  • 0 or 1 compactions per hour: optimal
  • 2 to 6 compactions per hour: adequate
  • 7+ compactions per hour: memory allocation insufficient

In short: more than six cache compactions per hour is a solid sign that you need to adjust the site collection’s Object Cache memory allocation upwards.  At this level of memory starvation within the Object Cache, there are bound to be secondary signs of performance problems popping up (for example, erratic response times and increasing ASP.NET request queue depth).

Conclusion

We were able to restore Object Cache performance to acceptable levels (and adjust our allocation down a bit), but we lacked good guidance and a quantifiable measure until the Total number of cache compactions performance counter came to light.  Keep this in your back pocket for the next time you find yourself doing some tuning!

Addendum

I owe Steve Sheppard an additional debt of gratitude for keeping me honest and cross-checking some of my earlier statements and numbers regarding the Publishing cache hit ratio.  Though the counter values persist beyond an IISReset, I had incorrectly stated that they persist beyond a reboot and effectively never reset.  The values do reset, but only after a server reboot.  I’ve updated this post to reflect the feedback Steve supplied.  Thank you, Steve!

Additional Reading and References

  1. User Group: SharePoint Users of Indiana
  2. Blog: Andrew Connell
  3. TechNet: Caching In Office SharePoint Server 2007
  4. Blog: Steve Sheppard

RPO and RTO: Prerequisites for Informed SharePoint Disaster Recovery Planning

July 8, 2009 5 comments

Years ago, before I began working with SharePoint, I spent some time working as an application architect for a Fortune 500 financial services company based here in Cincinnati, Ohio.  While at the company, I was awarded the opportunity to serve as a disaster recovery (DR) architect on the team that would build (from ground up) the company’s first DR site implementation.  It was a high-profile role with little definition – the kind that can either boost a career or burn it down.  Luckily for me, the outcome leaned towards the former.

Admittedly, though, I knew very little about DR before starting in that position.  I only knew what management had tasked me with doing: ensuring that mission-critical applications would be available and functional at the future DR site in the event of a disaster.  If you aren’t overly familiar with DR, then that target probably sounds relatively straightforward.  As I began working and researching avenues of attack for my problem set, though, I quickly realized how challenging and unusual disaster recovery planning was as a discipline – particularly for a “technically minded” person like myself.

Understanding the “Technical Tendency”

When it comes to DR, folks with whom I’ve worked have heard me say the following more than a few times:

It is the nature of technical people to find and implement technical solutions to technical problems.  At its core, disaster recovery is not a technical problem; it is a business problem.  Approaching disaster recovery with a purely technical mindset will result in a failure to deliver an appropriate solution more often than not.

What do I mean by that?  Well, technical personnel tend to lump DR plans and activities into categories like “buying servers,” “taking backups,” and “acquiring off-site space.”  These activities can certainly be (and generally are) part of a DR plan, but if they are the starting point for a DR strategy, then problems are likely to arise.

Let me explain by way of a simplistic and fictitious example.

Planning for SharePoint DR in a Vacuum

Consider the plight of Larry.  Larry is an IT professional who possesses administrative responsibility for his company’s SharePoint-based intranet.  One day, Larry is approached by his manager and instructed to come up with a DR strategy for the SharePoint farm that houses the intranet.  Like most SharePoint administrators, Larry’s never really “done” DR before.  He’s certain that he will need to review his backup strategy and make sure that he’s getting good backups.  He’ll probably need to talk with the database administrators, too, because it’s generally a good idea to make sure that SQL backups are being taken in addition to SharePoint farm (catastrophic) backups.

Larry’s been told that off-site space is already being arranged by the facilities group, so that’s something he’ll be able to take off of his plate.  He figures he’ll need to order new servers, though.  Since the company’s intranet farm consists of four servers (including database servers), he plans to play it safe and order four new servers for the DR site.  In his estimation, he’ll probably need to talk with the server team about the hardware they’ll be placing out at the DR site, he’ll need to speak with the networking team about DNS and switching capabilities they plan to include, etc.

Larry prepares his to-do list, dives in, and emerges three months later with an intranet farm DR approach possessing the following characteristics:

  • The off-site DR location will include four servers that are setup and configured as a new, “warm standby” SharePoint farm.
  • Every Sunday night, a full catastrophic backup of the SharePoint farm will be taken; every other night of the week, a differential backup will be taken.  After each nightly backup is complete, it will be remotely copied to the off-site DR location.
  • In the event of a disaster, Larry will restore the latest full backup and appropriate differential backups to the standby farm that is running at the DR site.
  • Once the backups have been restored, all content will be available for users to access – hypothetically speaking, of course.

There are a multitude of technical questions that aren’t answered in the plan described above.  For example, how is patching of the standby farm handled?  Is the DR site network a clone of the existing network?  Will server name and DNS hostname differences be an issue?  What about custom solution packages (WSPs)?  Ignoring all the technical questions for a moment, take a step back and ask yourself the question of greatest importance: will Larry’s overall strategy and plan meet his DR requirements?

If you’re new to DR, you might say “yes” or “no” based on how you view your own SharePoint farm and your experiences with it.  If you’ve previously been involved in DR planning and are being honest, though, you know that you can’t actually answer the question.  Neither can Larry or his manager.  In fact, no one (on the technical side, anyway) has any idea if the DR strategy is a good one or not – and that’s exactly the point I’m trying to drive home.

The Cart Before the Horse

Assuming Larry’s company is like many others, the SharePoint intranet has a set of business owners and stakeholders (collectively referred to as “The Business” hereafter) who represent those who use the intranet for some or all of their business activities.  Ultimately, The Business would issue one of three verdicts upon learning of Larry’s DR strategy:

Verdict 1: Exactly What’s Needed

Let’s be honest: Larry’s DR plan for intranet recovery could be on-the-money.  Given all of the variables in DR planning and the assumptions that Larry made, though, the chance of such an outcome is slim.

Verdict 2: DR Strategy Doesn’t Offer Sufficient Protection

There’s a solid chance that The Business could judge Larry’s DR plan as falling short.  Perhaps the intranet houses areas that are highly volatile with critical data that changes frequently throughout the day.  If an outage were to occur at 4pm in the afternoon, an entire day’s worth of data would basically be lost because the most recent backup would likely be 12 or so hours old (remember: the DR plan calls for nightly backups).  Loss of that data could be exceptionally costly to the organization.

At the same time, Larry’s recovery strategy assumes that he has enough time to restore farm-level backups at the off-site location in the event of a disaster.  Restoring a full SharePoint farm-level backup (with the potential addition of differential backups) could take hours.  If having the intranet down costs the company $100,000 per hour in lost productivity or revenue, you can bet that The Business will not be happy with Larry’s DR plan in its current form.

Verdict 3: DR Strategy is Overkill

On the flipside, there’s always the chance that Larry’s plan is overkill.  If the company’s intranet possesses primarily static content that changes very infrequently and is of relatively low importance, nightly backups and a warm off-site standby SharePoint farm may be overkill.  Sure, it’ll certainly allow The Business to get their intranet back in a timely fashion … but at what cost?

If a monthly tape backup rotation and a plan to buy hardware in the event of a disaster is all that is required, then Larry’s plan is unnecessarily costly.  Money is almost always constrained in DR planning and execution, and most organizations prioritize their DR target systems carefully.  Extra money that is spent on server hardware, nightly backups, and maintenance for a warm off-site SharePoint farm could instead be allocated to the DR strategies of other, more important systems.

Taking Care of Business First

No one wants to be left guessing whether or not their SharePoint DR strategy will adequately address DR needs without going overboard.  In approaching the challenge his manager handed him without obtaining any additional input, Larry fell into the same trap that many IT professionals do when confronted with DR: he failed to obtain the quantitative targets that would allow him to determine if his DR plan would meet the needs and expectations established by The Business.  In their most basic form, these requirements come in the form of recovery point objectives (RPOs) and recovery time objectives (RTOs).

The Disaster Recovery Timeline

I have found that the concepts of RPO and RTO are easiest to explain with the help of illustrations, so let’s begin with a picture of a disaster recovery timeline itself:

Disaster Recovery Timeline

The diagram above simply shows an arbitrary timeline with an event (a “declared disaster”) occurring in the middle of the timeline.  Any DR planning and preparation occurs to the left of the event on the timeline (in the past when SharePoint was still operational), and the actual recovery of SharePoint will happen following the event (that is, to the right of the event on the timeline in the “non-operational” period).

This DR timeline will become the canvas for further discussion of the first quantitative DR target you need to obtain before you can begin planning a SharePoint DR strategy: RPO.

RPO: Looking Back

As stated a little earlier, RPO is an acronym for Recovery Point Objective.  Though some find the description distasteful, the easiest way to describe RPO is this: it’s the maximum amount of data loss that’s tolerated in the event of a disaster.  RPO targets vary wildly depending on volatility and criticality of the data stored within the SharePoint farm.  Let’s add a couple of RPO targets to the DR timeline and discuss them a bit further.

Disaster Recovery Timeline with RPO

Two RPO targets have been added to the timeline: RPO1 and RPO2.  As discussed, each of these targets marks a point in the past from which data must be recoverable in the event of a disaster.  In the case of our first example, RPO1, the point in question is 48 hours before a declared disaster (that is, “we have a 48 hour RPO”).  RPO2, on the other hand, is a point in time that is a mere 30 minutes prior to the disaster event (or a “30 minute target RPO”).

At a minimum, any DR plan that is implemented must ensure that all of the data prior to the point in time denoted by the selected RPO can be recovered in the event of a disaster.  For RPO1, there may be some loss of data that was manipulated in the 48 hours prior to the disaster, but all data older than 48 hours will be recovered in a consistent state.  RPO2 is more stringent and leaves less wiggle room; all data older than 30 minutes is guaranteed to be available and consistent following recovery.

If you think about it for a couple of minutes, you can easily begin to see how RPO targets will quickly validate or rule-out varying backup and/or data protection strategies.  In the case of RPO1, we’re “allowed” to lose up to two days (48 hours) worth of data.  In this situation, a nightly backup strategy would be more than adequate to meet the RPO target, since a nightly backup rotation guarantees that available backup data is never more than 24 hours old.  Whether disk or tape based, this type of backup approach is very common in the world of server management.  It’s also relatively inexpensive.

The same nightly backup strategy would fail to meet the RPO requirement expressed by RPO2, though.  RPO2 states that we cannot lose more than 30 minutes of data.  With this type of RPO, most standard disk and tape-based backup strategies will fall short of meeting the target.  To meet RPO2’s 30 minute target, we’d probably need to look at something like SQL Server log shipping or mirroring.  Such a strategy is going to generally require a greater investment in database hardware, storage, and licensing.  Technical complexity also goes up relative to the aforementioned nightly backup routine.

It’s not too hard to see that as the RPO window becomes increasingly more narrow and approaches zero (that is, an RPO target of real-time failover with no data loss permitted), the cost and complexity of an acceptable DR data protection strategy climbs dramatically.

RTO: Thinking Ahead

If RPO drives how SharePoint data protection should be approached prior to a disaster, RTO (or Recovery Time Objective) denotes the timeline within which post-disaster farm and data recovery must be completed.  To illustrate, let’s turn once again to the DR timeline.

Disaster Recovery Timeline with RTO

As with the previous RPO example, we now have two RTO targets on the timeline: RTO1 and RTO2.  Analogous to the RPO targets, the RTO targets are given in units of time relative to the disaster event.  In the case of RTO1, the point in time in question is two hours after a disaster has been declared.  RTO2 is designated as t+36 hours, or a day and a half after the disaster has been declared.

In plain English, an RTO target is the maximum amount of time that the recovery of data and functionality can take following a disaster event.  If the overall DR plan for your SharePoint farm were to have an RTO that matches RTO2, for instance, you would need to have functionality restored (at an agreed-upon level) within a day and half.  If you were operating with a target that matches RTO1, you would have significantly less time to get everything “up and running” – only two hours.

RTO targets vary for the same reasons that RPO targets vary.  If the data that is stored within SharePoint is highly critical to business operations, then RTOs are generally going to trend towards hours, minutes, or maybe even real-time (that is, an RTO that mandates transferring to a hot standby farm or “mirrored” data center for zero recovery time and no interruption in service).  For SharePoint data and farms that are less business critical (maybe a publishing site that contains “nice to have” information), RTOs could be days or even weeks.

Just like an aggressive RPO target, an aggressive RTO target is going to limit the number of viable recovery options that can possibly address it – and those options are generally going to lean towards being more expensive and technically more complex.  For example, attempting to meet a two hour RTO (RTO1) by restoring a farm from backup tapes is going to be a gamble.  With very little data, it may be possible … but you wouldn’t know until you actually tried with a representative backup.  At the other extreme, an RTO that is measured in weeks could actually make a ground-up farm rebuild (complete with new hardware acquisition following the disaster) a viable – and rather inexpensive (in up-front capital) – recovery strategy.

Whether or not a specific recovery strategy will meet RTO targets in advance of a disaster is oftentimes difficult to determine without actually testing it.  That’s where the value of simulated disasters and recovery exercises come into play – but that’s another topic for another time.

Closing Words

This post was intended to highlight a common pitfall affecting not only SharePoint DR planning, but DR planning in general.  It should be clear by now that I deliberately avoided technical questions and issues to focus on making my point about planning.  Don’t interpret my “non-discussion” of technical topics to mean that I think that their place with regard to SharePoint DR is secondary.  That’s not the case at all; the fact that John Ferringer and I wrote a book on the topic (the “SharePoint 2007 Disaster Recovery Guide”) should be proof of this.  It should probably come as no surprise that I recommend our book for a much more holistic treatment of SharePoint DR – complete with technical detail.

There are also a large number of technical resources for SharePoint disaster recovery online, and the bulk of them have their strong and weak points.  My only criticism of them in general is that they equate “disaster recovery” to “backup/restore.”  While the two are interrelated, the latter is but only one aspect of the former.  As I hope this post points out, true disaster recovery planning begins with dialog and objective targets – not server orders and backup schedules.

If you conclude your reading holding onto only one point from this post, let it be this: don’t attempt DR until you have RPOs and RTOs in hand!

Additional Reading and References

  1. Book: SharePoint 2007 Disaster Recovery Guide
  2. Online: Disaster Recovery Journal site
Follow

Get every new post delivered to your Inbox.

Join 2,207 other followers

%d bloggers like this: