In this post, I discuss events that I’ll be participating in during the month of November. Events include a SharePoint Conference recap presentation for Microsoft, SharePoint Saturday Cleveland, and a webcast for Idera on SharePoint disaster recovery.
November was looking like a pretty busy month for me before this year’s SharePoint Conference (SPC) in Las Vegas, but the excitement about SharePoint 2010 both in and around the conference seems to have ratcheted things up a notch. Here’s where I’ll be and what I’ll be doing (in “order of appearance”) in the month of November:
Microsoft “Best of SPC 2009” Event
Many of the folks who wanted to attend the Microsoft SharePoint Conference in Las Vegas this year weren’t able to so for a variety of reasons. To “share the love” a bit, Microsoft is holding a series of one-day events that brings select sessions from the SPC to cities around the country … or at least around the state of Ohio. Yes, I’m extrapolating a bit with “around the country,” but it’s an educated guess :-)
In any case, I’ll be delivering a session titled What’s New for SharePoint 2010 Administration and Governance to the crowd that will be attending the event at the Microsoft office in Columbus, Ohio, on November 10th. The abstract for the session reads as follows:
SharePoint 2010 includes many new and improved tools for providing a flexible and controlled environment and this session will provide an overview of those innovations.
I caught this session while I was at the SPC, and I found it to be good, solid information for IT professionals. I’m very much looking forward to delivering the content myself!
SharePoint Saturday Cleveland
SharePoint Saturday finally makes its way to Ohio! SharePoint Saturday Cleveland will be held on Saturday, November 14th, at the Embassy Suites on Rockside Woods Blvd. in Independence, Ohio.
John Ferringer and I will be delivering our SharePoint disaster recovery (DR) talk titled “Saving SharePoint.” It will differ a bit from previous presentations on the topic in that we can now include SharePoint 2010 content. After the talk, I’ll be sure to post our slide deck here on my blog.
SPS Cleveland is less than two weeks away, but there are still seats open. As with all SPS events, there’s no charge for those in attendance – all you need to do is show up and take it all in!
The “week of whirlwind activity” (roughly speaking) will conclude with a webcast for Idera. John and I will be presenting SharePoint Disaster Recovery Essential Guidelines on Wednesday, November 18th, and it will be similar to the SharePoint Saturday presentation we’ve given in the past (and will have given a few days earlier at SPS Cleveland).
Todd Klindt recently presented a DR webcast with Idera; if you saw it, you might be asking “do I really need another DR webcast?” Probably the biggest differences between Todd’s webcast and ours are scope and target audience. I caught Todd’s presentation, and his webcast was aimed more at the solidly SharePoint admin/IT pro crowd. John and I include some of the same content and focus, but our webcast is packaged with more of a lean towards classic DR concepts (RTO, RPO, BCPs, etc.). I would also say that our webcast targets IT decision makers and DR planners as much as it does IT pros, though I feel that both groups will find something of interest in what we have to say.
If our webcast sounds like it would be of interest to you, hop over to Idera’s site and sign up!
This post investigates manual flushing of the MOSS BLOB cache via file system deletion, why such flushes might be needed, and how they should be carried out. Some common troubleshooting questions (and answers to them) are also covered.
It’s a fact of life when dealing with many caching systems: for all the benefits they provide, they occasionally become corrupt or require some form of intervention to ensure healthy ongoing operation. The MOSS Binary Large Object (BLOB) cache, or disk-based cache, is no different.
Is BLOB Cache Corruption a Common Problem?
In my experience, the answer is “no.” The MOSS BLOB cache generally requires little maintenance and attention beyond ensuring that it has enough disk space to properly store the objects it fetches from the lists within the content databases housing your publishing site collections.
How Should a Flush Be Carried Out?
When corruption does occur or a cache flush is desired for any reason, the built-in “Disk Based Cache Reset” option is typically adequate for flushing the BLOB cache on a single server and single web application zone. This option (circled in red on the page shown to the right) is exposed through the Site collection object cache menu item on a publishing site’s Site Collection Administration menu. Executing a flush is as simple as checking the supplied checkbox and clicking the OK button at the bottom of the page. When a flush is executed in this fashion, it affects only the server to which the postback occurs and only the web application through which the request is directed. If a site collection is extended to multiple web applications, only one web application’s BLOB cache is affected by this operation.
Alternatively, my MOSS 2007 Farm-Wide BLOB Cache Flushing Solution (screenshot shown on the right) can be used to clear the BLOB cache folders associated with a target site collection across all servers in a farm and across all web applications (zones) serving up the site collection. This solution utilizes a different mechanism for flushing, but the net effect produced is the same as for the out-of-the-box (OOTB) mechanism: all BLOB-cached files for the associated site collection are deleted from the file system, and the three BLOB cache tracking files for each affected web application (IIS site) are reset.
If the aforementioned flush mechanisms simply aren’t working for you, you’re probably staring down the barrel of a manual BLOB cache flush. Just delete all of the files in the target BLOB cache folder (as specified in the web.config) and you should be good to go, right?
Jumping in and simply deleting files without stopping requests to the affected site collection (or rather, the web application/applications servicing the site collection) risks sending you down the road to (further) cache corruption. This risk may be small for sites that see little traffic or are relatively small, but the risk grows with increasing request volume and site collection size. Allow me to illustrate with an example.
Let’s say that you decided to manually clear the BLOB cache for a sizable publishing site collection that is heavily trafficked. You go into the file system, find your BLOB cache folder (by default, C:\blobCache), open it up, select all files and sub-folders contained within, and press the <Delete> key on your keyboard. Deletion of the BLOB cache files and sub-folders commences.
Deleting the sub-folders and files isn’t an instantaneous operation, though. It takes some time. While the deletion is taking place, let’s say that your MOSS publishing site collections are still up and servicing requests. The web applications for which BLOB caching is enabled are still attempting to use the very folders and files currently being deleted.
The Race Condition
For the duration of the deletion, a race condition is in effect that can yield some fairly unpredictable results. Consider the following possible execution sequence. Note: this example is hypothetical, but I’ve seen results on multiple occasions that infer this execution sequence (or something similar to it).
The deletion operation deletes one or more of the .bin files at the root of a web application’s BLOB cache folder. These files are used by MOSS to track the contents of the BLOB cache, the number of times it was flushed, etc.
A request for a resource that would normally be present in the BLOB cache arrives at the web server. An attempted lookup for the resource in the BLOB cache folder fails because the .bin files are gone as a result of the actions taken in the last step.
The absence of the .bin files kicks off some housekeeping. Ultimately, a “fresh” set of .bin files written out.
The requested resource is fetched into the BLOB cache (sub-)folder structure and the .bin files are updated so that subsequent requests for the resource are served from the file system instead of the content database.
The deletion operation, which has been running the whole time, deletes the file and/or folder containing the resource that was just fetched.
Once the deletion operation has concluded, a resource that was fetched in step #4 is tracked in the BLOB cache’s dump.bin file, but as a result of step #5, the resource no longer actually exist in the BLOB cache file system. Net effect: requests for these resources return HTTP 404 errors.
Since image files are the most common BLOB-cached resources, broken link images (for example, that nasty red “X” in place of an image in Internet Explorer) are shown for these tracked-but-missing resources. No amount of browser refreshing brings the image back from the server; only an update to the image in the content database (which triggers a re-fetch of the affected resource into the BLOB cache) or another flush operation fixes the issue as long as BLOB caching remains enabled.
Proper Manual Clearing
The key to avoiding the type of corruption scenario I just described is to ensure that requests aren’t serviced by the web application or applications that are tied to the BLOB cache. Luckily, this is accomplished in a relatively straightforward fashion.
Before attempting either of the approaches I’m about to share, though, you need to know where (in the server file system) your BLOB cache root folder is located. By default, the BLOB cache root folder is located at C:\blobCache; however, most conscientious administrators change this path to point to a data drive or non-system partition.
If you are unsure of the location of the BLOB cache root folder containing resources for your site collection, it’s easy enough to determine it by inspecting the web.config file for the web application housing the site collection. As shown in the sample web.config file on the right, the location attribute of the <BlobCache> element identifies the BLOB cache root folder in which each web application’s specific subfolder will be created.
Be aware that saving any changes to the web.config file will result in an application pool recycle, so it’s generally a good idea to review a copy of the web.config file when inspecting it rather than directly opening the web.config file itself.
The Quick and Dirty Approach
When you just want to “get it done” as quickly as possible using the least number of steps, this is the process:
Stop the World Wide Web Publishing Service on the target server. This can be accomplished from the command line (via net stop w3svc) or the Services MMC snap-in (via Start –> Administrative Tools –> Services) as shown on the right.
Once the World Wide Web Publishing Service stops, simply delete the BLOB cache root folder. Ensure that the deletion operation completes before moving on to the next step.
Restart the World Wide Web Publishing service (via Services or net start w3svc).
Though this approach is quick with regard to time and effort invested, it’s certainly “dirty,” coarse, and not without disadvantages. Using this approach prevents the web server from servicing *any* web requests for the duration of the operation. This includes not only SharePoint requests, but requests for any other web site that may be served from the server.
Second, the “quick and dirty” approach wipes out the entire BLOB cache – not just the cached content associated with the web application housing your site collection (unless, of course, you have a single web application that hasn’t been extended). This is the functional equivalent of trying to drive a nail with a sledgehammer, and it’s typically overkill in most production scenarios.
The Controlled (Granular) Approach
There is a less invasive alternative to the “Quick and Dirty” technique I just described, and it is the procedure I recommend for production environments and other scenarios where actions must be targeted and impact minimized. The screenshots that follow are specific to IIS7 (Windows Server 2008), but the fundamental activities covered in each step are the same for IIS6 even if execution is somewhat different.
Determine the IIS ID of the web application servicing the site collection for which the flush is being performed. This is easily accomplished using the Internet Information Services (IIS) Manager (accessible through the Administrative Tools menu) as shown to the right. If I’m interested in clearing the BLOB cache of a site collection that is hosted within the InternalHomeWeb (Default) web application, for example, the IIS site ID of interest is 1043653284.
Determine the name of application pool that is servicing the web application. In IIS7, this is accomplished by selecting the web application (InternalHomeWeb (Default)) in the list of sites and clicking the Basic Settings… link under Edit Site in the Site Actions menu on the right-hand side of the window. The dialog box that pops up clearly indicates the name of the associated application pool (as shown on the right, circled in red). Note the name of the application pool for the next step.
Stop the application pool that was located in the previous step. This will shutdown the web application and prevent MOSS from serving up requests for the site collections housed within the web application, thus avoiding the sort of race condition described earlier. If multiple application pools are used to partition web applications within different worker processes, then shutting down the application pool is “less invasive” than stopping the entire World Wide Web Publishing Service as described in “The Quick and Dirty Approach.” If all (or most) web applications are serviced by a single application pool, though, then there may be little functional benefit to stopping the application pool. In such a case, it may simply be easier to stop the World Wide Web Publishing Service as described in “The Quick and Dirty Approach.”
Open Windows Explorer and navigate to the BLOB cache root folder. For the purposes of this example, we’ll assume that the BLOB cache root folder is located at E:\MOSS\BLOB Cache. Within the root folder should be a sub-folder with a name that matches the IIS site ID determined in step #1 (1043653284). Either delete the entire sub-folder (E:\MOSS\BLOB Cache\1043653284), or select the files within the sub-folder and delete them (as shown above).
Once the deletion has completed, restart the application pool that was shutdown in step #3. If the World Wide Web Publishing Service was shutdown instead, restart it.
Taking the approach just described affects the fewest number of cached resources necessary to ensure that the site collection in question (or rather, its associated web application/applications) starts with a “clean slate.” If web applications are partitioned across multiple application pools, then this approach also restricts the resultant service outage to only those site collections ultimately being served by the application being shutdown and restarted.
Some Common Questions and Concerns
Q: I have multiple servers or web front-ends. Do I need to take them all down and manually flush them as a group?
The BLOB cache on each MOSS server operates independently of other servers in the farm, so the answer is “no.” Servers can be addressed one at a time and in any order desired.
Q: I’ve successfully performed a manual flush and brought everything back up, but I’m *still* seeing an old image/script/etc. What am I doing wrong?
Interestingly enough, this type of scenario oftentimes has little to do with the actual server-side BLOB cache itself.
One of the attributes that can (and should) be configured when enabling the BLOB cache is the max-age attribute. The max-age attribute specifies the duration of time, in seconds, that client-side browsers should cache resources that are retrieved from the MOSS BLOB cache. Subsequent requests for these resources are then served directly out of the client-side cache and not made to the MOSS server until a duration of time (specified by the max-age attribute) is exceeded.
If a BLOB cache is flushed and it appears that old or incorrect resources (commonly images) are being returned when requested, it might be that the resources are simply cached on the local system and being returned from the cache instead of being fetched from the server. Flushing locally-cached items (or deleting “Temporary Internet files” in Internet Explorer’s terminology) is a quick way to ensure that requests are being passed to the SharePoint server.
Q: I’m running into problems with a manual deletion. Sometimes all files within the cache folder can’t be deleted, or sometimes I run into strange files that have a size of zero bytes. What’s going on?
I haven’t seen this happen too often, but when I have seen it, it’s been due to problems with (or corruption in) the underlying file system. If regular CHKDSK operations aren’t scheduled for the drive housing the BLOB cache, it’s probably time to set them up.
The Microsoft SharePoint Conference 2009 is less than three weeks away. This post covers some news regarding the book signing and “Ask the Experts” panel in which I’ll be sponsoring.
As most folks who work with SharePoint know, Microsoft’s SharePoint Conference 2009 (SPC09) is coming up in just a couple of weeks. This conference is the premier gathering for SharePoint professionals from all over the world, and this year’s conference promises to be chock-full of exciting sessions and announcements. Many of the announcements will undoubtedly revolve around SharePoint Server 2010, its capabilities, and (hopefully) some better information regarding its release timeline.
For me personally, the conference includes a couple of very special events. First, the great folks at Idera are sponsoring a book signing session for John Ferringer and me on Wednesday, October 21st, at 6pm. If you’re one of the first 50 people to come by the Idera booths (#811 and #813), you’ll get a free copy of the book … and it’ll be signed by John and me. What more could you ask for?!?! (and yes, I say that tongue-in-cheek)
Ask the Experts Session
On Monday, October 19th, from 6pm until 7:30pm, Idera will also be sponsoring an “Ask the Experts” session for SPC09 conference participants. Eric Shupps, Errin O’Connor, John Ferringer, Shane Young, Todd Klindt, and I will be taking (and hopefully answering) questions pertaining to the SharePoint platform, including SharePoint Server 2010.
Each of us on the panel has an “area of expertise.” For John and me, it probably comes as no surprise to learn that we’ll be fielding questions pertaining to SharePoint DR and backup/recovery. If you’re going to be at the conference and are interested in attending the session, swing by Idera’s booths (again, #811 and #813) for more information!
In this post, I discuss a couple of the SharePoint-related activities I was involved in over the summer — specifically, the SharePoint Saturday Ozarks event and the creation of a disaster recovery (DR) whitepaper for Idera.
In addition to the more formalized blog posts I’ve been assembling, I wanted to start detailing and informing readers about some of the upcoming SharePoint activities I’ll be involved in. With the SharePoint Conference 2009 (SPC09) taking place in just a few weeks, there will actually be quite a bit for me to announce.
Unfortunately, I’m not yet able to announce a few specific items on the horizon due to certain “restrictions” … so, in the absence of news on upcoming events, I figured I’d recap some of this summer’s activities. Hey, even “old news” is still news!
SharePoint Saturday Ozarks
SPS Ozarks was held in Harrison, Arkansas on July 18, 2009. The event was put together and coordinated by Mark Rackley, an all-around great guy who invested a tremendous amount of time and energy to ensure that everything was successful.
For those who aren’t familiar with SharePoint Saturdays: these events have been popping up all over the country and abroad. The SharePoint Saturday concept is the work of Michael Lotter of B&R Business Solutions, and the SPS events serve to educate and inform anyone willing to spend a Saturday learning about SharePoint. SharePoint Saturdays are free to attendees, and in addition to being highly informative, the events are a great way to meet and interact with members of the SharePoint community.
Here’s the description that was published for our “Saving SharePoint” session:
A look at the options available for preserving your SharePoint environment and why disaster recovery is so much more than using a tool or running regular back ups. John Ferringer and Sean McDonough, co-authors of the “SharePoint 2007 Disaster Recovery Guide,” will be discussing the benefits, limitations, and potential scenarios for the many tools Microsoft makes available to backup and restore SharePoint, with a focus on finding the right fit for a variety of situations and environments. They will also cover disaster recovery concepts and strategies, explaining terms such as recovery time objectives and recovery point objects and why DR is so much more than just backing up your SharePoint sites.
“Saving SharePoint” was well-received, and both John and I had fun presenting together (a first for us). It was also great to meet and spend some time with so many of the folks (other presenters) with whom we interact in the SharePoint space!
The slides we used during the delivery of out presentation can be found here.
SharePoint Disaster Recovery Whitepaper
John and I were also approached by Idera over the summer to write a whitepaper on SharePoint disaster recovery. Idera produces a number of very useful tools for SQL Server, SharePoint, and PowerShell. Given that they’ve made backup/recovery one of their focuses in the SharePoint space, the whitepaper seemed like a good fit.
Titled “Protect Your SharePoint Content: An Overview of SharePoint 2007 Disaster Recovery,” the whitepaper can be freely downloaded from Idera’s site. In the eight page paper, John and I walk through a number of the high-level considerations one should bear in mind when beginning the process of developing a SharePoint disaster recovery (DR) strategy. The whitepaper target audience is IT decision makers and those relatively unfamiliar to DR – not administrators and other technical personnel looking for DR “how to’s” or tools recommendations. Quite simply, DR is far too big a topic to cover in eight pages; that is, after all, why we wrote a 400 page book on the topic.
If you’re interested in SharePoint DR or tasked with assembling a strategy, have a look at the whitepaper. After all, it’s free!
For those of you who’ve been here before, you’ve probably noticed that the look and feel of my blog has changed a bit. While WordPress has always offered a variety of themes, I felt like I was “compromising” (and I don’t mean that in the positive sense of the word) with the one I had been using before. That theme, “Rubric,” had most of what I needed … but it didn’t look very professional.
When the INove theme was added, I sat up and took notice. To me, it looks more professional and contains more features. I’ve never been a fan of fixed column sizes, but I’ll learn to work around it.
In any case, I hope you find the new theme to your liking. As always, I welcome any sort of feedback you’re willing to share!
This post discusses the process of tuning the memory allocation for the Object Cache that is used by MOSS publishing sites. It includes some warnings regarding the “Publishing cache hit ratio” performance counter, and it describes the counter-intuitive use of the
I’ve been meaning to do a small write-up on a couple of key Object Cache points, but other things kept trumping my desire to put this post together. I finally found the nudge I needed (or rather, gave myself a kick in the butt) after discussing the topic a bit with Andrew Connell following a presentation he gave at a SharePoint Users of Indiana user group meeting. Thanks, Andrew!
A Brief Bit of Background
As I may have mentioned in a previous post, I’ve spent the bulk of the last two years buried in a set of Internet-facing MOSS publishing sites that are the public presence for my current client. Given that my current client is a Fortune 50 company, it probably comes as no surprise when I say that the sites see quite a bit of daily traffic. Issues due to poor performance tuning and inefficient code have a way of making themselves known in dramatic fashion.
Some time ago, we were experiencing a whole host of critical performance issues that ultimately stemmed from a variety of sources: custom code, infrastructure configuration, cache tuning parameters, and more. It took a team of Microsoft experts, along with professionals working for the client, to systematically address each item and bring operations back to a “normal” state. Though we ultimately worked through a number of different problem areas, one area in particular stood out: the MOSS Object Cache and how it was “tuned.”
What is the MOSS Object Cache?
Publishing sites make use of the Object Cache without any intervention on the part of administrators. By default, a publishing site’s Object Cache receives up to 100MB of memory for use when the site collection is created. This allocation can be seen on the Object Cache Settings site collection administration page within a publishing site:
Note that I said that up to 100MB can be used by the Object Cache by default. The size of the allocation simply determines how large the cache can grow in memory before item ejection, flushing, and possible compactions result. The maximum cache size isn’t a static allocation, so allocating 500MB of memory, for example, won’t deprive the server of 500MB of memory unless the amount of data going into the cache grows to that level. I’m taking a moment to point this out because I wasn’t (personally) aware of this when I first started working with the Object Cache. This point also becomes a relevant point in a story I’ll be telling in a bit.
Microsoft’s TechNet site has an article that provides pretty good coverage of caching within MOSS (including the Object Cache), so I’m not going to go into all of the details it covers in this post. I will make the assumption that the information presented in the TechNet article has been read and understood, though, because it serves as the starting point for my discussion.
Object Cache Memory Tuning Basics
The TechNet article indicates that two specific indicators should be watched for tuning purposes. Those two indicators, along with their associated performance counters, are
Cache hit ratio (SharePoint Publishing Cache/Publishing cache hit ratio)
The image below shows these counters highlighted on a MOSS WFE where all SharePoint Publishing Cache counters have been added to a Performance Monitor session:
According to the article, the Publishing cache hit ratio should remain above 90% and a low object discard rate should be observed. This is good advice, and I’m not saying that it shouldn’t be followed. In fact, my experience has shown Publishing cache hit ratio values of 98%+ are relatively common for well-tuned publishing sites possessing largely static content.
The “Dirty Little Secret” about the Publishing Cache Hit Ratio Counter
As it turns out, though, the Publishing cache hit ratio counter should come with a very large warning that reads as follows:
WARNING: This counter only resets with a server reboot. Data it displays has been aggregating for as long as the server has been up.
This may not seem like such a big deal, particularly if you’re looking at a new site collection. Let me share a painful personal experience, though, that should drive home how important a point this really is.
I was attempting to do a little Object Cache tuning for a client to help free up some memory to make application pool recycles cleaner, and I was attempting to see if I could adjust the Object Cache allocations for multiple (about 18) site collections downward. We were getting into a memory-constrained position, and a review of the Publishing cache hit ratio values for the existing site collections showed that all sites were turning in 99%+ cache hit ratios. Operating under the (previously described) mistaken assumption that Object Cache memory was statically allocated, I figured that I might be able to save a lot of memory simply by adjusting the memory allocations downward.
Mistaken understanding in mind, I went about modifying the Object Cache allocation for one of the site collections. I knew that we had some data going into the cache (navigational data and a few cross-list query result sets), so I figured that we couldn’t have been using a whole lot of memory. I adjusted the allocation down dramatically (to 10MB) on the site collection and I periodically checked back over the course of several hours to see how the Publishing cache hit ratio fared.
After a chunk of the day had passed, I saw that the Publishing cache hit ratio remained at 99%+. I considered my assumption and understanding about data going into the Object Cache to be validated, and I went on my way. What I didn’t realize at the time was that the actual Publishing cache hit ratio counter value was driven by the following formula:
Publishing cache hit ratio = total cache hits / (total cache hits + total cache misses) * 100%
Note the pervasive use of the word “total” in the formula. In my defense, it wasn’t until we engaged Microsoft and made requests (which resulted in many more internal requests) that we learned the formulas that generate the numbers seen in many of the performance counters. To put it mildly, the experience was “eye opening.”
In reality, the site collection was far from okay following the tuning I performed. It truly needed significantly more than the 10MB allocation I had given it. If it were possible to reset the Publishing cache hit ratio counter or at least provide a short-term snapshot/view of what was going on, I would have observed a significant drop following the change I made. Since our server had been up for a month or more, and had been doing a good job of servicing requests from the cache during that time, the sudden drop in objects being served out of the Object Cache was all but undetectable in the short-term using the Publishing cache hit ratio.
To spell this out even further for those who don’t want to do the math: a highly-trafficked publishing site like one of my client’s sites may service 50 million requests from the Object Cache over the course of a month. Assuming that the site collection had been up for a month with a 99% Object Cache hit ratio, plugging the numbers into the aforementioned formula might look something like this:
Publishing cache hit ratio = 49500000 / (49500000 + 500000) * 100% = 99.0%
50 million Object Cache requests per month breaks down to about 1.7 million requests per day. Let’s say that my Object Cache adjustment resulted in an extremely pathetic 10% cache hit ratio. That means that of 1.7 million object requests, only 170000 of them would have been served from the Object Cache itself. Even if I had watched the Publishing cache hit ratio counter for the entire day and seen the results of all 1.7 million requests, here’s what the ratio would have looked like at the end of the day (assuming one month of uptime):
Publishing cache hit ratio = 51200000 / (51200000 + 2030000) * 100% = 96.2%
Net drop: only about 2.8% over the course of the entire day!
Seeing this should serve as a healthy warning for anyone considering the use the Publishing cache hit ratio counter alone for tuning purposes. In publishing environments where server uptime is maximized, the Publishing cache hit ratio may not provide any meaningful feedback unless the sampling time for changes is extended to days or even weeks. Such long tuning timelines aren’t overly practical in many heavily-trafficked sites.
So, What Happens When the Memory Allocation isn’t Enough?
In plainly non-technical terms: it gets ugly. Actual results will vary based on how memory starved the Object Cache is, as well as how hard the web front-ends (WFEs) in the farm are working on average. As you might expect, systems under greater stress or load tend to manifest symptoms more visibly than systems encountering lighter loads.
In my case, one of the client’s main sites was experiencing frequent Object Cache thrashing, and that led to spells of extremely erratic performance during times when flushes and cache compactions were taking place. The operations I describe are extremely resource intensive and can introduce blocking behavior in the request pipeline. Given the volume of requests that come through the client’s sites, the entire farm would sometimes drop to its knees as the Object Cache struggled to fill, flush, and serve as needed. Until the problem was located and the allocation was adjusted, a lot of folks remained on-call.
First and foremost: don’t adjust the size of the Object Cache memory allocation downwards unless you’ve got a really good reason for doing so, such as extreme memory constraints or some good internal knowledge indicating that the Object Cache simply isn’t being used in any substantial way for the site collection in question. As I’ve witnessed firsthand, the performance cost of under-allocating memory to the Object Cache can be far worse than the potential memory savings gained by tweaking.
Second, don’t make the same mistake I made and think that the Object Cache memory allocation is a static chunk of memory that’s claimed by MOSS for the site collection. The Object Cache uses only the memory it needs, and it will only start ejecting/flushing/compacting the Object Cache after the cache has become filled to the specified allocation limit.
And now, for the $64,000-contrary-to-common-sense tip …
For tuning established site collections and the detection of thrashing behavior, Microsoft actually recommends using the Object Cache compactions performance counter (SharePoint Publishing Cache/Total number of cache compactions) to guide Object Cache memory allocation. Since cache compactions represent the greatest threat to ongoing optimal performance, Microsoft concluded (while working to help us) that monitoring the Total number of cache compactions counter was the best indicator of whether or not the Object Cache was memory starved and in trouble:
Steve Sheppard (a very knowledgeable Microsoft Escalation Engineer with whom I worked and highly recommend) wrote an excellent blog post that details the specific process he and the folks at Microsoft assembled to use the Total number of cache compactions counter in tuning the Object Cache’s memory allocation. I recommend reading his post, as it covers a number of details I don’t include here. The distilled guidelines he presents for using the Total number of cache compactions counter basically break counter values into three ranges:
0 or 1 compactions per hour: optimal
2 to 6 compactions per hour: adequate
7+ compactions per hour: memory allocation insufficient
In short: more than six cache compactions per hour is a solid sign that you need to adjust the site collection’s Object Cache memory allocation upwards. At this level of memory starvation within the Object Cache, there are bound to be secondary signs of performance problems popping up (for example, erratic response times and increasing ASP.NET request queue depth).
We were able to restore Object Cache performance to acceptable levels (and adjust our allocation down a bit), but we lacked good guidance and a quantifiable measure until the Total number of cache compactions performance counter came to light. Keep this in your back pocket for the next time you find yourself doing some tuning!
I owe Steve Sheppard an additional debt of gratitude for keeping me honest and cross-checking some of my earlier statements and numbers regarding the Publishing cache hit ratio. Though the counter values persist beyond an IISReset, I had incorrectly stated that they persist beyond a reboot and effectively never reset. The values do reset, but only after a server reboot. I’ve updated this post to reflect the feedback Steve supplied. Thank you, Steve!
RPO (recovery point objective) targets and RTO (recovery time objective) targets are critical to have in hand prior to the start of disaster recovery (DR) planning for SharePoint. This post discusses RPO and RTO to build an understanding of what they are and how they impact DR decision making.
Years ago, before I began working with SharePoint, I spent some time working as an application architect for a Fortune 500 financial services company based here in Cincinnati, Ohio. While at the company, I was awarded the opportunity to serve as a disaster recovery (DR) architect on the team that would build (from ground up) the company’s first DR site implementation. It was a high-profile role with little definition – the kind that can either boost a career or burn it down. Luckily for me, the outcome leaned towards the former.
Admittedly, though, I knew very little about DR before starting in that position. I only knew what management had tasked me with doing: ensuring that mission-critical applications would be available and functional at the future DR site in the event of a disaster. If you aren’t overly familiar with DR, then that target probably sounds relatively straightforward. As I began working and researching avenues of attack for my problem set, though, I quickly realized how challenging and unusual disaster recovery planning was as a discipline – particularly for a “technically minded” person like myself.
Understanding the “Technical Tendency”
When it comes to DR, folks with whom I’ve worked have heard me say the following more than a few times:
It is the nature of technical people to find and implement technical solutions to technical problems. At its core, disaster recovery is not a technical problem; it is a business problem. Approaching disaster recovery with a purely technical mindset will result in a failure to deliver an appropriate solution more often than not.
What do I mean by that? Well, technical personnel tend to lump DR plans and activities into categories like “buying servers,” “taking backups,” and “acquiring off-site space.” These activities can certainly be (and generally are) part of a DR plan, but if they are the starting point for a DR strategy, then problems are likely to arise.
Let me explain by way of a simplistic and fictitious example.
Planning for SharePoint DR in a Vacuum
Consider the plight of Larry. Larry is an IT professional who possesses administrative responsibility for his company’s SharePoint-based intranet. One day, Larry is approached by his manager and instructed to come up with a DR strategy for the SharePoint farm that houses the intranet. Like most SharePoint administrators, Larry’s never really “done” DR before. He’s certain that he will need to review his backup strategy and make sure that he’s getting good backups. He’ll probably need to talk with the database administrators, too, because it’s generally a good idea to make sure that SQL backups are being taken in addition to SharePoint farm (catastrophic) backups.
Larry’s been told that off-site space is already being arranged by the facilities group, so that’s something he’ll be able to take off of his plate. He figures he’ll need to order new servers, though. Since the company’s intranet farm consists of four servers (including database servers), he plans to play it safe and order four new servers for the DR site. In his estimation, he’ll probably need to talk with the server team about the hardware they’ll be placing out at the DR site, he’ll need to speak with the networking team about DNS and switching capabilities they plan to include, etc.
Larry prepares his to-do list, dives in, and emerges three months later with an intranet farm DR approach possessing the following characteristics:
The off-site DR location will include four servers that are setup and configured as a new, “warm standby” SharePoint farm.
Every Sunday night, a full catastrophic backup of the SharePoint farm will be taken; every other night of the week, a differential backup will be taken. After each nightly backup is complete, it will be remotely copied to the off-site DR location.
In the event of a disaster, Larry will restore the latest full backup and appropriate differential backups to the standby farm that is running at the DR site.
Once the backups have been restored, all content will be available for users to access – hypothetically speaking, of course.
There are a multitude of technical questions that aren’t answered in the plan described above. For example, how is patching of the standby farm handled? Is the DR site network a clone of the existing network? Will server name and DNS hostname differences be an issue? What about custom solution packages (WSPs)? Ignoring all the technical questions for a moment, take a step back and ask yourself the question of greatest importance: will Larry’s overall strategy and plan meet his DR requirements?
If you’re new to DR, you might say “yes” or “no” based on how you view your own SharePoint farm and your experiences with it. If you’ve previously been involved in DR planning and are being honest, though, you know that you can’t actually answer the question. Neither can Larry or his manager. In fact, no one (on the technical side, anyway) has any idea if the DR strategy is a good one or not – and that’s exactly the point I’m trying to drive home.
The Cart Before the Horse
Assuming Larry’s company is like many others, the SharePoint intranet has a set of business owners and stakeholders (collectively referred to as “The Business” hereafter) who represent those who use the intranet for some or all of their business activities. Ultimately, The Business would issue one of three verdicts upon learning of Larry’s DR strategy:
Verdict 1: Exactly What’s Needed
Let’s be honest: Larry’s DR plan for intranet recovery could be on-the-money. Given all of the variables in DR planning and the assumptions that Larry made, though, the chance of such an outcome is slim.
Verdict 2: DR Strategy Doesn’t Offer Sufficient Protection
There’s a solid chance that The Business could judge Larry’s DR plan as falling short. Perhaps the intranet houses areas that are highly volatile with critical data that changes frequently throughout the day. If an outage were to occur at 4pm in the afternoon, an entire day’s worth of data would basically be lost because the most recent backup would likely be 12 or so hours old (remember: the DR plan calls for nightly backups). Loss of that data could be exceptionally costly to the organization.
At the same time, Larry’s recovery strategy assumes that he has enough time to restore farm-level backups at the off-site location in the event of a disaster. Restoring a full SharePoint farm-level backup (with the potential addition of differential backups) could take hours. If having the intranet down costs the company $100,000 per hour in lost productivity or revenue, you can bet that The Business will not be happy with Larry’s DR plan in its current form.
Verdict 3: DR Strategy is Overkill
On the flipside, there’s always the chance that Larry’s plan is overkill. If the company’s intranet possesses primarily static content that changes very infrequently and is of relatively low importance, nightly backups and a warm off-site standby SharePoint farm may be overkill. Sure, it’ll certainly allow The Business to get their intranet back in a timely fashion … but at what cost?
If a monthly tape backup rotation and a plan to buy hardware in the event of a disaster is all that is required, then Larry’s plan is unnecessarily costly. Money is almost always constrained in DR planning and execution, and most organizations prioritize their DR target systems carefully. Extra money that is spent on server hardware, nightly backups, and maintenance for a warm off-site SharePoint farm could instead be allocated to the DR strategies of other, more important systems.
Taking Care of Business First
No one wants to be left guessing whether or not their SharePoint DR strategy will adequately address DR needs without going overboard. In approaching the challenge his manager handed him without obtaining any additional input, Larry fell into the same trap that many IT professionals do when confronted with DR: he failed to obtain the quantitative targets that would allow him to determine if his DR plan would meet the needs and expectations established by The Business. In their most basic form, these requirements come in the form of recovery point objectives (RPOs) and recovery time objectives (RTOs).
The Disaster Recovery Timeline
I have found that the concepts of RPO and RTO are easiest to explain with the help of illustrations, so let’s begin with a picture of a disaster recovery timeline itself:
The diagram above simply shows an arbitrary timeline with an event (a “declared disaster”) occurring in the middle of the timeline. Any DR planning and preparation occurs to the left of the event on the timeline (in the past when SharePoint was still operational), and the actual recovery of SharePoint will happen following the event (that is, to the right of the event on the timeline in the “non-operational” period).
This DR timeline will become the canvas for further discussion of the first quantitative DR target you need to obtain before you can begin planning a SharePoint DR strategy: RPO.
RPO: Looking Back
As stated a little earlier, RPO is an acronym for Recovery Point Objective. Though some find the description distasteful, the easiest way to describe RPO is this: it’s the maximum amount of data loss that’s tolerated in the event of a disaster. RPO targets vary wildly depending on volatility and criticality of the data stored within the SharePoint farm. Let’s add a couple of RPO targets to the DR timeline and discuss them a bit further.
Two RPO targets have been added to the timeline: RPO1 and RPO2. As discussed, each of these targets marks a point in the past from which data must be recoverable in the event of a disaster. In the case of our first example, RPO1, the point in question is 48 hours before a declared disaster (that is, “we have a 48 hour RPO”). RPO2, on the other hand, is a point in time that is a mere 30 minutes prior to the disaster event (or a “30 minute target RPO”).
At a minimum, any DR plan that is implemented must ensure that all of the data prior to the point in time denoted by the selected RPO can be recovered in the event of a disaster. For RPO1, there may be some loss of data that was manipulated in the 48 hours prior to the disaster, but all data older than 48 hours will be recovered in a consistent state. RPO2 is more stringent and leaves less wiggle room; all data older than 30 minutes is guaranteed to be available and consistent following recovery.
If you think about it for a couple of minutes, you can easily begin to see how RPO targets will quickly validate or rule-out varying backup and/or data protection strategies. In the case of RPO1, we’re “allowed” to lose up to two days (48 hours) worth of data. In this situation, a nightly backup strategy would be more than adequate to meet the RPO target, since a nightly backup rotation guarantees that available backup data is never more than 24 hours old. Whether disk or tape based, this type of backup approach is very common in the world of server management. It’s also relatively inexpensive.
The same nightly backup strategy would fail to meet the RPO requirement expressed by RPO2, though. RPO2 states that we cannot lose more than 30 minutes of data. With this type of RPO, most standard disk and tape-based backup strategies will fall short of meeting the target. To meet RPO2’s 30 minute target, we’d probably need to look at something like SQL Server log shipping or mirroring. Such a strategy is going to generally require a greater investment in database hardware, storage, and licensing. Technical complexity also goes up relative to the aforementioned nightly backup routine.
It’s not too hard to see that as the RPO window becomes increasingly more narrow and approaches zero (that is, an RPO target of real-time failover with no data loss permitted), the cost and complexity of an acceptable DR data protection strategy climbs dramatically.
RTO: Thinking Ahead
If RPO drives how SharePoint data protection should be approached prior to a disaster, RTO (or Recovery Time Objective) denotes the timeline within which post-disaster farm and data recovery must be completed. To illustrate, let’s turn once again to the DR timeline.
As with the previous RPO example, we now have two RTO targets on the timeline: RTO1 and RTO2. Analogous to the RPO targets, the RTO targets are given in units of time relative to the disaster event. In the case of RTO1, the point in time in question is two hours after a disaster has been declared. RTO2 is designated as t+36 hours, or a day and a half after the disaster has been declared.
In plain English, an RTO target is the maximum amount of time that the recovery of data and functionality can take following a disaster event. If the overall DR plan for your SharePoint farm were to have an RTO that matches RTO2, for instance, you would need to have functionality restored (at an agreed-upon level) within a day and half. If you were operating with a target that matches RTO1, you would have significantly less time to get everything “up and running” – only two hours.
RTO targets vary for the same reasons that RPO targets vary. If the data that is stored within SharePoint is highly critical to business operations, then RTOs are generally going to trend towards hours, minutes, or maybe even real-time (that is, an RTO that mandates transferring to a hot standby farm or “mirrored” data center for zero recovery time and no interruption in service). For SharePoint data and farms that are less business critical (maybe a publishing site that contains “nice to have” information), RTOs could be days or even weeks.
Just like an aggressive RPO target, an aggressive RTO target is going to limit the number of viable recovery options that can possibly address it – and those options are generally going to lean towards being more expensive and technically more complex. For example, attempting to meet a two hour RTO (RTO1) by restoring a farm from backup tapes is going to be a gamble. With very little data, it may be possible … but you wouldn’t know until you actually tried with a representative backup. At the other extreme, an RTO that is measured in weeks could actually make a ground-up farm rebuild (complete with new hardware acquisition following the disaster) a viable – and rather inexpensive (in up-front capital) – recovery strategy.
Whether or not a specific recovery strategy will meet RTO targets in advance of a disaster is oftentimes difficult to determine without actually testing it. That’s where the value of simulated disasters and recovery exercises come into play – but that’s another topic for another time.
This post was intended to highlight a common pitfall affecting not only SharePoint DR planning, but DR planning in general. It should be clear by now that I deliberately avoided technical questions and issues to focus on making my point about planning. Don’t interpret my “non-discussion” of technical topics to mean that I think that their place with regard to SharePoint DR is secondary. That’s not the case at all; the fact that John Ferringer and I wrote a book on the topic (the “SharePoint 2007 Disaster Recovery Guide”) should be proof of this. It should probably come as no surprise that I recommend our book for a much more holistic treatment of SharePoint DR – complete with technical detail.
There are also a large number of technical resources for SharePoint disaster recovery online, and the bulk of them have their strong and weak points. My only criticism of them in general is that they equate “disaster recovery” to “backup/restore.” While the two are interrelated, the latter is but only one aspect of the former. As I hope this post points out, true disaster recovery planning begins with dialog and objective targets – not server orders and backup schedules.
If you conclude your reading holding onto only one point from this post, let it be this: don’t attempt DR until you have RPOs and RTOs in hand!
This post investigates BLOB caching within MOSS and includes a discussion of how the BLOB cache is internally implemented, how flushing operations are carried out, and the differences between single-server (UI) and farm-wide flushes.
Most publishing site administrators have at least some degree of familiarity with the binary large object (BLOB) cache that is supplied by the MOSS platform, but trying to find information describing how it actually works its magic can be tough. This post is an attempt to shed a bit of light on the structure, implementation, and operations of the BLOB cache.
Before going too far, though, I should apologize to the group Motorcycle for twisting the title and lyrics of one of their more popular trance songs (“As The Rush Comes”) for the purpose of this post. I guess I simply couldn’t resist the opportunity to have a little (slightly juvenile) fun.
What is the MOSS BLOB Cache?
Also known as disk-based caching, BLOB caching is one of the three forms of caching supplied/supported by MOSS (not WSS) out-of-the-box (OOTB). Simply put, the BLOB cache is a mechanism that allows MOSS to locally store “larger” list items (images, CSS, and more) within the file system of web front-ends (WFEs) so that these resources can be served to callers more efficiently than round-tripping to the content database each time a request for such a resource is received.
The rest of this post assumes that you’re familiar with the basics of the MOSS BLOB cache. If you aren’t, I’d recommend checking out MSDN (“Caching In Office SharePoint 2007”) for a primer.
Some BLOB Cache Internals
Before discussing how flushes are carried out, it’s worth spending a few minutes talking about the internals of the BLOB cache. Having an understanding of what’s going on “under the hood” helps when explaining some of peculiarities I’ll be describing a little later in this post.
The MOSS BLOB caching mechanism is implemented primarily with the help of two types (classes) that live within the Microsoft.SharePoint.Publishing namespace: the BlobCache type and its associated BlobCacheEntry type. Each BlobCache object possesses a dictionary that houses BlobCacheEntry instances, and each BlobCacheEntry object represents an SPListItem (SharePoint list item) object that is being stored (cached) in the local file system of the server.
The scope of any BlobCache instance is a single IIS web site, and this is no surprise given that the BlobCache is enabled and disabled through the following (default) entry in the SharePoint web site’s web.config file:
As shown, BLOB caching is disabled by default. Since BLOB caching is enabled and disabled via the web.config file, configuration and “awareness of operation” is largely a manual affair. From within the SharePoint browser UI, it cannot be easily determined if BLOB caching is enabled or disabled in the same way that this information can be determined for page output caching and object caching.
This leads to another point that is also worth mentioning: though an Internet Information Services (IIS) web site and a SharePoint web application are fairly synonymous in the case of a single zone web application, the one-to-one equivalence breaks down when a web application is extended to multiple zones from within Central Administration. In such an extended scenario, each zone (Default, Internet, Intranet, Extranet, and Custom) has its own IIS web site with its own web.config, so it is possible that BLOB caching can be both enabled and disabled for site collections being exposed. The URL used to access a site collection becomes important in this scenario.
Setting the Wheels in Motion
The <BlobCache /> section that resides within the web.config for an IIS web site is recognized and processed by the MOSS PublishingHttpModule type. As its name implies, this type (which also resides in the Microsoft.SharePoint.Publishing namespace) is an HttpModule. Being an HttpModule, the PublishingHttpModule must be present as a child of the <httpModules /> element within the web.config for an IIS web site in order to do carry out its duties. Under normal circumstances, MOSS takes care of this:
The PublishingHttpModule itself is responsible for coordinating a number of caching-related operations for MOSS (more than just BLOB caching), and these operations all begin when an instance of the PublishingHttpModule is initialized at the same time that IIS is setting up the SharePoint/ASP.NET application pipeline. When IIS sets up this pipeline and the PublishingHttpModule.Init method is called, the following actions take place with regard to the BLOB cache:
The site’s web.config configuration settings for the BLOB cache get read and processed.
Assuming settings are found, the PublishingHttpModule creates a new BlobCache object instance to service the (IIS) web site. This happens whether or not BLOB caching is actually enabled. Put another way: all sites for which the PublishingHttpModule is active have a BlobCache object “assigned” to them whether that object is in use (enabled) or not.
The BlobCache instance takes care of a number of startup housekeeping items like computing file paths, setting up internal dictionaries, and ensuring that a consistent and ready state is established to facilitate requests.
Assuming all settings are consistent and valid, the BlobCache object instance registers itself with the hosting environment; it then spins-up a separate (independent) thread to rehydrate saved settings (for cached objects), create indexes, and perform some additional startup activities. This “maintenance thread” then stays alive to regularly perform background checks for things like flush requests, site changes, etc. – but only if BLOB caching is enabled within the web.config. If BLOB caching isn’t enabled, no additional work is performed on the thread.
Finally, the BlobCache instance’s RewriteUrl method is registered as a handler for the AuthorizeRequest method of the SharePoint application (HttpApplication) for which the pipeline was established. Since the AuthorizeRequest method fires for each SharePoint web request prior to actual page processing, it gives the BlobCache instance a chance to inspect a requested URL and possibly do something with it – such as serve an object back from the disk-based BLOB cache instead of allowing the request to proceed through “normal channels” (which may involve database object lookup).
At the end of this process, a BlobCache object exists for all publishing sites (that is, sites where the PublishingHttpModule is active). Again, this happens whether or not BLOB caching is actually enabled for the IIS site … though the BlobCache instance will only process requests (that is, perform useful actions in the RewriteUrl method) if it has been enabled to do so via the appropriate web.config setting.
BLOB Cache File System Structure
The following image illustrates the file system of a typical server that is implementing BLOB caching. In the case of this server, the BLOB cache location has been set to E:\MOSS\BLOBCache within the web.config file of each IIS web site utilizing the cache:
Within the E:\MOSS\BLOBCache folder are two subfolders named 748546212 and 1553899298. Each of these folders houses BLOB cache content for a different IIS site; each web site for which BLOB caching is enabled ends up with its own folder. The folder names (for example, 748546212) are nothing more than each web site’s ID value as assigned by IIS. These ID values are readily visible within the Internet Information Manager (IIS) Manager snap-in, making it easy to correlate folders with their associated IIS web sites.
Within each BLOB cache subfolder (web site folder) are three files that are maintained by MOSS; more specifically, they’re maintained by the BlobCache object instance servicing the web site. These files are critical to the operation of the BLOB cache, and they (primarily) serve to persist critical BlobCache variables and state during application pool shutdowns (when the BlobCache object is destroyed):
change.bin: This file contains serialized change tokens (SPChangeToken) for objects being cached in the local file system. These tokens allow the BlobCache maintenance thread to query the content source(s) and subsequently update the contents of the BLOB cache with any items that are identified as having changed since the last maintenance sweep.
dump.bin: This file contains a serialized copy of the BlobCache’s cache dictionary. The dictionary maintains information for all objects being tracked and maintained by the BlobCache object; each key/value pair in the dictionary consists of a local file path (key) and it’s associated BlobCacheEntry (value).
flushcount.bin: This file contains nothing more than the serialized value of the cacheFlushCount for the BlobCache object. Practically speaking, this value allows a BlobCache to determine if a flush has been requested while it was shutdown.
In a properly functioning BLOB cache, these three .bin files will always be present. If any of these files should become corrupt or be deleted, the BlobCache will execute a flush to remedy its inconsistent state.
In a site where web requests have been processed and files have been cached, additional folders and files will be present in addition to the change.bin, dump.bin, and flushcount.bin files. Additional folders (and subfolders) reflect the URL path hierarchy of the site being serviced by the BlobCache object. The files within these (path) folders correspond one-to-one with list items (that is, BLOB assets) that have been requested, and the cached files themselves have the same name as their corresponding list items with the addition of a .cache extension.
the BLOB cache folder servicing the http://www.myurl.com site within the server’s file system will have a subfolder within it named PUBLISHINGIMAGES.
The PUBLISHINGIMAGES subfolder will have a file named TEST.JPEG.cache.
Small side note which may be evident: the BlobCache object creates all cache-resident paths and filenames (save for the .cache extension) in uppercase.
What Are the Mechanics of a Flush?
The BlobCache can flush itself if it detects any internal problems (for example, one or more of its .bin files is missing or corrupt), but the process can also be requested by an external source or event. The actual BLOB cache flush process is relatively straightforward and follows this progression (assuming the BLOB cache has a working folder; that is, it hasn’t somehow been deleted):
The BlobCache acquires a writer lock for its working folder to prevent other operations during the flush that’s about to be conducted.
The BlobCache attempts to move it’s working folder to a temporary location – a new folder identified by a freshly generated globally unique identifier (GUID) string – in preparation for the flush.
If the previous folder move (to the temporary “GUID folder”) succeeded, the BlobCache attempts to delete the temporary folder. If the previous move attempt failed, the BlobCache attempts an in-place deletion of the working folder.
If the folder deletion attempt fails, the BlobCache waits two seconds before attempting the folder deletion operation once again. If the deletion fails a second time, the BlobCache leaves the temporary folder (or the original folder if the folder move failed in step #2) alone and proceeds.
The BlobCache performs internal housekeeping to clean up dictionaries, reset tracking variables, create a new BLOB cache subfolder (again, folder name is derived from the IIS site ID), and write out a new set of state files (change.bin, dump.bin, and flushcount.bin) to the folder.
With everything cleaned-up and ready to go, the BlobCache releases its Mutex writer lock and normal operations resume.
Single-Server Flush Versus Farm-Wide Flush
I mentioned that an external source or event can request a flush. A flush is typically requested in one of two ways:
A single-server flush can be requested from within the SharePoint browser UI via the Site Collection Administration column’s “Site collection object cache” link.
A single-server flush request is executed through the SharePoint browser UI on the ObjectCacheSettings.aspx application page. The relevant portion of that page appears below:
A request that is made through the ObjectCacheSettings.aspx page results in a direct call to the BlobCache object servicing the associated IIS site (and working folder) on the server receiving the postback (flush) request. Once the FlushCache call is made, the BlobCache carries out the flush as previously described.
A farm-wide flush request, on the other hand, is carried out in a very different fashion. The following is a section of the BlobCacheFarmFlush.aspx page from the BlobCacheFarmFlush solution:
A farm-wide flush is executed by incrementing a custom property value (named blobcacheflushcount) on the target site collection’s parent SPWebApplication. A change in this property value propagates to all servers since the affected SPWebApplication.Properties collection is updated and maintained in the SharePoint farm configuration database. Each BlobCache object servicing a site collection under the affected SPWebApplication picks up the property change and carries out a flush on the working folder it is responsible for managing.
Request Mechanism Impact on Flush Process
As you might expect, the choice of flush request mechanism (single-server versus farm-wide) has a profound effect on what actually happens during the flush process.
Consider a MOSS farm that has two WFEs (MOSSWFE1 and MOSSWFE2) serving up page requests for a single site collection. The site collection is exposed through an IIS web site on each server with a URL of http://internal.samplesite.com, and this URL is associated with the default web application. The site collection is also exposed through a web application that has been extended to the Internet zone, and its IIS site has a URL of http://www.samplesite.com. BLOB caching is enabled on both servers for each of the two IIS web sites, so a total of four working folders (2 servers * 2 sites) are in-play for BLOB caching purposes. A (simplified) visual representation looks something like this:
Each of the aforementioned IIS web sites is represented by circled numbers 1 through 4 in the diagram above, while the configuration database is represented by a circled number 5; I’ll be referring to these (numbers) in the descriptions that follow. Pay attention, too, to the IDs for each of the two IIS sites on each server (748546212 for the Internet zone and 1553899298 for the default zone).
Requesting a single-server flush via the SharePoint browser UI results in a request to (or rather, through) one site on one server. Prior to such a request, let’s look at how the BLOB cache might appear on MOSSWFE1:
As you can see, the BLOB cache folders for both IIS sites on MOSSWFE1 (that is, #1 and #2 in the previous farm diagram) have cached items in them. The http://www.samplesite.com (#1) site has a “MISCELLANEOUS SHOTS” subfolder (which will have one or more cached resources in it), and the internal.samplesite.com site (#2) has a “BRIAN HEATHERS WEDDING” subfolder (also with cached resources).
For the sake of discussion, let’s say that single-server BLOB cache flush request is made against MOSSWFE1 through the site collection via #2 (the internal.samplesite.com site). Once the flush has been executed, the BLOB cache folder structure would appear as follows:
Notice that the “BRIAN HEATHERS WEDDING” subfolder is gone from the site with ID 1553899298 (internal.samplesite.com, or #2). Further examination of the folder would also confirm that all .bin files had been reset – a clear sign that a flush had taken place. The cache folder for the other site at 748546212 (http://www.samplesite.com, or #1), on the other hand, remains unchanged. Each of the BLOB cache folders (#3 and #4) on MOSSWFE2 also remain unaffected.
A single-server flush, therefore, is not only restricted to a single server (MOSSWFE1 in this example), but it also impacts only the specific IIS site (or SharePoint zone) through which the flush request is made. In the case of the example above, a site administrator requesting a BLOB cache flush through http://internal.samplesite.com has no impact whatsoever on any of the cached files for http://www.samplesite.com.
This can have significant implications in many Internet publishing scenarios where publicly facing sites (zones) only permit anonymous access for security reasons. In such situations, no OOTB mechanism exists to actually permit a flush request for the public zone/site given that such a flush is a privileged operation available only to site collection administrators.
Thankfully, there is a way to address this problem …
In a farm-wide flush, the point of origin for the change that initiates a flush is #5 – the farm configuration database. As described earlier in this post, the blobcacheflushcount property on the SPWebApplication (web application) that houses the target site collection (in the case of the BlobCacheFarmFlush solution) is incremented. When the property is incremented, the BlobCache instances servicing the IIS sites under the SPWebApplication detect the property value change and carry out a flush.
Examining the file system for sites #3 and #4 on MOSSWFE2 prior to a farm-wide flush, we might see the following folder structure:
Once a farm-wide flush has been executed via STSADM or through a tool like the BlobCacheFarmFlush solution, the BLOB cache area of the file system (for sites #3 and #4) on MOSSWFE2 would appear like this:
A review of MOSSWFE1 would reveal the same file system changes; BLOB cache folders for #1 and #2 would also be reset.
Unlike the single-server BLOB cache flush via the SharePoint browser UI, a farm-wide flush impacts all WFEs in the farm serving up the site collection. Arguably the more important (and non-obvious) difference, though, is that the farm-wide flush impacts all zones/IIS sites for the web application serving the site collection. In the case of the example above, a farm-wide flush request through any of the available URLs on either server results in BLOB caches for #1, #2, #3, and #4 being flushed. This tends to make a farm-wide flush the preferred flush mechanism for the publishing site example I cited earlier (where public access occurs through an anonymous-only zone/site).
A Watch-Out with Farm-Wide Flush Requests
There is one additional point that should be made with regard to farm-wide flushes. In order for a flush to take place on a WFE, the IIS application pool servicing the targeted web application must be running. If the application pool isn’t running (hasn’t yet been started or perhaps has shutdown due lack of requests), it will appear that the flush had “no affect” on the server.
The reason for this is relatively straightforward. As described towards the beginning of this post, BlobCache object instances and their associated maintenance threads are created when IIS establishes a SharePoint pipeline (and SPHttpApplication) for request processing. If this pipeline isn’t yet ready to service requests for a targeted web application (perhaps because the IIS worker process hasn’t started-up or the application pool was recycled but not “primed”), then the SPWebApplication’s blobcacheflushcount property change won’t be detected at the time it is altered. No maintenance thread = no property change detection = no flush.
Since the cacheFlushCount for each BLOB cache is serialized and tracked via the flushcount.bin file, though, detection of the web application’s flush property value change occurs as soon as the BlobCache object is instantiated at the time of pipeline setup. The result is that a BLOB cache flush occurs as soon as the worker process or new application domain (and by extension, the BlobCache instance and its maintenance thread) spins-up to begin servicing requests.
It is my hope that this overview provides you with some insight into the internals of the MOSS BLOB cache, as well as a basis for understanding how flush mechanisms differ. As always, I welcome any feedback or questions you might have.
This post explores the SPWebService’s ApplyApplicationContentToLocalServer method, the constraints one faces when using it, and an alternative to its use when updating application page sitemap files.
Caching capabilities that are available (or exposed) through MOSS are something I spend a fair number of working hours focusing on. MOSS publishing farms can make use of quite a few caching options, and wise administrators find ways to leverage them all for maximum scalability and performance. While helping a client work through some performance and scalability issues recently, I ran into some annoying problems with disk-based caching – also known as BLOB (Binary Large OBject) caching. These problems inspired me to create the BlobCacheFarmFlush solution that I’ve shared on CodePlex, and it was during the creation of this solution that I wrangled with the ApplyApplicationContentToLocalServer method.
The BlobCacheFarmFlush solution itself has a handful of moving parts, and the element I’m going to focus on in this post is the administration page (BlobCacheFarmFlush.aspx) that gets added to the farm upon Feature activation. In particular, I want to share some of the lessons I learned while figuring out how to get the page’s navigational (breadcrumb) support operating properly.
Unlike “standard” content pages that one might deploy through a SharePoint Feature or solution package, application pages (also called “layouts pages” because they go into the LAYOUTS folder within SharePoint’s 12 hive) don’t come with wired-up breadcrumb support. An example of the type of breadcrumb to which I’m referring appears below (circled in red):
Unless additional steps are taken during the installation of your application pages (beyond simply placing them in the LAYOUTS folder), breadcrumbs like the one shown above will not appear. It’s not that application pages (which derive from LayoutsBasePage or UnsecuredLayoutsBasePage) don’t include support for breadcrumbs – they do. The reason breadcrumbs fail to show is because the newly added application pages themselves are not integrated into the sitemap files that describe the navigational hierarchy of the layouts pages.
Wiring Up Breadcrumb Support
Getting breadcrumbs to appear in your own application pages requires that you update the layouts sitemap files for each of the (IIS) sites serving up content on each of the SharePoint web front-end (WFE) servers in your farm. The files to which I’m referring are named layouts.sitemap and appear in the _app_bin folder of each IIS site folder on the WFE. An example of one such file (in its _app_bin folder) appears below.
I’m a “best practices” kind of guy, so when I was doing research for my BlobCacheFarmFlush solution, I was naturally interested in trying to make the required sitemap modifications in a way that was both easy and supported. It didn’t take much searching on the topic before I came across Jan Tielens’ blog post titled “Adding Breadcrumb Navigation To SharePoint Application Pages, The Easy Way.” In his blog post, Jan basically runs through the scenario I described above (though in much greater detail than I presented), and he mentions that another reader (Brian Staton) turned him onto a very simple and straightforward way of making the required sitemap modifications. I’ll refer you to Jan’s blog post for the specifics, but the two-step quick summary goes like this:
Create a layouts.sitemap.*.xml file that contains your sitemap navigation additions and deploy it to the LAYOUTS folder within SharePoint’s 12 hive on a server.
Execute code that implements one of the two approaches shown below (typically on Feature activation) :
// Approach #1: Top-down starting at the SPFarm level
// Approach #2: Applying to the sites within an SPWebApplication
This isn’t much code, and it’s pretty clear that the magic rests with the ApplyApplicationContentToLocalServer method. This method carries out a few operations, but the one in which we’re interested involves taking the new navigation nodes in the layouts.sitemap.*.xml file and integrating them into the layouts.sitemap file for each IIS site residing under a target SPWebService instance. With the new nodes (which tie the new application pages into the navigational hierarchy) present within each layouts.sitemap file, breadcrumbs appear at the top of the new application pages when they are rendered.
I took this approach for a spin, and everything looked great! My sitemap additions were integrated as expected, and my breadcrumb appeared on the BlobCacheFarmFlush.aspx page. All was well .. until I actually deployed my solution to its first multi-server SharePoint environment. That’s when I encountered my first problem.
Problem #1: The “Local” Part of the ApplyApplicationContentToLocalServer Method
When I installed and activated the BlobCacheFarmFlush solution in a multi-server environment, the breadcrumbs failed to appear on my application page. It took a little legwork, but I discovered that the ApplyApplicationContentToLocalServer method has “Local” in its name for a reason: the changes made through the method’s actions only impact the server on which the method is invoked.
This contrasts with the behavior that SharePoint objects commonly exhibit. The changes that are made through (and to) many SharePoint types impact data that is actually stored in SQL Server, and changes made through any farm member get persisted back to the appropriate database and become available through all servers within the farm. The ApplyApplicationContentToLocalServer method, on the other hand, carries out its operations directly against the files and folders of the server on which the method is called, and the changes that are made do not “automagically” appear on or through other farm members.
The Central Administration host server for the farm in which I was activating my Feature wasn’t one of the WFEs serving up my application page. When I activated my Feature from within Central Admin, my navigation additions were incorporated into the affected sites on the local (Central Admin) host … but the WFEs serving up actual site pages (and my application page) were not updated. Result: no breadcrumb on my application page.
This issue is one of those problems that wouldn’t normally be discovered in a typical development environment. Most of the SharePoint developers I know do their work within a virtual machine (VM) of some sort, so it’s not until one moves out of such an environment and into a multi-server environment that this type of deployment problem even makes itself known. This issue only serves to underscore how important it is to test Features and solutions in a typical target deployment environment before releasing them for general use.
Putting my thinking cap back on, I worked to come up with another way to integrate the sitemap changes I needed in a way that was multi-server friendly. The ApplyApplicationContentToLocalServer method still seemed like a winner given all that it did for a single line of code; perhaps all I needed to do was create and run a one-time custom timer job (that is, schedule a custom SPJobDefinition subclass) on each server within the farm and have that timer job execute the ApplyApplicationContentToLocalServer method locally.
I whipped-up a custom timer job to carry out this action and took it for a spin. That’s when I ran into my second problem.
Problem #2: Rights Required for ApplyApplicationContentToLocalServer Method Invocation
Prior to the creation of the custom timer job that I was going to use to update the sitemap files on each of the WFEs, I had basically ignored this point. The local administrator requirement quickly became a barricade for my custom timer job, though.
Timer jobs, both SharePoint-supplied and custom, are executed within the context of the SharePoint Timer Service (OWSTIMER.EXE). The Timer Service runs in an elevated security context with regard to the SharePoint farm, but its privileges shouldn’t extend beyond the workings of SharePoint. Though some SharePoint administrators mistakenly believe that the Timer Service account (also known as the “database access account” or “farm service account”) requires local administrator rights on each server within the SharePoint farm, Microsoft spells out that this is neither required nor recommended.
The ApplyApplicationContentToLocalServer method works during Feature activation when the activating user is a member of the Local Administrators group on the server where activation is taking place – a common scenario. The process breaks down, however, if the method call occurs within the context of the SharePoint Timer Service account because it isn’t (or shouldn’t be) a member of the Local Administrators group. Attempts to call the ApplyApplicationContentToLocalServer method from within a timer job fail and result in an “Access Denied” message being written to the Application Event Log. A quick look at the first section of code inside the method itself (using Reflector) makes this point pretty clearly:
thrownew SecurityException(SPResource.GetString("AccessDenied", newobject));
This revelation told me that the ApplyApplicationContentToLocalServer method simply wasn’t going to cut the mustard for my purposes unless I wanted to either (a) require that the Timer Service account be added to the Local Administrators group on each server in the farm, or (b) require that an administrator manually execute an STSADM command or custom command line application to carry out the method call. Neither of these were acceptable to me.
Since I couldn’t use the ApplyApplicationContentToLocalServer method directly, I wanted to dissect it to the extent that I could in order to build my own process in a manner that replicated the method’s actions as closely as possible. Performing the dissection (again via Reflector), I discovered that the method was basically iterating through each SPIisWebSite in each SPWebApplication within the SPWebService object being targeted. As implied by its type name, each SPIisWebSite represents a web site within IIS – so each SPIisWebSite maps to a physical web site folder within the file system at C:\Inetpub\wwwroot\wss\VirtualDirectories (by default if IIS folders haven’t been redirected).
Once each of the web site folder paths is known, it isn’t hard to drill down a bit further to each layouts.sitemap file within the _app_bin folder for a given IIS web site. With the fully qualified path to each layouts.sitemap file computed, it’s possible to carry out a programmatic XML merge with the new sitemap data from a layouts.sitemap.*.xml file that is deployed with a custom Feature or solution. The ApplyApplicationContentToLocalServer method carries out such a merge through the private (and obfuscated) MergeAspSiteMapFiles method of the SPAspSiteMapFile internal type, but only after it has created a backup copy of the current layouts.sitemap file using the SPAspSiteMapFile.Copy method.
With an understanding of the process that is carried out within the ApplyApplicationContentToLocalServer method, I proceeded to create my own class that effectively executed the same set of steps. The result was the UpdateLayoutsSitemapTimerJob custom timer job definition that is part of my BlobCacheFarmFlush solution. This class mimics the enumeration of SPWebApplication and SPIisWebSite objects, the backup of affected layouts.sitemap files, and the subsequent XML sitemap merge of the ApplyApplicationContentToLocalServer method. The class is without external dependencies (beyond the SharePoint object model), and it is reusable in its current form. Simply drop the class into a SharePoint project and call its DeployUpdateTimerJobs static method with the proper parameters – typically from the FeatureActivated method of a custom SPFeatureReceiver. The class then takes care of provisioning a timer job instance that will update the layouts.sitemap navigational hierarchy for affected sites on each of the servers within the farm.
As an aside: while putting together the UpdateLayoutsSitemapTimerJob, there were times when I thought I had to be missing something. On a handful of occasions, I found myself thinking, “Certainly there had to be a multi-server friendly version of the ApplyApplicationContentToLocalServer method.” When I didn’t find one (after much searching), I had the good fortune of stumbling upon Vincent Rothwell’s “Configuring the breadcrumb for pages in _layouts” blog post. Vincent’s post predates my own by a hefty two and a half years, but in it he describes a process that is very similar to the one I eventually ended up implementing in my custom timer job. Seeing his post helped me realize I wasn’t losing my mind and that I was on the right track. Thank you, Vincent.
I can sum up the contents of this post pretty simply: when developing application pages that entail sitemap updates, avoid using the ApplyApplicationContentToLocalServer method unless you’re (a) certain that your Feature will be installed into single server environments only, or (b) willing to direct those doing the installation and activation to carry out some follow-up administration on each WFE in the SharePoint farm.
Why does the ApplyApplicationContentToLocalServer method exist? I did some thinking, and my guess is that it is leveraged primarily when service packs, hotfixes, and other additions are configured via the SharePoint Products and Technologies Configuration Wizard. Anytime a SharePoint farm is updated with a patch or hotfix, the wizard is run on each server by a local administrator.
An examination of the LAYOUTS folder on one of my farm members provided some indirect support for this notion. In my LAYOUTS folder, I found the layouts.sitemap.search.xml file, and it was dated 3/25/2008. I believe (I’m not positive) that this file was deployed with the SharePoint Infrastructure Updates in the middle of 2008, and those updates introduced a number of new search admin pages for MOSS. Since the contents of the layouts.sitemap.search.xml file include quite a few new search-related navigation nodes, my guess is that the ApplyApplicationContentToLocalServer method was leveraged to merge the navigation nodes for the new search pages when the configuration wizard was run.
In the meantime, if you happen to find a way to use this method in a multi-server deployment scenario that doesn’t involve the configuration wizard, I’d love to hear about it! The caveat, of course, is that it has to be a best-practices approach – no security changes, no extra manual work/steps for farm administrators, etc.
SharePointInterface.com gets started with a statement of intention and some ground rules.
When software architects and developers are relatively certain of the behaviors they desire from their code, but uncertain of how those behaviors should be implemented concretely, it is common to begin with the creation of one or more interfaces. An interface serves as the contract between the consumer and the behind-the-scenes implementation. It’s a time-tested and proven way of moving forward when many details are still unknown.
In a weird sort of way, this first post adheres to the pattern just described.
As I describe in my About section, this blog is an attempt to give something back to the SharePoint community and those within it who have contributed so much of their time, expertise, and insight to “the cause.” The details of how this blog will evolve are the subject of speculation (at least by me if no one else), but I do know that I have plenty to share. In addition to the standard SharePoint fare that most of us SharePoint professionals wrangle with on a daily basis, I have done some diving in areas of SharePoint that I haven’t (at the time of writing this) seen covered anywhere. The trick, then, becomes discussing those topics in sufficient depth and in a timely fashion.
Those who know me well will tell you that I can be rather critical of certain portions of online SharePoint content. Don’t get me wrong: there are some fantastic bloggers and professionals operating in the SharePoint space, and I (and others) have learned a great deal from these generous folks. At the same time, the SharePoint blogosphere is filled with more than its fair share of posters who blatantly copy the well-written posts of others, admins who report “solutions” with limited (or no) understanding of the problems they’re supposedly solving, junior developers who post sample code or advice that may work but generates even greater (potentially unseen) issues, and people who do nothing more than link to other posts (with no value added) in the hopes of boosting rankings.
In order to avoid inadvertently falling into the latter categories of blogger I just described, I’m laying down a few guiding principles, goals, and ground rules. I intend to stick to these. I tend to be my own harshest critic when it comes to abiding by rules, but readers have free license to call me on the carpet in the event that I start doing something questionable:
First and foremost, I believe in giving credit where credit is due. I’m not in the habit of repackaging information others have made available through their own blogs, but in the event that I leverage or incorporate materials I picked up elsewhere, I’ll cite sources and link to them. I’m also inclined to drop the author(s) a note to let them know that I’ve cited their work.
In the event that I propose a solution to a problem, I’ll also do my best to explain my understanding of the problem and the factors contributing to it. Where there are gaps in my knowledge (that I’m aware of), I’ll clearly state them. In short: I’ll do what I can to provide a thorough analysis and perform due diligence.
If I supply code, it will be documented and written to best practice standards as I understand them. If there are watch-outs or factors that should be considered before implementing the code, I’ll state them.
In the event that an outside post or topic becomes the focus of one of my own, I’ll make every attempt to add value beyond simply linking to it. Some posts may not warrant much additional verbage (because they’re highly important in and of themselves), but I hope to be able to provide additional insight and personal tie-in points in most cases.
An aside: years ago, my friends and I used to have a lot of fun with role-playing games (RPGs). Shadowrun was our game of choice (not the PC game translation, but the original RPG), but we did play some Dungeons & Dragons, as well. For those of you who used to (or maybe still do) play D&D, you should know that all of my close friends describe me as having a (painfully) Lawful Neutral alignment. Personally, I’m inclined to agree. That might tell you a little something about my intention to adhere to the guidelines I’ve established :-)
Things to Come
While this post was needed (by my reckoning) to establish some ground rules, it obviously came up short on SharePoint content. My next post should be coming within a week or so and will remedy this. Stay tuned!