I’ve been meaning to do a small write-up on a couple of key Object Cache points, but other things kept trumping my desire to put this post together. I finally found the nudge I needed (or rather, gave myself a kick in the butt) after discussing the topic a bit with Andrew Connell following a presentation he gave at a SharePoint Users of Indiana user group meeting. Thanks, Andrew!
A Brief Bit of Background
As I may have mentioned in a previous post, I’ve spent the bulk of the last two years buried in a set of Internet-facing MOSS publishing sites that are the public presence for my current client. Given that my current client is a Fortune 50 company, it probably comes as no surprise when I say that the sites see quite a bit of daily traffic. Issues due to poor performance tuning and inefficient code have a way of making themselves known in dramatic fashion.
Some time ago, we were experiencing a whole host of critical performance issues that ultimately stemmed from a variety of sources: custom code, infrastructure configuration, cache tuning parameters, and more. It took a team of Microsoft experts, along with professionals working for the client, to systematically address each item and bring operations back to a “normal” state. Though we ultimately worked through a number of different problem areas, one area in particular stood out: the MOSS Object Cache and how it was “tuned.”
What is the MOSS Object Cache?
Publishing sites make use of the Object Cache without any intervention on the part of administrators. By default, a publishing site’s Object Cache receives up to 100MB of memory for use when the site collection is created. This allocation can be seen on the Object Cache Settings site collection administration page within a publishing site:
Note that I said that up to 100MB can be used by the Object Cache by default. The size of the allocation simply determines how large the cache can grow in memory before item ejection, flushing, and possible compactions result. The maximum cache size isn’t a static allocation, so allocating 500MB of memory, for example, won’t deprive the server of 500MB of memory unless the amount of data going into the cache grows to that level. I’m taking a moment to point this out because I wasn’t (personally) aware of this when I first started working with the Object Cache. This point also becomes a relevant point in a story I’ll be telling in a bit.
Microsoft’s TechNet site has an article that provides pretty good coverage of caching within MOSS (including the Object Cache), so I’m not going to go into all of the details it covers in this post. I will make the assumption that the information presented in the TechNet article has been read and understood, though, because it serves as the starting point for my discussion.
Object Cache Memory Tuning Basics
The TechNet article indicates that two specific indicators should be watched for tuning purposes. Those two indicators, along with their associated performance counters, are
- Cache hit ratio (SharePoint Publishing Cache/Publishing cache hit ratio)
- Object discard rate (SharePoint Publishing Cache/Total object discards)
The image below shows these counters highlighted on a MOSS WFE where all SharePoint Publishing Cache counters have been added to a Performance Monitor session:
According to the article, the Publishing cache hit ratio should remain above 90% and a low object discard rate should be observed. This is good advice, and I’m not saying that it shouldn’t be followed. In fact, my experience has shown Publishing cache hit ratio values of 98%+ are relatively common for well-tuned publishing sites possessing largely static content.
The “Dirty Little Secret” about the Publishing Cache Hit Ratio Counter
As it turns out, though, the Publishing cache hit ratio counter should come with a very large warning that reads as follows:
WARNING: This counter only resets with a server reboot. Data it displays has been aggregating for as long as the server has been up.
This may not seem like such a big deal, particularly if you’re looking at a new site collection. Let me share a painful personal experience, though, that should drive home how important a point this really is.
I was attempting to do a little Object Cache tuning for a client to help free up some memory to make application pool recycles cleaner, and I was attempting to see if I could adjust the Object Cache allocations for multiple (about 18) site collections downward. We were getting into a memory-constrained position, and a review of the Publishing cache hit ratio values for the existing site collections showed that all sites were turning in 99%+ cache hit ratios. Operating under the (previously described) mistaken assumption that Object Cache memory was statically allocated, I figured that I might be able to save a lot of memory simply by adjusting the memory allocations downward.
Mistaken understanding in mind, I went about modifying the Object Cache allocation for one of the site collections. I knew that we had some data going into the cache (navigational data and a few cross-list query result sets), so I figured that we couldn’t have been using a whole lot of memory. I adjusted the allocation down dramatically (to 10MB) on the site collection and I periodically checked back over the course of several hours to see how the Publishing cache hit ratio fared.
After a chunk of the day had passed, I saw that the Publishing cache hit ratio remained at 99%+. I considered my assumption and understanding about data going into the Object Cache to be validated, and I went on my way. What I didn’t realize at the time was that the actual Publishing cache hit ratio counter value was driven by the following formula:
Publishing cache hit ratio = total cache hits / (total cache hits + total cache misses) * 100%
Note the pervasive use of the word “total” in the formula. In my defense, it wasn’t until we engaged Microsoft and made requests (which resulted in many more internal requests) that we learned the formulas that generate the numbers seen in many of the performance counters. To put it mildly, the experience was “eye opening.”
In reality, the site collection was far from okay following the tuning I performed. It truly needed significantly more than the 10MB allocation I had given it. If it were possible to reset the Publishing cache hit ratio counter or at least provide a short-term snapshot/view of what was going on, I would have observed a significant drop following the change I made. Since our server had been up for a month or more, and had been doing a good job of servicing requests from the cache during that time, the sudden drop in objects being served out of the Object Cache was all but undetectable in the short-term using the Publishing cache hit ratio.
To spell this out even further for those who don’t want to do the math: a highly-trafficked publishing site like one of my client’s sites may service 50 million requests from the Object Cache over the course of a month. Assuming that the site collection had been up for a month with a 99% Object Cache hit ratio, plugging the numbers into the aforementioned formula might look something like this:
Publishing cache hit ratio = 49500000 / (49500000 + 500000) * 100% = 99.0%
50 million Object Cache requests per month breaks down to about 1.7 million requests per day. Let’s say that my Object Cache adjustment resulted in an extremely pathetic 10% cache hit ratio. That means that of 1.7 million object requests, only 170000 of them would have been served from the Object Cache itself. Even if I had watched the Publishing cache hit ratio counter for the entire day and seen the results of all 1.7 million requests, here’s what the ratio would have looked like at the end of the day (assuming one month of uptime):
Publishing cache hit ratio = 51200000 / (51200000 + 2030000) * 100% = 96.2%
Net drop: only about 2.8% over the course of the entire day!
Seeing this should serve as a healthy warning for anyone considering the use the Publishing cache hit ratio counter alone for tuning purposes. In publishing environments where server uptime is maximized, the Publishing cache hit ratio may not provide any meaningful feedback unless the sampling time for changes is extended to days or even weeks. Such long tuning timelines aren’t overly practical in many heavily-trafficked sites.
So, What Happens When the Memory Allocation isn’t Enough?
In plainly non-technical terms: it gets ugly. Actual results will vary based on how memory starved the Object Cache is, as well as how hard the web front-ends (WFEs) in the farm are working on average. As you might expect, systems under greater stress or load tend to manifest symptoms more visibly than systems encountering lighter loads.
In my case, one of the client’s main sites was experiencing frequent Object Cache thrashing, and that led to spells of extremely erratic performance during times when flushes and cache compactions were taking place. The operations I describe are extremely resource intensive and can introduce blocking behavior in the request pipeline. Given the volume of requests that come through the client’s sites, the entire farm would sometimes drop to its knees as the Object Cache struggled to fill, flush, and serve as needed. Until the problem was located and the allocation was adjusted, a lot of folks remained on-call.
First and foremost: don’t adjust the size of the Object Cache memory allocation downwards unless you’ve got a really good reason for doing so, such as extreme memory constraints or some good internal knowledge indicating that the Object Cache simply isn’t being used in any substantial way for the site collection in question. As I’ve witnessed firsthand, the performance cost of under-allocating memory to the Object Cache can be far worse than the potential memory savings gained by tweaking.
Second, don’t make the same mistake I made and think that the Object Cache memory allocation is a static chunk of memory that’s claimed by MOSS for the site collection. The Object Cache uses only the memory it needs, and it will only start ejecting/flushing/compacting the Object Cache after the cache has become filled to the specified allocation limit.
And now, for the $64,000-contrary-to-common-sense tip …
For tuning established site collections and the detection of thrashing behavior, Microsoft actually recommends using the Object Cache compactions performance counter (SharePoint Publishing Cache/Total number of cache compactions) to guide Object Cache memory allocation. Since cache compactions represent the greatest threat to ongoing optimal performance, Microsoft concluded (while working to help us) that monitoring the Total number of cache compactions counter was the best indicator of whether or not the Object Cache was memory starved and in trouble:
Steve Sheppard (a very knowledgeable Microsoft Escalation Engineer with whom I worked and highly recommend) wrote an excellent blog post that details the specific process he and the folks at Microsoft assembled to use the Total number of cache compactions counter in tuning the Object Cache’s memory allocation. I recommend reading his post, as it covers a number of details I don’t include here. The distilled guidelines he presents for using the Total number of cache compactions counter basically break counter values into three ranges:
- 0 or 1 compactions per hour: optimal
- 2 to 6 compactions per hour: adequate
- 7+ compactions per hour: memory allocation insufficient
In short: more than six cache compactions per hour is a solid sign that you need to adjust the site collection’s Object Cache memory allocation upwards. At this level of memory starvation within the Object Cache, there are bound to be secondary signs of performance problems popping up (for example, erratic response times and increasing ASP.NET request queue depth).
We were able to restore Object Cache performance to acceptable levels (and adjust our allocation down a bit), but we lacked good guidance and a quantifiable measure until the Total number of cache compactions performance counter came to light. Keep this in your back pocket for the next time you find yourself doing some tuning!
I owe Steve Sheppard an additional debt of gratitude for keeping me honest and cross-checking some of my earlier statements and numbers regarding the Publishing cache hit ratio. Though the counter values persist beyond an IISReset, I had incorrectly stated that they persist beyond a reboot and effectively never reset. The values do reset, but only after a server reboot. I’ve updated this post to reflect the feedback Steve supplied. Thank you, Steve!