By Woody Hutsell, appICU
I have a point of view about third party caching (particularly as it applies to external systems as opposed to caching at the server with PCI-E) that is different than many in the industry. Some will see this as bashing of some particular product, but it is not intended to be that. As far as I know, I am not competing with a third party caching solution at any customer site. My goal here is to start a discussion on third party caching, I will lead with my opinions and hope that others weigh-in. I am open to changing my mind on this topic as I have numerous friends in the industry who stand behind this category.
First, some background. Many years ago, 2003 to be exact, I helped bring a product to market to provide third party caching with RAM SSD. I believed in the product and was able to get many others to believe in the product. What I was not able to do was to get many people to buy the product. As I look at solutions on the market, I can see that companies trying to sell third party caching solutions are encountering the same obstacles and are fixing or working around the problems. Here are some problems I have experienced with third party caching solutions:
1. Writes. The really delicious problem to solve several years ago with a RAM caching appliance was related to write performance. Many storage systems had relatively small write caching capabilities that caused major pain for write intensive applications. A large RAM SSD (at the time I think we were using 128GB RAM) as a write cache was a major problem solver for these environments. Several things have happened to make selling write caching as a solution more difficult:
• RAID systems increasingly offered reasonable cache levels narrowing down the field of customers that need write caching. At the time we offered this RAM write cache, we thought that Xiotech customers were the perfect target as they did not believe in write caching at the time. Fact is, the combined solution worked out pretty well but was only useful until Xiotech realized that offering their own write cache could solve most customer problems.
• Third party write caching introduces a point of failure into the solution. If you write-cache, you have to be at least as reliable as the solution you are caching otherwise you have net lost the customer reliability.
• Write caching is nearly impossible if the backend storage array has replication or snapshot capabilities. Arrays with snapshot have to be cache aware when they snapshot or else they risk snapshotting without the full data set. I have seen companies try to get around this but most of the solutions look messy to me.
• Putting a third party device from a small company in front of a big expensive product from a big company is a good way for a customer to lose support. We realized early on that the only way for this product to really succeed was to get storage OEMs to certify it and approve it for their environments (we did not do very well at this).
2. Reads. Given the challenges with write caching it seems to me that most companies today are focused on read caching. Read caching solutions have a long history. Gear 6 was one of the first to take the space seriously and had some limited success with environments such as oil & gas HPC and rendering. Some of the companies that have followed Gear 6, seem to be following in their footsteps with markedly different types of hardware and cost. Here are some issues I see with read caching:
• A third party read-only cache adds a write bottleneck (as writes to the cache have to be subsequently written to the storage). i.e. Latency injection. I assume there are architectures that get around this today.
• A third party read only cache really only make sense if your controller is 1) poorly cached or 2) does not have fast backend storage or 3) is processor limited or 4) has inherently poor latency. This may be the real long term problem for this market. Whether you talk about SAN solutions or NAS solutions all storage vendors today are offering Flash SSD as disk storage. In SAN environments, many vendors can dynamically tier between disk levels (thus implementing their own internal kind of caching). NetApp has Flash PAM cards. Both BlueArc and NetApp can implement read caching. The only hope is that the customer has legacy equipment or poorly scoped their solution such that they need a third party caching product.
• Third party caching creates a support problem. Imagine you are NetApp and the customer calls in and says I am having problems with my NetApp storage can you fix it. Support says, describe the environment. Customer says “blah…blah…third party cache cache…NetApp”. NetApp says “that is not a supported environment”. I always saw this as a major limiting factor for third party caching solutions. How do you get the blessing of the array/NAS vendor so that your customer maintains support after placing your box between the servers and the storage.
• Third party read caching solutions cannot become a single point of failure for the architecture.
So, there it is. I am looking forward to some insightful comments and feedback from the industry. As you can see many are my opinions are based on scars from prior efforts in this segment and not meant to be a reflection on existing products and approaches.
Not the full view:
This is true if the topology is indeed one server connected to single storage. In this case it is obvious that the best solution in terms of complexity and performance is the in-server caching (e.g., Fusion-IO).
If the back end storage is a self managed dedicated high-end storage (e.g., EMC), than the best solution is tiering (e.g. FAST).
However, in a topology of multiple servers (e.g., ESX servers or Xen servers) connected to multiple low end storage arrays, then the best solution is in-network appliance(s) that provides both cache and virtualization. In this case, important features such as scalability (servers and storage) and motion can take place only in this topology.
I think the key piece here is that you cannot separate caching and storage virtualization. The scope of taking on all of the replication, backup, snapshot, etc features is a bigger pill then just caching and it implies that the storage behind the virtualization needs to cheap feature poor disk and not expensive arrays.
The main justification for network acceleration is to have cheap (commodity) scaling storage in the back and multiple-scaled virtualized servers in the front. This requires that the storage management take place in conjunction with the acceleration. Virtualization-acceleration synergy also increases the attractiveness of this topology.
Note that the above environment is the description of (private or public) cloud!
Woody (et al),
What is your opinion on a cache acceleration product like XcelaSAN from Dataram? I suspect they may be running into some of the marketing/support issues you mention.
They will be presenting some “real world” results at the upcoming Flash Memory Summit 2011.
Thanks for the post. Dataram has been in the RAM and SSD business for a long time. Without having any intimate details, I am sure they have run into many of these architectural issues as they work with potential customers. As they have been in the third party block caching business since 2008, they are a company I would trust to have come up with some good solutions. I am a fan of Jason Caulkins, Dataram’s CTO, who has been in the SSD business at least as long as I have been. If there are any Dataram supporters in the mix, I would love to hear how Dataram solves some of these challenges.
I asked the company (dataram) about some of the issues you brought up and they responded much the same as their XcelaSAN FAQ information states:
1. Re: Write caching is nearly impossible if the backend storage array has replication or snapshot capabilities or as they ask on FAQ – “How does the XcelaSAN function with storage subsystem features like snapshot, replication and de-duplication?”
Response: “XcelaSAN is transparent to the operation of these features.”
2. Re: Third party caching creates a support problem or as they ask in FAQ: “Is XcelaSAN qualified to work with my SAN?”
Response: “Dataram is a member of the Storage Network Industry Association (SNIA) and TSANet to facilitate standards and co-operation among vendors. In addition, the most common mid-range SANs are installed in our lab for development, testing, certification and quick problem resolution. All of the hardware components in the XcelaSAN are Enterprise Quality Off-the-Shelf-Components (COTS). These components, along with our operating system are qualified with the most common Fibre Switch and SAN manufacturers.”
Woody, did you catch Jason Caulkins’ talk at the Flash Memory Summit?
I wish I had seen Jason’s presentation, I think I missed seeing it on the agenda or maybe it was the sametime as my presentation. Thanks for posting DataRam’s position on these topics.
[…] features running on the storage behind it (Woody Hutsell discusses this in more depth in this article). This approach is effective for frequently accessed static data, but it is not ideal for […]
Take a look at Avere System’s solution, it does a pretty good job of addressing many of the challenges you mention about write caching – For example it can be configured to insure the cache is flushed before a snapshot is scheduled to take place etc…
On the other hand, there is a lot of truth in all the points you mention about the concerns and risk (percieved/real) of introducing a third party solution in front of a NetApp, EMC, Bluearc, Isilon etc… It makes for longer sales cycles where risk to benefit evaluation takes place on many levels.
It is hard to tell how things will turn out –
Are all these solutions just a temporary band-aid approach that will no longer be needed once everything ends up on SSDs (disk drives are at their performance limits and SSDs are getting cheaper and more reliable and have greater endurance)?
Because the need for performance will increase and thus the need to figure out which hot data to put on even higher performance more costly new technology will always be there – does it mean that the need for intelligent caching solutions always be there?
Time will tell…