Start waiting on 3DXP arrays

June 1, 2017

Start Waiting on 3DXP arrays, by Woody Hutsell, AppICU

Let’s get one thing out of the way.  Most storage systems will eventually offer 3DXP.  Why?  Because adding 3DXP SSDs to a storage array will be easy.

A second thing, I think the early usage for 3DXP will flow largely to server vendors (and their suppliers).  This is a major point and central to my thoughts on storage and 3DXP.  In the server, 3DXP reduces cost and increases density versus RAM.

3DXP in external storage will lag expectations until there are major advances in density and price.

I have worked in the part of the market that 3DXP external storage solutions will target for the last 17 years.  For most of those 17 years, I think we could comfortably call this space Tier 0.  These are customers whose end-customer satisfaction, missions or revenue are directly tied to the performance of their storage arrays.  When I say performance, I really mean latency sensitive.  They are so latency sensitive that they will not tolerate storage services getting in the way of application performance.  There are customers in the financial, telecom, defense, government, retail, e-commerce and logistics businesses that I could probably with a high degree of accuracy predict their interest in this solution.

These customers are willing to pay for low latency.  Customers in this category bought all RAM solid state storage.  They were early adopters of all flash arrays. They still buy based on latency curves (who delivers predictable low latency at the IOPS level they require).

These are not the customers buying Tier 1 arrays with a full suite of storage services.  They will not tolerate data reduction or storage services if it impacts latency.  These are not the customers buying primarily on cost/capacity though they still have budgets and need a solution that fits that budget.

I love this Tier 0 market, because these customers are solving world class problems and must stay on the bleeding edge of technology to grow their business.    These customers will buy 3DXP arrays that deliver on the low latency potential of 3DXP.  The phrasing of this sentence is no accident, if the array offers 3DXP but only delivers modest latency improvements, it will be largely ignored.

The first enterprise market to hit it big with flash was inside the server, particularly PCI flash (think Fusion-io).  The second enterprise market to hit it big with flash, a few years later, was the Tier 0 external storage market (think Texas Memory Systems (subsequently as IBM) and Violin Memory).  These splashes were nothing compared to the tsunami of business when all flash arrays entered the Tier 1 market with compelling economics driven by adoption of flash in consumer devices and supported by inline data reduction technologies to further reduce the cost per capacity.  These were majority buyers who were confident that the technology wrinkles were ironed out and who by and large wanted better performance than they could get from their disk-based solutions but were very focused on storage services, cost and cost/capacity.  They are not Tier 0 buyers though they won’t go back to disk having tasted the sweet nectar of low latency storage.

Tier 1 customers are unlikely to buy into all 3DXP storage arrays until the cost approaches the cost of flash because for these customers the difference between 120 microseconds of latency and 20 microseconds of latency is not as motivating as the difference between 5-20 milliseconds of latency and ½ a millisecond of latency.  And can you really get 20 microsecond latency on a Tier 1 device loaded with storage services?

What does this mean for the industry?  The market for 3DXP in external storage arrays will appear vibrant due to product introductions but the revenue that can be directly attributed to 3DXP in external storage will be low until the cost and density make meaningful improvements.  Storage architects are already designing ways to use 3DXP as a RAM replacement/supplement in the storage array.  There is some interesting potential here given the memory requirements for flash metadata and caching and the use of 3DXP as a tier of storage.  These steps are reminiscent of the way flash was gradually introduced into Tier 1 before it became Tier 1, for example in RAID cache backups.  As with the all flash arrays, the all 3DXP arrays custom built for the best latency curve at the right price will start out in the Tier 0 space waiting for the cost and density improvements that bring it to the big time.  This time around, that transition could take much longer than it did with flash based arrays.  Flash arrays benefited massively from the density and cost reductions needed in the consumer space.  3DXP does not appear to have the same tailwinds yet.


Cloud Grid Architecture

June 30, 2016

by Woody Hutsell, AppICU

Prevent cloud failures with grid architecture

Public and private cloud architectures fail with alarming frequency. David Linthicum, with Cloud Technology Partners, wrote in an article – Bracing for the Failure of Your Private Cloud Architecture – for TechTarget’s SearchCloudComputing that a major problem with private cloud deployments results from reusing the same hardware they used for their traditional IT. Specifically, he comments that “hardware requirements for most private cloud operating systems are demanding” and later that “If the hardware doesn’t have enough horsepower, the system will begin thrashing, which causes poor performance and likely a system crash.

Andrew Froehlich, writing 9 Spectacular Cloud Computing Fails for InformationWeek, extends this thought to the public cloud when he says that one of the three key reasons cloud service providers fail is due to “beginner mistakes on the part of service providers…when the provider starts out or grows at a faster rate than can be properly managed by its data center staff.”

Serving up applications in the cloud is different from traditional IT. Cloud deployments thrive when ease of application deployment is matched by ease of management combined with consistent performance under all workloads. Successful cloud deployments support many demanding applications and customers. With the increasing diversity of hosted applications comes some infrastructure headaches. We often custom tailor our traditional IT environments to meet the needs of a specific application or class of applications.  We know it has certain peaks for online transaction processing or batch processes. We know when we can perform maintenance. With the cloud, success means we have many applications with overlapping (or not) peak performance periods. With the cloud, we may be more likely to see constant use resulting in fewer opportunities to perform maintenance and restructure our storage to balance for intense workloads.

Successful cloud deployments can challenge and break traditional storage from a performance point of view. Traditional storage scales poorly. Whether the traditional storage array uses HDD or hybrid architectures, it will experience the same problem: as the number of I/Os to the system increase, the system performance will degrade rapidly. With an all-HDD system the latency will begin high and rapidly decay; with a hybrid configuration (SSD + HDD), the system latency will start lower, stay low longer but then rapidly decay.  When latency decays, applications and users suffer.

Successful cloud deployments can also challenge and break traditional storage from a management point of view. Traditional storage arrays are difficult to configure and deploy. It is not unheard of for initial deployments of scalable traditional storage to take days or sometimes weeks for the system to be tuned so that applications are properly mapped to the right RAID groups. Do you need a RAID group with SSDs; do you need a tiered deployment with SSDs, SAS, and SATA? How many drives are needed in each RAID group?  Should you implement RAID 0, 1, 5 or 6?  Once sized, configured, and deployed, further tweaking of these systems can be administrator intensive. When workloads change, as is the expectation in a cloud deployment, how quickly can you create new volumes and what happens when the performance needed for an application exceeds what the system is capable of delivering? The hard answer is that traditional storage was not designed for the cloud.

Fortunately, IBM has a solution – the IBM FlashSystem A9000 a modular configuration that is also available as the IBM FlashSystem A9000R, a multi-unit rack model. The new IBM FlashSystem family members tackle the performance and management issues caused by successful cloud deployments. Where the cloud needs consistent low latency even as I/O increases, FlashSystem A9000 applies low latency all-flash storage. Where the cloud needs simplified management, the systems apply grid storage architecture.

It all starts with the configuration. FlashSystem A9000 customers do not have to configure RAID groups, the system automatically implements a Variable Stripe RAID within each MicroLatency flash module and a RAID-5 stripe across all of the modules in an enclosure. An administrator configuring the system creates volumes and assigns those volumes to hosts for application use. Every volume’s data is distributed evenly across the grid controllers (this is where the storage services software runs) and the flash enclosures (this is where the data is stored). This grid distribution prevents hot spots and never requires tuning in order to maintain performance. No tuning means substantially less on-going system management. When the rack-based FlashSystem A9000R is expanded it automatically redistributes the workloads across the new grid controllers and flash enclosures.

When an I/O comes into these new FlashSystem arrays, it is written to three separate grid controllers simultaneously. These I/Os are cached in controller RAM and the write is considered committed from the application’s point of view. In this way, the application is not slowed down by data reduction. Next, the three controllers distribute the pattern reduction, inline data deduplication, and data compression tasks across all the grid controllers, thus providing the best possible data reduction performance before writing the data to the flash enclosure(s). Data can be written across any of the flash enclosures in the system, preserving the grid architecture and distribution of workload. When data is written to flash inside the flash enclosure, it is distributed evenly across the flash in a way that ensures consistent low latency performance. All of this is aided by IBM FlashCore™ technology which provides a hardware only data path inside the flash enclosure during the time data is written persistently to flash. The flash storage is housed in IBM MicroLatency® modules whose massively parallel array of flash chips provides high storage density, extremely fast I/O, and consistent low latency.

Together these technologies are a real blessing for the cloud service provider (CSP). When new customers arrive, CSPs know they can easily allocate new storage to new customers and not worry about special tuning to ensure the best performance possible. When existing customers’ performance demands skyrocket, CSPs know that their FlashSystem A9000-based systems offer enough performance to match the growing requirements of their customers without negatively impacting other customers. And when launching or expanding their businesses, CSPs know that FlashSystem A9000 can eliminate one of the leading causes of cloud offering failures, the inability of storage architectures to scale.

For more information, read Ray Luchessi’s, Silverton Consulting, article on Grid Storage Technology and Benefits


Our Growing FlashSystem family

April 27, 2016

IBM storage is proud to introduce our new twins, FlashSystem A9000 and FlashSystem A9000R, affectionately known as pod and rack respectively. The twins come from the loving family created by the marriage of FlashSystem (hometown Houston, Texas, USA) and XIV (hometown Tel Aviv, Israel). The twins share the same DNA but have taken on completely different appearances and capabilities.

new family announcement

I have to say, as a member of one of the proud parent teams, the last two years have been a real eye opening experience. Any major system release is an exercise in coordination and collaboration, and this one crossed many time zones and cultures, to say nothing of merging technologies. The marriage of these two groups involved integration of offering management, product marketing, marketing, technical sales, sales, support, development, testing, and sales enablement.

As a quick refresher, IBM acquired XIV in 2008 and Texas Memory Systems (now referred to as FlashSystem) in 2012. XIV’s claim to fame was taking the world’s least reliable disk technology (SATA HDDs) and packaging them into a highly reliable, scalable, and high performance enterprise storage solution. The Texas Memory Systems and FlashSystem claim to fame involved extracting the lowest possible latency from solid state storage media in a shared storage solution. It is obvious why these two solutions would be merged together, isn’t it?

OK, so maybe it isn’t that obvious, so I will explain. Over two years ago, we were looking to the future and envisioning a world where solutions for the cloud service provider market took on increasing importance. The vision really crystallized in 2013 when IBM acquired SoftLayer. As with any initiative of this sort, IBM went through a buy vs build analysis. With the acquisition of Texas Memory Systems only having been recently closed, buying another all-flash array provider was not likely. A quick look across the storage stacks available from within IBM revealed some great options: the software behind IBM SAN Volume Controller, the software behind XIV, and what we now refer to as Spectrum Scale. We were looking for some key features: scalability, because we knew cloud service providers need to be able to grow with their customer base; quality-of-service so that our customers can prevent noisy neighbor problems in multi-tenant environments; multi-tenant management so that those tenants could manage their own logical component of the system; and critically, a team (the people) with the experience and resources to implement full-time data reduction so that we could help cloud service providers lower the cost of their all-flash deployments. When you put it all together, it was obvious that our best match was with the XIV software (and team). XIV, for years, had led IBM’s focus on cloud integration points, including many of the key features mentioned above plus strong links to cloud orchestration solutions from Microsoft, VMware, and OpenStack.

Leaving out the details, suffice it to say there have been many IBMers crossing the Atlantic and Mediterranean in order to bring these new members of our product family to market.

But the strength of the family is not just the individuals…it’s in the family itself and here is where our marriage makes even more sense. As much as we can see the future through our all-flash lenses, it is abundantly clear that customers will take a variety of paths and differing amounts of time to get there. Our combined family includes a true software defined storage capability in Spectrum Accelerate, a capacity-optimized solution with XIV, and a performance solution with the FlashSystem A9000 twins. In addition to sharing a software lineage, these products actually can share licensing. A customer testing the waters with this family could start with a trial deployment of Spectrum Accelerate, then actually buy software licenses on a per capacity basis for Spectrum Accelerate. Those software licenses are then transferable to XIV for low cost capacity and to FlashSystem A9000 for dynamic performance with full time data reduction. In the near future, customers will be able to asynchronously replicate from a FlashSystem A9000 to an XIV, enabling additional cost cutting for disaster recovery deployments.

It’s been quite journey, literally, getting these two new products to market. But now that they’ve arrived, please join us in welcoming the twins to our growing FlashSystem family!


The Heart of FlashCore Technology

February 19, 2015

The Heart of FlashCore Technology

by Woody Hutsell, http://www.appICU.com

So, it’s the 1980s or 1990s and you are building a solid state storage device with RAM. You can build it from scratch or you can take a computer with RAM, HDD and processors and a motherboard and turn it into a solid state storage device with a little bit of software. Which do you choose? If you’re in a hurry and have a short term view on the market, you take the fast way. How many companies built RAM solid state storage devices out of computers? Many, but zero that are still in business. Why are they out of business? Inferior product performance, inferior product cost structure, difficult to iterate without real engineering talent? All are reasonable answers. Many had their time in the spotlight but all disappear into history.

When the engineering team at Texas Memory Systems had to make this decision they did what any self-respecting engineering team would do, they designed a solid state storage device from scratch. No kidding. They never made the RAM chips, but they made just about everything else. Why would you do this? I wish you could have worked with some of the Texas Memory System customers in these days. Their interactions with the engineering team are nearly legendary around here. Could it be that customers would complain about a few microseconds when the rest of the world was dealing in 10s of milliseconds? There were and they did. And these engineers reacted and tuned and shaved off latency.

And that is where the story is kind of awesome, those engineers who dealt with those customers, those engineers are still at the heart, at the core, of the IBM engineering team for FlashSystem today. Now when someone comes in with a great idea, the first questions from these core engineers are what happens to the latency, what happens to the response time curve, how can we reduce the impact? This is not just an engineering goal, this is our engineering culture, the DNA that makes our products what they are.

So in marketing, we call it FlashCore Technology and we even subject it to another descriptor called Hardware Accelerated I/O but you know now that is really a lie. The core doesn’t start with Flash, it starts with people and a culture built meeting the demands of the most demanding customers in the world.

If you are looking for the technical depth in FlashCore look here.


Server-Side Caching

January 20, 2012

Woody Hutsell, http://www.appICU.com

Fusion-io recently posted this blog that I wrote:   http://www.fusionio.com/blog/why-server-side-caching-rocks/

I feel strongly that 2011 will be remembered, at least in the SSD industry, for establishing the role of server-side caching using Flash.  I recall soaking in all of the activity at last year’s Flash Memory Summit and being excited about the new ways Flash was being applied to solve customer problems.  It is a great time to be in the market.  I look forward to sharing more of the market’s evolution with you.

 

 


Flash Memory Summit Presentation

September 6, 2011

Woody Hutsell, www.appICU.com

For those of you who are interested, here is a link to a presentation that I delivered at the 2011 Flash Memory Summit on “Mission Critical Computing with SSD”.

http://www.flashmemorysummit.com/English/Collaterals/Proceedings/2011/20110810_T1B_Hutsell.pdf

 


Third Party Caching

August 1, 2011

By Woody Hutsell, appICU

I have a point of view about third party caching (particularly as it applies to external systems as opposed to caching at the server with PCI-E) that is different than many in the industry.  Some will see this as bashing of some particular product, but it is not intended to be that.  As far as I know, I am not competing with a third party caching solution at any customer site.  My goal here is to start a discussion on third party caching, I will lead with my opinions and hope that others weigh-in.  I am open to changing my mind on this topic as I have numerous friends in the industry who stand behind this category.

First, some background.  Many years ago, 2003 to be exact, I helped bring a product to market to provide third party caching with RAM SSD.  I believed in the product and was able to get many others to believe in the product.  What I was not able to do was to get many people to buy the product.  As I look at solutions on the market, I can see that companies trying to sell third party caching solutions are encountering the same obstacles and are fixing or working around the problems.  Here are some problems I have experienced with third party caching solutions:

1.  Writes.  The really delicious problem to solve several years ago with a RAM caching appliance was related to write performance.  Many storage systems had relatively small write caching capabilities that caused major pain for write intensive applications.  A large RAM SSD (at the time I think we were using 128GB RAM) as a write cache was a major problem solver for these environments.  Several things have happened to make selling write caching as a solution more difficult:

•  RAID systems increasingly offered reasonable cache levels narrowing down the field of customers that need write caching.  At the time we offered this RAM write cache, we thought that Xiotech customers were the perfect target as they did not believe in write caching at the time. Fact is, the combined solution worked out pretty well but was only useful until Xiotech realized that offering their own write cache could solve most customer problems.

•  Third party write caching introduces a point of failure into the solution.  If you write-cache, you have to be at least as reliable as the solution you are caching otherwise you have net lost the customer reliability.

•  Write caching is nearly impossible if the backend storage array has replication or snapshot capabilities.   Arrays with snapshot have to be cache aware when they snapshot or else they risk snapshotting without the full data set.  I have seen companies try to get around this but most of the solutions look messy to me.

•  Putting a third party device from a small company in front of a big expensive product from a big company is a good way for a customer to lose support.  We realized early on that the only way for this product to really succeed was to get storage OEMs to certify it and approve it for their environments (we did not do very well at this).

2.  Reads.  Given the challenges with write caching it seems to me that most companies today are focused on read caching.  Read caching solutions have a long history.  Gear 6 was one of the first to take the space seriously and had some limited success with environments such as oil & gas HPC and rendering.  Some of the companies that have followed Gear 6, seem to be following in their footsteps with markedly different types of hardware and cost.  Here are some issues I see with read caching:

•  A third party read-only cache adds a write bottleneck (as writes to the cache have to be subsequently written to the storage). i.e. Latency injection.  I assume there are architectures that get around this today.

•  A third party read only cache really only make sense if your controller is 1) poorly cached or 2) does not have fast backend storage or 3) is processor limited or 4) has inherently poor latency.  This may be the real long term problem for this market.  Whether you talk about SAN solutions or NAS solutions all storage vendors today are offering Flash SSD as disk storage.  In SAN environments, many vendors can dynamically tier between disk levels (thus implementing their own internal kind of caching).  NetApp has Flash PAM cards. Both BlueArc and NetApp can implement read caching.  The only hope is that the customer has legacy equipment or poorly scoped their solution such that they need a third party caching product.

•  Third party caching creates a support problem.  Imagine you are NetApp and the customer calls in and says I am having problems with my NetApp storage can you fix it.  Support says, describe the environment.  Customer says “blah…blah…third party cache cache…NetApp”.  NetApp says “that is not a supported environment”.  I always saw this as a major limiting factor for third party caching solutions.  How do you get the blessing of the array/NAS vendor so that your customer maintains support after placing your box between the servers and the storage.

•  Third party read caching solutions cannot become a single point of failure for the architecture.

So, there it is. I am looking forward to some insightful comments and feedback from the industry.  As you can see many are my opinions are based on scars from prior efforts in this segment and not meant to be a reflection on existing products and approaches.

 

 


Tales from the Field

June 22, 2011

Tales from the Field

by Woody Hutsell, www.appICU.com

Instead of marketing from afar, I have been selling from the trenches and let me tell you the world looks very different from this view point.

I have a variety of observations from my first 9 months of working closely with IT end-users:

  1. At least 50% of the IT people I talk to are generally unfamiliar with solid state storage.  These 50% are so busy worrying about backups, replication, storage capacity and virtualization that it would take a whole screaming train full of end users before they would care about performance.  What they are likely to think they know about SSD is that they are unreliable and don’t have great write performance.  I always ask these end users about performance or interest in SSD and usually get fairly blank looks back.  Don’t get me wrong, their interest in performance or SSD is no reflection on them just a reflection on their situation.  Maybe they don’t need any more performance than they already get from their storage.  Maybe performance is so far down their list of concerns as to not matter.  Maybe they just can’t budget a big investment in SSD.
  2. Some high percentage of IT buying is done without any real research.  So much for technical marketing.  You could write any number of case studies, brochures and white papers and these guys wouldn’t learn about it unless the sales person sitting across from them drops in at just the right time immediately after the aforementioned train full of end-users has started complaining about performance (and the IT guy happens to have budget to spend on something other than backup, storage capacity, replication or virtualization).
  3. These groups are deploying server virtualizationin mass.
  4. These groups are standardizing on low cost storage solutions.  The rush to standardize is driven by the number one reality affecting many IT shops:  they are under staffed and their budgets are constrained.  The lack of staffing means that it is hard to get staff trained on multiple products and life is easier if they can manage multiple components from a single interface.  The lack of budget means that IT buyers have to make compromises when it comes to storage solutions.  Because of item #2 (above), they are reasonably likely to buy storage from their server vendor and often find their way to the bottom of the storage line-up to save money.

You might think these observations would be disheartening, but really I think the story is that SSD is just starting to make its way through to the more mature buyers in the market.  Eventually, I believe that all IT storage buyers will be as familiar with and concerned with protecting application performance as they are with capacity and reliability.

A case in point, I have run into at least two customers where the drive to standardize with VMWare and low cost storage is crushing application performance for mission critical applications.  The good news for these IT shops is they have low storage costs and an easy to manage environment (because they have one storage vendor and one server virtualization solution).  The bad news is that their core business is suffering.

From my limited point of view, standardization is something that the IT guys like and the application owners don’t like.  You might assume that I think the IT guys are short-sighted, but no, increasingly I am seeing that they just don’t have a choice; they have to standardize or die under a staggering workload and shrinking budget.  Something though has to give.  A core business of one of these operations was risk analysis.  This company deployed low-cost storage and had virtualized the entire IT environment with VMWare (including the SQLServer database).  The entire IT infrastructure ran great for this customer but a mission critical sub-terabyte database was a victim of standardization.  The risk managers, whose decisions drove business profitability, were punished every time they did complex analyses by slow application response time.  The second business is really a conglomerate of some 50+ departments.  These departments were not created equally, however, there were some really profitable big departments and some paper-pushing small departments.  To the benefit of some end users and the tremendous detriment of others this business standardized on a middle tier storage solution with generous capacity scalability but not so generous performance scalability.  Their premier revenue generating department was suffering with, you won’t believe this, 60 millisecond latencies from storage for their transaction processing system.  Yikes.  For the non-storage geeks reading this blog, a really fast solid state storage system will return data to the host in well under 1 millisecond.  A well-tuned hard disk based RAID array will return data in 5 to 7 milliseconds.  A 60 millisecond response time is indicative of a major storage bottleneck.  Experiencing a 60 millisecond response time on a single request is no big deal but when this is during a batch process or spread across many concurrent users applications get to be very slow, end-users wait for seconds or batch process take too long to complete resulting in blown batch processing windows.

For now, the story for these two environments is not finished.  Once companies head down the standardization trail they are pretty confident and committed.  Eventually, the wheels fall off and people begin to realize that it is as bad to standardize on all low cost storage as it is to standardize on all high end storage.  Eventually, people realize that IT needs to align to business and not the other way around.

As companies amass larger data stores and the price and options for deploying SSD evolves, SSD solutions will become more common in the data center and a part of each IT manager’s bag of tricks.  Zsolt Kerekes, at StorageSearch.com, put it best in his 2010 article “This Way to Petabyte SSD” (http://www.storagesearch.com/ssd-petabyte.html) when he said “The ability to leverage the data harvest will create new added value opportunities in the biggest data use markets – which means that backup will no longer be seen as an overhead cost. Instead archived data will be seen as a potential money making resource or profit center. Following the Google experience – that analyzing more data makes the product derived from that data even better. So more data is good rather than bad. (Even if it’s expensive.)”


Waves of Opportunity

May 28, 2011

by Woody Hutsell at www.appicu.com

The next big opportunity/threat for SSD manufacturers is playing itself out right now. SSD vendors are scrambling to be a part of this next big wave. The winners are your next acquisition targets or companies poised to go public. The losers will hope that this new wave expands the overall market just like the first wave.

The first big wave in the enterprise SSD market was the rapid adoption of hard disk form factor SSDs for use in enterprise storage arrays. The SSD companies most seriously contending to ride this wave were BitMicro and STEC. STEC, by virtue of their GnuTek acquisition, had the right product at the right time and were able to win early business with EMC. Suddenly, venture money was pouring into the market and any company that had ever put a Flash chip on a board was selling Flash disk drives. The clear winners in this category have been STEC, who continues to have great revenue growth, and Pliant’s investors who have successfully sold their company to SanDisk after getting some traction with the OEM community. The story in this market is not finished as companies like Western Digital, Seagate, LSI and Intel look to chip away at this part of the business. At the same time though, a few companies were swept out to sea and others saw their golden opportunity for enterprise riches turn into dreams of big volumes (but low margins) in consumer markets. As I have argued before, the use of Flash hard drives in enterprise arrays is really about accelerating infrastructures more than about accelerating a specific application. This first big wave actually increased opportunities for all SSD companies by increasing the market size and validating the technology for mainstream use.

The newest wave to entice and yet concern SSD manufacturers is hitting closer to home for those manufacturers focused on the application acceleration market. For many years, the data warehousing sector has led to some great success stories for companies like Netezza who tightly bundled database functionality with hardware. Netezza’s success led Oracle and HP to try Exadata which was anything but a rousing success in the market. But somewhere along the way, Oracle was watching what Sun was doing with solid state storage and noticed a way to take the relatively less exciting Exadata and turn it into something much more captivating and yet similarly named Exadata 2. Some day we will learn whether the prospects of Exadata 2 were a big motivator for the Sun acquisition or just a quick way to demonstrate that Oracle was serious about the hardware market. Either way, Oracle’s claims of big margins and big potential revenue streams for Exadata 2 have ignited a flurry of activity in the market. Already vendors are clamoring to get into this space and there is a series of speed dating exercises going on as database vendors, server vendors and SSD vendors start trying to find some magical combination which helps them beat Oracle at this new market. Will the rich SSD vendors get richer still in this category or will the remaining SSD manufacturers find new partners, buyers and OEMs? Can any combination beat Oracle?

Whoever the winners, this second wave will show more clearly the ability of a tightly integrated solid state storage solution to increase application performance.


Consistency Groups: The Trouble with Stand-alone SSDs

February 28, 2011

SSDs (Solid State Disks) are fast; everyone knows this.  So, if they are all so very fast, why are we still using spinning disks at all?  The thing about SSDs (OK, well, one of the things) is that while they are unarguably fast, they can need to be implemented with reliability and availability in mind just like any other storage media.  Deploying them in an Enterprise environment can be sort of like “putting all of your eggs in one basket”.  In order for them to meet the RAS needs of enterprise customers, they must be “backed up” in some meaningful way.  It is not good enough to make back-up copies occasionally; we must protect their data in real time, all of the time.  Enterprise storage systems do this in many different ways, and over time, we will touch upon all of these ways.  Today, we want to talk about one of the ways – replication.

One of the key concepts in data center replication is the concept of consistency groups.  A consistency group is a set of files that must be backed up/replicated/restored together with the primary data in order for the application to be properly restored.  Consistency groups are the cause of the most difficult discussions between end-users and SSD manufacturers.  At the end of this article, I will suggest some solutions to this problem.

The largest storage manufacturers have a corner on the enterprise data center marketplace because they have array-based replication tools that have been proven, in many locations over many years.  For replicated data to be restored, an entire consistency group must be replicated using the same tool set.  This is where external SSDs encounter a problem.  External SSDs are not typically (though this is changing) used to store all application data; furthermore, they do not usually offer replication.  In a typical environment, the most frequently accessed components of an application are stored on SSD and the remaining, less frequently accessed data, are stored on slower, less expensive disk.  If a site has array-based replication, that array no longer has the entire consistency group to replicate.

External SSD write caching solutions encounter a more significant version of this same problem.  Instead of storing specific files that are accessible to the array-based replication tool, it has cached some writes that may, or may not be, flushed through to the replicating array.  The replicating array has no way of knowing this and will snapshot or replicate and not have a full set of consistent data because some of that data is cached in the external caching solution.  I am aware that some of these third party write caching solutions do have a mechanism to flush cache and allow the external array to snapshot or replicate, but generally speaking, these caching SSDs have historically been used to cache only reads, since write-caching creates too many headaches.  Unless the external caching solution is explicitly certified and blessed by the manufacturer of the storage being cached, using these products for anything more than read caching can be a pretty risky decision.

Automatic integration with array-based replication tools is a main reason that some customers will select disk form factor SSD rather than third party SSDs, in spite of huge performance benefits from the third party SSD.  If you are committed to attaining the absolute highest performance, and are willing to invest just a little bit of effort to maximize performance, the following discussion details some options for getting around this problem.

Solution 1:  Implement a preferred-read mirror.  For sites committed to array-based replication, a preferred-read mirror is often the best way to get benefit from an external SSD and yet keep using array-based replication.  A preferred-read mirror writes to both the external SSD and to the replicating SAN array.  In this way, the replicating array has all of the data needed to maintain the consistency group and yet all reads come from the faster external SSD.  One side benefit of this model is that it allows a site to avoid mirroring two expensive external SSDs for reliability, saving money.  This is because the existing array provides this role.  If your host operating system or individual software application does not offer preferred read mirroring, then a common solution is to use third-party storage application such as Symnatec’s Veritas Storage Foundation to provide this feature.  You must bear in mind that a preferred read mirror does not accelerate writes.

Solution 2:  Implement server-based replication.  There are an increasing number of good server-based replication solutions.  These tools allow you to maintain consistency groups from the server rather than from the controller inside the storage array, allowing one tool to replicate multiple heterogeneous storage solutions.

Solution 3:  For enterprise database environments, it is common for a site to replicate using transaction log shipping.  Transaction log shipping makes sure all writes to a database are replicated to a remote site where a database can be rebuilt if needed.  This approach takes database replication away from the array – moving things closer to the database application. 

Solution 4:  Implement a virtualizing controller with replication capabilities.  A few external SSD manufacturers have partnered with vendors that offer controller based replication and who support heterogeneous external storage behind that controller.  This moves the SSD behind a controller capable of performing replication.  The performance characteristics of the virtualizing controller now are a gating factor in determining the effectiveness, and indeed the value added by the external SSD.  In other words, if the virtualizing controller adds latency (it must) or has bandwidth limitations (generally they do), those will now apply to the external SSD.  This can slow SSDs down by a factor of from three to ten times.  It is also the case that this approach will solve the consistency group problem only if the entire consistency group is stored behind the virtualizing controller.

Most companies implementing external SSD have had to make decisions, trying to grapple with the impact of consistency groups on application performance, replication and recovery speed.  Even so, the great speed associated with external SSDs often leads them to implement external SSD using one of the solutions we have discussed. 

What has been your experience?