Elements of a Next Generation Data Center Series: Guaranteed Performance
Use the words “storage” and “guaranteed” in the same sentence, and eyebrows will raise.
I get that. And yet, I’m going to make this statement clearly and without equivocation: if your storage doesn’t allow you to provide both an I/O limit and a minimum guarantee at the smallest storage construct it supports, you are putting your IT organization, your IT operators, and ultimately your customers at a disadvantage for no good reason.
Before we dig into the storage side of the equation, let’s take a look back in time. In the olden days, we had a very easy boundary we could use to guarantee performance: the physical server. Before 2001, every server contained its own processors, memory, and often storage as well. The resources were all given to an OS, which provided a place for an application (or sometimes a couple applications) to reside. If an application needed more RAM, you added RAM to the server. If the application needed more CPUs, you moved it to a bigger server. You never had to worry about another application impacting your performance.
But of course, there were drawbacks.
Doing maintenance and upgrades involved change management outages. Growth was hard to handle, both from a resource perspective as well as from a power and cooling standpoint. Overall, utilization was very low. Resources were captive to the hardware, and moving either resources or apps was a non-trivial task involving outages, screwdrivers, sleepless nights, and working weekends. Asset management was also hard, because every server might be a custom build specific to the workload it was supporting.
In the early 2000s, VMware changed all of that. They showed IT organizations everywhere that pooling compute and memory resources and provisioning out exactly what a given workload needed was not only possible, but that it gave them more control over their environments than ever before. The advent of server virtualization gave previously impossible amounts of flexibility to IT organizations, and completely changed how x86 servers were purchased by putting the focus on the VM and the workload it was supporting.
Gone were the days of applications going down because of hardware maintenance.
Gone were the days of getting woken up by a pager at 3 a.m. because a server went down.
Gone were the days of using 20% of the capacity of a server, or having to manage dozens of server configurations.
Very literally, virtualization changed fundamental concepts of IT operations, and for the most part the industry was better off for it.
And yet, storage remained untouched.
Have no doubt, nothing has driven the adoption of shared storage like virtualization. All those cool HA and resource scheduling features? The magic that is vMotion? All of it depended on every host having access to the storage where the VM resided. But from an operational model, nothing changed. We still provisioned LUNs or NFS mount points. We still turned them into VMFS datastores. We still shared datastores with a more or less random number of VMs that used various amounts of capacity and performance.
For years, nothing in that model changed. IT operations teams got better integration, maybe, and VMware released storage APIs that allowed things to be offloaded to the storage array for the purposes of increased speed and lower overhead. But the basic blocking and tackling of storage in a virtualized environment stayed the same.
It turns out storage architecture is hard. It’s hard to design, for sure, but it’s even harder to change. As virtualization was changing the compute side of the house by introducing the idea of software-based resource reservations, and scheduling, and load balancing, and failover, none of those things existed on the storage side. They were hard. And they cost money. And traditional storage folks were doing just fine masking those complexities, thank you very much, so why make the investment into something that was only going to increase efficiency and decrease the amount of storage customers were going to have to buy?
With that realization, shared storage remained stuck in the stone age. Or at least the pre-virtualized age, which is pretty much the same thing.
Leave it to new entrants into the storage market to shake things up. If you look at the new storage vendors that have set up shop in the last two to three years, they fall decidedly into two camps:
- Those who are just reusing the architectures of their dinosaur brethren that came before them, and
- Those who have reimagined what a storage array is from the ground up and are trying to solve some of the gaps that exist.
SolidFire is clearly one of the leaders in the latter group. From Gartner to ESG to investors to customer wins, SolidFire has, from its inception, dedicated itself to the idea that storage doesn’t have to be the boat anchor in your data center. You can have world-class shared storage and still bring flexibility and scale, and seamlessly integrate into any operational process.
The fundamental flaw to recognize in legacy shared storage (and there are many others, to be sure) is that capacity and performance are almost always directly related. With spinning disk, every spindle has IOPS and capacity associated with it, and when you need one, you get the other. This results in very inefficient usage of resources, and it makes customers buy more of what they don’t need in order to get what they do. This is, obviously, great for the people selling the disk, but not so great for the people using it.
Two core features separate this new architecture from its legacy: Scale and quality of service (QoS). These two core tenets work hand-in-hand to offer customers two things they need to be able to do with their shared storage platform:
- Provision capacity and performance incrementally, granularly, and most importantly, completely independent of one another;
- Guarantee, at a granular level, that every volume will get the performance it’s been allocated. Always. Regardless of what else is happening on the array.
When talking about scale, the first hurdle is to make sure that the act of scaling doesn’t force all of the compromises customers were forced into in a legacy environment. Being able to scale in and out using granular, configurable steps, being able to distribute load and traffic over any number of nodes, and being able to increase the scope of the data services layers that went along with it were all key parts of solving that puzzle. This kind of next-gen design allows for large-scale adoption of cluster sizes that would be extraordinarily expensive, inefficient, and unmanageable with a legacy architecture. Using a node-based, shared-nothing design gives the customer a level of design freedom they’ve never had before. In the case of SolidFire, putting the FlashForward guarantee in place makes it easy to trust that a new model isn’t going to demand that customers continue with forklift storage upgrades.
The second goal is to make sure that once customers have this gloriously large pool of low latency, high-performance flash, they can consolidate as many workloads onto it as possible without increasing the risk profile of those applications. This is where an integrated QoS feature comes in.
QoS is one of those terms that has been abused pretty badly by the storage industry in the last couple years. Some vendors use a simple prioritization process and call it QoS despite not being able to enforce it at a granular level. Some vendors put rate limiting in place for noisy neighbor protection and call it QoS. Some apply it only to limited object types, limiting the use case to a specific hypervisor.
For the best of the implementations, it’s exceedingly simple: every volume ever provisioned must be assigned a minimum, maximum, and burst IOPS value, and those values must be respected no matter what is happening on any other volume, at any capacity level, with any I/O pattern, forever.
The benefit to the customer becomes clear at this point. Everything we loved about VMware on the compute side: pooling and isolation of separate resources, reservation of resources for critical workloads, flexibility to move resources from workload to workload dynamically, and increased efficiency have brought all of those things to the storage side of the equation.
When we have this discussion with virtualization architects or with IT operations teams, they get it. When we tell them to demand that their storage vendor have the ability to do this under multiple virtualization and cloud management platforms, and for bare metal servers all at the same time, they start to see the possibilities for how storage can, finally, make their lives easier. When we tell them to demand support for every application regardless of I/O profile on a single array with no compromises, they start to wonder why they’ve been purchasing a separate array for each workload all these years.
And they should.
It’s time for the storage industry to embrace the technology shift that has remade the computing world and drive that kind of simplicity and value into the last technology still lurking in siloed darkness.
I talk more about all of this in my whiteboard video on Guaranteed Performance in the next generation data center. I also encourage you to download SolidFire’s Definitive Guide to Guaranteed Storage Performance.
Comment below with your thoughts!