Why Guaranteed Quality of Service (QoS) Matters
The SolidFire implementation of guaranteed Quality of Service (QoS) is, very literally, the only technical reason I came to work at SolidFire. After spending 10+ years as a service provider customer of HPE and EMC and then another 3+ years selling EMC arrays, I was done with storage. Done putting lipstick on the pig. Done having to deal with customers who were frustrated when the product couldn’t live up to the marketing hype. SolidFire QoS was one of the first truly unique innovations I’d seen in storage in a long time, and it forms the core of how we are changing the relationship between storage and storage consumers and bringing shared storage out of the dark ages.
Pure Storage has taken many, many of the pages out of the EMC playbook, and with their latest set of press releases has continued to embrace the “marketing-first, reality-later, whether-you need-it-or-not” mentality of their big brother. We could talk about many different parts of that “announcement,” but the one I want to focus on, the one that matters most to me, is this lunacy about the role that Quality of Service plays in an all-flash environment.
Simply put, Pure is focusing their attention on the least interesting part of QoS because it’s the only part their architecture will let them implement. Pure states: “As customers consolidate dozens or even hundreds of applications, concerns about ‘noisy neighbors’ – applications that starve the performance of other applications on an array – can emerge.” Preventing noisy neighbors by means of a rate-limiting mechanism is something that almost every storage vendor can do at this point, and it’s primarily a way to protect the array from being overrun by the applications. It doesn’t allow for more consolidation, it doesn’t allow for better utilization, and it certainly doesn’t allow for customers to provide guaranteed service levels without overbuilding for peak utilization.
To make a limited feature even less useful, Pure seems to have put a nonconfigurable, I/O credit system on top of it, allowing the system to rate limit workloads without any context. Is that volume using more I/O because it’s running a backup? Do you want to oversubscribe I/O because you know that batch jobs or table updates will be happening at specific times? Do you want to exclude some volumes from the process so you don’t have to explain to application owners why their performance suddenly dropped? When Pure says “always-on QoS is simple and autonomous, making it ideal for most use cases, and builds the foundation for future policy-driven QoS extensions,” what customers should hear is that it’s rigid and inflexible, isn’t ideal for every use case, and doesn’t include any policy-based management.
And to top it off, Pure ignores all of the other benefits a real, guaranteed QoS system can provide, simply because they have no way to deliver them. They can’t let customers provision performance the same way they provision capacity. They can’t guarantee critical workloads will get the performance the enterprise demands of them. They can’t scale performance independently. They can’t allow customers to move I/O from one volume to another deliberately.
Think of it this way:
Imagine if you couldn’t allocate RAM to a VM. You paid for a server that had the right number of CPUs/cores, and with that, you got some amount of RAM that you couldn’t really specify and you couldn’t allocate out to specific workloads. All of your VMs pulled from the same pool, and even if that pool was really, really big, enough consolidation or a workload that misbehaves enough is going to impact everything else that’s sharing from the pool.
Well, this sucks, and changing the entire architecture and software stack to address the issue is really hard, but maybe the vendor can provide a minimum effort, minimally effective, half measure, making it so any VM that uses more than its fair share of RAM gets rate limited back so that everyone gets their fair share. What’s a fair share? Who knows. Maybe it’s something you manually set. Maybe it’s “auto-magic.” But there’s no way to promise RAM will be available for critical apps, and if a VM bursts its RAM usage for a good reason, it’s going to get limited just like a VM that bursts for a bad reason.
VMware admins would never stand for such a system. VMware would never expect them to. They provision CPU and RAM independently, have the ability to guarantee those resources per VM, have systems in place to limit utilization when there are VMs who are abusive, and overallocate when things are running well. We’ve expected that VM admins will know, monitor, trend, manage, and adjust CPU and RAM allocations to every VM, forever. It’s just what the job entails.
On the storage side, SolidFire is the only platform that lets you provision both resources of the array (performance and capacity) independently, guarantee both per workload, and allow admins to know, monitor, trend, manage, and adjust those allocations in real time. The analogy with VMware is almost perfect.
One benefit is being able to make sure misbehaving workloads don’t overwhelm the array (rate limiting), but that’s (IMO) the least interesting. Even guaranteeing a minimum amount of performance isn’t the real magic; although for people who have to deliver an actual service level and don’t want to spend to overprovision for the peak usage, it’s a godsend.
No, the real magic of SolidFire guaranteed QoS is the ability to actually provision performance. Think about it: Customers have never been able to say, “I want to buy this much performance, and I want to allocate it out to these workloads in these amounts. And if I get it wrong on day one (or if things change), I want to move that performance around to where it’s needed.”
Capacity was once capped by a maximum theoretical amount of performance that varied by media type and architecture. SolidFire can fundamentally change that behavior. Buy the amount of performance and capacity you need in the ratio that makes sense today. Scale performance and capacity independently in a ratio that makes sense going forward. Allocate out performance and capacity independently, per volume, providing both guaranteed minimums and effective maximums. Track, analyze, and reallocate as needed. Move large chunks of performance from one place to another because of scheduled operational tasks, like backups or batch jobs. Automate the dynamic provisioning of more performance as the workload grows. Use the storage resources you paid for in the most direct, incremental, logical way possible, just like you do with compute and memory.
Lots of vendors have maximums/rate limiting, Pure is just the most recent one. This isn’t an interesting or difficult feature to implement, and it’s primarily meant to protect the array, not the workloads. Some vendors have a prioritization or fair queuing system, which simply punishes all workloads in a similar queue equally. While slightly more interesting, it still doesn’t solve the problem.
Only SolidFire can solve the problem — and has done that since day one. And it integrates into both VMware and OpenStack. And is bringing it to VVols. No other storage vendor can. No HCI vendor can. This is not just a HUGE competitive differentiator, it’s something critical to pushing storage into the next phase of the SDDC, and vendors who can’t implement it well and comprehensively will always find themselves behind the curve.
And if the biggest negative a competitor can throw at SolidFire QoS is that it’s too powerful and too granular and too policy-driven? Well, I think we can all live with that. Want to know more about true, native, guaranteed QoS? Check out our YouTube channel or any number of blogs like this one, or this one, or this one.
| 0 comments
Posted in Quality of Service.