Musings on VMware ESX Quality of Service


Hypervisors, and VMware’s ESX in particular, offer a variety of Quality of Service (QoS) features and adjustments for storage. They were designed and added in a world where the storage itself might not offer any QoS capabilities, and the tasks of ensuring fair access, prioritizing access for critical workloads, and making best use of the available storage were the responsibility of the hypervisor.

In ESX 4.1, VMware introduced Storage I/O Control (SIOC) to help ensure that low-priority, I/O-intensive processes or VMs didn’t interfere with higher-priority tasks.  SIOC prevents activities like backup or data mining (that access a lot of data all the time) from interfering with mission-critical operations that need timely responses (like database or user interaction).

VMware released Storage Distributed Resource Scheduling (SDRS) in ESX 5.0. SDRS takes advantage of Storage vMotion (sometimes called SvMotion) to automatically migrate VM storage from heavily used datastores to lightly used datastores. The idea is borrowed from VMware’s compute-resource balancing technique: migrate a VM from a heavily loaded compute server to a lightly loaded one. Dealing with compute resources is truly a hypervisor’s domain. SDRS tries to apply the same technique to storage.

In vSphere 5.1, Storage Policy-Based Management (SPBM) allows different tiers of storage to offer individual capabilities to VMware datastores. A VM or group of VMs could have certain QoS requirements. VMware uses policies to match these QoS requirements with compliant storage capabilities.

SIOC: Who are these noisy neighbors?

VMware SIOC is meant to deal with the “noisy neighbor” problem, where one VM uses so much storage throughput that other VMs start to see increased latency to storage.  SIOC in VMware deals with the noisy neighbor by setting latency thresholds for VMs, and starts throttling other workloads that use the same datastore if the thresholds are exceeded.

SIOC allows each VM to be given a number of “storage shares” that’s proportional to other VMs using that datastore. Then, when latency hits the predetermined threshold (30ms by default), I/O is throttled so each VM gets its share. This works even across clusters of ESX servers, because shared information for all VMs in that cluster using that datastore is taken into account.

What isn’t taken into account is knowledge of what other “noisy neighbors” might be lurking in a multi-tenant storage environment. If a non-ESX workload is responsible for the high latency, ESX ends up throttling its VMs to the benefit of that non-ESX workload.

SIOC also examines latency and throttling at a datastore level, without considering
that several datastores might reside on the same storage system, or use the same storage network. In this case, another ESX-based datastore could be the cause of
throttling imposed on a set of VMs.

Being able to handle the noisy neighbors on the storage side — with knowledge of all the tenants and requirements — makes the results more predictable and accommodates all consumers of the storage.


SDRS: Is it the right time to move?

VMware Storage DRS is another method of trying to level the playing field for VMs in an ESX cluster. Instead of choking a VM’s access to a datastore, it looks for a more favorable datastore for that VM to use. Again, the criteria for this is I/O latency of the datastore, although with a much higher granularity than with SIOC. (SDRS also takes datastore capacity into account, which is less interesting because initial placement of the VM into a particular datastore is a manual decision. If one datastore is nearly full, the VM can initially be placed in another datastore.)

Because I/O latency is the deciding factor, VMware makes decisions to move a VM from one datastore to another with the same blindness from which SIOC suffers. Other systems that aren’t using ESX (or even other ESX clusters) can be using the same storage or storage network, and moving the VMs might not help the latency at all.

SDRS has the additional downside of adding load to an already potentially overloaded storage network. If network capacity is part of the cause of the I/O latency, storage migration can further load the network, which makes the latency problem worse.

A storage system providing guaranteed IOPS and latency smoothing across all datastores disposes of the need to move from one datastore to another. Datastores with the same guarantees receive the same performance.

SPBM: Are we compatible?

VMware Storage Policy-Based Management brings a lot to the table. It allows you to quickly identify the protection policy, encryption, compression, and archival capabilities of storage. It also ensures that VMs with QoS requirements are assigned to compliant storage.

The unfortunate side of SPBM is that it also is often used to create storage tiers. Most examples from VMware discuss Gold, Silver, and Bronze classes of storage, instead of talking about concrete capabilities. And using SPBM to determine storage tiering means that it’s likely getting used as a front line for doing QoS, even though using the policies doesn’t provide any QoS.

Because VMs and applications might have similar priorities, tiering is a fairly poor way of providing QoS. This is because it places them in the same tier, even though they have completely different storage requirements, meaning they still compete with each other. This can result in tier promotion not because a workload needs better storage, but simply because it can’t co-exist with its own noisy neighbors.

SPBM is actually very valuable in coordinating the needs of a VM with the capabilities of a certain type of storage. In later releases, it can also be used to coordinate vSphere and storage to ensure compliant storage is available.

However, it loses a lot of its usefulness when it gets used to drive a tiered storage approach.

What’s the answer?

The alternative to using vSphere-based QoS methods is to make sure your storage architecture offers the QoS guarantees you need. SolidFire provides true scale-out capabilities, allowing the storage to grow with your requirements.

SolidFire provides load balancing distribution, meaning the entire cluster ensures all storage is delivered in the best possible way. This distribution accomplishes what SDRS attempts to do, but it does so with full knowledge of the storage load, and without needing to migrate complete VMs which would further tie up the storage network.

SolidFire also provides fine-grained QoS control of each datastore. This means the datastore has exactly the IOPS and latency it requires, without needing to be throttled at the hypervisor, moved to a different datastore, or adjusted to a different tier.

What’s in the future?

VMware recognizes that the various different approaches to QoS at the hypervisor level take storage out of the equation.

With the advent of Virtual Volumes (Vvols), there is no longer any need for SIOC or SDRS, which observe latency at the datastore level. Vvols use datastores as logical groupings of VMs, unrelated to how they are accessed or stored.

SPBM takes on a new, more important role of ensuring each VM (and even each virtual disk on each VM) gets the policies it needs to best perform for the application. SolidFire has a full-featured Vvols implementation underway.

For more information on hypervisor-based and all other kinds of QoS, check out our Definitive Guide to Guaranteeing Performance in a next generation data center.

2470 views 3 views Today


Posted in Quality of Service, VMware.