Demonstration of SolidFire’s Automated Self-Healing HA

See a demo of SolidFire’s automated HA and watch what it looks like to rebuild from a shelf-based failure in less than 10 minutes.


SolidFire delivers the industry's best automated HA story with the ability to self-heal from any drive, controller, or shelf-based hardware failure and the software that enables that. This demonstration will go through each one of the conceptual items of SolidFire's automated HA, as well as go through a demonstration of what it looks like to rebuild from a shelf-based failure. On a traditional storage system, it runs a shared disk shelf HA scheme with dual controllers and typically shelves behind. Possibly quad controllers, depending upon the architecture. The biggest challenge associated with any of these high availability-type schemes is if we were lose to one of these controllers like this, then we have to rush into the data center to bring that second controller up because we run the risk that the second controller could go down and then we would have a data unavailability event. Worse, were this shelf to go down, then we automatically have a data unavailability event and then we need to rely upon four hour support or premium support to bring that up as quickly as possible.

Whereas on SolidFire, we have what we call the Shared Nothing High Availability scheme. The idea is is we have resiliency across in our case notes, and that data redundancy is distributed across the entire system. Then from there, we take it one step further and then we have a self-healing HA concept. The reason why self-healing because so interesting is that over time, computers and the back end software end up rebuilding things at a much, much more quicker rate than we ever possibly could between a four hour or a next business day support contract. Consider this, let's talk about the hours to start a rebuild process. In a self-healing environment, the healing process occurs automatically, whereas on a four hour support or a next business day support, it either takes four hours or twenty-four plus hours for that rebuild process to start. The reason why this becomes so [inaudible 00:02:16] is if you look at most modern SSD arrays, they rebuild in a very short manner. For example, let's take an hour rebuild process. As you can see here in the self-healing environment on a SolidFire system, the rebuild will complete in roughly an hour, whereas even run on an all SSD system, it would take a minimum of five hours or even possibly twenty-five hours, largely because of that time to start the rebuild process on, albeit, a controller or a shelf-based failure.

What we're going to do is we're going to end up pulling the apparent of a shelf on SolidFire. What I have right here is it's pinging the shelf on an active basis so we can figure out exactly when it's powered off. Then, I'm going to run a load to this system. Now we're starting that workload to the system. You can see right here, currently there are no alerts on this cluster, and currently, there are no running tasks. As an informational piece, we're currently running a four note cluster and with SolidFire, as we grow that cluster out, this rebuild process gets better and better. Think of this as being the low water mark for the time it takes for a rebuild process. Now that we've started off the traffic, let's pull a shelf. I'm going to pull the shelf from the background and then we'll see it show up right here when we can no longer ping it. Now we can see pretty much immediately we can no longer ping that particular shelf by this request timeout right here. We know that this shelf is down and we pulled the power plug on it. We can see down here with the [inaudible 00:04:07] rate, of course, there will be a performance impact, however, as we can see right here, there isn't any downtime associated with pulling out that full shelf on the system.

On the SolidFire system, it will wait for five minutes prior to starting the rebuild process. The reason being is during an upgrade or something of that nature, there were to be a reboot on one of the individual nodes, then we want to make sure that that reboot completes prior to rebuilding all of the data on there. Now that that brief period has passed, you can see underneath the alerts that our nodes offline and various different other warning messages associated with us not having that node in the respective copies of the data. Now that the five minutes has passed, the rebuild process will start immediately. We can see under the "Running Tasks" section right here, this will show the active progress on how quickly that rebuild process will occur. In this particular instance, we had lost the full node. Under the summary report, we are right around forty to fifty percent full on this four node cluster, and we can see that that full rebuild process under load is going to take us, according to this chart right here, right around ten minutes to complete.

In the interest of time, I'm going to fast-forward. Now the rebuild process is nearing completion at ninety nine percent with only a few seconds remaining. Look at the elapsed time. It's taken a little bit over ten minutes for this full node rebuild to complete. This would be the apparent of losing, again, a full shelf on a traditional-based system. Now the entire rebuild process has been complete and performance has been restored, as we can see here on the lower left, back to the original performance levels. You can also see here the performance impact during that rebuild process was actually quite minimal considering we went from a four node cluster, and then we dropped that down to a three node with the rebuild process. We were only really around a forty percent decline, despite the fact we had at least a twenty percent decline in the overall amount of performance in the system.

Here's why, in a typical environment, albeit service provider or enterprise, this becomes so critical, this ability to automate HA. What I just showed in this example was we had started with a cluster, then we had pulled the node on that one cluster and the data had automatically re-distributed itself and low balanced and self-healed without any manual intervention. From there, we could lose another node, and continuing on and on and on down the process based upon the amount of capacity within the system. The biggest question that needs to be asked is, in your particular environment, do we need one node of redundancy, two nodes of redundancy, or possibly even three or four nodes of redundancy for environments that are very, very critical in nature? The biggest reason for any time of self-healing HA is, well, if we're going to end up automating the back end system through uses of various different things like APIs for example, or if we're going to try to simplify management as much as possible so that a single administrator can manage as much data and free up as much time so that then they can go do the projects that, in the end, end up benefiting business, why shouldn't we be doing that on a high availability side? Only SolidFire can deliver that automated, self-healing high availability. For more information, visit or I would highly recommend reading the white paper on the cost of storage high availability. Thanks for watching.

Types: Videos

Categories: Service Provider, Product,

request a demo
contact sales