The Advantages of a Shared Nothing Architecture for Truly Non-Disruptive Upgrades
In our Tech Field Day presentation on Modern All-Flash Architectures earlier this year I talked about the trade-offs between shared disk and shared-nothing architectures for data protection. Shared-nothing architectures, which don’t use a shared-disk shelf but instead protect data across multiple, independent nodes in a networked cluster, offer several advantages over controller-based architectures, including no single points of failure and self-healing capabilities. One thing I didn’t touch on was the significant advantage a shared-nothing architecture provides for non-disruptive upgrades.
Storage upgrades are never fun. If something goes wrong, the problems can start with extended downtime and end at complete data loss. This explains why both storage customers and storage vendors are very conservative when it comes to major upgrades. In particular, storage vendors don’t like to make major changes to on-disk data or metadata structures “in place,” and particularly not “online.” This stark reality was highlighted recently when it was revealed that the next major XtremIO software upgrade would be not just disruptive, but destructive, requiring customers to completely backup and restore their data.
Chad Sakac of EMC was quick to respond that a destructive upgrade was necessitated because of the low-level disk structures changing. They point out that this isn’t entirely unheard of in the industry, with examples of VNX, Netapp FAS, and VMFS. Future XtremIO upgrades like cluster expansion and replication that don’t touch on disk structures should (hopefully) be non-disruptive … but others might be. At the scale of a single array, a disruptive (or worse, destructive) upgrade is royal pain — downtime windows, data migration, loaner systems, etc.
What about at enterprise scale? At cloud scale? With hundreds or thousands of systems? You just created months of work.
Chad asks in the EMC blog post: Why do disruptive storage events ever happen anymore? It’s a question we asked ourselves when we started SolidFire. It turns out one major reason for disruptive storage upgrades is shared-disk architectures.
In a shared-disk architecture, there is only a single “source of truth” for a piece of data or metadata. Changes that dramatically affect the format or layout of the data or metadata are inherently risky, since there is no “backup.” By comparison, shared-nothing architectures have redundant copies of data distributed across multiple nodes in a cluster. This allows you to modify the format of data in one location (or completely migrate it off) while preserving known good copies elsewhere. Chad even notes that non-disruptive upgrades are common in Object storage systems, but completely misses the reason for it. It’s not because of the storage protocol (object vs. block), but because Object storage systems are almost universally shared nothing architectures.
In the all-flash array world today, only SolidFire offers a true shared nothing architecture. All other systems are based on a either a dual-controller/shared-disk model, or worse, a single controller model without redundancy.
SolidFire has proven the advantages of a shared-nothing architecture for enabling non-disruptive upgrades by offering 100% non-disruptive software AND hardware upgrades from Day 1 of GA over two years ago. No downtime, data migration, or planned outages have been required to upgrade software, firmware, or hardware across hundreds of upgrades in the field.
As EMC notes, on-disk data format changes absolutely are required for major upgrades and for adding significant new features. Our last major release, Carbon, included a 100% new on-disk format for volume metadata to enable more advanced snapshot features, real-time replication, and incremental block-level backup and restore. However, our shared-nothing architecture allowed us to perform this upgrade completely safely and non-disruptively.
Shared-nothing architectures provide many advantages in next generation data centers, including true no-single-points of failure, self-healing capabilities and simple dynamic cluster sizing. Fully non-disruptive hardware and software upgrades — even when on-disk formats need to be changed — that’s just the icing on the cake.
| 1 comments
Posted in Next Generation Data Center.