Disaggregated Commodity Storage Direct attach SAS and SATA are dead
The Problems with Commodity Servers with Direct Attached Storage
Commodity servers for use in large clusters have two huge benefits: low cost and high bandwidth between the storage and the compute resources. However, there are several problems:
• Over Provisioning: Since the storage must be ordered in the server at purchase time and, at scale cannot realistically be changed for the lifetime of the server, we had to make sure we would not run out of storage space or I/O bandwidth during the lifetime of the server instead of tuning the space and I/O bandwidth per application. So we loaded each of our Hadoop and Vertica nodes with 24 x 2.5” disk drives not because we definitely needed them but just so we would not run short if we do need them. We did the same with flash drive in the Aerospike nodes.
• Lockstep Life-cycle Management: Due to the relentless march of Moore’s Law, we decommission old racks and install new racks on a set 3-year schedule. When we upgrade to the newest processors and memory, we do not replace individual servers but instead replace a full rack of servers. Extracting drives from old servers to install them in new servers just does not work logistically at scale. As a result, we replaced disk drives every time we replaced servers. So while drives can easily have a lifetime of 6-8 years we were only getting 3 years out of them as we replace the servers every 3 years.
Commodity servers for use in large clusters have two huge benefits: low cost and high bandwidth between the storage and the compute resources
• Silos and Lots of SKUs – We have several types of clusters (beyond just Hadoop and Aerospike), each with its own number of drives either 1, 2, 6, 12, or 24. Because different numbers of drives must be ordered up front, each configuration needs its own SKU. Further the servers in one cluster cannot be used for another cluster due to varying configurations, so each cluster is its own silo. As a result we may have extra of a 6 disks configuration and be out of a 24 disk configuration.
• Failures are Failures – When a disk drive fails, the data is automatically reconstructed, but the server it is in is now down one drive, until and unless someone goes out into the data center, finds that server in the its rack, finds the bad drive, and swaps out the drive. This is cumbersome at scale which is why there has been much talk of the “hyperscales” adopting “fail in place” strategies where failed drives are not replaced at all but are simply abandoned. The problem with fail in place and direct attached storage is that the compute to storage ratio and the storage bandwidth are changed often resulting in negative impacts to application performance. Even worse, when a server fails in a clustered system the cluster needs to “rebalance” which usually involves shuffling a lot of data around in the cluster. The data shuffling frequently degrades the performance of the entire cluster with negative impact on the applications that rely on the cluster regardless of whether they are upstream or downstream of the cluster. Further, the rebalancing often takes quite a long time making it extremely challenging to do rolling migrations from old gear to new gear on clusters that cannot be taken down and cut over.
Disaggregated Drives with Software Composable Infrastructure (SCI)
We have adapted DriveScale’s SCI architecture, in which we buy two racks of disk-light servers (only a local boot device per server) and one rack of JBODs. DriveScale’s technology allows us to attach the drives in the JBODs to our disk-light servers through our data center switch fabric. The management of the DriveScale architecture is done through either a GUI or RESTful API from a central console with numerous benefits:
• No More Over-provisioning - Since we can add disks to a node any time we need to, we decided to see how few we could get by with. We configured a cluster with our standard 24 drives per node, ran the production job; then configured it with 23 drives; then with 22, etc. We got all the way down to 4 drives before we saw a degradation in performance. Now we provision with 5x 3.5” drives per node instead of 24 x 2.5” drives per node, saving a lot on both $/GB and total cost of ownership. If, in the future, we find that two racks of servers to 1 racks of disks is not enough storage or I/O bandwidth we can always add another rack of storage. Similarly, if storage racks were not ordered with the maximum capacity of JBODs and disks that the rack can support we could add more drives by adding more JBODs to the storage racks (our storage racks are ordered “maxed out” based on power and rack/floor load limits).
• Separate Life-cycle Management – Our plan now is to swap out the processor racks every 3 years but hold on to the racks of JBODs for 6 years producing significant savings.
• Consolidation of SKUs, Reduction in # of Silos – We still have several server types generally with only different amounts of memory but sometimes with different CPUs. But now we no longer have to stock several different storage configurations since the storage is assigned under software control in the data center and can be changed at will. This also means that a server can “give up its drives” and leave a cluster and then join a new cluster getting drives assigned to it as required by the needs of the application it will run.
• Failure is not as Dramatic – When a drive fails, and, even before it fails as it shows signs of degraded performance, it can be removed from a server and cluster and a fresh drive assigned to replace it, all through software control. It will be marked “failed,” and can be left in place, or physically replaced on a convenient schedule without disrupting any applications. When a server fails, all of its drives can be reassigned to other servers avoiding the reconstruction of their data and the accompanying “network storm.”
All of this results in significant cost savings and improved manageability and flexibility, with the same performance of direct attached disks. We are moving rapidly to this new model of disaggregated storage with Software Composable Infrastructure and are very pleased with new model.