I can’t believe we are at part 5 in this series already, hopefully for any newcomers to the subject you have been able to pick up some concepts and terminology, and for those with prior experience, a refresher in some areas. In this post we sign off from the basics by looking at a number of typical ways that data can be addressed from an infrastructure perspective.
Block Storage
Disks and tapes are written in blocks of data to make it easily readable by machines, it’s the most primitive way of addressing the storage which makes it simple and predictable, so remains popular. In a lot of cases it is the right choice for applications that need an uninterrupted path to the raw data, this is usually because the intelligence for managing the integrity of the data is applied at a layer above, for example a relational database or platform specific file system. Using block storage can seem easy, but it’s a question of scalability of administration, managing a small number of volumes can be straightforward, managing 10000 volumes? Not such a walk in the park. Almost all storage systems ultimately use block storage as the means to store and retrieve data from a hard disk.
File Storage
When ‘File’ Storage is talked about, it’s usually synonymous to NAS (see below), however to understand it a little better it’s best to compare to block storage. File Storage is where access to the data is through a predefined file system, which basically means the method by which the storage system uses block storage has already been decided for you. This has its pros and cons, and these days, mostly pros. These storage systems remove and obscure some of the complexity of managing the block storage, many availability features are built in at the file system level and when it comes down to it files are simpler to explain to the end users, as what they see on their G:\ drive is a common point of reference for discussion. The main con is that data has to pass through a more complicated software stack, meaning more opportunities for bad code.
DAS – Direct Attached Storage
DAS is back, it’s retro, it’s hip, groovy, cool and wicked. Well maybe. DAS simply means that the storage array is attached directly to the host that is going to use it, no storage network between them, and in the most part, no sharing with others. This makes this potentially the cheapest form of mass storage as the complexity is lower which leads to less expensive componentry. DAS usually only provides a block storage interface, so it pushes any incremental intelligence to the software stack inside the operating system of the connected host.
DAS – The Renaissance
The reason why there is a renaissance for DAS is that for many years storage array vendors have been packing more and more functionality into the array, increasing its complexity, positioning it as a central hub where all data should be stored. There were several problems with this:
- the demand for low-cost capacity was at odds with the ‘real estate’ cost of putting a low-cost drive inside an expensive storage array
- the complexity of connectivity meant that the cost of attachment to the storage network for a host was prohibitive unless the data was deemed business critical
- the demand for performance on a centralised hub was too much in comparison to a federated model
- and the nail in the coffin, it created a point of failure that could take down a business, only mitigated by more expensive components
Something had to change. The answer came from solving a bigger problem, a much bigger problem, a 500 million person problem, Facebook, Google and Yahoo. It simply was not practical, sensible, or physically possible, for every host that needed to share a common store with any other host to be connected to the same storage. Storage networks were never designed to address that scale, nor were they intelligent enough to handle it. It needed to be solved at a more fundamental level, the problem needed to be pushed up the stack to allow the storage system to operate simply and reliably in a distributed and federated manner. We will go deeper into distributed file systems in a later post, suffice to say that a by-product of simplicity is reliability, and a lack of dependency on a single supplier. Pushing the intelligence around storage into the operating system allows generic white-box servers (ODM) to perform any task that would usually been destined for a dedicated storage controller, the aggregated performance of these federated white-box servers is virtually linear to the number of servers involved and has led to a classification of compute architecture known as hyper-converged. Put simply hyper-converged is the use the law of diminishing returns to produce a server that achieves the commercial sweet-spot architecture for commodity processing power, memory, storage and network capability in a small form factor rack-mountable server, with bolted on DAS for capacity centric deployments, no other bells and whistles. That is why DAS is back in favour, because there are a lot of capacity centric deployments out there, think iCloud, Google Photos, Facebook…. It all needs to be stored somewhere.
NAS – Network Attached Storage
NAS takes block storage, mixes it with a file system we mentioned earlier, and presents this over a network as ‘shares’. Home drives and Virtual Machine storage are great uses for NAS. It’s also one of the unsung heroes, it’s the C-3PO of storage protocols, if you can talk a storage language, the odds are in your favour that a NAS can talk back to you, and the chances are that it can talk all of the languages at the same time about the same piece of data. Quite cool.
CAS - Content Addressable Storage or ‘Object’
Object or CAS is about storing data, and information about the data (called metadata) in an effective bottomless bucket where it can be retrieved by its unique address – effectively removing the complexity of the file system hierarchy, but introducing the requirement to manage the unique identifiers. Object Stores take a lot of the consumer data for web companies such as document, video or image storage. An Object Store will be one of the functions of the white box servers that I mentioned earlier in this post and a good solution for data where content remains fairly static.
SAN – Storage Area Network
The storage area network – traditionally this is meant to relate solely to the connectivity between storage devices, but has been extended in its use to cover any storage device that is connected to the storage area network too. It is a network that allows storage devices to communicate, historically in an isolated fashion from other networking connections due to both the separation of administration skills and the alternative protocols that tend to be in use. Moving forward, through the increased in reliability and Quality of Service (QoS) isolation on more ubiquitous IP networks, convergence is inevitable to be able to deliver such developments as VMware VSAN.
Previously
- Storage Basics Part 1 – or From Disc to Discovery
- Storage Basics Part 2 – or A Chip Off The Old Block
- Storage Basics Part 3 – or Time Waits For No WAN
- Storage Basics Part 4 – or The Dead Parity Sketch
Next Time…
It’s time to let the basics settle in for a short time, and come back to the table with some more advanced areas, there are some good subjects already under development – also get your suggestions in for subjects that you would like to see. Finally, for this basics series I had some feedback that you would like a link to the Dead Parrot sketch as mentioned in Part 4 of this series – so here it is. Don’t forget to comment, get involved and get in touch using the boxes below and reach out direct @glennaugustus
comments powered by Disqus