Best of both worlds

In my last article, I compared the total cost of ownership for tape to that of the public cloud, in particular taking into account the additional costs of accessing data stored in an archival tier of a service like Microsoft Azure Blob. I explained that over longer time frames, tape is a more cost effective, long term, archival solution compared to the public cloud.

But if one makes the same long term comparison for object storage in a private cloud system, does tape still hold onto its significant cost advantage?

Pay as you grow

Object storage systems, such as those based on HPE Apollo servers and Scality RING, require at least 500 TB, but ideally a full petabyte, to utilise the base hardware upon which they run. And although in theory you could run an object storage system from a single server, in reality, you need several nodes to build in the self-healing redundancy that is essential for a system that acts as its own failsafe.

At today’s prices, the cost of a petabyte of object storage using the Apollo/Scality combination is many times more expensive than an equivalent MSL tape solution using LTO-7 or LTO-8.

So with that cost difference, clearly there has to be a compelling use case or application need for hosting an entire archive on an object solution, especially one that may be growing at 40% every year. Remember that adding archive capacity through the purchase of inexpensive media (fractions of ¢ per GB) to fill empty library slots remains one of tape’s key strengths.

In a previous article I mentioned what I consider to be one of the best use cases for object storage. ‘Short term archiving’ is a particularly good application for private cloud because it allows businesses to extend RPO by storing backup data on a lower cost platform instead of using more costly disk arrays. And it can achieve this with minimal impact upon their RTO. An example of this would be HPE CloudBank for HPE StoreOnce which supports multiple gateways to the public and private cloud.

Boosting your backup

HPE Scalable Object Storage with Scality RING is a software-defined storage platform that is density-optimized to allow organisations to scale to petabytes and beyond in less physical space and without compromising data availability and durability.

HPE Cloud Bank with HPE Scalable Object Storage is a great solution when you are looking for some or all of the following:

The benefits of object storage without changing your existing backup application environment.
The lower cost of an object platform for longer-term retention of backup data which must be available for regulatory or business requirements, yet would take up too much space on your secondary array (e.g.HPE StoreOnce or HPE Nimble devices).
Disk-based disaster recovery but without the high costs associated with the transfer of large amounts of data to the cloud and with greater immediacy than off-prem tape.
Access to portions of your archive dataset that can be utilised more rapidly and efficiently than recovering data from offsite or vaulted tape. Because object storage separates information about the data from the actual data itself, (typically storing this metadata in a separate faster tier, like flash), it’s much easier to search through an unimaginable amount of content quickly.

Picking up on that last point, although it’s true that one of the advantages of an on-prem object storage system is the potential for faster access to the archive, this assumes speed of access is the primary consideration, which generally it isn’t for several good reasons.

Qualifying the need for speed

Firstly, all data becomes old and cold. Data being kept warm in object storage because it is still being accessed - e.g. streaming media content for a contemporary TV show - will eventually see user requests slow to a trickle. At which point, the value of maintaining ‘always on’ availability is questionable. An economic trade off can be made, contrasting the benefit of fast access with the cost of maintaining the service. If a user is willing to wait for obscure or less frequently accessed data, there is no benefit in keeping that data on more expensive media.

Secondly, big data queries are not typically run on native data at rest; rather the datasets are most likely pulled into a storage environment such as Hadoop Distributed File System (HDFS), which permits interactive, high-bandwidth, complex analyses, and has the CPU power to handle the job. So the challenge is not running applications on the object storage archive directly, but more being able to move data to the analysis platform easily and efficiently, in which case bandwidth could be seen as a bigger issue.

Most datasets will need to be prepared for analysis - e.g. migrated from lowest cost tape - rather than being kept online or nearline ‘just in case’. Thus, tape can be just another archive data source since the application is probably not streaming directly from the archive store and time is less likely to be of the essence.

The use of AI in big data analysis doesn’t necessarily change this because data scientists are still responsible for determining the parameters of the analysis and creating the algorithmic boundaries within which their research operates.

I was reminded of this during a discussion with a data scientist at the Digital Future Society summit at Mobile World Congress 2019 in Barcelona. We were discussing the work of goodcitylife.org, an organisation that:

uses social media data to map the sensorial and emotional layers of cities. Those layers will make a variety of applications possible - from urban planning to health informatics.

The kind of big data analytics the Good City Life promotes requires access to vast amounts of data - billions of data points gleaned and cleaned from a variety of pervasive social media platforms or commercial databases - but the planning of the interrogation is not time dependent. The data does not need to be sitting there waiting for real time query because the analysis being undertaken requires careful definition. An example would be ‘Happy Maps’, a project that uses crowdsourcing and geo-tagged pictures and the associated metadata to build an alternative cartography of a city weighted for human emotions.

The deeper the analysis, the less time critical it is

A recent article in TechRepublic summarised some of the limitations of automated non-human intelligence in the field of big data.

There is some argument that AI, ML, and deep learning are each individual technologies….On the first tier of this platform sits AI, which analyses data and quickly delivers analytical outcomes to users. Machine learning sits on the tier two application of AI that not only analyses raw data, but it also looks for patterns in the data that can yield further insights. Deep learning is a third-tier application that analyses data and data patterns, but it goes even further. The computer also uses advanced algorithms developed by data scientists that ask more questions about the data with the ability to yield even more insights.

The deeper the analysis, the less time critical it becomes. Nothing I have heard from data scientists themselves suggests that object storage should be regarded as a pre-requisite for providing cheap, scalable storage to support HDFS, Spark or any other kind of advanced data research in a HPC environment. Or, conversely, that the use of tape storage for archiving data would be a disadvantage. Therefore, it would seem sensible to continue to use the right technology, for the right purpose at the right time.

LTO tape has the lowest cost of ownership of any storage technology for long-term retention of infrequently accessed, but essential data. Object storage solutions, like HPE Apollo and Scality RING can complement inexpensive tape archives by offering greater immediacy and faster access to portions of the archive or to backup data, especially in comparison to primary and secondary platforms. In short, companies that use tape to underpin their hybrid IT or private cloud strategy as part of a holistic approach will have the best of both worlds.

Articles on ransomware & lto tape storage