Qualifying the need for speed
Firstly, all data becomes old and cold. Data being kept warm in object storage because it is still being accessed - e.g. streaming media content for a contemporary TV show - will eventually see user requests slow to a trickle. At which point, the value of maintaining ‘always on’ availability is questionable. An economic trade off can be made, contrasting the benefit of fast access with the cost of maintaining the service. If a user is willing to wait for obscure or less frequently accessed data, there is no benefit in keeping that data on more expensive media.
Secondly, big data queries are not typically run on native data at rest; rather the datasets are most likely pulled into a storage environment such as Hadoop Distributed File System (HDFS), which permits interactive, high-bandwidth, complex analyses, and has the CPU power to handle the job. So the challenge is not running applications on the object storage archive directly, but more being able to move data to the analysis platform easily and efficiently, in which case bandwidth could be seen as a bigger issue.
Most datasets will need to be prepared for analysis - e.g. migrated from lowest cost tape - rather than being kept online or nearline ‘just in case’. Thus, tape can be just another archive data source since the application is probably not streaming directly from the archive store and time is less likely to be of the essence.
The use of AI in big data analysis doesn’t necessarily change this because data scientists are still responsible for determining the parameters of the analysis and creating the algorithmic boundaries within which their research operates.
I was reminded of this during a discussion with a data scientist at the Digital Future Society summit at Mobile World Congress 2019 in Barcelona. We were discussing the work of goodcitylife.org, an organisation that:
uses social media data to map the sensorial and emotional layers of cities. Those layers will make a variety of applications possible - from urban planning to health informatics.
The kind of big data analytics the Good City Life promotes requires access to vast amounts of data - billions of data points gleaned and cleaned from a variety of pervasive social media platforms or commercial databases - but the planning of the interrogation is not time dependent. The data does not need to be sitting there waiting for real time query because the analysis being undertaken requires careful definition. An example would be ‘Happy Maps’, a project that uses crowdsourcing and geo-tagged pictures and the associated metadata to build an alternative cartography of a city weighted for human emotions.
The deeper the analysis, the less time critical it is
A recent article in TechRepublic summarised some of the limitations of automated non-human intelligence in the field of big data.
There is some argument that AI, ML, and deep learning are each individual technologies….On the first tier of this platform sits AI, which analyses data and quickly delivers analytical outcomes to users. Machine learning sits on the tier two application of AI that not only analyses raw data, but it also looks for patterns in the data that can yield further insights. Deep learning is a third-tier application that analyses data and data patterns, but it goes even further. The computer also uses advanced algorithms developed by data scientists that ask more questions about the data with the ability to yield even more insights.
The deeper the analysis, the less time critical it becomes. Nothing I have heard from data scientists themselves suggests that object storage should be regarded as a pre-requisite for providing cheap, scalable storage to support HDFS, Spark or any other kind of advanced data research in a HPC environment. Or, conversely, that the use of tape storage for archiving data would be a disadvantage. Therefore, it would seem sensible to continue to use the right technology, for the right purpose at the right time.
LTO tape has the lowest cost of ownership of any storage technology for long-term retention of infrequently accessed, but essential data. Object storage solutions, like HPE Apollo and Scality RING can complement inexpensive tape archives by offering greater immediacy and faster access to portions of the archive or to backup data, especially in comparison to primary and secondary platforms. In short, companies that use tape to underpin their hybrid IT or private cloud strategy as part of a holistic approach will have the best of both worlds.