Storage for unstructured data (I/O) processing
In the last few blogs we have discussed variety of analytics related problems, edge-side processing & use-cases. The kind of data that we talked about was primarily unstructured data – one without schema like video, audio, images, social media texts, docs etc. This data keeps on growing as we all know thereby resulting in the following problems –
1) Ever increasing TCO for scaling & maintaining on-premise storage systems
a. Most of the Enterprises still need to store files on-premises such as Media files, Backups, Application specific files etc. Similarly, Industry specific datasets too. Examples of such industries are Manufacturing, Media & Entertainment, Oil & Gas, Telecom, Design & architecture.
b. Such data often needs online storage – one which will allow sharing/distribution across users, across geos as well.
2) Large siloed datastores on-premise
a. With usage of traditional specialized storage server on-premise, customers end up creating silos of datastores on-premise.
b. This siloed data is unfriendly to state of the art big data processing capabilities or for that matter media processing as well.
3) Non-flexible Protection for business continuity
a. No freedom to protect such data outside on-premise data center such as to any cloud of choice, to any # of locations or to one with high durability.
b. With one cloud vendor, it’s a lock-in with that vendor which needs to be maintained for reasonable amount of time & thus pay tax to maintain that relationship.
So if we think about it, we need “Freedom to Perform” wrt the business & technology of storage for unstructured data. This is more important than ever since there are so many storage choices for on-premise & public clouds (getting) available.
On this note – Object Storage makes a better fit for solving the important use cases for unstructured data. Object storage is not new. It differs from traditional file-system based storage & designed to handle workload of unstructured data kind very well. Workloads/apps which are mostly read-intensive & writing data (generally once) in terms of files are good fits. Such storage is generally eventual consistent & so apps also can tolerate some delay for immediate reading after write to it. For e.g. a typical Video transcoding application creates no. of transcoded files (as output) of small duration of few seconds for each video resolution needed for the kind of devices to support. Such files are read-only unless it needs to be corrected for some error in which case we need to re-transcode again that chuck of file. Each such video file can be streamed & playable by the software based player. Now, still in case there is any update needed to such files written to then entire file needs to be re-written again along with the incremental change unlike traditional file-system based storage which supports incremental writes to a file. So here, entire file gets rewritten for the update operation. This is OK for most of the unstructured data workload/apps.
There are interesting things happening wrt Cloud Storage & specifically Cloud Object Storage market. Global Cloud Storage market accounted for $34.6 billion in 2017 & is expected to reach $207.05 billion by 2026 growing at a CAGR of 21.9% (source: ResearchAndMarkets.com). While the cloud object storage market is projected to reach $6 billion by 2023, at a CAGR of 14% during the forecast period 2017-2023 (source: Marketwatch.com). This is interesting growth.
The question here is how do we get our unstructured data working optimally with such storage? Is it just about storing it on-premise or cloud based object storage only? How can we get true native capability of such storage to serve Analytics workloads – be it on Edge or Cloud native?