Connecting Data and Analytics in a Cloud-First World
The technology industry is a good training ground to test your astrological skills, because there is so much of a suspense element here, and even the best technologists get surprises. Today we have a situation where it’s difficult to say which and what percentage of workloads will be on-premises and which will land up on the cloud. Knowing the future will be of great value to companies working feverishly on their strategies. But what’s becoming increasingly clear, is that customers want flexibility and freedom to innovate, be in control of the cost, and get access to the best technology available whether its on-premises or cloud.
We have seen how in the last 10-15 years, people have learnt how to put together multiple computers and spinning disks to solve seemingly impossible problems of scale around analysis of PB scale datasets. This was made possible due to the ideas google shared around its file-system and the map-reduce framework, and adopted by the Apache Hadoop ecosystem. Yet in-spite of all the innovation there remain many challenges to be addressed. Not many businesses can put together 100s of nodes to form a Hadoop Cluster, that can do useful analytics. Putting together such an infrastructure is a daunting prospect, let alone tuning it, ingesting, cleaning data, writing bug free programs, debugging etc. Thus we have a new generation of customers who are screaming, look – I want everything managed so I can focus on just getting the business outcomes! And cloud vendors are listening, the trend is increasingly to wrap all complexity into managed services.
With this background, we focus on one particular challenge, the challenge of data fragmentation in large enterprises. Data has gravity, if data is generated at place A, the odds are that it will remain at place A for a long time, because moving it has many costs – the immediately obvious costs being duplication and network bandwidth. Enterprises have typically solved the problem of data fragmentation using well-defined data pipelines and ETL/ELT processes. Data pipelines are undoubtedly an elegant solution and area of focus for us, but we also recognize the following challenges with data pipelines.
1. Tight coupling – any changes made to the data formats in the app, potentially breaks the data pipeline.
2. Inflexibility: The pipeline typically deposits the data into a data-warehouse for analysis, this makes us dependent on the capabilities of the warehouse. e.g. if the data is stored in on-prem hadoop, and user wants to run AWS ML capabilities on the data, it is not possible.
3. Silos of unaddressed data – Data pipelines usually address only a subset of the data generated. There continue to exist stores of data e.g. legacy storage file servers, application servers, where useful data is stored. Whenever there is demand for storage by a LoB(Line of Business), the typical response is to allocate storage from whichever data store it is currently available. This increases the fragmentation of data.
Considering these challenges, there is an increasing need to enhance the addressable data available for analytics. And then make this data analysable with the state of the art tools – which means connecting this data to the best tools, in a location independent manner. This is one of the problems we are trying to solve here at IOPhysics Systems. If you are interested to hear about the ideas and solutions we have, please feel free to drop me an email at email@example.com.