In the previous article, we talked about connecting data and analytics in a cloud first world. In this article, we will talk about a simple data integration workflow to the cloud. The workflow shows how IOPlane™ for Cloud can make leveraging the power of cloud for analytics highly simple.
Consider that you have a bunch of csv (comma-separated-value) files, and assume they are too large to even open in MS Excel properly, let’s say containing 10 million records. This example can be considered a small Big-Data file (pardon the oxymoron here :-)). And we want an efficient way to query such files using cloud based analytics. The query should be server-less, hence we use technology like AWS Athena or S3 Select. In this particular example, we use S3 Select.
#[root@iopxvm1 ~]# ls -lh 10million.csv
-rw-r–r– 1 root root 1.2G Sep 3 07:34 10million.csv
We configure IOPlane™ for Cloud with the following:
1. Source Path for csv files – A source path where csv files can be ingested/generated.
2. Transformation script – To convert csv to parquet.
3. Target Path for parquet file.
4. AWS Target.
The transformation script takes the file from /iopxsource and writes the transformed file from csv to parquet into /iopxsource1. The transformation script uses the pandas and pyarrow library. The smart transformation reduces the space consumed by the 1.2GB file to 345 MB. Parquet format brings the power of compression and columnar layout to the file format.
[root@iopxvm1 ~]# ls -lh /iopxsource1/10million.csv.parquet
-rw-r–r– 1 root root 345M Sep 3 07:35 /iopxsource1/10million.csv.parquet
The file is ingested automatically to the cloud using IOPlane™ resilient transfer. The IOPlane™ Catalog is updated. The schema is also detected automatically.
After the file is ingested, we run an S3 select script to get the average of a column.
[root@iopxvm1 ~]# time python s3select.py 10million.csv.parquet
Query:select AVG(s.”Unit Price”) from s3object s
Avg:266.03
Stats details bytesScanned:
5164966
Stats details bytesProcessed:
200000000
real 0m9.652s
user 0m0.651s
sys 0m0.045s
Cost of Such a Solution
Keeping aside minor costs, the principal aspect of the cost comes in the data scanned by the S3 select query.
Cost of S3 Select ~ $0.002 per GB Scanned ~ $2 per TB Scanned
The query scans 200MB i.e. ~0.2 GB of data. Hence to hit a dollar of cost, the query needs to scan 0.5TB i.e. it needs to be run ~2500 times. We get 2500 invocations of the query for 10 million records for a dollar!
Intelligent IOs,
IOPhysics Systems