Simplified & Cost Effective Data Integration to Cloud

In the previous article, we talked about connecting data and analytics in a cloud first world. In this article, we will talk about a simple data integration workflow to the cloud. The workflow shows how IOPlane™ for Cloud can make leveraging the power of cloud for analytics highly simple.

Consider that you have a bunch of csv (comma-separated-value) files, and assume they are too large to even open in MS Excel properly, let’s say containing 10 million records. This example can be considered a small Big-Data file (pardon the oxymoron here :-)). And we want an efficient way to query such files using cloud based analytics. The query should be server-less, hence we use technology like AWS Athena or S3 Select. In this particular example, we use S3 Select.

#[root@iopxvm1 ~]# ls -lh 10million.csv

-rw-r–r– 1 root root 1.2G Sep 3 07:34 10million.csv

We configure IOPlane™ for Cloud with the following:

1.    Source Path for csv files – A source path where csv files can be ingested/generated.

2.    Transformation script – To convert csv to parquet.

3.    Target Path for parquet file.

4.    AWS Target.

configuration for csv to parquet transformation

The transformation script takes the file from /iopxsource and writes the transformed file from csv to parquet into /iopxsource1. The transformation script uses the pandas and pyarrow library. The smart transformation reduces the space consumed by the 1.2GB file to 345 MB. Parquet format brings the power of compression and columnar layout to the file format.

[root@iopxvm1 ~]# ls -lh /iopxsource1/10million.csv.parquet

-rw-r–r– 1 root root 345M Sep 3 07:35 /iopxsource1/10million.csv.parquet

The file is ingested automatically to the cloud using IOPlane™ resilient transfer. The IOPlane™ Catalog is updated. The schema is also detected automatically.

IOPlane catalog showing source csv, target parquet file on-prem and on Cloud

After the file is ingested, we run an S3 select script to get the average of a column.

[root@iopxvm1 ~]# time python s3select.py 10million.csv.parquet

Query:select AVG(s.”Unit Price”) from s3object s

Avg:266.03

Stats details bytesScanned:

5164966

Stats details bytesProcessed:

200000000

real   0m9.652s

user   0m0.651s

sys    0m0.045s

Cost of Such a Solution

Keeping aside minor costs, the principal aspect of the cost comes in the data scanned by the S3 select query.

Cost of S3 Select ~ $0.002 per GB Scanned ~ $2 per TB Scanned

The query scans 200MB i.e. ~0.2 GB of data. Hence to hit a dollar of cost, the query needs to scan 0.5TB i.e. it needs to be run ~2500 times. We get 2500 invocations of the query for 10 million records for a dollar!

Intelligent IOs,
IOPhysics Systems