BlazingSQL review: Fast ETL for GPU-based data science

BlazingSQL is a GPU-accelerated SQL engine built on top of the RAPIDS ecomethod. BlazingSQL allows measure SQL queries to be distributed athwart GPU bunchs and the results to be fed straightly into GPU-accelerated visualization and machine learning libraries. Basically BlazingSQL provides the ETL portion of an all-GPU data science workflow.

RAPIDS is a suite of open rise software libraries and APIs incubated by Nvidia that uses CUDA and is based on the Apache Arrow columnar remembrance format. CuDF part of RAPIDS is a Pandas-like DataFrame library for loading joining aggregating filtering and otherwise manipulating data on GPUs.

[ Participate in InfoWorlds 2021 Data amp; Analytics scan to share your thoughts and expertise on data-driven investments strategies and challenges ]

For distributed SQL question execution BlazingSQL draws on Dask which is an open rise tool that can layer Python packages to multiple machines. Dask can distribute data and computation over multiple GPUs whichever in the same method or in a multi-node bunch. Dask integrates with RAPIDS cuDF XGBoost and RAPIDS cuML for GPU-accelerated data analytics and machine learning.

BlazingSQL is a SQL interface for cuDF with different features to support large-layer data science workflows and enterprise datasets including support for the dask-cudf library maintained by the RAPIDS project. BlazingSQL allows you to question data stored externally (such as in Amazon S3 Google Storage or HDFS) using single SQL; the results of your SQL queries are GPU DataFrames (GDFs) which are without affable to any RAPIDS library for data science workloads.

The BlazingSQL code is an open rise project released below the Apache 2.0 License. The BlazingSQL Notebooks site is a labor using BlazingSQL RAPIDS and JupyterLab built on AWS. It currently uses g4dn.xlarge instances and Nvidia T4 GPUs. There are plans to upgrade some of the larger BlazingSQL Notebooks bunch sizes to A100 GPUs in the forthcoming.