An exceptional convergence of big data is demonstrated by the ability to have Spark run on a GPU-enabled cluster. During the past decades, the explosion of the GPU market has been witnessed when companies decided to integrate workflows like AI and other HPC into their businesses. Tensorflow is a framework which serves the motive of utilizing GPUs for numerical computation and neural networks. It has made it to a testament to the rise of AI and consequently the demand for GPUs. Meanwhile, there has never been a need for big data and powerful data processing engines as great as now, when hundreds of companies have started to gather data in the petabyte range.

Data scientists and data engineers have now made it possible to achieve different scenarios by coming up with an infrastructure for GPUs and other high performing hardware by using big data engines such as Spark. Achieving such scenarios would not have been possible otherwise.

Along with the recent release of the latest GPUs it is now possible to run Spark on a GPU-enabled cluster using the Azure Distributed Data Engineering Toolkit (AZTK). In a single command, AZTK allows provisioning on-demand GPU-enabled Spark clusters on top of Azure Batch’s infrastructure, helping users to take high-performance implementations that are usually single-node only and distribute them across the Spark cluster.

GPU-enabled Docker images for AZTK are already available. These include:

  • Python image – it comes packaged with an Anaconda.
  • Jupyter and PySpark.
  • R image that comes packaged with Tidyverse, RStudio-Server, and SparklyR.

These images use NVIDIA Docker Engine to give the host’s GPUs access to the Docker Images. Due to the fact that AZTK runs Spark in a manner which is entirely containerized, users have the option of having their GPU Docker images customized as per their own needs. As for the users who want to run Spark on a GPU-enabled cluster, they too need not worry about Docker.  AZTK will pull the suitable images on its own, providing an access to GPUs in case they are detected on the host machine.

The example below illustrates a four-node GPU-enabled Spark cluster created with AZTK:

$ aztk spark cluster create –id my_gpu_cluster –size 4 –vm-size standard_nc6

AZTK knows the fact that Standard NC6 VMs come with NVIDIA’s Tesla K80s so it chooses a GPU-enabled Docker image on its own when provisioning a cluster. It is also possible to manually specify an appropriate image of choice.