Azure SQL Data Warehouse showed to be a fast and flexible analytics platform. Enhancements in the distributed query execution layer as per the recent experience result in the significant query performance improvements for Azure SQL Data Warehouse (SQL DW).
I/O bandwidth to storage and repartitioning speed (shuffle speed) determine the analytics workload performance. In this article, we are going to see how the shuffling speed has improved.
Data movement is a process of moving parts of the distributed tables to different nodes during the query execution. This is usually required when the tables do not share the distribution key with shuffle being the most common operation. In this case, SQL DW computes a hash value using the join columns for each input row and sends that row to the node that owns that hash value. Any of the join sides can participate in the shuffle. Here is an example implementing join between tables T1 and T2 with neither of the tables being distributed on the join column col2.
A broadcast is another common data movement operation where table parts are copied from the source node to all the other SQL DW nodes. An example of a broadcast is when joining a dimension and a fact table, where the dimension table is commonly broadcast. In order to minimize the number of rows transferred the appropriate data movement type is selected by SQL DW query optimizer.
The below diagram shows the SQL DW Data Movement Service (DMS) component that used to perform the data movement operations.
When implementing shuffle, the Data Movement Service copies rows out of SQL Server Engine, hashes them and sends them to the appropriate instance of DMS on other nodes. Next, the DMS copies rows to the local temporary tables using SQL Server BulkCopy API. A bottleneck may occur while reading rows out of SQL Server as this is a single threaded operation.
Now the data movement is directly integrated into SQL Server engine via SQL DW which allows benefiting from full multi-core parallelism as well as the use of batch mode execution for data movement operations.
When combined with the use of Azure Accelerated Networking, SQL DW is able to move data up to one gigabyte a sec per node which significantly improves the queries that join data from tables not aligned on their distribution keys.
Below is the shuffle using SQL DW instant data movement mode.
When the data is being moved in the instant mode, the intermediate results are produced and sent to all the required nodes using all available CPU cores, taking advantage of the multi-core trend. Furthermore, minimal per row overhead is achieved as the data is written in a compact batch-oriented row format directly to the remote node’s temporary database and it does not cross SQL Server Engine front-door which performs data validations thus adding a significant cost.
All existing and new customers of Compute Optimized Gen2 tier of SQL DW may now enjoy this capability. SQL DW instant data movement is used for queries that do not use external data; PolyBase queries remain using Data Movement Service.