Pushdown
Important We're excited to offer a new Pushdown experience! The version of Pushdown described on this page was released as a beta feature for some data sources and had limited processing capabilities for both the Full Profile and Count options. Check out the new documentation page to learn more about the enhanced Pushdown experience.
Why use pushdown
To analyze a dataset for DQ findings as part of a DQ Job, CDQ uses Spark as the compute engine to power the analysis. This requires the dataset to be loaded into Spark as a Spark DataFrame, and the DQ Job performance (job completion speed) is limited by Spark resources available and the complexity of the job. This data transfer from data source to Spark is dependent on two of the following factors:
- Bandwidth limitation at the data source
If the DQ Job requires pulling 100 million rows from a SQL Database, then the Input/Output limit of transferring data out of SQL Database into the Spark cluster will greatly increase the overall speed of the DQ Job.
- Compute limitation at the data source for complex queries
If the DQ Job requires complex queries to create the dataset (via
-q
), then this computation is done at the SQL Database level. Some examples of "complex" queries are:-q "SELECT * FROM public.very_long_table_of_transactions WHERE date = ${rd} and department = 'finance'"
This query can take long time at the database level due to
WHERE
clause filtering a very long table on columnsdate
anddepartment
that may not be indexed properly, leading to a full table scan without aLIMIT
clause specified.-q "SELECT * FROM public.very_long_table_of_transactions transactions INNER JOIN public.departments departments ON transactions.department_id = departiments.id WHERE transactions.date = ${rd} and departments.name = 'finance'"
This query can take long time at the database level due to a join between tables.
- Resource limitation at the data source
The data source may not have enough hardware resource available to efficiently fulfill the query and/or handle multiple DQ Jobs & other non-CDQ applications requesting data.
The bottleneck due to data transfer can be viewed in Jobs page > Job Logs > LOAD stage. This data loading step is the first step for all DQ Jobs. However, pushdown feature can be used to reduce (NOT eliminate) the data transfer & compute to Spark from the data source, if the DQ Job specified does not require loading all the data into Spark.
In summary, speed of loading the data from data source to Spark is a -q query compute time at data source
+ network transfer of -q result between data source and Spark
. Pushdown can reduce both elements, but the efficiency gained is dependent on the complexity of -q
query and how big the dataset from -q
result would have been without pushdown.
How pushdown is efficient
Some of the Spark compute performed by the DQ Job can be translated into SQL queries that most relational databases support natively. In such a case, the DQ Job does not need to load all the rows of the dataset. Instead, the DQ Job can query the data source for the results of those SQL queries and reduce the amount of data transferring out of the data source. The results of these SQL queries are almost always lead to smaller amount of data compared to the full dataset defined by -q
. Only some of the DQ Job features require the full dataset to be loaded into Spark. Therefore, pushdown can be a useful tool to speed up the overall DQ Job speed -- provided that the speed of executing these SQL queries are faster than the speed of transferring the data out of the data source into Spark. In most use cases, pushdown leads to faster DQ Job execution for large datasets. If the -q
query is sufficiently complex, then the speed reduced by transferring less data into Spark can be cancelled out by the multiple frequent SQL queries made to the data source by the Pushdown process (because each query may have have redundant compute due to the complexity of -q
).
How pushdown works in CDQ
Using pushdown only reduces the amount of data transferred out of the data source. It does NOT skip the LOAD stage in DQ Job. Every DQ Job requires a small sample of rows (10-20) of the dataset defined by -q
in order to generate Data Preview and analyze schema information for the dataset run. This means the -q
query may be fully computed at data source before the sampling can occur (depending on the complexity of the -q
). In such a case, sampling 10-20 rows of data is not a quick and immediate LOAD stage and only efficiency gain comes from lack of transferring data between data source and Spark.
Therefore, pushdown feature would be most efficient if -q
is a simple select query with simple where filtering. The benefit comes from the fact that if your dataset defined by -q
results in 100 million rows, only 10-20 rows of the dataset defined by -q
will be loaded into Spark.
Profile with pushdown will then generate a series of SQL queries and query the data source again for aggregate metric data. Depending on the dataset, these multiple SQL queries can be more efficient than loading all the data into Spark and computing these aggregate metric in Spark. The results between Profile with pushdown and Profile without pushdown are (practically) identical.
Profile: pushdown vs no pushdown
Here is the summary of Profile activity with details regarding pushdown support
Feature | Supports pushdown | Description |
---|---|---|
Row Count | Yes | Computes row count of the dataset. |
Distinct Count | Yes | The number of distinct values in a column. |
Mean | Yes | The average of all the values in the column. Supports numeric columns only. |
Min / Max | Yes | The minimum and maximum values of the column. Supports numeric and boolean columns only. |
NULL Count | Yes | The number of null values in the column. |
EMPTY Count | Yes | The number of empty values in the column. Supports string columns only. |
TYPE Count | Yes | The number of different types inferred in a column (if any). |
TopN / BottomN | No |
Computes the top 5 most frequent (TopN) and least frequent (BottomN) values. Supports all types. This result is displayed as a frequency bar chart in Profile page. If pushdown is enabled, then TopN and BottomN values are not displayed. Related features like Stat Rules (Distribution) are also disabled. |
Data Shape Detection | No | Detects shapes of the values based on Shape parameters provided (automatic or manual). |
Histogram | No | Creates histogram of the values in the column. |
Correlation Matrix | No | Creates correlation matrix. Supports only numeric columns. |