Pushdown processing
Pushdown is an alternative computing method for running a DQ job, where all of the job's processing is submitted to a SQL data warehouse. DQ jobs with Pushdown processing generate SQL queries that offload the compute directly to the data source, reducing the amount of data transfer and Spark compute.
With Pushdown, you can also scale your computing needs based on the specific requirements of your DQ job. This is because the architecture of Pushdown data sources feature auto-scaling, which allows you to automatically scale up, or burst, to 64 or 128 nodes when you require more significant processing needs. Pushdown also automatically scales down when your DQ job does not require robust processing. With auto-scaling, the processing of your data is enhanced, improving runtime performance and removing the egress costs of reading large amounts of data.
Throughout the app, you can identify Pushdown jobs by the icon.
Benefits of Pushdown
By running a Pushdown job, you can:
- Reduce latency.
- Eliminate dependencies on Spark compute to run Collibra DQ, and increase processing speeds.
- Eliminate egress latency when running DQ jobs against large datasets.
- Auto-scale to data warehouses, such as Snowflake and Databricks.
Prerequisites for using Pushdown
Before running Pushdown jobs, a Collibra DQ user with Admin permissions must enable Pushdown for your data source. Pushdown can be enabled from the data source connection template in Admin Console Connections.
Tip For first-time Snowflake Pushdown configurations, we recommend that you also successfully run the Pushdown setup script. If you do not run the setup script, ensure that you meet all the criteria that the script attempts to accomplish.
Available data sources and layers
Pushdown is currently available for the following data sources and layers:
Basic Layers | Advanced Layers | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Data Source | Lifecycle | Rules | Behavior (Adaptive Rules) | Schema | Archive Break Records | Outliers | Dupes | Shapes | Patterns | Records | Mapping (Source to Target) |
Snowflake | Generally available |
Yes |
Yes |
Yes |
Yes |
Yes |
Yes |
Yes |
No |
Yes |
Yes |
Databricks | Generally available |
Yes |
Yes |
Yes |
Yes |
Yes |
Yes |
Yes |
No |
No |
No |
BigQuery | Generally available |
Yes |
Yes |
Yes |
Yes |
No |
Yes |
Yes |
No |
No |
No |
Athena | Generally available |
Yes |
Yes |
Yes |
Yes |
No |
No |
No |
No |
No |
No |
Redshift | Generally available |
Yes |
Yes |
Yes |
Yes |
No |
No |
No |
No |
No |
No |
Trino | Generally available |
Yes |
Yes |
Yes |
Yes |
Yes |
Yes |
Yes |
No |
No |
No |
SQL Server | Generally available |
Yes |
Yes |
Yes |
Yes |
No |
No |
No |
No |
No |
No |
SAP HANA | Generally available |
Yes |
Yes |
Yes |
Yes Note Currently, only Dupes and Rules are eligible for break record archival to SAP HANA. |
No |
Yes Note Currently, SAP HANA Pushdown only scans for exact match dupes. Support for fuzzy match dupes is planned for an upcoming release. |
Yes |
No |
No |
No |
Running rules on Pushdown jobs
All rules that execute on Pushdown jobs use Database SQL dialects native to the specific data warehouse to which they submit your data for processing. Because Pushdown submits queries directly to the data source, the processing time of SQL rules reduces considerably.
The following table provides examples of how rules run on Pushdown jobs.
Note The Translated Query Example column in the following table demonstrates the translation of the rule shown in the Collibra DQ Query Example column when you submit it to the data warehouse for processing.
Feature | Query Type | Definition | From | Collibra DQ Query Example | Translated Query Example |
---|---|---|---|---|---|
Native SQL |
Physical Query (PQ) |
Runs the entire query in native database dialect. | nyse.dataset | select * from nyse.dataset where fname like 'John'
|
select * from nyse.dataset where fname like 'John'
|
Freeform SQL |
Logical Query (LQ) |
Runs the query including the original scope. | @nyse-dq-dataset | select * from @nyse-dq-dataset where fname = 'John'
|
select * from (select * from nyse-dq-dataset where d_date='2023-01-01') where fname = 'John'
|
Freeform SQL |
Template Logical Query (TLQ) |
Runs the query including the original scope without specifying the dataset name. | @dataset | select * from @dataset where fname = 'John'
|
select * from (select * where d_date='2023-01-01') where fname = 'John'
|
Known limitations
- We strongly encourage avoiding the use of special characters in table and column names for Pushdown datasets.
- Currently, you cannot run a job from the command line.
- Mixed data type rules are not supported in Pushdown.
- Binary data types are not supported in Pushdown, because there is no way to quantify the quality of unstructured binary data.
- Cross-connection rules are not supported in Pushdown. However, support for this capability when connections belong to the same Snowflake account is planned for an upcoming version.
- We do not currently support an Okta integration. Support for this integration is planned for a future version.
- There is a limitation where Snowflake Pushdown jobs with queries containing REGEXP_REPLACE fail because Snowflake does not currently support arguments with collation specifications. There are two known workarounds:
- Make a clone of the table that is synced in Snowflake and has no collation set.
- Use Pullup mode instead of Pushdown to avoid this Snowflake limitation.