Pushdown processing

Pushdown is an alternative computing method for running a DQ job, where all of the job's processing is submitted to a SQL data warehouse. DQ jobs with Pushdown processing generate SQL queries that offload the compute directly to the data source, reducing the amount of data transfer and Spark compute.

With Pushdown, you can also scale your computing needs based on the specific requirements of your DQ job. This is because the architecture of Pushdown data sources feature auto-scaling, which allows you to automatically scale up, or burst, to 64 or 128 nodes when you require more significant processing needs. Pushdown also automatically scales down when your DQ job does not require robust processing. With auto-scaling, the processing of your data is enhanced, improving runtime performance and removing the egress costs of reading large amounts of data.

Throughout the app, you can identify Pushdown jobs by the pushdown icon icon.

Benefits of Pushdown

By running a Pushdown job, you can:

  • Reduce latency.
  • Eliminate dependencies on Spark compute to run Collibra DQ, and increase processing speeds.
  • Eliminate egress latency when running DQ jobs against large datasets.
  • Auto-scale to data warehouses, such as Snowflake and Databricks.

Prerequisites for using Pushdown

Before running Pushdown jobs, a Collibra DQ user with Admin permissions must enable Pushdown for your data source. Pushdown can be enabled from the data source connection template in Admin Console Connections.

Tip For first-time Snowflake Pushdown configurations, we recommend that you also successfully run the Pushdown setup script. If you do not run the setup script, ensure that you meet all the criteria that the script attempts to accomplish.

Available data sources and layers

Pushdown is currently available for the following data sources and layers:

    Basic Layers Advanced Layers
Data Source Lifecycle Rules Behavior (Adaptive Rules) Schema Archive Break Records Outliers Dupes Shapes Patterns Records Mapping (Source to Target)
Snowflake Generally available

Yes

Yes

Yes

Yes

Yes

Yes

Yes

No

Yes

Yes

Databricks Generally available

Yes

Yes

Yes

Yes

Yes

Yes

Yes

No

No

No

BigQuery Generally available

Yes

Yes

Yes

Yes

No

Yes

Yes

No

No

No

Athena Generally available

Yes

Yes

Yes

Yes

No

No

No

No

No

No

Redshift Generally available

Yes

Yes

Yes

Yes

No

No

No

No

No

No

Trino Generally available

Yes

Yes

Yes

Yes

Note Currently, only Dupes, Outliers, and Rules are eligible for break record archival to Trino.

Yes

Yes

Yes

No

No

No

SQL Server Generally available

Yes

Yes

Yes

Yes

No

No

No

No

No

No

SAP HANA Generally available

Yes

Yes

Yes

Yes

Note Currently, only Dupes, Outliers, and Rules are eligible for break record archival to SAP HANA.

Yes

Yes

Note Currently, SAP HANA Pushdown only scans for exact match dupes.

Yes

No

No

No

Running rules on Pushdown jobs

All rules that execute on Pushdown jobs use Database SQL dialects native to the specific data warehouse to which they submit your data for processing. Because Pushdown submits queries directly to the data source, the processing time of SQL rules reduces considerably.

The following table provides examples of how rules run on Pushdown jobs.

Note The Translated Query Example column in the following table demonstrates the translation of the rule shown in the Collibra DQ Query Example column when you submit it to the data warehouse for processing.

Feature Query Type Definition From Collibra DQ Query Example Translated Query Example
Native SQL

Physical Query (PQ)

Runs the entire query in native database dialect. nyse.dataset select * from nyse.dataset where fname like 'John' select * from nyse.dataset where fname like 'John'
Freeform SQL

Logical Query (LQ)

Runs the query including the original scope. @nyse-dq-dataset select * from @nyse-dq-dataset where fname = 'John' select * from (select * from nyse-dq-dataset where d_date='2023-01-01') where fname = 'John'
Freeform SQL

Template Logical Query (TLQ)

Runs the query including the original scope without specifying the dataset name. @dataset select * from @dataset where fname = 'John' select * from (select * where d_date='2023-01-01') where fname = 'John'

Known limitations

  • We strongly encourage avoiding the use of special characters in table and column names for Pushdown datasets.
  • Currently, you cannot run a job from the command line.
  • Mixed data type rules are not supported in Pushdown.
  • Binary data types are not supported in Pushdown, because there is no way to quantify the quality of unstructured binary data.
  • Cross-connection rules are not supported in Pushdown.
  • We do not currently support an Okta integration. Support for this integration is planned for a future version.
  • There is a limitation where Snowflake Pushdown jobs with queries containing REGEXP_REPLACE fail because Snowflake does not currently support arguments with collation specifications. There are two known workarounds:
    • Make a clone of the table that is synced in Snowflake and has no collation set.
    • Use Pullup mode instead of Pushdown to avoid this Snowflake limitation.