Pushdown processing

Pushdown is an alternative computing method for running a DQ job, where all of the job's processing is submitted to a SQL data warehouse. DQ jobs with Pushdown processing generate SQL queries that offload the compute directly to the data source, reducing the amount of data transfer and Spark compute.

With Pushdown, you can also scale your computing needs based on the specific requirements of your DQ job. This is because the architecture of Pushdown data sources feature auto-scaling, which allows you to automatically scale up, or burst, to 64 or 128 nodes when you require more significant processing needs. Pushdown also automatically scales down when your DQ job does not require robust processing. With auto-scaling, the processing of your data is enhanced, improving runtime performance and removing the egress costs of reading large amounts of data.

Throughout the app, you can identify Pushdown jobs by the pushdown icon icon.

Benefits of Pushdown

By running a Pushdown job, you can:

  • Reduce latency.
  • Eliminate dependencies on Spark compute to run Collibra DQ, and increase processing speeds.
  • Eliminate egress latency when running DQ jobs against large datasets.
  • Auto-scale to data warehouses, such as Snowflake and Databricks.

Prerequisites for using Pushdown

Before running Pushdown jobs, a Collibra DQ user with Admin permissions must do one of the following depending on the connection:

  • Enable Pushdown from the Snowflake connection template in Admin Console Connections.
  • Tip For first-time configurations, we recommend that you also successfully run the Pushdown setup script. If you do not run the setup script, ensure that you meet all the criteria that the script attempts to accomplish.

  • Enable Pushdown from the Databricks connection template in Admin Console Connections.
  • Enable Pushdown from the BigQuery connection template in Admin Console Connections.
  • Enable Pushdown from the Athena connection template in Admin Console Connections.
  • Enable Pushdown from the Redshift connection template in Admin Console Connections.
  • Enable Pushdown from the Trino connection template in Admin Console Connections.

Available data sources and layers

Pushdown is currently available for the following data sources and layers:

    Basic Layers Advanced Layers
Data Source Lifecycle Rules Behavior (Adaptive Rules) Schema Archive Break Records Outliers Dupes Shapes Patterns Records Mapping (Source to Target)
Snowflake Generally available

Yes

Yes

Yes

Yes

Yes

Yes

Yes

No

Yes

No

Databricks Generally available

Yes

Yes

Yes

Yes

Yes

Yes

Yes

No

No

No

BigQuery Generally available

Yes

Yes

Yes

Yes

No

No

Yes

No

No

No

Athena Beta

Yes

Yes

Yes

Yes

No

No

No

No

No

No

Redshift Beta

Yes

Yes

Yes

Yes

No

No

No

No

No

No

Trino Beta

Yes

Yes

Yes

Yes

No

No

No

No

No

No

SQL Server Coming soon (Beta 2024.05)

Yes

Yes

Yes

Yes

No

No

No

No

No

No

SAP HANA Coming soon (Beta 2024.05)

Yes

Yes

Yes

Yes

No

No

No

No

No

No

Running rules on Pushdown jobs

All rules that execute on Pushdown jobs use Database SQL dialects native to the specific data warehouse to which they submit your data for processing. Because Pushdown submits queries directly to the data source, the processing time of SQL rules reduces considerably.

The following table provides examples of how rules run on Pushdown jobs.

Note The Translated Query Example column in the following table demonstrates the translation of the rule shown in the Collibra DQ Query Example column when you submit it to the data warehouse for processing.

Feature Query Type Definition From Collibra DQ Query Example Translated Query Example
Native SQL

Physical Query (PQ)

Runs the entire query in native database dialect. nyse.dataset select * from nyse.dataset where fname like 'John' select * from nyse.dataset where fname like 'John'
Freeform SQL

Logical Query (LQ)

Runs the query including the original scope. @nyse-dq-dataset select * from @nyse-dq-dataset where fname = 'John' select * from (select * from nyse-dq-dataset where d_date='2023-01-01') where fname = 'John'
Freeform SQL

Template Logical Query (TLQ)

Runs the query including the original scope without specifying the dataset name. @dataset select * from @dataset where fname = 'John' select * from (select * where d_date='2023-01-01') where fname = 'John'

Known limitations

  • Currently, you cannot run a job from the command line.
  • Mixed data type rules are not supported in Pushdown.
  • Binary data types are not supported in Pushdown, because there is no way to quantify the quality of unstructured binary data.
  • Cross-connection rules are not supported in Pushdown. However, support for this capability when connections belong to the same Snowflake account is planned for an upcoming version.
  • We do not currently support an Okta integration. Support for this integration is planned for a future version.
  • There is a limitation where Snowflake Pushdown jobs with queries containing REGEXP_REPLACE fail because Snowflake does not currently support arguments with collation specifications. There are two known workarounds:
    • Make a clone of the table that is synced in Snowflake and has no collation set.
    • Use Pullup mode instead of Pushdown to avoid this Snowflake limitation.