Push down sampling

Push down sampling means that the task of creating the data sample is delegated to the data source itself. This can be done using dynamic SQL query, if the data source supports data sampling.

The data source creates the sample from randomly selected data and transfers it to the Jobserver or the Edge site in one fetching process. If the cache storage is reached nonetheless, the fetching process can be stopped. Because the data source already created the sample randomly, the omitted data can be ignored without lowering the representativeness of the sample.

Push down sampling drastically increases the performance of sampling.

Enabling push down sampling

Via Jobserver
Via Edge

Push down sampling is not used by default. In order to use push down sampling, do the following:

Step	When	Description
1	Manage the driver	Add the pushDownSampling connection property.
2	Register your data source	Follow the usual steps to register a data source, but include the following options: Enter a value for the pushDownSampling connection property. Note The value must be between 100 and 1 000 000. Your data source creates the sample of that amount of rows. If the size of the amount of rows exceeds the limit of the cache storage (Collibra recommends 10 to 20 GB), the amount of rows is reduced. If you typed a value that is bigger than the amount of rows in the data source, the entire data source is used as a sample. Select the following Profiling options: Store Data Profile and, optionally, Store Sample Data to profile via Jobserver. Profile and classify data to profile and classify via Edge.

Step

When

Description

Manage the driver

Add the pushDownSampling connection property.

Follow the usual steps to register a data source, but include the following options:

Enter a value for the pushDownSampling connection property.
Note
- The value must be between 100 and 1 000 000. Your data source creates the sample of that amount of rows.
- If the size of the amount of rows exceeds the limit of the cache storage (Collibra recommends 10 to 20 GB), the amount of rows is reduced.
- If you typed a value that is bigger than the amount of rows in the data source, the entire data source is used as a sample.
Select the following Profiling options:
- Store Data Profile and, optionally, Store Sample Data to profile via Jobserver.
- Profile and classify data to profile and classify via Edge.

Push down sampling is an option you can select when profiling and classifying a registered data source. To enable push down sampling, do the following:

Step	When	Description
1	Register a data source via Edge.	Follow the usual steps to register a data source via Edge.
2	Profile and classify synchronized metadata.	On the Profiling and Classification tab of a Database asset's Configuration tab page, do the following: In the Profiling options section, click Edit. Select Partial scan. Enter the sample size that you want to use for profiling. Click Save. Click Run profiling and classification.

Step

When

Description

Follow the usual steps to register a data source via Edge.

Profile and classify synchronized metadata.

On the Profiling and Classification tab of a Database asset's Configuration tab page, do the following:

In the Profiling options section, click Edit.
Select Partial scan.
Enter the sample size that you want to use for profiling.
Click Save.
Click Run profiling and classification.

Supported data sources

Not all data sources support push down sampling. Currently, you can use push down sampling for the following data sources:

Amazon Redshift
Databricks
Exasol
Google BigQuery
Oracle
PostgreSQL
Snowflake
SQL server
Teradata

Push down sampling currently is a beta feature for the following data source:

Apache Hive