Using push down sampling or partial scan

Push down sampling means that the task of creating the data sample is delegated to the data source itself. In Edge, push down sampling is called partial scan.

  • The data source creates the sample from randomly selected data and transfers it to the Jobserver or the Edge site in one fetching process.

    If the cache storage is reached nonetheless, the fetching process can be stopped. Because the data source already created the sample randomly, the omitted data can be ignored without lowering the representativeness of the sample.
  • Push down sampling can be done using dynamic SQL query, if the data source supports data sampling. For an overview, see Overview of Collibra-provided JDBC drivers.

Push down sampling drastically increases the performance of sampling.

Enable push down sampling

Push down sampling is not used by default. To use push down sampling, do the following:

Step

When

Description

1 Manage the driver Add the pushDownSampling connection property.
2 Register your data source Follow the usual steps to register a data source, but include the following options:
  1. Enter a value for the pushDownSampling connection property.

    Note 
    • The value must be between 100 and 1 000 000. Your data source creates the sample of that amount of rows.
    • If the size of the amount of rows exceeds the limit of the cache storage (Collibra recommends 10 to 20 GB), the amount of rows is reduced.
    • If you typed a value that is bigger than the amount of rows in the data source, the entire data source is used as a sample.
  2.  Select Store Data Profile and, optionally, Store Sample Data to profile via Jobserver.

Push down sampling is performed via the Partial scan option. To use partial scan, do the following:

Step

When

Description

1 Register and synchronize a data source via Edge. Follow the usual steps to register and synchronize a data source via Edge.
2 Profile and classify synchronized metadata. Click the Profiling and classification tab in a Database asset's Configuration tab page, and do one of the following:
  • To use partial scan for all schemas:
    1. In the Default profiling and classification rule section, click Edit.
    2. Select Partial scan.
    3. Enter the maximum number of rows that you want to use for profiling.
    4. Click Save.
    5. Profile and classify.
  • To use partial scan for a specific schema only:
    1. In the Schema profiling and classification rules section, select the schema.
    2. Do one of the following:
      • To create a new table rule, click Add table rule.
      • To edit an existing table rule, click Edit .
    3. Select Partial scan.
    4. Enter the maximum number of rows that you want to use for profiling.
    5. Click Save.
    6. Profile and classify.