Only using part of the data to create profiling results

With source-driven random sampling, you can use only part of the data in a data source to generate profiling results. It means the data source creates a random set of data to profile.

Important Source-driven random sampling can be done using dynamic SQL query, if the data source supports it. To verify if source-driven random sampling is available for your data source, go to Collibra-provided JDBC drivers.

The data source randomly selects data to profile and transfers it to Jobserver or the Edge site in one fetching process.
If the Jobserver cache storage is reached, the fetching process can be stopped. Because the data source already created the data randomly, the omitted data can be ignored without lowering the representativeness of the data.

Source-driven random sampling is not used by default on Jobserver. To use source-driven random sampling, do the following:

Step

When

Description

1 Manage the driver Add the pushDownSampling connection property.
2 Register your data source Follow the usual steps to register a data source, but include the following options:
  1. Enter a value for the pushDownSampling connection property.

    Note 
    • The value must be between 100 and 1 000 000. Your data source creates the set of data to profile from that amount of rows.
    • If the size of the amount of rows exceeds the limit of the cache storage (Collibra recommends 10 to 20 GB), the amount of rows is reduced.
    • If you typed a value that is bigger than the amount of rows in the data source, the entire data source is used to create the profiling results.
  2.  Select Store Data Profile and, optionally, Store Sample Data to profile via Jobserver.

Select the Random Rows option when you configure the profiling options to apply source-driven random sampling. For details about the options, go to Profiling options.