Only using part of the data to create profiling results

Push down sampling (Jobserver) or partial scan (Random Rows) (Edge) means that the task of creating a set of data to profile is delegated to the data source itself and allows to only use part of the data to create profiling results.

  • The data source randomly selects data to profile and transfers it to the Jobserver or the Edge site in one fetching process.
    If the Jobserver cache storage is reached, the fetching process can be stopped. Because the data source already created the data randomly, the omitted data can be ignored without lowering the representativeness of the data.
  • Push down sampling or partial scan can be done using dynamic SQL query, if the data source supports it. For an overview, see Overview of Collibra-certified JDBC drivers.

Push down sampling or partial scan drastically increases the performance of collecting data to profile.

Push down sampling is not used by default on Jobserver. To use push down sampling, do the following:

Step

When

Description

1 Manage the driver Add the pushDownSampling connection property.
2 Register your data source Follow the usual steps to register a data source, but include the following options:
  1. Enter a value for the pushDownSampling connection property.

    Note 
    • The value must be between 100 and 1 000 000. Your data source creates the set of data to profile from that amount of rows.
    • If the size of the amount of rows exceeds the limit of the cache storage (Collibra recommends 10 to 20 GB), the amount of rows is reduced.
    • If you typed a value that is bigger than the amount of rows in the data source, the entire data source is used to create the profiling results.
  2.  Select Store Data Profile and, optionally, Store Sample Data to profile via Jobserver.

Random Rows is an option when you profile and classify a data source that allows partial scan. For details about the options, go to Profiling and classification options.