Data Catalog methods to select data from a data source

In Collibra, the following Data Catalog features require data from a data source:

  • Data sampling: Collibra needs data to show examples on an asset page.
  • Data classification: Collibra needs data to verify it against available data classes and suggest the best classification for the asset.
  • Data profiling: Collibra needs data to calculate descriptive statistics, which are shown on an asset page.

Data Catalog can use the following methods to select data from a data source:

  • Source-driven random sampling: This method is where the data source itself returns a set of random rows, improving performance for features that need data.
    Note 

    This selection method is also referred to as Push down sampling (Jobserver), partial scan (Edge), or Random Rows (Edge).

  • Platform-driven random sampling: This method is where all rows are selected from the data source and only a random set of them are used by the Edge capability.
  • All rows: This method is where all rows are selected in the data source and all of them are used by the Edge capability.

Below, you find which selection methods are used by the Data Catalog features.

Selection method Source-driven random sampling Platform-driven random sampling All rows
Data Sampling Automatically used if available for the data.

Used if source-driven random sampling isn't available for the data source.

Not used
Data Classification

Automatically used if available for the data.

Used if source-driven random sampling isn't available for the data source. Not used
Data Profiling Used if available for the data source and if the Random rows option is selected when configuring the profiling options.

Not used

Used if source-driven random sampling isn't available for the data source.