Warning Jobserver and all related Jobserver integrations are end of life starting October, 2024, with the exception of Public Sector customers using GovCloud or on-prem environments.
For information on profiling via Edge, go to Profiling via Edge.

About profiling via Jobserver

Profiling process via Jobserver

When you register a data source via Jobserver, Data Catalog triggers the ingestion process.
By default, the complete data set is transferred to Jobserver. Then Jobserver creates a representative subset of the data to profile, based on your data source. Jobserver then profiles that data and sends the profiling results to Data Catalog. You can enable the Anonymize data option to hash or remove profiling results that can be considered sensitive.

Data used to create profiling results via Jobserver

To create the profiling results, Data Catalog uses a representative set of the data from the data source.

Note This data is not the same as the sample data that can be available for an asset.

If you register a data source via Jobserver, the data that will be used by data profiling is created when you register the data source.

If you use Jobserver without source-driven random sampling:
First, the complete data set is transferred to Jobserver. Then Jobserver creates the set of the data to be profiled. This is sometimes called sampling.
The size is determined by the Table profiling data size setting in Collibra Console or the Services Configuration section of the Collibra settings. By default, the size is 10 GB.
If you use Jobserver with source-driven random sampling (before also called partial scan or push down sampling):
The data source itself creates the set of data to profile and sends it to Jobserver. The data source creates the set of data from randomly selected rows. If the Jobserver cache storage is reached, the process stops. Because the data source already created the set of data randomly, the omitted data can be ignored without lowering the representativeness of the sample.
Important Source-driven random sampling can be done using dynamic SQL query, if the data source supports it. To verify if source-driven random sampling is available for your data source, go to Collibra-provided JDBC drivers.