Warning Jobserver and all related Jobserver integrations are end of life starting October, 2024, with the exception of Public Sector customers using GovCloud or on-prem environments.
For information on profiling via Edge, go to Profiling via Edge.
About profiling via Jobserver
Profiling process via Jobserver
When you register a data source via Jobserver, Data Catalog triggers the ingestion process.
By default, the complete data set is transferred to Jobserver. Then Jobserver creates a representative subset of the data to profile, based on your data source. Jobserver then profiles that data and sends the profiling results to Data Catalog. You can enable the Anonymize data option to hash or remove profiling results that can be considered sensitive.
Data used to create profiling results via Jobserver
To create the profiling results, Data Catalog uses a representative set of the data from the data source.
Note This data is not the same as the sample data that can be available for an asset.
If you register a data source via Jobserver, the data that will be used by data profiling is created when you register the data source.
- If you use Jobserver without push down sampling:
First, the complete data set is transferred to Jobserver. Then Jobserver creates the set of the data to be profiled. This is sometimes called sampling.
The size is determined by the Table profiling data size setting in Collibra Console or the Services Configuration section of the Collibra settings. By default, the size is 10 GB. - If you use Jobserver with push down sampling (also called partial scan):
The data source itself creates the set of data to profile and sends it to Jobserver. The data source creates the set of data from randomly selected rows. If the Jobserver cache storage is reached, the process stops. Because the data source already created the set of data randomly, the omitted data can be ignored without lowering the representativeness of the sample.Warning Push down sampling can be done using dynamic SQL query, if the data source supports it. To verify if your data source allows push down sampling, see Collibra-provided JDBC drivers.
Tip Push down sampling drastically increases the performance of collecting data to profile.