About data profiling
You can create profiling data by:
- Registering a data source and choosing to create profiling data.
- Importing profiling data via the profiling API.
You can find the profiling information on the asset page of a table or a column, by clicking Data Profiling in the tab pane.
Profiling process
Collibra offers two profiling processes: via Edge or via Jobserver.
Profiling via Jobserver
When you register a data source, Data Catalog triggers the ingestion process via Jobserver. By default, the complete data set is transferred to the Jobserver, which then creates a sample based on your data source. Jobserver profiles the sample and sends the result to Data Catalog.
You enable the Anonymize data option to hash or remove profiling information that can be considered sensitive.
Profiling via Edge
When you registered a data source via Edge and you have created a profiling capability, you can profile and classify the metadata via the Database asset page of the registered database.
You install an Edge site close to your data source and create the necessary JDBC connections and capabilities to ingest the data source and to profile and classify its metadata. After synchronizing the schemas of the registered database, you can profile and classify the metadata. The profiling results are automatically anonymized.
Differences between profiling via Jobserver or via Edge
The following table shows the differences between profiling via Jobserver or via Edge.
|
Part of process |
Profiling via Jobserver |
Profiling via Edge |
|---|---|---|
|
Data size |
There is a limit on the size of the data that is used to calculate profiling statistics. By default, this is 10 GB. |
There is no data size limit. The Edge site calculates the profiling statistics while reading the data. |
|
Connectivity |
Jobserver requires an HTTP proxy to support reverse connectivity. |
Collibra connects to an Edge site. The Edge site is installed in the customer's environment, close to the data source. The Edge site communicates to Collibra Data Intelligence Cloud and other 3rd party systems using an HTTPS connection. |
| Register a data source | When registering a data source via Jobserver, you have profiling options to create profile and sample data. | You can only profile the metadata after you registered a data source and synchronized one or more schemas. You can start the profiling process via the Configuration tab page on a Database asset page. |
Profiling sample
To create a data profile, Data Catalog uses a representative sample of the data. This profiling sample is created when you register your data source.
Standard profiling sample creation process via Jobserver
If you use the Jobserver to register a data source without push down sampling, the complete data set is transferred to the Jobserver, which then creates a sample based on your data source. Jobserver uses the entire data set to ensure that the sample is representative.
The sample size is determined by the Table profiling data size setting in Collibra Console. By default, the size is 10 GB.
Tip If you use Edge to register a data source, sample data is not automatically created. You can only create sample data by using push down sampling.
Push down sample creation process
Push down sampling means that the task of creating the data sample is delegated to the data source itself. This can be done using dynamic SQL query, if the data source supports data sampling.
The data source creates the sample from randomly selected data and transfers it to the Jobserver
Warning Push down sampling is only available for specific data sources.