About data profiling
Data profiling creates a summary of a data source that is registered with Data Catalog and determines the data type of columns in the data source. The summary mainly contains statistics and graphics to give the user an idea what the registered data is about.
You can create profiling results by:
- Registering a data source via Jobserver or via Edge, and choosing to profile the data.
- Importing profiling results via the Catalog API.
The profiling results are available in Table and Column asset pages.
You can profile data via Edge or via Jobserver. The following table shows the differences.
Part of process |
Profiling via Edge |
Profiling via Jobserver |
---|---|---|
Data size | There is no data size limit. The Edge site calculates the profiling statistics while reading the data. | There is a limit on the size of the data that is used to calculate the profiling statistics. By default, this is 10 GB. |
Connectivity | Collibra connects to an Edge site. The Edge site is installed in the customer's environment, close to the data source. The Edge site communicates to Collibra Data Intelligence Platform and other 3rd party systems using an HTTPS connection. | Jobserver requires an HTTP proxy to support reverse connectivity. |
Register a data source | You can profile the data only after you registered a data source and synchronized one or more schemas via Edge. You can start the profiling process via the Configuration tab page on the Database asset page. | When registering a data source via Jobserver, options are available to profile the data and create sample data. |
Anonymizing data |
Profiling happens on the Edge site. The profiling results are automatically anonymized for columns with the Text or Geo data type before they are sent to Data Catalog. It is not possible to disable the anonymization of these data types. |
Settings are available to enable the anonymization of the profiling results. |
Deleting data profiling results | Once data profiling results are available, you can only delete them by deleting the assets. | To delete data profiling results for a schema, refresh the schema without storing the data profile. Go to Refresh the schema of a registered data source. |
Classification |
The Unified Data Classification process does not start together with the profiling. |
N/A |