About data profiling

Data profiling creates a summary of a data source that is registered with Data Catalog and determines the data type of columns in the data source. The summary mainly contains statistics and graphics to give the user an idea what the registered data is about.

You can create profiling results by:

  • Registering a data source via Jobserver or via Edge, and choosing to profile the data.
  • Importing profiling results via the Catalog API.

You can find the profiling results in Table and Column asset pages.

Profiling process

You can profile data via Edge or via Jobserver. The following table shows the differences.

Part of process

Profiling via Edge

Profiling via Jobserver

Data size There is no data size limit. The Edge site calculates the profiling statistics while reading the data. There is a limit on the size of the data that is used to calculate the profiling statistics. By default, this is 10 GB.
Connectivity Collibra connects to an Edge site. The Edge site is installed in the customer's environment, close to the data source. The Edge site communicates to Collibra Data Intelligence Platform and other 3rd party systems using an HTTPS connection. Jobserver requires an HTTP proxy to support reverse connectivity.
Register a data source You can profile the data only after you registered a data source and synchronized one or more schemas via Edge. You can start the profiling process via the Configuration tab page on the Database asset page. When registering a data source via Jobserver, options are available to profile the data and create sample data.
Anonymizing data

Profiling happens on the Edge site. The profiling results are automatically anonymized for columns with the Text or Geo data type before they are sent to Data Catalog. It is not possible to disable the anonymization of these data types.
An administrator can also decide to anonymize the profiling results for all columns.

Settings are available to enable the anonymization of the profiling results.
Classification

Depending on the classification method that you use, the classification process starts at the same time as the profiling of the data.

Important If you are using the Unified Data Classification method, the classification process does not automatically run at the same time as profiling. When you start the profiling and classification activity, only profiling results will be collected. You need to activate the classification process separately.

The classification process does not start together with the profiling.
Deleting data profiling results Once data profiling results are available, you can only delete them by deleting the assets. To delete data profiling results for a schema, refresh the schema without storing the data profile. Go to Refresh the schema of a registered data source.