About data profiling

Data profiling creates a summary of a data source that is registered with Data Catalog and determines the data type of columns in the data source. The summary mainly contains statistics and graphics to give the user an idea what the registered data is about.

You can create profiling results by:

  • Registering a data source via Jobserver or via Edge, and choosing to profile the data.
  • Importing profiling results via the Catalog API.

You can find the profiling results in Table and Column asset pages.

Profiling process

You can profile data via Edge or via Jobserver. The following table shows the differences.

Part of process

Profiling via Edge

Profiling via Jobserver

Data size There is no data size limit. The Edge site calculates the profiling statistics while reading the data. There is a limit on the size of the data that is used to calculate the profiling statistics. By default, this is 10 GB.
Connectivity Collibra connects to an Edge site. The Edge site is installed in the customer's environment, close to the data source. The Edge site communicates to Collibra Data Intelligence Cloud and other 3rd party systems using an HTTPS connection. Jobserver requires an HTTP proxy to support reverse connectivity.
Register a data source You can only profile the data after you registered a data source and synchronized one or more schemas via Edge. You can start the profiling process via the Configuration tab page on the Database asset page. When registering a data source via Jobserver, options are available to profile the data and create sample data.
Anonymizing data

Profiling happens on the Edge site. The profiling results are automatically anonymized for columns of data type Text and Geo before they are sent to Data Catalog. It is not possible to disable the anonymization of these data types.

Settings are available to enable the anonymization of the profiling results.
Classification The classification process starts at the same time as the profiling of the data. The classification process does not start together with the profiling.
Deleting data profiling results Once data profiling results are available, you can only delete them by deleting the assets. To delete data profiling results for a schema, refresh the schema without storing the data profile. See Refresh the schema of a registered data source.