About Data Profiling

Data Profiling creates a summary of a data source in Data Catalog and determines the data type of columns in the data source. The summary mainly contains statistics and graphics to give the user an idea what the data is about.

Data Profiling is available for registered JDBC data sources and for Databricks Unity Catalog, Dataplex Catalog, and SageMaker Unified Studio (in preview) data sources integrated via Edge.

You can also import profiling data via the Catalog API.

The profiling results are shown in Table and Column asset pages.

Note If you're using a Collibra Cloud site, go the Collibra Cloud site documentation to check if your data source is supported.

You can also profile via Jobserver. However, Jobserver and all related Jobserver integrations reached their End of Life in commercial environments in October, 2024. In Collibra Platform for Government and Collibra Platform Self-Hosted environments, they will reach their End of Life on May 30, 2027. The following table shows the differences.

Part of process

Profiling via Edge

Profiling via Jobserver

Data size There is no data size limit. The Edge site calculates the profiling statistics while reading the data. There is a limit on the size of the data that is used to calculate the profiling statistics. By default, this is 10 GB.
Connectivity Collibra connects to an Edge site. The Edge site is installed in the customer's environment, close to the data source. The Edge site communicates to Collibra Platform and other 3rd party systems using an HTTPS connection. Jobserver requires an HTTP proxy to support reverse connectivity.
Register or integrate a data source

You can profile the data only after you registered a data source and synchronized one or more schemas via Edge.

In the latest UI, you can also set up profiling for integrated Databricks Unity Catalog, integrated Dataplex Catalog, or SageMaker Unified Studio (in preview) assets.

You can start the profiling process via the Configuration tab page on the Database asset page.

When registering a data source via Jobserver, options are available to profile the data and create sample data.

For Databricks Unity Catalog or Dataplex Catalog, you can add a JDBC connection to the synchronization capability.

For SageMaker Unified Studio, you can add one or more JDBC connections to the synchronization configuration page.

After synchronization, options are available to profile the data and create sample data.

Anonymizing data

Profiling happens on the Edge site. The profiling results are automatically anonymized for columns with the Text or Geo data type before they are sent to Data Catalog. It is not possible to disable the anonymization of these data types.
An administrator can also decide to anonymize the profiling results for all columns.

Settings are available to enable the anonymization of the profiling results.
Deleting data profiling results Once data profiling results are available, you can only delete them by deleting the assets. To delete data profiling results for a schema, refresh the schema without storing the data profile. Go to Refresh the schema of a registered data source.

Related topics