About profiling via Edge

Important

In Collibra 2024.05, we launched a new user interface (UI) for Collibra Platform! You can learn more about this latest UI in the UI overview.

Use the following options to see the documentation in the latest UI or in the previous, classic UI:

Latest UI Classic UI

Data Profiling creates a summary of a data source in Data Catalog and determines the data type of columns in the data source. The summary mainly contains statistics and graphics to give the user an idea what the data is about.

Data Profiling is available for registered data sources and for Databricks Unity Catalog assets integrated via Edge.

Important Advanced data types are not taken into account when profiling via Edge.

After profiling has been set up, you can profile the data via the Configuration tab in the Database asset page of the data source. Edge profiles the data on the Edge or Collibra Cloud site itself and only sends the results to Collibra Platform. The profiling results are automatically anonymized based on your anonymization configuration before they are sent to Collibra Platform.
As a result, if you profile a data source via Edge:

Data Catalog has access to synchronized metadata and profiling results.
Data Catalog doesn't have access to the actual data from your data source.

Profiling steps in Edge

Steps for JDBC registrations
Steps for Databricks integrations

Step	Description
Before you start	Enable profiling via Edge
	Create an Edge or Collibra Cloud site with a JDBC connection, a JDBC ingestion capability, and a JDBC profiling capability.
	Register a data source via an Edge or Collibra Cloud site.
	Synchronize one or more schemas.
	Configure the profiling options for the synchronized schemas.
	Profile the data. The Edge or Collibra Cloud site will initiate the profiling process and send the results to Collibra Platform. Tip You can trigger the profiling job manually, set up a schedule, or trigger it after synchronizing a schema. Note The Data Classification process does not automatically run at the same time as profiling. You need to activate the classification process separately.

Step	Description
Before you start	Enable profiling via Edge
	Make sure you have set up the Databricks Unity Catalog integration.
	Add a JDBC profiling capability for the JDBC Databricks connection.
	Synchronize Databricks Unity Catalog via the integration
	Configure the profiling options for the synchronized schemas.
	Profile the data. The Edge or Collibra Cloud site will initiate the profiling process and send the results to Collibra Platform. Tip You can trigger the profiling job manually, set up a schedule, or trigger it after synchronizing a schema. Note The Data Classification process does not automatically run at the same time as profiling. You need to activate the classification process separately.

Step	Description
Before you start	Enable profiling via Edge
	Create an Edge with a JDBC connection, a JDBC ingestion capability, and a JDBC profiling capability.
	Register a data source via Edge.
	Synchronize one or more schemas.
	Configure the profiling options for the synchronized schemas.
	Profile the data. The Edge will initiate the profiling process and send the results to Collibra Platform. Tip You can trigger the profiling job manually, set up a schedule, or trigger it after synchronizing a schema. Note The Data Classification process does not automatically run at the same time as profiling. You need to activate the classification process separately.

Data used to create profiling results via Edge

To create the profiling results, Data Catalog uses a representative set of the data from the data source.

Note This data is not the same as the sample data that can be available for an asset.

Edge profiles the data on the Edge or Collibra Cloud site itself and only sends the profiling results to Collibra Platform.

If you use all rows, all the rows in a data source table are used by Edge for profiling, without limit.
If you use a random set of rows, the data source randomly selects data and sends it to Edge for profiling.
Important Only some data sources support the use of random rows, also called source-driven random sampling. To verify if it is available for your data source, go to Collibra-provided JDBC drivers.

For more information, go to Configure the profiling options via Edge.