About profiling and classification via Edge

Important 

In Collibra 2024.02, we've launched a new user interface (UI) in beta for Collibra Data Intelligence Platform! You can learn more about this latest UI in the UI overview.

Use the following options to see the documentation in the latest UI or in the previous, classic UI:

Profiling and classification via Edge is a functionality offered by Collibra for Collibra Data Intelligence Platform users. Depending on the classification method that you use, the functionality combines both data profiling and data classification in one process.

  • Data profiling creates a summary of a data source that is registered with Data Catalog and determines the data type of columns in the data source. The summary mainly contains statistics and graphics to give the user an idea what the registered data is about.

    Important Advanced data types are not taken into account when profiling via Edge.

  • Automatic Data Classification tries to define the data class of a column. You can accept or reject the suggested data class of each column or add your own new classes.
    Automatic Data Classification can suggest multiple data classes for a column. If the suggestion is accurate, you can accept multiple data classes for the column.

    Important If you are using the Unified Data Classification method, the classification process does not automatically run at the same time as profiling. When you start the profiling and classification activity, only profiling results will be collected. You need to activate the classification process separately.

Profiling and classification process via Edge

When you registered a data source via Edge and you have created a profiling capability, you can profile and classify the data via the Database asset page of the registered data source.
Edge profiles and classifies the data on the Edge site itself and only sends the profiling results and classification suggestions to Collibra Data Intelligence Platform.
The profiling results are automatically anonymized based on your anonymization configuration before they are sent to Collibra Data Intelligence Platform.

As a result, if you register, profile, and classify a data source via Edge:

  • Data Catalog has access to synchronized metadata, profiling results, and classification suggestions.
  • Data Catalog doesn't have access to the actual data from your data source.

Profiling and classification steps in Edge

Step

Description

Before you start
Create an Edge site with a JDBC connection, a JDBC ingestion capability, and a JDBC profiling capability.
Register a data source via Edge.

Synchronize one or more schemas.
Configure the profiling and classification options for the synchronized schemas.
Profile and classify.
The Edge site will initiate the profiling and classification process and send the results to Collibra Data Intelligence Platform.

Tip You can trigger the profiling and classification job manually, set up a schedule or trigger it after synchronizing a schema.

Important If you are using the Unified Data Classification method, the classification process does not automatically run at the same time as profiling. When you start the profiling and classification activity, only profiling results will be collected. You need to activate the classification process separately.

Data used to create profiling results via Edge

To create the profiling results, Data Catalog uses a representative set of the data from the data source.

Note This data is not the same as the sample data that can be available for an asset.

Edge profiles and classifies the data on the Edge site itself and only sends the profiling results to Collibra Data Intelligence Platform.

  • If you use all rows, all the rows in a data source table are used by Edge for profiling, without limit.
  • If you use a random set of rows, the data source randomly selects data and sends it to Edge for profiling.

    Warning Only some data sources support the use of random rows. To verify if your data source allows it, go to Collibra-provided JDBC drivers.

For more information, go to Configure the profiling and classification options via Edge.

Limitations

Profiling via Edge has the following limitations:

Automatic Data Classification via Edge has the following limitations:

  • Automatic Data Classification via Edge is only available for customers using Collibra Data Intelligence Platform.
  • Data classification on Edge does not retrain the classification model to improve future classification predictions.
  • Out-of-the-box, automatic data classification can predict several data classes. You can also create user-defined data classes. These user-defined data classes are not taken into account by the automatic classification process. You need to assign user-defined data classes manually.
  • English is the only supported language, but Automatic Data Classification via Edge can run on data in other Latin alphabet-based languages as well.
  • Automatic Data Classification via Edge needs profiling data to predict the data classes. Data classification is performed automatically after the profiling process on an Edge site. That means that you can only classify columns of data sources registered in Data Catalog via an Edge site that has the JDBC profiling capability.
Tip 

Instead of Automatic Data Classification via Edge, you can also use the Unified Data Classification method.