About profiling and classification via Edge
Profiling and classification via Edge is a functionality offered by Collibra for Collibra Data Intelligence Cloud users. The functionality combines both data profiling and data classification in one process.
-
Data profiling creates a summary of a data source that is registered with Data Catalog and determines the data type of columns in the data source. The summary mainly contains statistics and graphics to give the user an idea what the registered data is about.
Important Advanced data types are not taken into account when profiling via Edge.
- Automatic Data Classification tries to define the data class of a column. You can accept or reject the suggested data class of each column or add your own new classes.
Automatic Data Classification can suggest multiple data classes for a column. If the suggestion is accurate, you can accept multiple data classes for the column.
Profiling and classification process via Edge
When you registered a data source via Edge and you have created a profiling capability, you can profile and classify the data via the Database asset page of the registered data source.
Edge profiles and classifies the data on the Edge site itself and only sends the profiling results and classification suggestions to Collibra Data Intelligence Cloud. The profiling results are automatically anonymized for columns of data type Text and Geo before they are sent to Collibra Data Intelligence Cloud.
As a result, if you register a data source via Edge:
- Data Catalog has access to synchronized metadata, profiling results (that are automatically anonymized for columns of type Text and Geo), and classification suggestions.
- Data Catalog does not have access to the actual data from your data source.
Profiling and classification steps in Edge
|
Step |
Description |
|---|---|
|
Create an Edge site with a JDBC connection, a JDBC ingestion capability, and a JDBC profiling capability. Note Ensure you have defined the profiling and classification settings. |
|
Register a data source via Edge. |
|
|
Synchronize one or more schemas. |
|
Configure the profiling and classification options for the synchronized schemas. |
|
Profile and classify. The Edge site will initiate the profiling and classification process and send the results to Collibra Data Intelligence Cloud. Tip You can trigger the profiling and classification job manually, set up a schedule or trigger it after synchronizing a schema. |
Data used to create profiling results via Edge
To create the profiling results, Data Catalog uses a representative set of the data from the data source.
Note This data is not the same as the sample data that can be available for an asset.
Edge profiles and classifies the data on the Edge site itself and only sends the profiling results to Collibra Data Intelligence Cloud.
- If you use full scan via Edge, all the rows in a data source table are used by Edge for profiling, without limit.
- If you use partial scan, the data source randomly selects data and sends it to Edge for profiling.
Warning Partial scan is only available for some data sources. To verify if your data source allows partial scan, see Collibra-provided JDBC drivers.
For more information, see Configure the profiling and classification options via Edge.
Limitations
Profiling via Edge has the following limitations:
- Advanced data types are not supported.
- Not all data sources are certified for Edge.
Automatic Data Classification via Edge has the following limitations:
- Automatic Data Classification via Edge is only available for customers using Collibra Data Intelligence Cloud.
- Data classification on Edge does not retrain the classification model to improve future classification predictions.
- Out-of-the-box, automatic data classification can predict several data classes. You can also create user-defined data classes. Currently, these user-defined data classes are not taken into account by the automatic classification process. You need to assign user-defined data classes manually.
- English is the only supported language, but Automatic Data Classification via Edge can run on data in other Latin alphabet-based languages as well.
- Automatic Data Classification via Edge needs profiling data to predict the data classes. Data classification is performed automatically after the profiling process on an Edge site. That means that you can only classify columns of data sources registered in Data Catalog via an Edge site that has the JDBC profiling capability.
Profiling and classification settings
-
To classify data via Edge, ensure you run the command to enable classification on an Edge site.
-
The following settings in the Services Configuration section of the Collibra settings or in Collibra Console are relevant when you want to profile and classify via Edge.
Setting Section Description Database registration via Edge Register data source An option to enable database registration via Edge.
-
True: Register a data source via Edge. -
False: Register a data source via Jobserver only.
Note Enabling data source registration via Edge does not prevent you from registering a data source via Jobserver as well.
Anonymize data Data profiling This setting is not relevant for Edge.
Edge only sends the profiling results and classification suggestions to Collibra Data Intelligence Cloud. The profiling results are automatically anonymized for columns of data type Text and Geo before they are sent to Data Catalog.Database profiling via Edge Data profiling An option to enable profiling and classifying synchronized metadata via Edge instead of Jobserver.
-
True: Profiling and classification via Edge. -
False: Profile via Jobserver and classify via the Data Classification Platform.
Note You can only enable Database profiling via Edge if you also enabled Database registration via Edge.
Parallel database profiling via Edge Data profiling The maximum number of databases that Edge can profile and classify at the same time.
Note Schemas in a database are always processed sequentially.
By default, the value of the setting is one. This means Edge processes one profiling job at a time. The maximum value is four.
If you change this setting, you must restart Collibra.Enable Data Classification Cloud Data Classification configuration Ensure the Enable data classification option in Cloud Data Classification configuration is set to false.
If the Enable data classification option in Cloud Data Classification Configuration is set totrue, the Classify button is available on Column and Table asset pages. This button allows you to classify data via the Data Classification Platform, However, when using profiling and classification via Edge, you no longer need the Data Classification Platform. -