Anonymize data via Jobserver

You can enable or disable the option to anonymize the content of columns with data type TEXT and GEO after the profiling process via Jobserver.

Tip For information about Edge, go to Anonymization via Edge.

Warning Currently, if you enable the data anonymization process you can no longer use automatic data classification via the Data Classification platform. However, you can still classify and anonymize profiling results if you use Edge.

Depending on your environment, follow this procedure either in Collibra Console or on the Services Configuration tab of the Collibra settings:

Collibra Console Collibra Settings

Important You can't edit the service configuration from the Settings page in the latest UI. If you use the latest UI, you can edit the service configuration only in Collibra Console. For more information, go to DGC service configuration settings.

Prerequisites

You have the ADMIN or SUPER role in Collibra Console.

You have a global role that has the Product Rights > System administration global permission.

The Services Configuration tab is available in the Collibra settings.

Steps

Open the Services Configuration tab:
1. On the main toolbar, click → Settings.
  The Settings page opens.
2. Click Services Configuration.
3. Click Edit configuration.
Open the DGC service settings for editing:
1. Open Collibra Console.
  Collibra Console opens with the Infrastructure page.
2. In the tab pane, expand an environment to show its services.
3. In the tab pane, click the Data Governance Center service of that environment.
4. Click Configuration.
5. Click Edit configuration.

In the Data Profiling section, enter the required information:

Setting	Description
Maximum number of samples	The maximum number of samples you want to collect for a data source. The default value is 100. The maximum value is 1,000. This setting is specific to sample data.
Maximum value length	The maximum length of a value extracted during profiling or sampling. Additional characters are trimmed.
Default date pattern	The default format used to decode dates. It is the default pattern used for detecting dates when the Date Pattern and/or Time Pattern attribute is not specified in Column assets.
Default time pattern	The default format used to decode times. It is the default pattern used for detecting times when the Date Pattern and/or Time Pattern attribute is not specified in Column assets.
Default combined date and time pattern	The default format used to decode combined dates and times. It is the default pattern used for detecting combined dates and times when the Date Pattern and/or Time Pattern attribute is not specified in Column assets.
Empty values	A comma separated list of strings enclosed in double quotes. A value that matches one of those expressions is considered an empty value. Please note that a database null value is always considered an empty value, for example "", "na" and "none".
Data type detection threshold	The percentage of matching Column values to reach for an Advanced Data Type to be considered a possible Data Type for that Column. This is expressed as a value between 0.0 and 1.0).
Anonymize data (Jobserver)	An option to anonymize sensitive data. True: Content in columns with data type Text or Geo is removed or replaced by a random hash value before the profiling results are sent to the cloud. False (default): No content is removed or replaced by a random hash value. Tip For anonymization via Edge, see setting "Anonymize Edge profiling results for all data types".
Database profiling via Edge	An option to enable profiling of synchronized metadata via Edge instead of Jobserver. True: Profiling via Edge is active. False: The profiling option via Edge is not active. Note You can enable Database profiling via Edge only if you also enabled Database registration via Edge.
Maximum duration of a profiling Edge job	The maximum time duration, in minutes, that a profiling Edge job can run before Data Profiling stops the job. The default value is 20,160 minutes, 2 days. You can increase this limit to a maximum of 4 days.
Parallel database profiling via Edge	The maximum number of schemas that Edge can profile at the same time. By default, the value of this setting is 4. This means Edge processes four profiling jobs at a time. This can have a huge positive impact on the performance of the profiling activity. You can increase this number to a maximum of 16. Note If you increase this number to more than four jobs, make sure that your Edge site resources are aligned with the extra requests it will receive. If you decrease this number and the running number of jobs exceeds the limit, no job will be canceled. Instead, there won't be any room to schedule a new job until at least one running job is completed. Example The parallel schema profiling via Edge setting is set to 4. For 1 database that contains 3 schemas, we will process all 3 schemas at the same time. For 2 databases that contain 4 schemas in total, we will process all 4 schemas at the same time. For 1 database that contains 8 schemas, we will start with 4 schemas and then proceed to the next ones as soon as a job is completed.
Anonymize Edge profiling results for all data types	Enable this option to anonymize all Edge profiling results stored in Collibra. True: Profiling results via Edge are anonymized for all columns. False (default): Profiling results via Edge are anonymized only for columns with the Text or Geo data type.
Calculate Data Similarity	Important Data similarity is a cloud-only feature and is not certified for Collibra Platform for Government. Enables the data similarity feature in your environment. True (default for cloud environments, except for Public sector): Extra algorithms can run during profiling via Edge allowing the calculation of data similarity scores. The data similarity scores are currently used in Data Marketplace to show similar Table assets. False: The feature is not enabled.
Data Similarity Threshold	This setting relates to the data similarity feature and defines from which similarity score Table assets must be displayed as similar data. Enter a value between 0.1 and 0.9. The default value is 0.5, which means that Table assets with a similarity score higher than 50% will show up as similar data.

Click Save all.