Add the Google Dataplex synchronization capability

After you have created a connection to the Google Cloud Platform (GCP) in your Edge or Collibra Cloud site, you have to add the "Google Dataplex Catalog synchronization" capability to the connection.

Before you start

You either created and installed an Edge site or were granted a Collibra Cloud site.
You have created a connection to the Google Cloud Platform (GCP) in your Edge or Collibra Cloud site.

Required permissions

You have a global role that has the Manage connections and capabilities global permission, for example, Edge integration engineer.

Important

In Collibra 2024.05, we launched a new user interface (UI) for Collibra Platform! You can learn more about this latest UI in the UI overview.

Use the following options to see the documentation in the latest UI or in the previous, classic UI:

Latest UI Classic UI

Steps

Open a site.
1. On the main toolbar, click → Settings.
  The Settings page opens.
2. In the tab pane, click Edge.
  The Sites tab opens and shows a table with an overview of your sites.
3. In the table, click the name of the site whose status is Healthy.
  The site page opens.
In the Capabilities section, click Add capability.
The Create capabilityAdd capability page appears.
Select Google Dataplex Catalog synchronization.
This capability is used for both the Google Dataplex ingestion and the Google Dataplex Catalog ingestion.

Enter the required information.

Field	Description	Required
Capability	This section contains general information about the capability.
Name	The name of the capability.	Yes
Description	The description of the capability.	No
GCP service account	This section contains information on how to connect to Google Cloud Storage.
GCP Connection	The GCP connection to be used.	Yes
Configuration	This section contains information on the configuration of the crawlers.
Project IDs (Deprecated)	Add a comma-separated list of the Project IDs where Dataplex is enabled. This field is deprecated in the latest user interface and replaced by the Project IDs field on the Synchronize Metadata page. You can add the Project IDs when you synchronize Google Dataplex. The following rules apply when you add Project IDs: If you enter a value in this field and do not add Project IDs on the Synchronize Metadata page, the capability will search in these projects in this field when you synchronize the capability. If you leave this field empty and do not add project IDs on the Synchronize Metadata page, the capability will search in the projects that you entered in the Service Account / Workload Identity Federation (WIF) field in the GCP connection. This applies only when the connection type is set to `Service Account`. Do not enter a value in this field and also add Project IDs on the Synchronize Metadata page; otherwise, the synchronization will end with an error when you synchronize the capability.	No
Save input metadata	Select the checkbox if you want to save the input metadata extracted from the data source in ZIP files. The files can be useful for troubleshooting. Select this option only on request of Collibra Support. The Collibra Support team can provide the location of the saved ZIP files after the synchronization. This checkbox is not selected by default.	No
(deprecated) Filters and Domain Mapping	Important This field is deprecated in the latest UI. You can now define the mappings in the integration configuration. If you have existing mappings here, they will continue to work. However, we advise you to move them to the integration configuration. Text in JSON format to include or exclude lakes and zones, and to configure domain mappings. The text must be in JSON format and can contain an include and an exclude block. In the include block, You can specify the domain in which specific lakes or zones must be ingested. The format is: `“project ID> lake ID> zone ID”: “domain ID”`. For example, `"integrations-automated-uer > testlake> testzone": "c8fe882a-a12e-4284-b655-7ac2a4fb08cb`. You can also specify the domain in which specific tables and columns must be ingested. The format is `"project ID> lake ID > zone ID > table ID":"domain ID"` In the exclude block, you can specify the lakes or zones that you don't want to ingest. For example, `"* > test"`. The exclude block has priority over the include block. If the include block is not present, we ingest all assets into the same domain as the System asset. If there is no explicit domain mapping for a zone, we use the domain specified for the Lake. You can use the keyword `default` as a domain ID. In that case, the lake or zone will be ingested in the same domain as the System asset. A match with a lake has priority over a match with a zone. The integration fails before the synchronization starts, if one or more domain IDs specified in the include block don't exist. The integration fails before the synchronization starts if a domain ID is left empty in the include block. You can use the ? and * wildcards in the zone and lake names. If a lake or zone matches multiple lines, the most detailed match is taken into account. If you registered the BigQuery data source via the BigQuery JDBC connector, and then integrate Google Dataplex, assets will be ingested in the same domains that were registered during JDBC ingestion. Specifically, Project assets are registered in the Database domains, and Zone assets are registered in the Schema domains. The mapping created by JDBC ingestion takes priority over the configurations in this field. In this way, no duplicated tables or columns are created. For more information, go to Ways to work with Google Cloud Platform (GCP). Examples Example 1 Ingest assets from Projects and Lakes to separate domains { "include": { "integrations-automated-uer > testlake": "20000000-0000-0000-0000-000000000000", "integrations-automated-uer > testlake > us-east-1":"8c09099c-2e89-4d06-a880-464172b7767e", "integrations-automated-uer > ay-west-1": "e274aed6-fdd7-40bb-80d0-2ae27848855d", "integrations-automated-uer > vb-mapping-test-1": "default" }, "exclude": [ integrations-automated-uer > testlake > testzone" ] } In this example: Assets from Project ID "integrations-automated-uer" and Lake ID "testlake" will be ingested into the Collibra domain with ID "20000000-0000-0000-0000-000000000000". However, all assets from the "us-east-1" zone will be ingested into the domain with ID "8c09099c-2e89-4d06-a880-464172b7767e". All assets from Project ID "integrations-automated-uer" and Lake ID "ay-west-1" will be ingested into the domain with ID "e274aed6-fdd7-40bb-80d0-2ae27848855d". All assets from Project ID "integrations-automated-uer" and Lake ID "vb-mapping-test-1" will be ingested in the same domain as the System asset. All assets from Zone ID "testzone” in Project ID "integrations-automated-uer" and Lake ID "testlake" will be excluded. Example 2 Ingest assets from Projects, Lakes, and Zones to separate domains { "include": { "project": "project-domain-id", "project > datalake ": "lake-domain-id", "project > datalake > zone1": "domain-id-1", "project > datalake > zone2": "domain-id-2", "project > datalake > zone1 > *": "table-column-domain-id" }, "exclude": [ "project > datalake > zone3" ] } In this example: The GCP Project, Dataplex Lake, Zone 1, and Zone 2 will be ingested in separate domains from each other. Zone 1 will be ingested in the `zone-domain-id-1` domain, and the table and columns from Zone 1 will be ingested in the `table-column-domain-id` domain. If you use this strategy, ensure to exclude any lakes and zones that you don't want to ingest. Example 3 and 4 Ingest assets from Tables to a separate domain If your Dataplex has the following structure, JSON example 3 and 4 show different ways of filtering: Dataplex structure project-1 -lake-1 zone-1 table-1 table-2 zone-2 table-3 -lake-2 zone-3 table-4 zone-4 table-5 project-2 -lake-3 table-7 Example 3 This JSON example ensures that only the hierarchy of `table 1` will be ingested in the`table-column-domain-id` domain, and no exclude block is needed: { "include": { "project-1 > lake-1 > zone1 > table-1": "table-column-domain-id" }, } Assets in Data Catalog resulting from example 3 This structure shows the assets in Data Catalog resulting from the configuration in Example 3: project-1 : system-domain(default) -lake-1 : system-domain(default) zone-1 : system-domain(default) table-1 : table-column-domain Example 4 This JSON example ensures that the `table 1` Table is ingested in a different domain from the other assets in the Project: { "include": { "project-1 > "project-domain-id", "project-1 > lake-1 > zone1 > table-1": "table-column-domain-id" }, } Assets in Data Catalog resulting from example 4 This structure shows the assets in Data Catalog resulting from the configuration in Example 4: project-1 : project-domain -lake-1 : project-domain zone-1 : project-domain table-1 : table-column-domain table-2 : project-domain zone-2 : project-domain table-3 : project-domain -lake-2 : project-domain zone-3 : project-domain table-4 : project-domain zone-4 : project-domain table-5 : project-domain Example 5 Integrate Google Dataplex with Google BigQuery { "include": { "integrations-automated-uer > testBigQuerylake > us-east-1":"10098d38-3e53-463d-8e4b-0b93d308a8ea" } } In this example, all assets in the us-east-1 Dataplex Zone, including the Schema, Table, and Column assets, will be ingested in the `10098d38-3e53-463d-8e4b-0b93d308a8ea` domain.	No
Extensible Properties Mapping	Via the Extensible Properties Mapping field, you can integrate additional properties from Dataplex: Table creation date, Table modified date, System (showing where the table comes from), and type (the Zone type). Important If you use this feature, make sure to set up all required characteristic assignments for the asset types. You do this by adding the mapping between the fields for the objects in Dataplex and the Collibra attribute IDs to ingest the data in, using a JSON string. The text must be in JSON format and can contain a Zones and Tables block. In each block, you specify the property name and the attribute ID to which you want to map the value in the property. The format is: `"[property name]": "[attribute resource ID]"`. For example, `"system": "19a27fda-8c50-48a8-87b3-f275ad450fe5"`. Example { "tables": { "system": "19a27fda-8c50-48a8-87b3-f275ad450fe5", "create_time": "00c57a11-37ca-4259-9c38-0ac5e522e9e8", "update_time": "a415c2a6-8289-4a4d-8d49-3685712d7622", }, "zones": { "zone_type": "c217db55-b5d6-4430-ad80-8534e691e54a" } }	No
Default Asset Status	Define which status assets need to receive during the integration synchronization. No Status (default): With the first synchronization, assets receive the first status listed in the Operating Model statuses. During a resynchronization, the status is not updated. For example, if you change an asset status from "Candidate" to "Review" before resynchronization, the status remains "Review." Implemented: all assets get the "Implemented" status.	No
Advanced Configuration	These configuration options help when investigating issues with the capability. Important Only complete the fields Save Input Metadata, Logging configuration, Memory (MiB), and JVM arguments on request of or together with Collibra Support. Only use Log level if your data source is a commercial JDBC offering. For more information, go to the Collibra Marketplace.	No
Debug	This field is ignored when you integrate metadata from the Google Dataplex. An option to automatically send Edge infrastructure log files to Collibra Platform. By default, this option is set to false. Note We highly recommend to only send Edge infrastructure log files to Collibra Platform when you have issues with Edge. If you set it to true, it will automatically revert to false after 24h.	No
Log level	This field is ignored when you integrate metadata from the Google Dataplex. An option to determine the verbosity level of Catalog connector log files. By default, this option is set to No logging.	No

Enter the required information.

Field	Description	Required
Capability	This section contains general information about the capability.
Name	The name of the capability.	Yes
Description	The description of the capability.	No
Capability template	The capability template. The value that you select in this field determines which sections appear on the page. Select the following capability: `Google Dataplex Catalog synchronization`	Yes
GCP Connection	This section contains information on how to connect to Google Cloud Platform.
GCP Connection	The GCP connection to be used.	Yes
Configuration	This section contains information on the configuration of the capability.
Project IDs	Add a comma-separated list of the Project IDs where Dataplex is enabled. The capability will search in these projects. If the Project IDs field is empty, the integration will search in the project included in the provided GCP Service Account Credentials JSON.	No
Save input metadata	Select the checkbox if you want to save the input metadata extracted from the data source in ZIP files. The files can be useful for troubleshooting. Select this option only on request of Collibra Support. The Collibra Support team can provide the location of the saved ZIP files after the synchronization. This checkbox is not selected by default.	No
Filters and Domain Mapping (in preview)	Text in JSON format to include or exclude lakes and zones, and to configure domain mappings. The text must be in JSON format and can contain an include and an exclude block. In the include block, You can specify the domain in which specific lakes or zones must be ingested. The format is: `“project ID> lake ID> zone ID”: “domain ID”`. For example, `"integrations-automated-uer > testlake> testzone": "c8fe882a-a12e-4284-b655-7ac2a4fb08cb`. You can also specify the domain in which specific tables and columns must be ingested. The format is `"project ID> lake ID > zone ID > table ID":"domain ID"` In the exclude block, you can specify the lakes or zones that you don't want to ingest. For example, `"* > test"`. The exclude block has priority over the include block. If the include block is not present, we ingest all assets into the same domain as the System asset. If there is no explicit domain mapping for a zone, we use the domain specified for the Lake. You can use the keyword `default` as a domain ID. In that case, the lake or zone will be ingested in the same domain as the System asset. A match with a lake has priority over a match with a zone. The integration fails before the synchronization starts, if one or more domain IDs specified in the include block don't exist. The integration fails before the synchronization starts if a domain ID is left empty in the include block. You can use the ? and * wildcards in the zone and lake names. If a lake or zone matches multiple lines, the most detailed match is taken into account. If you registered the BigQuery data source via the BigQuery JDBC connector, and then integrate Google Dataplex, assets will be ingested in the same domains that were registered during JDBC ingestion. Specifically, Project assets are registered in the Database domains, and Zone assets are registered in the Schema domains. The mapping created by JDBC ingestion takes priority over the configurations in this field. In this way, no duplicated tables or columns are created. For more information, go to Ways to work with Google Cloud Platform (GCP). Examples Example 1 Ingest assets from Projects and Lakes to separate domains { "include": { "integrations-automated-uer > testlake": "20000000-0000-0000-0000-000000000000", "integrations-automated-uer > testlake > us-east-1":"8c09099c-2e89-4d06-a880-464172b7767e", "integrations-automated-uer > ay-west-1": "e274aed6-fdd7-40bb-80d0-2ae27848855d", "integrations-automated-uer > vb-mapping-test-1": "default" }, "exclude": [ integrations-automated-uer > testlake > testzone" ] } In this example: Assets from Project ID "integrations-automated-uer" and Lake ID "testlake" will be ingested into the Collibra domain with ID "20000000-0000-0000-0000-000000000000". However, all assets from the "us-east-1" zone will be ingested into the domain with ID "8c09099c-2e89-4d06-a880-464172b7767e". All assets from Project ID "integrations-automated-uer" and Lake ID "ay-west-1" will be ingested into the domain with ID "e274aed6-fdd7-40bb-80d0-2ae27848855d". All assets from Project ID "integrations-automated-uer" and Lake ID "vb-mapping-test-1" will be ingested in the same domain as the System asset. All assets from Zone ID "testzone” in Project ID "integrations-automated-uer" and Lake ID "testlake" will be excluded. Example 2 Ingest assets from Projects, Lakes, and Zones to separate domains { "include": { "project": "project-domain-id", "project > datalake ": "lake-domain-id", "project > datalake > zone1": "domain-id-1", "project > datalake > zone2": "domain-id-2", "project > datalake > zone1 > *": "table-column-domain-id" }, "exclude": [ "project > datalake > zone3" ] } In this example: The GCP Project, Dataplex Lake, Zone 1, and Zone 2 will be ingested in separate domains from each other. Zone 1 will be ingested in the `zone-domain-id-1` domain, and the table and columns from Zone 1 will be ingested in the `table-column-domain-id` domain. If you use this strategy, ensure to exclude any lakes and zones that you don't want to ingest. Example 3 and 4 Ingest assets from Tables to a separate domain If your Dataplex has the following structure, JSON example 3 and 4 show different ways of filtering: Dataplex structure project-1 -lake-1 zone-1 table-1 table-2 zone-2 table-3 -lake-2 zone-3 table-4 zone-4 table-5 project-2 -lake-3 table-7 Example 3 This JSON example ensures that only the hierarchy of `table 1` will be ingested in the`table-column-domain-id` domain, and no exclude block is needed: { "include": { "project-1 > lake-1 > zone1 > table-1": "table-column-domain-id" }, } Assets in Data Catalog resulting from example 3 This structure shows the assets in Data Catalog resulting from the configuration in Example 3: project-1 : system-domain(default) -lake-1 : system-domain(default) zone-1 : system-domain(default) table-1 : table-column-domain Example 4 This JSON example ensures that the `table 1` Table is ingested in a different domain from the other assets in the Project: { "include": { "project-1 > "project-domain-id", "project-1 > lake-1 > zone1 > table-1": "table-column-domain-id" }, } Assets in Data Catalog resulting from example 4 This structure shows the assets in Data Catalog resulting from the configuration in Example 4: project-1 : project-domain -lake-1 : project-domain zone-1 : project-domain table-1 : table-column-domain table-2 : project-domain zone-2 : project-domain table-3 : project-domain -lake-2 : project-domain zone-3 : project-domain table-4 : project-domain zone-4 : project-domain table-5 : project-domain Example 5 Integrate Google Dataplex Catalog with Google BigQuery { "include": { "integrations-automated-uer > testBigQuerylake > us-east-1":"10098d38-3e53-463d-8e4b-0b93d308a8ea" } } In this example, all assets in the us-east-1 Dataplex Zone, including the Schema, Table, and Column assets, will be ingested in the `10098d38-3e53-463d-8e4b-0b93d308a8ea` domain.	No
Extensible Properties Mapping (in preview)	Via the Extensible Properties Mapping field, you can integrate additional properties from Dataplex: Table creation date, Table modified date, System (showing where the table comes from), and type (the Zone type). Important If you use this feature, make sure to set up all required characteristic assignments for the asset types. This feature is intended for non-production use at this moment, as it is in preview. You do this by adding the mapping between the fields for the objects in Dataplex and the Collibra attribute IDs to ingest the data in, using a JSON string. The text must be in JSON format and can contain a Zones and Tables block. In each block, you specify the property name and the attribute ID to which you want to map the value in the property. The format is: `"[property name]": "[attribute resource ID]"`. For example, `"system": "19a27fda-8c50-48a8-87b3-f275ad450fe5"`. Example { "tables": { "system": "19a27fda-8c50-48a8-87b3-f275ad450fe5", "create_time": "00c57a11-37ca-4259-9c38-0ac5e522e9e8", "update_time": "a415c2a6-8289-4a4d-8d49-3685712d7622", }, "zones": { "zone_type": "c217db55-b5d6-4430-ad80-8534e691e54a" } }	No
Advanced Configuration	These configuration options help when investigating issues with the capability. Important Only complete the fields Save Input Metadata, Logging configuration, Memory (MiB), and JVM arguments on request of or together with Collibra Support. Only use Log level if your data source is a commercial JDBC offering. For more information, go to the Collibra Marketplace.	No

Click Create.
The capability is added to the Edge or Collibra Cloud site.
The fields become read-only.

What's next?

You can synchronize the Google Dataplex. Go to Synchronize via Google Dataplex ingestion