Synchronize via Google Dataplex ingestion
Synchronizing via Google Dataplex ingestion is the process of integrating metadata from the Google Dataplex projects and making the data available in Collibra Platform.
You can either synchronize manually or automate the process by adding a synchronization schedule.
Prerequisites
In your Collibra environment
- You have created a GCP connection.
- You have added the Google Dataplex Catalog synchronization capability to the GCP connection.
- You know in which System asset you want to add the Google Dataplex assets.
- If you have registered Google databases before via the JDBC driver, use the same System asset.
- If you never registered Google databases before, create a new System asset manually and use that one.
- You have a resource role with the Configure external system resource permission, for example, Owner.
- You have a global role with the Catalog global permission, for example, Catalog Author.
- You have a global role with the View Edge connections and capabilities global permission, for example, Edge integration engineer. For example, Edge integration engineer.
In your GCP environment
-
You have enabled the Cloud Resource Manager API in GCP.
Steps
-
On the main toolbar, click
→
Catalog.
The Catalog homepage opens. -
In the tab bar, click Integrations.
The Integrations page opens. - Click the Integration Configuration tab.
- In the Connection Name column, locate the GCP connection that you used when you added the Dataplex capability and click the capability link in the Capabilities column.
The Dataplex capability configuration page opens. - In the Synchronization Configuration section, click the Edit icon.
- In Ingestion Type, select Dataplex ingestion.
This integrates the metadata from the projects, lakes, zones, tables, and columns.
If you want to integrate the Dataplex Universal Catalog Entries and Aspects, go to Dataplex Universal Catalog integration. - Complete the fields as follows:
Field Action Required System In System, select the System asset in which you want to add the Google Dataplex assets.
YesUpdated: <timestamp> Click Updated: <timestamp> next to Synchronization Configuration, where timestampindicates the last time when the data was loaded from Google Dataplex.
The Project IDs are loaded to the dropdown list of the Project Id fields that you can use in the following step. This can take some time.
NoProject ID To add a Project ID where Dataplex is enabled, click Add Project Id. You can add multiple Project IDs. The capability will search in these projects.
The following rules apply when you add Project IDs:- If you do not add Project IDs here but entered a value in the Project IDs (Deprecated) field in the Dataplex capability, the capability will search in the projects that you entered in the capability.
- If you do not add Project IDs here and left the Project IDs (Deprecated) field empty in the Dataplex capability, the capability will search in the projects that you entered in the Service Account / Workload Identity Federation (WIF) field in the GCP connection. This applies only when the connection type is set to Service Account.
- Do not add Project IDs here and also enter a value in the Project IDs (Deprecated) field in the Dataplex capability. This will cause the synchronization to end with an error.
NoDataplex location Select the Dataplex locations you want to integrate. The Dataplex ingestion only allows single-region locations. Type the name of the location and press Enter.
- If you select locations, the integration ingests Dataplex assets only from the specified locations.
- If the location is added in Dataplex but is not visible in the list, you can use this field to add the location for integration.
For more information, go to Dataplex locations in Google Cloud documentation.
NoDomain Include Mappings In Domain Include Mappings, specify the entries in Google Dataplex that you want to integrate and the Collibra domains where they need to be added. Here's how it works:
- If no include mappings are defined, we ingest all assets into the same domain as the System asset.
- If there is no explicit domain mapping for a schema, we use the domain specified for the database.
- A match with a database has priority over a match with a schema.
To limit the scope of metadata ingestion to specific domains in Collibra, add a domain include mapping:
- Click Add Domain Include Mappings.
- In Path, add the path to the entries in Google Dataplex for which you want to integrate the metadata. Tip
Use the following pattern: project name > lake name > zone name > table name.
You can use the question mark (?) and asterisk (*) wildcards. To include all entries within a defined scope, use the asterisk (*) wildcard to account for a string of characters. To define a more granular scope, use the question mark (?) wildcard to account for single-character variations.
If an entry matches multiple lines, the most detailed match is taken into account.Exampleintegrations-automated-user > testlake> testzone > tableprojectA > datalakeX > zone1projectC > datalakeY > zone2 > *
- In Domain, select the Collibra domain in which you want to integrate the metadata.
NoDomain Exclude Mappings In Domain Exclude Mappings, specify the path to entries in Google Dataplex that you don't want to integrate.
Note The exclude mapping has priority over the include mapping.
To exclude specific metadata from being ingested into Collibra, add a domain exclude mapping:
- Click Add Domain Exclude Mappings.
- In the field, add the path to entries that you want to exclude.
Tip You can use the question mark (
?) and asterisk (*) wildcards. To exclude all entries within a defined scope, use the asterisk (*) wildcard to account for a string of characters. To limit the scope to a more granular filter, use the question mark (?) wildcard to account for single-character variations.
For example:projectA > * > Test.
NoCustom Label Mappings Use this field to ingest and map labels from Google Dataplex to asset attributes in Data Catalog.
You can map labels to any out-of-the-box (OOTB) attributes or custom attributes on the Dataplex Lake, Dataplex Zone, and Schema asset types.
Before you add mappings, check the existing value and the max cardinality of the attribute you're mapping to, and follow the following rules:
- Make sure the number of mappings does not exceed the Max Cardinality of the attribute; otherwise, an error occurs and synchronization stops.
- If the attribute already has one or more values, ensure that the total number of existing values and new mappings does not exceed the max cardinality; otherwise, an error occurs and synchronization stops.
- If the max cardinality is 1 and you add one mapping, the existing value is replaced.
- If the max cardinality is 2 or more, the new values are added until the limit is reached.
To add a custom label mapping:
- Click Custom Label Mappings.
- In the Label field, enter the label key from Google Dataplex.
- In the Attribute field, select an OOTB attribute or a custom attribute for the Dataplex Lake, Dataplex Zone, and Schema asset types.
No - Click Save.
- Click Synchronize.
A notification indicates the synchronization has started.
-
On the main toolbar, click
→
Catalog.
The Catalog homepage opens. -
On the main toolbar, click
.
The Create dialog box appears. - In the Register with Edge section of the Create dialog box, click Integration Configuration.
The Integration Configuration tab page opens. - In the Connection Name column, locate the GCP connection that you used when you added the Dataplex capability and click the capability link in the Capabilities column.
The Dataplex capability configuration page opens. - In the Synchronization Configuration section, click the Edit icon.
- In Ingestion Type, select Dataplex ingestion.
This will integrate the metadata from the projects, lakes, zones, tables, and columns.
If you want to integrate the Dataplex Catalog Entries and Aspects, go to Dataplex Catalog ingestion. - Complete the fields as follows:
Field Action Required System In System, select the System asset in which you want to add the Google Dataplex assets.
YesUpdated: <timestamp> Click Updated: <timestamp> next to Synchronization Configuration, where timestampindicates the last time when the data was loaded from Google Dataplex.
The Project IDs are loaded to the dropdown list of the Project Id fields that you can use in the following step. This can take some time.
NoProject ID To add a Project ID where Dataplex is enabled, click Add Project Id. You can add multiple Project IDs. The capability will search in these projects.
The following rules apply when you add Project IDs:- If you do not add Project IDs here but entered a value in the Project IDs (Deprecated) field in the Dataplex capability, the capability will search in the projects that you entered in the capability.
- If you do not add Project IDs here and left the Project IDs (Deprecated) field empty in the Dataplex capability, the capability will search in the projects that you entered in the Service Account / Workload Identity Federation (WIF) field in the GCP connection. This applies only when the connection type is set to Service Account.
- Do not add Project IDs here and also enter a value in the Project IDs (Deprecated) field in the Dataplex capability. This will cause the synchronization to end with an error.
NoDataplex location Select the Dataplex locations you want to integrate. The Dataplex ingestion only allows single-region locations. Type the name of the location and press Enter.
- If you select locations, the integration ingests Dataplex assets only from the specified locations.
- If the location is added in Dataplex but is not visible in the list, you can use this field to add the location for integration.
For more information, go to Dataplex locations in Google Cloud documentation.
NoDomain Include Mappings In Domain Include Mappings, specify the entries in Google Dataplex that you want to integrate and the Collibra domains where they need to be added. Here's how it works:
- If no include mappings are defined, we ingest all assets into the same domain as the System asset.
- If there is no explicit domain mapping for a schema, we use the domain specified for the database.
- A match with a database has priority over a match with a schema.
To limit the scope of metadata ingestion to specific domains in Collibra, add a domain include mapping:
- Click Add Domain Include Mappings.
- In Path, add the path to the entries in Google Dataplex for which you want to integrate the metadata. Tip
Use the following pattern: project name > lake name > zone name > table name.
You can use the question mark (?) and asterisk (*) wildcards. To include all entries within a defined scope, use the asterisk (*) wildcard to account for a string of characters. To define a more granular scope, use the question mark (?) wildcard to account for single-character variations.
If an entry matches multiple lines, the most detailed match is taken into account.Exampleintegrations-automated-user > testlake> testzone > tableprojectA > datalakeX > zone1projectC > datalakeY > zone2 > *
- In Domain, select the Collibra domain in which you want to integrate the metadata.
NoDomain Exclude Mappings In Domain Exclude Mappings, specify the path to entries in Google Dataplex that you don't want to integrate.
Note The exclude mapping has priority over the include mapping.
To exclude specific metadata from being ingested into Collibra, add a domain exclude mapping:
- Click Add Domain Exclude Mappings.
- In the field, add the path to entries that you want to exclude.
Tip You can use the question mark (
?) and asterisk (*) wildcards. To exclude all entries within a defined scope, use the asterisk (*) wildcard to account for a string of characters. To limit the scope to a more granular filter, use the question mark (?) wildcard to account for single-character variations.
For example:projectA > * > Test.
NoCustom Label Mappings Use this field to ingest and map labels from Google Dataplex to asset attributes in Data Catalog.
You can map labels to any out-of-the-box (OOTB) attributes or custom attributes on the Dataplex Lake, Dataplex Zone, and Schema asset types.
Before you add mappings, check the existing value and the max cardinality of the attribute you're mapping to, and follow the following rules:
- Make sure the number of mappings does not exceed the Max Cardinality of the attribute; otherwise, an error occurs and synchronization stops.
- If the attribute already has one or more values, ensure that the total number of existing values and new mappings does not exceed the max cardinality; otherwise, an error occurs and synchronization stops.
- If the max cardinality is 1 and you add one mapping, the existing value is replaced.
- If the max cardinality is 2 or more, the new values are added until the limit is reached.
To add a custom label mapping:
- Click Custom Label Mappings.
- In the Label field, enter the label key from Google Dataplex.
- In the Attribute field, select an OOTB attribute or a custom attribute for the Dataplex Lake, Dataplex Zone, and Schema asset types.
No - Click Save.
- Click the Add synchronization schedule icon.
- Enter the required information and click Save:
Field Description Repeat The interval when you want to synchronize automatically. The possible values are: Daily, Weekly, Monthly, and Cron expression. CronThe Quartz Cron expression that determines when the synchronization takes place.
This field is only visible if you select
Cron expressionin the Repeat field.EveryThe day on which you want to synchronize, for example, Sunday.
This field is only visible if you select
Weeklyin the Repeat field.Every firstThe day of the month on which you want to synchronize, for example, Tuesday.
This field is only visible if you select
Monthlyin the Repeat field.At
The time at which you want to synchronize automatically, for example, 14:00.
- You can only schedule on the hour. For example, you can add a synchronization schedule at 8:00, but not at 8:45.
- This field is only visible if you select
Daily,Weekly, orMonthlyin the Repeat field.
Time zone The time zone for the schedule.
The synchronization job synchronizes the Google Dataplex data.
After the synchronization:
- You can view a summary of the results from the Activities list.
- The resulting assets get a relation to the System asset that you selected.
For information on the integrated data via the Dataplex ingestion, go to Synchronized data via Google Dataplex ingestion.