Google Dataplex ingestion via Edge

Note This ingestion is no longer in active development and will only update for defect fixes. Consider using the Google Dataplex Catalog ingestion via Edge instead.

Google Dataplex is a technical catalog on Google that provides information for data in the various Dataplex projects. If you integrate Google Dataplex, you integrate the metadata of all data of Dataplex projects into Collibra Platform. Collibra offers the multiple Dataplex integration types: Google Dataplex ingestion and Google Dataplex Catalog ingestion. For information, go to Integrating Google Dataplex.

The Google Dataplex ingestion is based on Dataplex and results in assets that represent the projects, lakes, zones, tables, and columns. For information on other ingestion types, go to Integrating Google Dataplex.

Important

We only integrate the metadata, so you cannot get sample data for the columns and tables, nor profile and classify them. If you want to get samples, and profile and classify the data, you can combine the integration of Google Dataplex with the registration of a Bigquery data source. For more information, go to Ways to work with Google Cloud Platform (GCP).
The current Dataplex GCS discovery system has a limit of 1,000 tables per bucket.

The following images show the asset types in Collibra after the integration of Google Dataplex via the Google Dataplex ingestion. The asset types from Google Cloud Storage (GCS) assets can either contain the GCS Bucket asset type or not. The asset types from the integration of Google Dataplex with Google BigQuery include the Schema asset type.

GCS assets without GCS Bucket
GCS assets with GCS Bucket
Google BigQuery

Important

In Collibra 2024.05, we launched a new user interface (UI) for Collibra Platform! You can learn more about this latest UI in the UI overview.

Use the following options to see the documentation in the latest UI or in the previous, classic UI:

Latest UI Classic UI

Important

In Collibra 2024.05, we launched a new user interface (UI) for Collibra Platform! You can learn more about this latest UI in the UI overview.

Use the following options to see the documentation in the latest UI or in the previous, classic UI:

Latest UI Classic UI

Important

In Collibra 2024.05, we launched a new user interface (UI) for Collibra Platform! You can learn more about this latest UI in the UI overview.

Use the following options to see the documentation in the latest UI or in the previous, classic UI:

Latest UI Classic UI

Important

You can integrate Google Dataplex only via Edge, not via Jobserver.

For information on the Google Dataplex, go to the Google documentation.
For information on the supported data types, go to the data types Google documentation.

Note When you add a bucket to Dataplex and Dataplex identifies schemas (tables and columns) for files in the bucket, these tables and columns are also added automatically to BigQuery by Dataplex.

Combining the ways of working

It is possible to combine the Google Dataplex integration - Dataplex ingestion and the registration of a Google BigQuery database because they result in the same technology assets.

You can use the Dataplex ingestion to quickly get an overview of all your databases in Collibra Platform. Once you have a better view on the important databases, you can register them individually via the JDBC driver.

Important

You can't combine the Google Dataplex integration - Dataplex Catalog ingestion with the registration of a Google BigQuery database.
Use the same System asset for the integration and the registration.

Combining the two ways of working with GCP
You first register a BigQuery data source via the BigQuery JDBC connector. If you then integrate Google Dataplex via the Dataplex ingestion, the integration: Skips the assets that have been registered via JDBC. Adds the new information from Google Dataplex.
You first integrate Google Dataplex via the Dataplex ingestion. If you then register a BigQuery data source via the JDBC connector, the registration adds all data source assets. This results in duplicate assets. If you then integrate Google Dataplex again via the Dataplex ingestion, the integration: Adds the new information from Google Dataplex. Skips the assets that have been registered via JDBC. If any assets were removed or excluded from the integration, they are marked as Missing from source. Example Your Google Dataplex consists of three databases: A, B, C. You integrate the metadata from Google Dataplex. This results in the Database assets: A, B, C. You want to register the metadata of database C to access the profiling and classification results. The JDBC registration results in a Database asset C', with the same metadata as C. You integrate the metadata again (because there have been updates). If you don't exclude database C, all databases will be updated, except for C'. If you exclude database C, database C will receive the Missing from source status, and you can manually remove these assets. From that moment, you should: For A and B, use the Google Dataplex integration via Dataplex ingestion with exclude rules for C, to update the metadata. For C', use the synchronization via JDBC to update the metadata.

Combining the two ways of working with GCP

You first register a BigQuery data source via the BigQuery JDBC connector.
If you then integrate Google Dataplex via the Dataplex ingestion, the integration:
- Skips the assets that have been registered via JDBC.
- Adds the new information from Google Dataplex.

You first integrate Google Dataplex via the Dataplex ingestion.
If you then register a BigQuery data source via the JDBC connector, the registration adds all data source assets.
This results in duplicate assets.
If you then integrate Google Dataplex again via the Dataplex ingestion, the integration:
- Adds the new information from Google Dataplex.
- Skips the assets that have been registered via JDBC.
- If any assets were removed or excluded from the integration, they are marked as Missing from source.
Example
Your Google Dataplex consists of three databases: A, B, C.
1. You integrate the metadata from Google Dataplex.
  This results in the Database assets: A, B, C.
2. You want to register the metadata of database C to access the profiling and classification results.
  The JDBC registration results in a Database asset C', with the same metadata as C.
3. You integrate the metadata again (because there have been updates).
  - If you don't exclude database C, all databases will be updated, except for C'.
  - If you exclude database C, database C will receive the Missing from source status, and you can manually remove these assets.
From that moment, you should:
- For A and B, use the Google Dataplex integration via Dataplex ingestion with exclude rules for C, to update the metadata.
- For C', use the synchronization via JDBC to update the metadata.