Ways to work with Google Cloud Platform (GCP)

Important 

In Collibra 2024.05, we launched a new user interface (UI) for Collibra Data Intelligence Platform! You can learn more about this latest UI in the UI overview.

Use the following options to see the documentation in the latest UI or in the previous, classic UI:

In Collibra Data Intelligence Platform, you can:

  • Register individual Google BigQuery databases via the BigQuery JDBC driver.
  • Integrate a Google Cloud Storage (GCS) file system.
  • Integrate all metadata of the projects from Google Dataplex Catalog.

It's important to understand the difference between these methods because the result in Collibra is different.

Possible way to work with GCP Result in Collibra
Integrating Google Dataplex Catalog Google Dataplex Catalog is a technical catalog on Google side that provides information for all the data in the various Dataplex projects. If you use the Google Dataplex Catalog integration, we will register and synchronize the GCP Projects, Dataplex Lakes, Dataplex Zones, Tables, and Columns.

The Google Dataplex Catalog Synchronization will create the whole asset structure, representing Dataplex objects such as Project, Lake , Zone, Table, Column, and allows for filtering based on Lakes and Zones.

Integrating a Google Cloud Storage file system

The Google Cloud Storage (GCS) file system integration allows for the registration of Google Cloud Storage (GCS) as a data source in Collibra and the synchronization of the metadata. The GCS integration supports Google Dataplex, a service used for schema discovery. This allows you to integrate the schemas, tables and columns from the files and create a File Group asset in Collibra rather than multiple File assets.

This GCS integration will integrate data from GCS based on the configured crawler and in addition add Tables and Columns recognized by Dataplex, which are related to files and file groups.

Register a Google BigQuery database

If you register a specific Google BigQuery data source via the BigQuery JDBC connector, the resulting assets represent the columns and the tables in the database.
You can retrieve sample data, and can profile and classify the data.

Combining the ways of working

It is possible to combine the Google Dataplex Catalog integration and the registration of a Google BigQuery database because they result in the same technology assets. You can use the integration of Google Dataplex Catalog to quickly get an overview of all your databases in Collibra Data Intelligence Platform. Once you have a better view on the important databases, you can register them individually via the JDBC driver.

Important Use the same System asset for the integration and the registration.

Combining the two ways of working with GCP
  1. You first register a BigQuery data source via the BigQuery JDBC connector.
  2. If you then integrate Google Dataplex Catalog, the integration:
    • Skips the assets that have been registered via JDBC.
    • Adds the new information from Google Dataplex Catalog.
  1. You first integrate Google Dataplex Catalog.
  2. If you then register a BigQuery data source via the JDBC connector, the registration adds all data source assets.
    This results in duplicate assets.
  3. If you then integrate Google Dataplex Catalog again, the integration:
    • Adds the new information from Google Dataplex Catalog.
    • Skips the assets that have been registered via JDBC.
    • If any assets were removed or excluded from the integration, they are marked as Missing from source.
    Example 

    Your Google Dataplex Catalog consists of three databases: A, B, C.

    1. You integrate the metadata from Google Dataplex Catalog.
      This results in the Database assets: A, B, C.
    2. You want to register the metadata of database C to access the profiling and classification results.
      The JDBC registration results in a Database asset C', with the same metadata as C.
    3. You integrate the metadata again (because there have been updates).
      • If you don't exclude database C, all databases will be updated, except for C'.
      • If you exclude database C, database C will receive the "Missing from source" status, and you can manually remove these assets.

    From that moment, you should:

    • For A and B, use the Google Dataplex Catalog integration with exclude rules for C, to update the metadata.
    • For C', use the synchronization via JDBC to update the metadata.