Ways to work with Databricks

Important 

In Collibra 2024.05, we've launched a new user interface (UI) for Collibra Data Intelligence Platform! You can learn more about this latest UI in the UI overview.

Use the following options to see the documentation in the latest UI or in the previous, classic UI:

In Collibra Data Intelligence Platform, you can work with Databricks in two ways.
You can register individual Databricks databases via the Databricks JDBC driver, and you can integrate all metadata of the databases from Databricks Unity Catalog.
It is important to understand the difference between these ways of working because the result in Collibra is different.

Possible way to work with Databricks Result in Collibra

Integrating metadata from Databricks Unity Catalog

If you integrate Databricks Unity Catalog, you integrate the metadata of all databases in the Databricks Unity Catalog metastore into Collibra Data Intelligence Platform. The resulting assets represent the Databricks databases, schemas, tables and columns.

Note 
  • Because we only integrate the metadata, you cannot get sample data for the columns and tables, nor profile and classify them. If you want to do that, you need to register the Databricks database via the Databricks JDBC driver. For information, go to combining the integration and the JDBC driver.
  • The Databricks Unity Catalog integration supports the integration of following tables: EXTERNAL, MANAGED, STREAMING_TABLE, and VIEW tables.
Important 

You can integrate Databricks Unity Catalog only via Edge. You cannot integrate Databricks Unity Catalog via Jobserver.

Use the Databricks Unity Catalog connector if you want to integrate lots of databases at the same time and in a short amount of time or if Databricks Unity Catalog is activated in your organization.

With JDBC, you need to register the data, database by database.

Registering a Databricks data source via the Databricks JDBC connector If you register a specific Databricks data source via the Databricks JDBC connector, the resulting assets represent the columns and the tables in the Databricks database.
You can retrieve sample data, and can profile and classify the data.

Use the JDBC driver for Databricks if you want to profile, classify, and request sample data for the data source.

Combining the two ways of working with Databricks

The two possibilities don't cancel each other out. You can use both ways to show the information you want in Collibra Data Intelligence Platform. You can use the integration of Databricks Unity Catalog to quickly get an overview of all your Databricks databases in Collibra Data Intelligence Platform. Once you have a better view on the important databases, you can register them individually via the JDBC driver.

Combining the two ways of working with Databricks
  1. You first register a Databricks data source via the Databricks JDBC connector.
  2. If you then integrate Databricks Unity Catalog, the integration:
    • Skips the assets that have been registered via JDBC.
    • Adds the new information from Databricks Unity Catalog.
  1. You first integrate Databricks Unity Catalog.
  2. If you then register a Databricks data source via the JDBC connector, the registration adds all data source assets.
    This results in duplicate assets.
  3. If you then integrate Databricks Unity Catalog again, we advise to exclude the databases that you registered via JDBC. You can do this via the Filters and Domain Mapping property in the Databricks Unity Catalog capability (for classic UI) or integration configuration (latest UI). The integration:
    • Adds the new information from Databricks Unity Catalog.
    • Skips the assets that have been registered via JDBC.
    • If any assets were removed or excluded from the integration, they are marked as Missing from source. You can manually remove them.
Example 

Your Databricks Unity Catalog consists of the three databases: A, B, C.

  1. You integrate the metadata from Data Unity Catalog.
    This results in the Database assets: A, B, C.
  2. You want to register the metadata of database C to access the profiling and classification results.
    The JDBC registration results in a Database asset C', with the same metadata as C.
  3. You integrate the metadata again (because there have been updates).
    • If you don't exclude database C, all databases will be updated, except for C'.
    • If you exclude database C, database C will receive the "Missing from source" status, and you can manually remove these assets.
From that moment, you should:
  • For A and B, use the Databricks Unity Catalog integration with exclude rules for C, to update the metadata.
  • For C', use the synchronization via JDBC to update the metadata.

Important Use the same System asset for the integration and the registration. Otherwise, assets will be duplicated.

For more information about Databricks, go to the Databricks documentation.