Ways to work with Databricks

Important 

In Collibra 2024.05, we launched a new user interface (UI) for Collibra Platform! You can learn more about this latest UI in the UI overview.

Use the following options to see the documentation in the latest UI or in the previous, classic UI:

In Collibra Platform, you can work with Databricks in 2 ways.
You can integrate all metadata of the databases from Databricks Unity Catalog or you can register individual Databricks databases via the Databricks JDBC driver. It is important to understand the difference between these ways of working because the result in Collibra is different.

Two ways to work with Databricks

Integrating metadata from Databricks Unity Catalog

If you integrate Databricks Unity Catalog, you can integrate the metadata of all or multiple databases in the Databricks Unity Catalog metastore into Collibra Platform. The resulting assets represent the Databricks databases, schemas, tables and columns.

Important This is the preferred way of working with Databricks Unity Catalog because it shows the hierarchy of the assets and allows you to setup sampling, profiling, and classification (beta).
If you used a combination of integrating Databricks Unity Catalog and registering an individual Databricks database via the Databricks JDBC driver before, and you want to switch to using the integration only, go to Switching to working exclusively with the Databricks Unity Catalog integration (beta).

Important Because we only integrate the metadata, you cannot get sample data for the columns and tables, nor profile and classify them. If you want to do that, you need to register the Databricks database via the Databricks JDBC driver. For information, go to combining the integration and the JDBC driver.

Note 
  • The Databricks Unity Catalog integration supports the integration of following tables: EXTERNAL, MANAGED, STREAMING_TABLE, and VIEW tables.
  • You can integrate Databricks Unity Catalog only via Edge.
Tip You can also integrate Databricks AI models. To do so, make sure to set up your connection permissions and define the required configuration. For the best AI integration experience and functionality, make sure to enable AI Governance. If you don't enable AI Governance, the AI integration functionality is limited.

Use the Databricks Unity Catalog connector if you want to integrate lots of databases at the same time and in a short amount of time or if Databricks Unity Catalog is activated in your organization. With JDBC, you need to register the data, database by database.

For more information, go to Steps overview: Integrating metadata from Databricks Unity Catalog.

Registering a Databricks data source via the Databricks JDBC connector

If you register a specific Databricks data source via the Databricks JDBC connector, the resulting assets represent the columns and the tables in the Databricks database.
You can also extend the setup to allow for the retrieval of sample data, profiling, and classification.

Use the JDBC driver for Databricks if you want to profile, classify, and request sample data for the data source.

For more information, go to Registering a Databricks data source via the Databricks JDBC connector

Combining the ways of working with Databricks

The 2 possibilities don't cancel each other out. You can use both ways to show the information you want in Collibra Platform. You can use the integration of Databricks Unity Catalog to quickly get an overview of all your Databricks databases in Collibra Platform. Once you have a better view on the important databases, you can register them individually via the JDBC driver.

Combining the two ways of working with Databricks
  1. You first register a Databricks data source via the Databricks JDBC connector.
  2. If you then integrate Databricks Unity Catalog, the integration, using the same System asset:
    • Skips the assets that have been registered via JDBC.
    • Adds the new information from Databricks Unity Catalog.
  1. You first integrate Databricks Unity Catalog.
  2. If you then register a Databricks data source via the JDBC connector, the registration adds all data source assets.
    This results in duplicate assets.
  3. If you then integrate Databricks Unity Catalog again, we advise to exclude the databases that you registered via JDBC. You can do this via the Filters and Domain Mapping property in the Databricks Unity Catalog capability (for classic UI) or integration configuration (latest UI). The integration:
    • Adds the new information from Databricks Unity Catalog.
    • Skips the assets that have been registered via JDBC.
    • If any assets were removed or excluded from the integration, they are marked as Missing from source. You can manually remove them.
Example 

Your Databricks Unity Catalog consists of the three databases: A, B, C.

  1. You integrate the metadata from Data Unity Catalog.
    This results in the Database assets: A, B, C.
  2. You want to register the metadata of database C to access the profiling and classification results.
    The JDBC registration results in a Database asset C', with the same metadata as C.
  3. You integrate the metadata again (because there have been updates).
    • If you don't exclude database C, all databases will be updated, except for C'.
    • If you exclude database C, database C will receive the "Missing from source" status, and you can manually remove these assets.
From that moment, you should:
  • For A and B, use the Databricks Unity Catalog integration with exclude rules for C, to update the metadata.
  • For C', use the synchronization via JDBC to update the metadata.

Important Use the same System asset for the integration and the registration. Otherwise, assets will be duplicated.

Switching to working exclusively with the Databricks Unity Catalog integration (beta)

If you previously used both the Databricks Unity Catalog integration and the Databricks JDBC synchronization for some databases, and you want to switch to using only the Databricks Unity Catalog integration, complete the following steps:

  1. Update the existing Databricks Unity Catalog synchronization capability.

    1. Go to the existing Databricks Unity Catalog synchronization capability.
    2. Click Edit.
    3. Add the existing Databricks JDBC connection to the JDBC Databricks Connection field.
    4. Click Save.
  2. Re-synchronize the Databricks Unity Catalog integration.

If you set up sampling, profiling, and classification before and added the JDBC connection specified in the Databricks Unity Catalog capability, you can now profile and classify the data, and get sample data for the assets integrated by the Databricks Unity Catalog integration. For the steps to set up sampling, profiling, and classification, go to Steps: Integrate Databricks Unity Catalog via Edge.

Helpful resources

For more information about Databricks, go to the Databricks documentation.