Ways to work with Databricks

Important 

Choose an option below to explore the documentation for the latest user interface (UI) or the classic UI.

You can work with Databricks in Collibra Platform in the following two ways:

  • Integrate all metadata of the databases from Databricks Unity Catalog. You can also choose to allow for sampling, profiling, and classification (in preview). We recommend using the Databricks Unity Catalog integration.
  • Register individual Databricks databases via the Databricks JDBC driver.

It is important to understand the difference between these two ways because the resulting data in Collibra varies.

Way 1: Integrating metadata from Databricks Unity Catalog

You can integrate the metadata of all or multiple databases from Databricks Unity Catalog into Collibra. The integrated assets are Databricks databases, schemas, tables, and columns. You can integrate Databricks Unity Catalog only via Edge.

Integrating Databricks Unity Catalog is the recommended way to work with Databricks Unity Catalog because it shows the hierarchy of the assets and allows you to set up sampling, profiling, and classification (in preview).

Use the Databricks Unity Catalog integration if:

  • You want to integrate multiple databases simultaneously and efficiently.
  • Databricks Unity Catalog is activated in your organization.

With the JDBC connector, you need to register each database individually.

If you previously used a combination of integrating Databricks Unity Catalog and registering an individual Databricks database via the Databricks JDBC driver, and you want to switch to using only the integration, go to Switching to working only with Databricks Unity Catalog integration (in preview).

Tip  You can also integrate Databricks AI models via Edge. To do so, set up your connection permissions and then define the required configuration. AI Governance is not a prerequisite for this integration. However, you must enable AI Governance to access the AI Governance product pages and leverage features that help you visualize and govern the ingested AI-related data.

Supported table types

Key considerations

The Databricks Unity Catalog integration supports the following table types:

  • EXTERNAL
  • MANAGED
  • STREAMING_TABLE
  • VIEW

Since only the metadata is integrated, you can neither get sample data for tables and columns nor profile or classify them. If you want to do so, you need to register the Databricks database via the Databricks JDBC driver. For more information, go to combining the integration and JDBC driver.

Way 2: Registering Databricks data source via Databricks JDBC connector

If you register a specific Databricks data source via the Databricks JDBC connector, the resulting assets represent the columns and the tables in the Databricks database.
You can also extend the setup to allow for the retrieval of sample data, profiling, and classification.

Use the JDBC driver for Databricks if you want to profile, classify, and request sample data for the data source.

For more information, go to Registering a Databricks data source via the Databricks JDBC connector

Combining the two ways

The two ways of working with Databricks don't cancel each other out. You can use both ways to show the information you want in Collibra. For example, you can use the integration of Databricks Unity Catalog to quickly get an overview of all your Databricks databases in Collibra. Once you identify the important databases, you can register them individually via the JDBC driver.

Combining the two ways of working with Databricks
  1. Register a Databricks data source via the Databricks JDBC connector.
  2. If you then integrate Databricks Unity Catalog using the same System asset, the integration:
    • Skips the assets that have been registered via JDBC.
    • Adds the new information from Databricks Unity Catalog.
  1. Integrate Databricks Unity Catalog.
  2. If you then register a Databricks data source via the JDBC connector, the registration adds all data source assets, resulting in duplicate assets.
  3. If you then integrate Databricks Unity Catalog again, we advise to exclude the databases that you registered via JDBC. You can do this via the Filters and Domain Mapping property in the Databricks Unity Catalog capability (for classic UI) or integration configuration (latest UI). The integration:
    • Adds the new information from Databricks Unity Catalog.
    • Skips the assets that have been registered via JDBC.
    • Marks any assets that were removed or excluded from the integration as Missing from source. You can manually remove them.

Important Use the same System asset for both integration and registration. Otherwise, assets will be duplicated.

Switching to working only with Databricks Unity Catalog integration (in preview)

If you previously used both the Databricks Unity Catalog integration and Databricks JDBC synchronization for some databases, and you now want to switch to using only the Databricks Unity Catalog integration, complete the following steps:

  1. Update the existing Databricks Unity Catalog synchronization capability:

    1. Go to the existing Databricks Unity Catalog synchronization capability.
    2. Click Edit.
    3. Add the existing Databricks JDBC connection to the JDBC Databricks Connection field.
    4. Click Save.
  2. Resynchronize the Databricks Unity Catalog integration.

If you previously set up sampling, profiling, and classification and added the JDBC connection specified in the Databricks Unity Catalog capability, you can now profile and classify the data and get sample data for the assets integrated by the Databricks Unity Catalog integration. For the steps to set up sampling, profiling, and classification, go to Steps: Integrate Databricks Unity Catalog via Edge.

Helpful resources

For more information about Databricks, go to the Databricks documentation.