Ways to work with Databricks

You can work with Databricks in Collibra Platform in the following two ways:

  • Integrate all metadata of the databases from Databricks Unity Catalog. You can also choose to allow for sampling, profiling, and classification (in preview). We recommend using the Databricks Unity Catalog integration.
  • Register individual Databricks databases via the Databricks JDBC driver.

It is important to understand the difference between these two ways because the resulting data in Collibra varies.

Way 1: Integrating metadata from Databricks Unity Catalog

You can integrate the metadata of all or multiple databases from Databricks Unity Catalog into Collibra. The integrated assets are Databricks databases, schemas, tables, and columns. You can integrate Databricks Unity Catalog only via Edge.

Integrating Databricks Unity Catalog is the recommended way to work with Databricks Unity Catalog because it shows the hierarchy of the assets and allows you to set up sampling, profiling, and classification (in preview).

If you previously used a combination of integrating Databricks Unity Catalog and registering an individual Databricks database via the Databricks JDBC driver, and you want to switch to using only the integration, go to Switching to working only with Databricks Unity Catalog integration (in preview).

Tip  You can also integrate Databricks AI models via Edge. To do so, set up your connection permissions and then define the required configuration.

You can configure Edge connections and capabilities without an active AI Governance license. However, AI Governance must be enabled to harvest AI model metadata, ingest corresponding AI assets in Data Catalog, and access the dashboards and features necessary to visualize and govern your AI landscape.

Supported table types

The Databricks Unity Catalog integration supports the following table types:

  • EXTERNAL
  • MANAGED
  • STREAMING_TABLE
  • VIEW

Way 2: Registering Databricks data source via Databricks JDBC connector

You can register a Databricks data source using the Databricks JDBC connector to ingest your metadata in Collibra. This process creates assets that represent your Databricks tables and columns, providing a clear view of your data landscape. You can also configure the connection to retrieve sample data, profile your data, and set up data classification.

For more information, go to Registering a Databricks file system via Databricks JDBC connector and Edge.

Combining the two ways

The two ways of working with Databricks don't cancel each other out. You can use both ways to show the information you want in Collibra. For example, you can use the integration of Databricks Unity Catalog to quickly get an overview of all your Databricks databases in Collibra. Once you identify the important databases, you can register them individually via the JDBC driver.

Combining the two ways of working with Databricks
  1. Register a Databricks data source via the Databricks JDBC connector.
  2. If you then integrate Databricks Unity Catalog using the same System asset, the integration:
    • Skips the assets that have been registered via JDBC.
    • Adds the new information from Databricks Unity Catalog.
  1. Integrate Databricks Unity Catalog.
  2. If you then register a Databricks data source via the JDBC connector, the registration adds all data source assets, resulting in duplicate assets.
  3. If you then integrate Databricks Unity Catalog again, we advise to exclude the databases that you registered via JDBC. You can do this via the Filters and Domain Mapping property in the Databricks Unity Catalog capability (for classic UI) or integration configuration (latest UI). The integration:
    • Adds the new information from Databricks Unity Catalog.
    • Skips the assets that have been registered via JDBC.
    • Marks any assets that were removed or excluded from the integration as Missing from source. You can manually remove them.

Important Use the same System asset for both integration and registration. Otherwise, assets will be duplicated.

Switching to working only with Databricks Unity Catalog integration (in preview)

If you previously used both the Databricks Unity Catalog integration and Databricks JDBC synchronization for some databases, and you now want to switch to using only the Databricks Unity Catalog integration, complete the following steps:

  1. Update the existing Databricks Unity Catalog synchronization capability:

    1. Go to the existing Databricks Unity Catalog synchronization capability.
    2. Click Edit.
    3. Add the existing Databricks JDBC connection to the JDBC Databricks Connection field.
    4. Click Save.
  2. Resynchronize the Databricks Unity Catalog integration.

If you previously set up sampling, profiling, and classification and added the JDBC connection specified in the Databricks Unity Catalog capability, you can now profile and classify the data and get sample data for the assets integrated by the Databricks Unity Catalog integration. For the steps to set up sampling, profiling, and classification, go to Steps: Integrate Databricks Unity Catalog via Edge.

Helpful resources

For more information about Databricks, go to the Databricks documentation.