Ways to work with Databricks

Important

In Collibra 2024.05, we launched a new user interface (UI) for Collibra Platform! You can learn more about this latest UI in the UI overview.

Use the following options to see the documentation in the latest UI or in the previous, classic UI:

Latest UI Classic UI

In Collibra Platform, you can work with Databricks in 2 ways.
You can integrate all metadata of the databases from Databricks Unity Catalog or you can register individual Databricks databases via the Databricks JDBC driver. It is important to understand the difference between these ways of working because the result in Collibra is different.

Two ways to work with Databricks

Integrating metadata from Databricks Unity Catalog

If you integrate Databricks Unity Catalog, you can integrate the metadata of all or multiple databases in the Databricks Unity Catalog metastore into Collibra Platform. The resulting assets represent the Databricks databases, schemas, tables and columns.

Important

This is the preferred way of working with Databricks Unity Catalog because it shows the hierarchy of the assets and allows you to setup sampling, profiling, and classification (in preview).

Example

You want to add 10 databases and profile the data.

Using only the Databricks JDBC connector:
- Create 10 JDBC connections.
- Add the required capabilities to each connection.
- Register and synchronize each database individually.
Using the Databricks Unity Catalog integration with profiling:
- Create 2 connections: one for the integration and one for JDBC.
- Add the required capabilities to the JDBC connection.
- Integrate Databricks Unity Catalog.
  The resynchronization for the databases is managed through the Databricks Unity Catalog capability and profiling is performed via the Database asset.

Tip

If you used a combination of integrating Databricks Unity Catalog and registering an individual Databricks database via the Databricks JDBC driver before, and you want to switch to using the integration only, go to Switching to working exclusively with the Databricks Unity Catalog integration (in preview).

Important Because we only integrate the metadata, you cannot get sample data for the columns and tables, nor profile and classify them. If you want to do that, you need to register the Databricks database via the Databricks JDBC driver. For information, go to combining the integration and the JDBC driver.

Note

The Databricks Unity Catalog integration supports the integration of following tables: EXTERNAL, MANAGED, STREAMING_TABLE, and VIEW tables.
You can integrate Databricks Unity Catalog only via Edge.

Tip You can also integrate Databricks AI models. To do so, make sure to set up your connection permissions and define the required configuration. For the best AI integration experience and functionality, make sure to enable AI Governance. If you don't enable AI Governance, the AI integration functionality is limited.

Use the Databricks Unity Catalog connector if you want to integrate lots of databases at the same time and in a short amount of time or if Databricks Unity Catalog is activated in your organization. With JDBC, you need to register the data, database by database.

For more information, go to Steps overview: Integrating metadata from Databricks Unity Catalog.

Registering a Databricks data source via the Databricks JDBC connector

If you register a specific Databricks data source via the Databricks JDBC connector, the resulting assets represent the columns and the tables in the Databricks database.
You can also extend the setup to allow for the retrieval of sample data, profiling, and classification.

Use the JDBC driver for Databricks if you want to profile, classify, and request sample data for the data source.

For more information, go to Registering a Databricks data source via the Databricks JDBC connector

Combining the ways of working with Databricks

The 2 possibilities don't cancel each other out. You can use both ways to show the information you want in Collibra Platform. You can use the integration of Databricks Unity Catalog to quickly get an overview of all your Databricks databases in Collibra Platform. Once you have a better view on the important databases, you can register them individually via the JDBC driver.

Combining the two ways of working with Databricks
You first register a Databricks data source via the Databricks JDBC connector. If you then integrate Databricks Unity Catalog, the integration, using the same System asset: Skips the assets that have been registered via JDBC. Adds the new information from Databricks Unity Catalog.
You first integrate Databricks Unity Catalog. If you then register a Databricks data source via the JDBC connector, the registration adds all data source assets. This results in duplicate assets. If you then integrate Databricks Unity Catalog again, we advise to exclude the databases that you registered via JDBC. You can do this via the Filters and Domain Mapping property in the Databricks Unity Catalog capability (for classic UI) or integration configuration (latest UI). The integration: Adds the new information from Databricks Unity Catalog. Skips the assets that have been registered via JDBC. If any assets were removed or excluded from the integration, they are marked as Missing from source. You can manually remove them. Example Your Databricks Unity Catalog consists of the three databases: A, B, C. You integrate the metadata from Data Unity Catalog. This results in the Database assets: A, B, C. You want to register the metadata of database C to access the profiling and classification results. The JDBC registration results in a Database asset C', with the same metadata as C. You integrate the metadata again (because there have been updates). If you don't exclude database C, all databases will be updated, except for C'. If you exclude database C, database C will receive the "Missing from source" status, and you can manually remove these assets. From that moment, you should: For A and B, use the Databricks Unity Catalog integration with exclude rules for C, to update the metadata. For C', use the synchronization via JDBC to update the metadata.

Combining the two ways of working with Databricks

You first register a Databricks data source via the Databricks JDBC connector.
If you then integrate Databricks Unity Catalog, the integration, using the same System asset:
- Skips the assets that have been registered via JDBC.
- Adds the new information from Databricks Unity Catalog.

You first integrate Databricks Unity Catalog.
If you then register a Databricks data source via the JDBC connector, the registration adds all data source assets.
This results in duplicate assets.
If you then integrate Databricks Unity Catalog again, we advise to exclude the databases that you registered via JDBC. You can do this via the Filters and Domain Mapping property in the Databricks Unity Catalog capability (for classic UI) or integration configuration (latest UI). The integration:
- Adds the new information from Databricks Unity Catalog.
- Skips the assets that have been registered via JDBC.
- If any assets were removed or excluded from the integration, they are marked as Missing from source. You can manually remove them.

Example

Your Databricks Unity Catalog consists of the three databases: A, B, C.

You integrate the metadata from Data Unity Catalog.
This results in the Database assets: A, B, C.
You want to register the metadata of database C to access the profiling and classification results.
The JDBC registration results in a Database asset C', with the same metadata as C.
You integrate the metadata again (because there have been updates).
- If you don't exclude database C, all databases will be updated, except for C'.
- If you exclude database C, database C will receive the "Missing from source" status, and you can manually remove these assets.

From that moment, you should:

For A and B, use the Databricks Unity Catalog integration with exclude rules for C, to update the metadata.
For C', use the synchronization via JDBC to update the metadata.

Important Use the same System asset for the integration and the registration. Otherwise, assets will be duplicated.

Switching to working exclusively with the Databricks Unity Catalog integration (in preview)

If you previously used both the Databricks Unity Catalog integration and the Databricks JDBC synchronization for some databases, and you want to switch to using only the Databricks Unity Catalog integration, complete the following steps:

Update the existing Databricks Unity Catalog synchronization capability.
1. Go to the existing Databricks Unity Catalog synchronization capability.
2. Click Edit.
3. Add the existing Databricks JDBC connection to the JDBC Databricks Connection field.
4. Click Save.

Re-synchronize the Databricks Unity Catalog integration.

Show how to synchronize Databricks Unity Catalog

On the main toolbar, click → Catalog.
The Catalog homepage opens.
In the tab bar, click Integrations.
The Integrations page opens.
Click the Integration Configuration tab.
On the main toolbar, click .
The Create dialog box appears.
In the Register with Edge section of the Create dialog box, click Register a data sourceIntegration Configuration.
The Register contentIntegration Configuration tab page opens.
Locate the Databricks connection that you used when you added the Databricks Unity Catalog capability and click the link in the Data sources/Capabilities column.
The synchronization configuration page opens.
In the Configuration SectionSynchronization Configuration section, click Add Configuration.
In the Configuration SectionSynchronization Configuration section, click the Edit icon.
In Ingestion Type, select what you want to integrate.
The available options are: metadata, AI models, and metadata and AI models.
Depending on your selection, extra fields appear. Your selection will also impact the integrated Databricks Unity Catalog data.

Complete the fields as needed.

Field	Available if you integrate	Action
System	Metadata	In System, select the System asset in which you want to link the Databricks assets.
Default Asset Status (Deprecated)	Metadata AI models	In Default Asset Status, select how you want to set the status of the synchronized assets. The possible values are: Implemented: Implemented means that all assets receive the Implemented status. No Status: No status means that newly created assets receive the first status listed in your Operating Model statuses, and that existing assets keep their assigned status. This field is deprecated and will be removed in the future. You can now define the default status in the capability configuration. Ensure that the value in this field matches the one in the capability configuration, as this field still takes precedence over the capability value.
Domain Include Mappings	Metadata	Optionally, in Domain Include Mappings, specify which databases and schemas you want to integrate and optionally the Collibra domains where they need to be added. This means you can use this field to limit the databases and schemas you integrate and to define where they need to be added. Important If you don't define include any mappings, the integration automatically creates new domains for each Database and Schema asset in the same community as the System asset. For more information, go to Integrated Databricks Unity Catalog data. If you include a path but don't define a domain, the integration automatically creates new domains in the same community as the System asset. If you add a domain include mapping for the database but not for a related schema, the automatically created domain for the schema is added in the same community as the domain of the database. A match with a schema has priority over a match with a database. Show steps to add a domain include mapping Click Add Domain Include Mappings. In Path, add the path to the databases and schemas in Databricks Unity Catalog for which you want to integrate the metadata. Tip You can use the ? and * wildcards in the catalog and schema names. If a catalog or schema matches multiple lines, the most detailed match is taken into account. Optionally, in Domain, select the Collibra domain in which you want to integrate the metadata. If you don't define a domain, the integration automatically creates new domains in the same community as the System asset. Example Show examples Path `Orders` and domain `Domain B` In this case, the Orders Database asset and all its related assets will be integrated in Domain B. Path `Orders > fk` and domain `Domain B` In this case, the Orders Database asset will be integrated in the same domain as the System asset. All schemas that start with fk and their related assets will be integrated in Domain B. Path `Orders > ` and domain `Domain B` In this case, the Orders Database asset will be integrated in the same domain as the System asset. All schemas in the Orders catalog and their related assets will be integrated in Domain B. Show full scenario You have a database Orders that includes multiple schemas. If you want to make sure that the Orders database and related schemas are added to domain B, add the following include mappings: Path `Orders` and domain `Domain B`, to make sure the Database asset is added to Domain B. Path `Orders > ` and domain `Domain B`, to make sure all Schema assets in Orders are added to Domain B. If you want to make sure that the Orders database and related schemas are added to domain B, except for the schemas that start with test_, add the following include mappings: Path `Orders` and domain `Domain B`, to make sure the Database asset is added to Domain B. Path `Orders > test_` and domain `Domain C`, to make sure that all schemas in Order that start with test_ are added to Domain C. Path `Orders > *` and domain `Domain B`, to make sure all other Schema assets in Orders are added to Domain B.
Domain Exclude Mappings	Metadata	Optionally, in Domain Exclude Mappings, specify the path to databases and schemas in Databricks Unity Catalog that you don't want to integrate. Note The exclude mapping has priority over the include mapping. Show steps to add a domain exclude mapping Click Add Domain Exclude Mappings. In the field, add the path to the databases and schemas in Databricks Unity Catalog that you want to exclude. Tip You can use the ? and * wildcards in the catalog and schema names. For example:`* > test`.
Extensible Properties Mappings	Metadata	Via the Extensible Properties Mapping field, Databricks Unity Catalog allows you to add additional properties to Catalog, Schema, and Table objects. Optionally, in Extensible Properties Mappings, specify which additional default system properties or custom properties that you want to integrate from Databricks Unity Catalog into Collibra. You can integrate most values from the Details page from Catalog, Schema, Table, and View objects into specific attributes in Collibra assets. You do this by adding the mapping between the fields for the objects in Databricks Unity Catalog and the Collibra attribute. Important If you use this feature, make sure to add any custom attributes/characteristics, as needed, to the asset type assignment. The name of the property starts with the object type, for example `catalogs.systemAttributes.metastore_id`. `catalogs` refers to Database assets, `schemas` to Schema assets, `table` to Table assets, and `views` to Database View assets. The following system properties are supported: Catalogs: "browse_only", "catalog_type", "connection_name", "created_at", "created_by", "isolation_mode", "metastore_id", "provider_name", "provisioning_info", "securable_kind", "securable_type", "share_name", "storage_location", "storage_root", "updated_at" , and "updated_by". Schemas: "catalog_type", "created_at", "created_by", "metastore_id", "securable_type", "securable_kind", "storage_location", "storage_root", "updated_at", and "updated_by". Table: "access_point", "catalog_name", "created_at", "created_by", "data_access_configuration_id", "data_source_format", "deleted_at", "metastore_id", "schema_name", "securable_type", "securable_kind", "sql_path", "storage_credential_name", "storage_location", "table_type", "updated_at", "updated_by", and "view_definition". Views: "access_point", "catalog_name", "created_at", "created_by", "data_access_configuration_id", "data_source_format", "deleted_at", "metastore_id", "schema_name", "securable_type", "securable_kind", "sql_path", "storage_credential_name", "storage_location", "table_type", "updated_at", "updated_by", and "view_definition". Show steps to add a property mapping Click Add Another Mapping. In Property Name, do one of the following: To add a system attribute, select the Databricks Unity Catalog property name from the drop-down list. To add a custom attribute, type the name of the custom property manually. Use the following naming convention: `[object type].customParameters.[name of parameter]`. For example: catalogs.customParameters.Parameter1 schemas.customParameters.catalogAndNamespace.part.1 table.customParameters.view.catalogAndNamespace.part.2 views.customParameters.Paramerer2 In Attribute, select the attribute in which you want to see the value.
Stop Compute Resource	Metadata	This field is important if the Compute Resource HTTP Path field was completed in the Databricks capability to allow for source tag integration. If this field is set to Yes, the compute resource in Databricks Unity Catalog will be stopped right after the source tags are extracted. If this field is set to No, the compute resource remains active. Tip To prevent clusters from running for the entire synchronization duration, you can also configure the Terminate after ... minutes of inactivity setting in Databricks. The setting ensures that clusters automatically stop after a period of inactivity. For more information, go to the Databricks documentation.
Domain	AI models	In Domain, select the domain in which you want to add the Databricks AI Model assets.
Custom AI Metrics Mappings	AI models	Optionally, in Custom AI Metrics Mappings, define which custom Databricks AI Model metrics you want to integrate. You do this by adding the mapping between the custom metric and the Collibra attribute. For an overview of the out-of-the-box metrics we integrate by default, go to Integrated Databricks Unity Catalog data. Important If you use this feature, make sure to add any custom attributes/characteristics, as needed, to the asset type assignment. Show steps to add a custom AI metrics mapping Click Add Custom AI Metrics Mappings. In Metric, type the name of the custom metric manually. Use the exact name as in Databricks Unity Catalog. In Attribute, select the attribute in which you want to see the value. Make sure to select an attribute that is included in the Databricks AI Model asset type assignment.
Exclude system AI Models	AI models	Optionally, in Exclude system AI models, indicate that you don't want to integrate the pretrained Databricks AI models. By default, No is selected, and all accessible AI models are integrated. If you select Yes, the AI models in the "system" Databricks catalog will be excluded from the integration. For more information about these pretrained Databricks AI models, go to the Databricks documentation.

Important

When you integrate a data source without applying Include or Exclude Mappings rules, and then later exclude a registered asset using an Include or Exclude Mapping during resynchronization, the related assets will receive the Missing from Source status.

Click Save Configuration.
Click Synchronize.
A notification indicates the synchronization has started.

If you set up sampling, profiling, and classification before and added the JDBC connection specified in the Databricks Unity Catalog capability, you can now profile and classify the data, and get sample data for the assets integrated by the Databricks Unity Catalog integration. For the steps to set up sampling, profiling, and classification, go to Steps: Integrate Databricks Unity Catalog via Edge.

Helpful resources

For more information about Databricks, go to the Databricks documentation.