Example | Integrating Databricks Unity Catalog via Edge on k3s

In a modern data lakehouse environment, governance must be as agile as the data itself. This guide provides a framework for ingesting Databricks Unity Catalog assets into Collibra via an Edge site, ensuring that your business users can discover and trust technical assets in a governed, enterprise-ready catalog.

As our global logistics footprint expands, the ability for analysts to quickly discover and trust warehouse data is paramount. This guide outlines how to establish a production-grade link between Databricks Unity Catalog and Collibra. Edge provides the building blocks for establishing this link in a secure, streamlined manner.

Scenario

You want to install your Edge site on bundled k3s. You've checked with your internal teams, and know that your organization uses Red Hat Enterprise Linux (RHEL) version 9.3 on AWS and want to enable SE Linux. Once your Edge site is created and installed, you then want to set up an integration with Databricks Unity Catalog.

By using a service principal on a bundled K3s Edge site, we ensure that the integration is built on a resilient, high-availability infrastructure. This setup allows the Logistics team, for example, to access "Gold-tier" inventory data without compromising security or relying on individual user accounts.

We'll walk you through the steps to establish a production-ready metadata ingestion process. You can follow along to learn how to prepare your Edge site and integrate Databricks Unity Catalog with Collibra.

  1. Check and complete the prerequisites.
  2. Create an Edge site.
  3. Install your Edge site and confirm it's healthy.
  4. Create a connection to Databricks Unity Catalog.
  5. Add a Databricks Unity Catalog synchronization capability to your connection.

This guide doesn’t cover the use of a forward proxy. A forward proxy must be configured before you install your Edge site.

Prerequisites

On your local machine

You can confirm, or know who to reach out to in your organization that can confirm, that your server meets all system requirements below.

Server

  • You want to install the Edge software on Red Hat Enterprise Linux (RHEL) version 9.3.
  • The sudo package is installed on the Linux host.
  • You have full sudo access (ALL=(ALL) ALL).
  • You have installed the following policy packages to enable SE Linux:
    yum install -y https://github.com/k3s-io/k3s-selinux/releases/download/v1.6.latest.1/k3s-selinux-1.6-1.el9.noarch.rpm
  • Your Virtual Machine has at least:
    • 64 GB memory.
    • 16-core CPU with x86_64 architecture. 
    • 50 GB of free storage on the partition that contains /var/lib/rancher/k3s.
      Copy
      mkdir -p /var/lib/rancher/k3s
      mkfs.xfs /dev/<block-device-name>
      mount /dev/<block-device-name> /var/lib/rancher/k3s
      echo '/dev/<block-device-name> /var/lib/rancher/k3s xfs defaults 0 0' >> /etc/fstab
    • 5 GB of free storage on the partition that contains /var/log.
    • 200 GB of free storage on the partition that holds /var/lib/kubelet.
    • As you run your Linux server on AWS, you need to disable the services nm-cloud-setup.service and nm-cloud-setup.timer.
      Copy
      systemctl disable nm-cloud-setup.service nm-cloud-setup.timer 
      reboot

Network

  • An Edge site needs outbound connections to all of the following:
    • The URL of your Collibra Platform environment.
    • https://*.repository.collibra.io: This URL serves as the primary source for downloading the latest Edge docker images from Collibra's docker registry and helm-chart repository.
      Note If the allowlist does not accept wildcards:
      • https://repository.collibra.io
      • https://edge-docker-delivery.repository.collibra.io
      • https://mirror-docker.repository.collibra.io
  • Access to all data sources you need to connect to your Edge sites.
  • Your Edge site has to be able to connect to port 443.
  • Set the Linux system value for IP forwarding to 1: net.ipv4.ip_forward=1
    Note If IP forwarding is turned off (net.ipv4.ip_forward=0), your Edge site may become unhealthy. Follow the steps in this Support article to turn IP forwarding on.
  • The resolve configuration file of your Linux host has maximum three search domains and two name servers.

Within Collibra

  • To create and install your Edge site:
    • You have enabled database registration via Edge in Collibra Console.

      Note You must restart the Data Governance Center service when you have enabled this option.

    • You have a global role or roles that have the following global permissions:
      • Manage Edge sites
      • Instal Edge sites
      • User Administration
  • To create a Databricks Unity Catalog connection to Edge:
    • You created and installed an Edge site.
    • You have a global role that has the Manage connections and capabilities global permission, for example, Edge integration engineer.
  • To add the Databricks Unity Catalog capability:
    • You have created a connection to Databricks in your Edge site.
    • You have a global role that has the Manage connections and capabilities global permission, for example, Edge integration engineer.
    • You have a resource role with the Configure external system resource permission, for example, Owner.
    • You have a global role with the Catalog global permission, for example, Catalog Author.
    • You have a global role with the View Edge connections and capabilities global permission, for example, Edge integration engineer.

Within Databricks

  • Your Databricks service principal must have the BROWSE permission on the catalogs in Databricks Unity Catalog from which you want to integrate metadata from Databricks Unity Catalog.
  • If you want to integrate source tags, additional permissions are needed.
    • The metadata synchronization for Databricks Unity Catalog uses compute clusters (SQL query compute warehouse) to collect source tags. To allow this, grant the following permissions:
      • CAN ATTACH TO
      • CAN RESTART
      During the synchronization configuration, you can define that the compute clusters must stop after the source tags are extracted.
    • To integrate source tags from specific tables in system.information_schema, grant the following permissions:
      • USE CATALOG permission on system catalog
      • USE SCHEMA permission on system.information_schema
    • SELECT permission on the following:
      • system.information_schema.catalog_tags
      • system.information_schema.schema_tags
      • system.information_schema.table_tags
      • system.information_schema.column_tags

Create your Edge site

First, we need to create the Edge site site on the Collibra infrastructure.

  1. On the main toolbar, click Products iconCogwheel icon Settings.
    The Settings page opens.
  2. Click Edge.
    The Sites overview opens.
  3. Above the table, to the right, click Create Site.
    The Create Edge site wizard appears.
  4. Enter the required information.
    1. Site name: k3s Edge site.
    2. Description: Our bundled k3s Edge site.
  5. Select the Upgrade Mode for this Edge site.
  6. Click Create Site.
    Your new Edge sites overview appears, including the new Edge site with the status To be installed.

Install your Edge site

Once the Edge site is created in Collibra, we can install it locally. We will use the bundled k3s method, as Edge automatically handles the required Kubernetes version and cluster level objects for you. As an Edge Administrator, this method reduces manual configuration and ensures your Edge site meets the cluster level requirements.

  1. Download the installer:
    1. Open a site.
      1. On the main toolbar, click Products iconCogwheel icon Settings.
        The Settings page opens.
      2. In the tab pane, click Edge.
        The Sites tab opens and shows a table with an overview of your sites.
      3. In the site overview, click the name of a site.
        The site page appears.
    2. Click Download Installer.
      An Edge user is created in Collibra.
  2. Extract the TGZ archive on the server on which you want to install the Edge site software and ensure the directory is not mounted as noexec.
    Copy
    tar -xf <edge-site-id>-installer.tgz
  3. From inside the extracted TGZ archive directory, run the k3s installer script:
    Copy
    sudo sh install-master.sh -r registries.yaml
    In the Edge sites overview, you can see the status of the deployment.
  4. Run the following commands to verify the status of the installation.
    • To ensure that Kubernetes is running and that there is an existing node:
      Copy
      sudo /usr/local/bin/kubectl get nodes
    • To ensure the state of all pods are installed and running:
      Copy
      sudo /usr/local/bin/kubectl get pods --all-namespaces

Integrate Databricks Unity Catalog

In this section, we’re going to create an integration between Databricks Unity Catalog and Collibra. This integration is made up of 2 parts:

  • An Edge site connection.
  • An Edge site capability.

The connection sets up the communication between Databricks Unity Catalog and Collibra through Edge. This connection contains the parameters needed to successfully connect to Databricks, so that the capability can request information.The capability uses the connection to communicate with Databricks Unity Catalog and requests information, which it then sends back to Collibra. You can add more than one capability to a connection.

Create a Databricks Unity Catalog connection

To support the "Global Logistics Dashboard", we need a persistent connection to the Databricks Unity Catalog production environment. We will use a service principal (a system account) to ensure that the integration remains active regardless of individual staff transitions or password rotations. We'll use example values to demonstrate how you can complete the required fields.

  1. Open a site.
    1. On the main toolbar, click Products iconCogwheel icon Settings.
      The Settings page opens.
    2. In the tab pane, click Edge.
      The Sites tab opens and shows a table with an overview of your sites.
    3. In the table, click the name of the site whose status is Healthy.
      The site page opens.
  2. In the Connections section, click Create connection.
  3. Select Databricks Workspace to connect to your Databricks workspace.
    The Create connection page appears.
  4. Enter the required information.
    • Name: Databricks_Logistics_Production
    • Description: Production connection for supply chain metadata using service principal authentication.
    • Workspace URL: https://supplychain.cloud.databricks.com
    • Authentication Type: Microsoft Entra ID
    • Client ID: f47ac10b-xxxx-xxxx-xxxx-0e02b2c3d479
    • Client Secret: ••••••••••••••••
    • Tenant ID: 99bc1234-xxxx-xxxx-xxxx-1234567890ab
  5. Click Create.
    The connection is added to the Edge or Collibra Cloud site.

Collibra validates the credentials when synchronizing Databricks Unity Catalog.

Add a Databricks Unity Catalog capability

Now that the secure bridge is built, we will configure the capability. This is the logic that allows your Edge site to interact with the Databricks Unity Catalog metastore. Our goal is to prepare for the ingestion of the "Inventory_Gold" catalog, ensuring that any Databricks tags (like "Proprietary" or "Region: EMEA") are eventually discoverable in Collibra.

In this section of the Add capability screen, you will encounter several advanced and legacy fields. For this supply chain use case, we will skip the deprecated items to ensure that the integration remains future-proof. Once again, we'll use example values to demonstrate how you can complete the required fields.

  1. Open a site.
    1. On the main toolbar, click Products iconCogwheel icon Settings.
      The Settings page opens.
    2. In the tab pane, click Edge.
      The Sites tab opens and shows a table with an overview of your sites.
    3. In the table, click the name of the site whose status is Healthy.
      The site page opens.
  2. In the Capabilities section, click Add capability.
    The Add capability page appears.
  3. Select the Databricks Unity Catalog synchronization capability template.
  4. Enter the required information.
    • Name: Inventory_Metadata_Sync
    • Description: Metadata synchronization engine for the Gold-tier Inventory catalog.
    • Databricks Connection: Databricks_Logistics_Production. We'll use the same connection created in the previous step.
    • JDBC Databricks Connection: Leave blank.
    • Create additional capabilities automatically (in preview): Do not create additional capabilities
    • Save input metadata: Leave unchecked.
    • Exclude Schemas (will be removed soon, use domain mapping instead): Leave blank.
    • (deprecated) Filters and Domain Mapping: Leave blank.
    • (deprecated) Extensible Properties Mapping: Leave blank.
    • Compute Resource HTTP Path: /sql/1.0/warehouses/a1b2c3d4e5f6g7h8
    • Default Asset Status: Candidate
    • Logging configuration: Leave blank.
    • Memory: Leave blank.
    • JVM arguments: Leave blank.
    • Debug: false
    • Log level: Leave blank.
  5. Click Create.
    The capability is added to the Edge or Collibra Cloud site.
    The fields become read-only.

The infrastructure for your Databricks Unity Catalog integration is now complete. You have successfully moved from a raw data environment to a secure, governed foundation.

Helpful resources

What's next

You can now synchronize Databricks Unity Catalog.