Add the Databricks Unity Catalog capability

Important 

In Collibra 2024.02, we've launched a new user interface (UI) in beta for Collibra Data Intelligence Platform! You can learn more about this latest UI in the UI overview.

Use the following options to see the documentation in the latest UI or in the previous, classic UI:

After you have created a connection to Databricks in your Edge site, you have to add the Databricks Unity Catalog capability to the connection.

Before you start

Required permissions

Steps

  1. Open an Edge site.
    1. On the main toolbar, click Products icon, and then click Cogwheel icon Settings.
      The Collibra settings page opens.
    2. In the tab pane, click Edge.
      The Sites tab opens and shows a table with an overview of the Edge sites.
    3. In the table, click the name of the Edge site whose status is Healthy.
      The Edge site page opens.
  2. In the Capabilities section, click Add capability.
    The Add capability page appears.
  3. Select the Databricks Unity Catalog synchronization capability template.
  4. Enter the required information.
    FieldDescriptionRequired

    Capability

    This section contains general information about the capability.

    Name

    The name of the Edge capability.

    Yes

    Description

    The description of the Edge capability.

    No

    Capability template

    The capability template. The value that you select in this field determines which sections appear on the page.

    Select the following Edge capability:

    Databricks Unity Catalog synchronization

    Yes

    Databricks Connection

     
    Databricks Connection
    The Databricks connection to be used.

    Yes

    Configuration

    This section contains information on how to connect to Databricks Unity Catalog. 
    Save input metadata
    If you select this option the metadata extracted from the data source will be saved in a file that can be used for troubleshooting. Select this option only on request of Collibra Support.

    No

    Exclude Schemas (will be removed soon)

    Comma-separated list of the schemas that you don't want to integrate.

    Note The listed schemas will be excluded for all databases. We recommend using this field to list the schemas that are automatically generated in a database, such as information_schema and default, and which you don't want to integrate.

    No

    Filters and Domain Mapping

    Text in JSON format to include or exclude databases and schemas, and to configure domain mappings.
    This feature is intended for non-production use at this moment, as it is in Beta.

    • The text must be in JSON format and can contain an include and an exclude block. You can use any JSON validator to verify the format. Collibra is not responsible for the privacy, confidentiality, or protection of the data you submit to such JSON validators, and has no liability for such use.
    • In the include block, you can specify the domain in which specific catalogs or schemas must be ingested. The format is: “Catalog/Database > schema ”: “domain ID”. For example, "HR > address-schema": "30000000-0000-0000-0000-000000000000".
    • In the exclude block, you can specify the catalogs or schemas that you don't want to ingest. For example, "* > test".
    • The exclude block has priority over the include block.
    • If the include block is not present, we ingest all assets into the same domain as the System asset.
    • If there is no explicit domain mapping for a schema, we use the domain specified for the database.
    • You can use the keyword default as a domain ID. In that case, the catalog or schema will be ingested in the same domain as the System asset.
    • A match with a database has priority over a match with a schema.
    • The integration fails before the synchronization starts, if one or more domain IDs specified in the include block don't exist.
    • The integration fails before the synchronization starts if a domain ID is left empty in the include block.
    • You can use the ? and * wildcards in the catalog and schema names. If a catalog or schema matches multiple lines, the most detailed match is taken into account.
    Example 
    {
    "include": {
    "HR": "20000000-0000-0000-0000-000000000000",
    "HR > address-schema": "30000000-0000-0000-0000-000000000000",
    "Orders > fk*": "40000000-0000-0000-0000-000000000000",
    "Orders > *": "50000000-0000-0000-0000-000000000000",
    "* > profiling": "60000000-0000-0000-0000-000000000000",
    "sales": "default"

    },
    "exclude": [
    "testDB",
    " * > information_schema"
    ]
    }

    In this example:

    • Assets from the "HR" database will be ingested into the domain with ID "20000000-0000-0000-0000-000000000000". However, all assets from the "HR > address-schema" schema will be ingested into the domain with id "30000000-0000-0000-0000-000000000000".
    • All assets from the "Orders” database with schemas starting with fk (fk*) will be ingested into the domain with ID "40000000-0000-0000-0000-000000000000", and all other assets from the "Orders” database will be ingested into the domain with ID "50000000-0000-0000-0000-000000000000".
    • All assets from the "sales" database will be ingested in the same domain as the System asset.
    • Assets from the "profiling" schema will be ingested into the domain with ID "60000000-0000-0000-0000-000000000000". However, the "profiling" schema in the database "Orders" will be ingested in the domain with ID "50000000-0000-0000-0000-000000000000" because a database match has priority over a schema match.
    • All assets from the "testDB” database will be excluded.
    • All assets from the “information_schema” schema in all databases will be excluded.

    No

    Extensible Properties Mapping

    Via the Extensible Properties Mapping field, Databricks Unity Catalog allows you to add additional properties to Catalog, Schema, and Table objects.

    Important 
    • This feature is intended for non-production use at this moment, as it is in Beta.
    • If you use this feature, make sure to set up all required characteristic assignments for the asset types.

    Three possible JSON formats are available.

    • Version 0.1: This version allows you to ingest custom properties only. You can ingest the values from the Properties field from Catalog, Schema, and Table objects into specific attributes in Collibra assets. You do this by adding the mapping between the Properties fields for the objects in Databricks Unity Catalog and the Collibra attribute IDs to ingest the data in, using a JSON string.
      • The text must be in JSON format and can contain a Catalogs, Schemas, and Tables block. The Catalogs block refers to Database assets, the Schemas block to Schema assets, and the Tables block to Table assets.
      • In each block, you specify the property name and the attribute ID to which you want to map the value in the property. The format is: "[property name]": "[attribute resource ID]". For example, "Description from source system": "00000000-0000-0000-0001-000500000074".
      Example 
      {
      "catalogs": {
      "color": "00000000-0000-0000-0000-000000001234",
      "Description from source system": "00000000-0000-0000-0001-000500000074"
      },
      "schemas": {
      "File Location": "00000000-0000-0000-0001-000500000004"
      },
      "tables": {
      "delta.lastCommitTimestamp": "00000000-0000-0000-0000-000000003114"
      }
      }

      In this example:

      • In the Database assets that we create, we'll add the Color value in attribute 00000000-0000-0000-0000-000000001234, and the Description from Source value in attribute 00000000-0000-0000-0001-000500000074.
      • In the Schema assets that we create, we'll add the File Location value in attribute 00000000-0000-0000-0001-000500000004.
      • In the Table assets that we create, we'll add the delta.lastCommitTimestamp value in attribute 00000000-0000-0000-0000-000000003114.
    • Version 0.2: This version allows you to ingest both default system properties and custom properties. You can ingest most values from the Details page from Catalog, Schema, and Table objects into specific attributes in Collibra assets. You do this by adding the mapping between the fields for the objects in Databricks Unity Catalog and the Collibra attribute IDs to ingest the data in, using a JSON string.
      • The text must be in JSON format.
      • A Version block referencing 0.2 must be added.
      • A Catalogs, Schemas, and Tables block can be added. The Catalogs block refers to Database assets, the Schemas block to Schema assets, and the Tables block to Table assets.
      • Inside a Catalogs, Schemas, or Tables block, you can add a systemAttributes and a customParameters block. systemAttributes refers to the default system properties. customParameters refers to the custom properties.
      • In each block, you specify the property name and the attribute ID to which you want to map the value in the property. The format is: "[property name]": "[attribute resource ID]". For example, "Description from source system": "00000000-0000-0000-0001-000500000074".
        Following system properties are supported:
        • Catalogs: "browse_only", "catalog_type", "connection_name", "created_at", "created_by", "isolation_mode", "metastore_id", "provider_name", "provisioning_info", "securable_kind", "securable_type", "share_name", "storage_location", "storage_root", "updated_at" , and "updated_by".
        • Schemas: "catalog_type", "created_at", "created_by", "metastore_id", "securable_type", "securable_kind", "storage_location", "storage_root", "updated_at", and "updated_by".
        • Tables: "access_point", "created_at", "created_by", "data_access_configuration_id", "data_source_format", "deleted_at", "metastore_id", "securable_type", "securable_kind", "sql_path", "storage_credential_name", "storage_location", "table_type", "updated_at", "updated_by", and "view_definition".
          Tables mapping apply to tables and views.
      Example 
      {
      "version": 0.2,
      "catalogs": {
      "systemAttributes": {
      "metastore_id": "00000000-0000-0000-0000-000000004224"
      },
      "customParameters": {
      "color": "00000000-0000-0000-0000-000000001234",
      "Description from source system": "00000000-0000-0000-0001-000500000074"
      }
      },
      "schemas": {
      "customParameters": {
      "File Location": "00000000-0000-0000-0001-000500000004"
      }
      },
      "tables": {
      "systemAttributes": {
      "metastore_id": "00000000-0000-0000-0000-000000004224"
      },
      "customParameters": {
      "delta.lastCommitTimestamp": "00000000-0000-0000-0000-000000003114"
      }
      }
      }

      In this example:

      • In the Database assets that we create, we'll add the metastore_id value in attribute "00000000-0000-0000-0000-000000004224", the Color value in attribute 00000000-0000-0000-0000-000000001234, and the Description from Source value in attribute 00000000-0000-0000-0001-000500000074.
      • In the Schema assets that we create, we'll add the File Location value in attribute 00000000-0000-0000-0001-000500000004.
      • In the Table and View assets that we create, we'll add the metastore_id value in attribute "00000000-0000-0000-0000-000000004224" and the delta.lastCommitTimestamp value in attribute 00000000-0000-0000-0000-000000003114.
    • Version 0.3: This version allows you to ingest both default system properties and custom properties, and define separate decisions for tables and views. You can ingest most values from the Details page from Catalog, Schema, Table, and View objects into specific attributes in Collibra assets. You do this by adding the mapping between the fields for the objects in Databricks Unity Catalog and the Collibra attribute IDs to ingest the data in, using a JSON string.
      • The text must be in JSON format.
      • A Version block referencing 0.3 must be added.
      • A Catalogs, Schemas, Tables, and Views block can be added. The Catalogs block refers to Database assets, the Schemas block to Schema assets, the Tables block to Table assets, and the Views block to Database View assets.
      • Inside a Catalogs, Schemas, Tables, or Views block, you can add a systemAttributes and a customParameters block. systemAttributes refers to the default system properties. customParameters refers to the custom properties.
      • In each block, you specify the property name and the attribute ID to which you want to map the value in the property. The format is: "[property name]": "[attribute resource ID]". For example, "Description from source system": "00000000-0000-0000-0001-000500000074".
        Following system properties are supported:
        • Catalogs: "browse_only", "catalog_type", "connection_name", "created_at", "created_by", "isolation_mode", "metastore_id", "provider_name", "provisioning_info", "securable_kind", "securable_type", "share_name", "storage_location", "storage_root", "updated_at" , and "updated_by".
        • Schemas: "catalog_type", "created_at", "created_by", "metastore_id", "securable_type", "securable_kind", "storage_location", "storage_root", "updated_at", and "updated_by".
        • Tables: "access_point", "created_at", "created_by", "data_access_configuration_id", "data_source_format", "deleted_at", "metastore_id", "securable_type", "securable_kind", "sql_path", "storage_credential_name", "storage_location", "table_type", "updated_at", "updated_by", and "view_definition".
        • Views: "access_point", "created_at", "created_by", "data_access_configuration_id", "data_source_format", "deleted_at", "metastore_id", "securable_type", "securable_kind", "sql_path", "storage_credential_name", "storage_location", "table_type", "updated_at", "updated_by", and "view_definition".
      Example 
      {
      "version": 0.3,
      "catalogs": {
      "systemAttributes": {
      "metastore_id": "00000000-0000-0000-0000-000000004224"
      },
      "customParameters": {
      "color": "00000000-0000-0000-0000-000000001234",
      "Description from source system": "00000000-0000-0000-0001-000500000074"
      }
      },
      "schemas": {
      "customParameters": {
      "File Location": "00000000-0000-0000-0001-000500000004"
      }
      },
      "tables": {
      "systemAttributes": {
      "metastore_id": "00000000-0000-0000-0000-000000004224"
      },
      "customParameters": {
      "delta.lastCommitTimestamp": "00000000-0000-0000-0000-000000003114"
      }
      }
      "views": {
      "systemAttributes": {
      "metastore_id": "00000000-0000-0000-0000-000000004224"
      },
      "customParameters": {
      "view.sqlConfig.spark.sql.session.timeZone": "018cedbf-37fc-7da3-9ea8-da2af754222e"
      }
      }
      }

      In this example:

      • In the Database assets that we create, we'll add the metastore_id value in attribute "00000000-0000-0000-0000-000000004224", the Color value in attribute 00000000-0000-0000-0000-000000001234, and the Description from Source value in attribute 00000000-0000-0000-0001-000500000074.
      • In the Schema assets that we create, we'll add the File Location value in attribute 00000000-0000-0000-0001-000500000004.
      • In the Table assets that we create, we'll add the metastore_id value in attribute "00000000-0000-0000-0000-000000004224" and the delta.lastCommitTimestamp value in attribute 00000000-0000-0000-0000-000000003114.
      • In the Database View assets that we create, we'll add the metastore_id value in attribute "00000000-0000-0000-0000-000000004224" and the view.sqlConfig.spark.sql.session.timeZone value in attribute 018cedbf-37fc-7da3-9ea8-da2af754222e.

    No

    Compute Resource HTTP Path

    The HTTP path of the compute resource in Databricks Unity Catalog that we can process to extract the source tags.

    You can find the HTTP path in the connection details of your cluster. For details, go to Get connection details for a cluster in Databricks documentation.

    No

    Advanced Configuration
    • Logging configuration
    • Memory
    • JVM arguments

    These configuration options help when investigating issues with the capability.

    Important Only complete the fields Logging configuration, Memory (MiB), and JVM arguments on request of or together with Collibra Support.

    No

    Debug

    This setting is not valid for this integration. It should be set to false.

    An option to automatically send Edge infrastructure log files to Collibra Data Intelligence Platform. By default, this option is set to false.

    Note We highly recommend to only send Edge infrastructure log files to Collibra Data Intelligence Platform when you have issues with Edge. If you set it to true, it will automatically revert to false after 24h.

    No

    Log level

    This setting is not valid for this integration. It should be set to No logging.

    An option to determine the verbosity level of Catalog connector log files. By default, this option is set to No logging.

    No

  5. Click Create.
    The capability is added to the Edge site.
    The fields become read-only.

What's next?

You can now synchronize Databricks Unity Catalog.