Add a Technical Lineage for Databricks Unity Catalog capability to an Edge or Collibra Cloud site

After you enable technical lineage on Edge and have the Databricks connection available, add a Technical Lineage for Databricks Unity Catalog capability to the Edge or Collibra Cloud site.

Requirements and permissions

A global role that has the Manage connections and capabilities global permissions, for example Edge integration engineer.

Steps

  1. Open a site.
    1. On the main toolbar, click Products iconCogwheel icon Settings.
      The Settings page opens.
    2. In the tab pane, click Edge.
      The Sites tab opens and shows a table with an overview of your sites.
    3. In the table, click the name of the site whose status is Healthy.
      The site page opens.
  2. In the Capabilities section, click Add capability. For Collibra Data Lineage to stitch the data objects in your technical lineage to the assets in Data Catalog, add a Catalog JDBC ingestion capability before you add the technical lineage capability.
    The Add capability page appears.
  3. Enter the required information.
    FieldDescriptionRequired?

    Name

    The name of the capability.

    Yes

    Description

    The description of the capability.

    No

    Databricks Connection

    The Databricks connection that you created.

    Yes

    Compute Resource HTTP Path

    The HTTP path of the compute resource in Databricks Unity Catalog that Collibra Data Lineage collects and processes to create technical lineage.

    Yes

    Source ID

    The name of the data source. The name must be unique and cannot contain special characters, for example, /.

    Yes

    TechLin Admin Connection (in preview)

    If you want to use the OAuth authentication type to connect to the Collibra Data Lineage service instances, you have to create a Technical Lineage Admin Edge or Collibra Cloud site connection and select the OAuth authentication type. Then, in this field, you specify the name of the Technical Lineage Admin Edge or Collibra Cloud site connection.

    No

    Time Frame

    Specify the duration for data collection. You can enter any of the following values:

    • A number of days.
      The default value is 365, which means that Collibra Data Lineage collects data of the past 365 days.
      If a negative number or 0 is entered, the default time frame of the past 365 day is used.
    • A date range:
      • YYYY-MM-DD YYYY-MM-DD. Collibra Data Lineage collects data from the specified start date to the specified end date.
      • YYYY-MM-DD now. Collibra Data Lineage collects data from the specified start date to the current date.
      • now YYYY-MM-DD. Collibra Data Lineage collects data the current date to the specified end date.

      The start date must be earlier than the end date and at least one day apart.

    No

    Property

    This section contains the custom parameters you can specify to create technical lineage. Click Add property to add a property.

    You can use this field to set the HTTP timeout duration by adding the httpTimeout property: 

    Warning If you are a Collibra Platform for Government customer, this field is required to connect to a Collibra Data Lineage service instance:

    Yes for US government customers.

    Processing Level

    Important This setting replaces the deprecated Analyze Only option, which will be removed in a future version of Collibra.

    For each of your data sources, you have to specify one of the following values: Load, Analyze, or Sync. Then, when you synchronize your technical lineage, the following process begins:

    1. Metadata for all data sources is loaded, regardless of the value of this setting for a particular data source.
    2. Metadata from data sources for which the value of this setting is either Analyze or Sync, is analyzed.
    3. Metadata from data sources for which the value of this setting is Sync, is synchronized.

    ValueDescription
    Load

    Harvest metadata from the data source and upload it to your Collibra environment. This allows you to inspect and, if necessary, edit the harvested metadata before uploading it to the Collibra Data Lineage service instance for analysis.

    When the job is done, you can download and review the metadata:

    1. Open the Activities list.
    2. In the row containing the job, click Result.
      The Synchronization Results dialog box appears.
    3. Click download and save the ZIP file to your hard drive.

    Tip The download link resembles the following: https://integrations.collibra-abc.com/rest/2.0/files/01944f12-7665-7d9c-8bc5-aa426b6a63cc. Take note of the file ID, in this example: 01944f12-7665-7d9c-8bc5-aa426b6a63cc. After you inspect the metadata, you can send the ZIP file for analysis by using the "Analyze files" option. Alternatively, you can upload the ZIP file using the POST /files API. In either case, you need to specify the file ID.

    Analyze

    Load and analyze the metadata on the Collibra Data Lineage service instance.

    Synchronization does not start after analysis; it starts only after either:

    Important  If you want to synchronize multiple data sources, we strongly recommend that you select this option in the respective Edge or Collibra Cloud site capabilities for each of your data sources. This allows you to synchronize all data sources in a single job, thereby maximizing efficiency and mitigating the risk of failed synchronization jobs.
    Sync

    Load, analyze, and synchronize metadata from all data sources. Synchronization starts – or is queued, if another synchronization job is running – immediately after analysis.

    Important If you want to synchronize multiple data sources and you select this option, each data source is processed as a separate job. This is highly inefficient and will likely lead to failed sync jobs. For complete information and important considerations, go to Tips for successful lineage synchronization.

    Yes

    Active

    The option determines whether to include or remove the technical lineage of the data source.

    Select this option to include the technical lineage of this data source.

    Clear the checkbox to exclude the technical lineage of this data source.

    Yes

    Save Input Metadata

    Select the checkbox if you want to save the input metadata extracted from the data source in ZIP files. The files can be useful for troubleshooting. Select this option only on request of Collibra Support. If this option is selected, you can download the files from the Synchronization Result dialog box once the synchronization activity is completed.

    No

    Ingest lineage from external tables

    Select this option to ingest lineage from external delta tables. Selecting this option can cause longer synchronization times.

    Clear the checkbox to exclude lineage from external delta tables.

    No

    Also ingest lineage from table_lineage

    Select this option to create both table-level and column-level lineage. In addition to the lineage from the system.access.column_lineage table, Collibra Data Lineage also ingests lineage that exists only in the system.access.table_lineage table when this option is selected. Selecting this option can cause longer synchronization times.

    To create only column-level lineage, clear the checkbox.

    No

    (Deprecated) Filters
    Note This field is deprecated. Use the Include Filter and Exclude Filter fields on the Synchronization page to specify which lineage events to include or exclude in technical lineage. If you specify this field and also the Include Filter and Exclude Filter fields, the Include Filter and Exclude Filter fields take precedence.

    Use this section to include or exclude databases and schemas to be ingested. Enter the filters in JSON format. If you used filters when you integrated Databricks Unity Catalog, you can enter in this field the content from the Filters and Domain Mapping field in the Databricks Unity Catalog capability. Noted that Collibra Data Lineage ignores the UUIDs that are specified in the content.

    Text in JSON format to include or exclude databases and schemas, and to configure domain mappings.

    • The text must be in JSON format and can contain an include and an exclude block. You can use any JSON validator to verify the format. Collibra is not responsible for the privacy, confidentiality, or protection of the data you submit to such JSON validators, and has no liability for such use.
    • In the include block, you can specify the domain in which specific catalogs or schemas must be ingested. The format is: “Catalog/Database > schema ”: “domain ID”. For example, "HR > address-schema": "30000000-0000-0000-0000-000000000000".
    • In the exclude block, you can specify the catalogs or schemas that you don't want to ingest. For example, "* > test".
    • The exclude block has priority over the include block.
    • If the include block is not present, we ingest all assets into the same domain as the System asset.
    • If there is no explicit domain mapping for a schema, we use the domain specified for the database.
    • You can use the keyword default as a domain ID. In that case, the catalog or schema will be ingested in the same domain as the System asset.
    • A match with a database has priority over a match with a schema.
    • The integration fails before the synchronization starts, if one or more domain IDs specified in the include block don't exist.
    • The integration fails before the synchronization starts if a domain ID is left empty in the include block.
    • You can use the ? and * wildcards in the catalog and schema names. If a catalog or schema matches multiple lines, the most detailed match is taken into account.

    No

    Logging configuration

    Memory (MiB)

    JVM arguments

    These fields are configuration options that can help when investigating issues with the capability.

    Important Use these fields only at the request of Collibra Support.

    No

    Debug

    This setting is not valid for this integration. It should be set to false.

    No

    Log level

    Only complete this field on the request of or together with Collibra Support.

    No

  4. Click Save.
    The capability is added to the Edge or Collibra Cloud site.
    The fields become read-only.

What's next

Synchronize technical lineage for Databricks Unity Catalog.