Create a crawler for Azure Data Lake Storage

Important 

In Collibra 2024.05, we've launched a new user interface (UI) for Collibra Data Intelligence Platform! You can learn more about this latest UI in the UI overview.

Use the following options to see the documentation in the latest UI or in the previous, classic UI:

By creating a crawler for Azure Data Lake Storage (ADLS), you can specify which directories you want to synchronize.

Before you begin

Requirements and permissions

Steps

  1. Open the ADLS File System asset.
  2. In the tab panebar, click Configuration. In the tab panebar, click Configuration.
  3. In the Crawlers section, click Edit Configuration.
  4. Click Add Crawler.
  5. In the Crawlers section, click Create crawler.
    The Create crawler dialog appears.
  6. Enter the required information.
    FieldDescription

    Domain

    The domain in which the assets of the ADLS file system are to be created.

    Name

    The name you want to give to the crawler in Collibra.

    The crawler name character limit is 255.

    Include Paths 
    Include path

    The case-sensitive path to a directory of a directory in ADLS. All objects and subdirectories of this path are taken into account during the synchronization.
    Use the following structure to refer to the path:
    https://<storage account name>.blob.core.windows.net/<container name>/<blob name>.

    Note The include path is case-sensitive.

    Example 

    https://myaccount.blob.core.windows.net/mycontainer/myblob
    https://myaccount.blob.core.windows.net/$root/myblob refers to the root container. For information on working with root containers, go to the ADLS documentation.

    Exclude patterns

    A case-sensitive pattern that represents the objects that are included via the Include path, but that you want to exclude from the synchronization.
    When you define a pattern, you can use the following rules:

    • * matches zero or more characters.
    • ** matches zero or more directories in a path.
    • ? matches one character.
    Note 

    The exclude patterns are case-sensitive.

    Example 
    • comm/*.jsp matches all .jsp files in the comm path.
    • comm/t?st.jsp matches comm/test.jsp but also comm/tast.jsp or comm/txst.jsp.
    • commm/**/test.jsp matches all test.jsp files in the comm path.
    • org/framework/**/*.jsp matches all .jsp files in the org/framework path.
    • org/**/servlet/test.jsp matches org/framework/servlet/test.jsp but also org/framework/testing/servlet/test.jsp and org/servlet/test.jsp.
    • Full example:
      • Exclude pattern: **/inner_dir/**
      • For the following ADLS tree:
        • /dir_1
        • /dir_2
        • /outer_dir
        • /outer_dir/inner_dir
        • /outer_dir/inner_dir/file.csv
      • The result will be:
        • /dir_1
        • /dir_2
        • /outer_dir
    Exclude patterns
     
    Add Exclude Pattern
    Button to add an exclude pattern.
    Exclude pattern

    A case-sensitive pattern that represents the objects that are included via the Include path, but that you want to exclude from the synchronization.

    When you define a pattern, you can use the following rules:

    • * matches zero or more characters.
    • ** matches zero or more directories in a path.
    • ? matches one character.
    Note 

    The exclude patterns are case-sensitive.

    Example 
    • comm/*.jsp matches all .jsp files in the comm path.
    • comm/t?st.jsp matches comm/test.jsp but also comm/tast.jsp or comm/txst.jsp.
    • commm/**/test.jsp matches all test.jsp files in the comm path.
    • org/framework/**/*.jsp matches all .jsp files in the org/framework path.
    • org/**/servlet/test.jsp matches org/framework/servlet/test.jsp but also org/framework/testing/servlet/test.jsp and org/servlet/test.jsp.
    • Full example:
      • Exclude pattern: **/inner_dir/**
      • For the following ADLS tree:
        • /dir_1
        • /dir_2
        • /outer_dir
        • /outer_dir/inner_dir
        • /outer_dir/inner_dir/file.csv
      • The result will be:
        • /dir_1
        • /dir_2
        • /outer_dir
    Add Include Path
    Button to add an additional Include path.
    Add CrawlerButton to define an additional crawler.
    Add patternButton to add additional exclude patterns.
    Add pathButton to add an additional Include path.
  7. Click Save.
  8. Click Create.

What's next?

You can now synchronize ADLS file system manually or define a synchronization schedule.