Create a crawler for Azure Data Lake Storage

Important 

Choose an option below to explore the documentation for the latest user interface (UI) or the classic UI.

You can create a crawler for Azure Data Lake Storage (ADLS) to specify the directories that you want to synchronize.

Prerequisites

In your Collibra environment:

Steps

  1. Open the ADLS File System asset.
  2. In the tab panebar, click Configuration. In the tab panebar, click Configuration.
  3. In the Crawlers section, click Edit Configuration.
  4. Click Add Crawler.
  5. In the Crawlers section, click Create crawler.
    The Create crawler dialog appears.
  6. Enter the required information.
    FieldDescription
    Name

    The name you want to give to the crawler in Collibra.

    • Crawler names can contain up to 255 characters.
    • All crawler names must be unique.

    Domain

    The domain in which the assets of the ADLS file system are to be created.

    Include Paths 
    Include Path

    The include path is case-sensitive.

    The case-sensitive path to a directory of a directory in ADLS. All objects and subdirectories of this path are taken into account during synchronization.
    Use the following structure to refer to the path:
    https://<storage account name>.blob.core.windows.net/<container name>/<blob name>.

    Example 

    https://myaccount.blob.core.windows.net/mycontainer/myblob
    https://myaccount.blob.core.windows.net/$root/myblob refers to the root container. For information on working with root containers, go to the ADLS documentation.

    Exclude patterns

    A case-sensitive pattern that represents the objects that are included via the Include path, but that you want to exclude from the synchronization.
    When you define a pattern, you can use the following rules:

    • * matches zero or more characters.
    • ** matches zero or more directories in a path.
    • ? matches one character.
    Note 

    The exclude patterns are case-sensitive.

    Example 
    • comm/*.jsp matches all .jsp files in the comm path.
    • comm/t?st.jsp matches comm/test.jsp but also comm/tast.jsp or comm/txst.jsp.
    • commm/**/test.jsp matches all test.jsp files in the comm path.
    • org/framework/**/*.jsp matches all .jsp files in the org/framework path.
    • org/**/servlet/test.jsp matches org/framework/servlet/test.jsp but also org/framework/testing/servlet/test.jsp and org/servlet/test.jsp.
    • Full example:
      • Exclude pattern: **/inner_dir/**
      • For the following ADLS tree:
        • /dir_1
        • /dir_2
        • /outer_dir
        • /outer_dir/inner_dir
        • /outer_dir/inner_dir/file.csv
      • The result will be:
        • /dir_1
        • /dir_2
        • /outer_dir
    Exclude Patterns
     
    Add Exclude Pattern
    Button to add an exclude pattern.
    Exclude pattern

    The exclude patterns are case-sensitive.

    A case-sensitive pattern that represents the objects that are included via the Include path but that you want to exclude from synchronization.
    When you define a pattern, you can use the following rules:

    • * matches zero or more characters.
    • ** matches zero or more directories in a path.
    • ? matches one character.
    Example 
    • comm/*.jsp matches all .jsp files in the comm path.
    • comm/t?st.jsp matches comm/test.jsp but also comm/tast.jsp or comm/txst.jsp.
    • commm/**/test.jsp matches all test.jsp files in the comm path.
    • org/framework/**/*.jsp matches all .jsp files in the org/framework path.
    • org/**/servlet/test.jsp matches org/framework/servlet/test.jsp but also org/framework/testing/servlet/test.jsp and org/servlet/test.jsp.
    • Full example:
      • Exclude pattern: **/inner_dir/**
      • For the following ADLS tree:
        • /dir_1
        • /dir_2
        • /outer_dir
        • /outer_dir/inner_dir
        • /outer_dir/inner_dir/file.csv
      • The result will be:
        • /dir_1
        • /dir_2
        • /outer_dir
    Add Include Path

    Button to add an additional include path.

    • Include paths cannot be empty.
    • Currently, the capability does not allow for more than 5 include paths.
    Add Crawler

    Button to define an additional crawler.

    If you add a number of crawlers, you might need to scroll more, thus impacting the responsiveness of the UI.

    Add patternButton to add additional exclude patterns.
    Add pathButton to add an additional include path.
  7. Click Save.
  8. Click Create.

What's next

You can now synchronize the ADLS file system manually or define a synchronization schedule.