Create a crawler for Azure Data Lake Storage
You can create a crawler for Azure Data Lake Storage (ADLS) to specify the directories that you want to synchronize.
Prerequisites
In your Collibra environment:
- You have registered an ADLS file system.
- You have connected the ADLS File System asset to the ADLS Edge capability.
- You have a resource role with the Configure external system resource permission, for example, Owner.
- You have a global role with the Catalog global permission, for example, Catalog Author.
- You have a global role with the View Edge connections and capabilities global permission, for example, Edge integration engineer.
Steps
- Open the ADLS File System asset.
- In the tab bar, click Configuration.
- In the Crawlers section, click Edit Configuration.
- Click Add Crawler.
- Enter the required information.
Field Description Name The name you want to give to the crawler in Collibra.
- Crawler names can contain up to 255 characters.
- All crawler names must be unique.
Domain
The domain in which the assets of the ADLS file system are to be created.
Include Paths Include PathThe include path is case-sensitive.
The case-sensitive path to a directory of a directory in ADLS. All objects and subdirectories of this path are taken into account during synchronization.
Use the following structure to refer to the path:https://<storage account name>.blob.core.windows.net/<container name>/<blob name>.Examplehttps://myaccount.blob.core.windows.net/mycontainer/myblobhttps://myaccount.blob.core.windows.net/$root/myblobrefers to the root container. For information on working with root containers, go to the ADLS documentation.Exclude PatternsAdd Exclude PatternButton to add an exclude pattern. Exclude patternThe exclude patterns are case-sensitive.
A case-sensitive pattern that represents the objects that are included via the Include path but that you want to exclude from synchronization.
When you define a pattern, you can use the following rules:*matches zero or more characters.**matches zero or more directories in a path.?matches one character.
Examplecomm/*.jspmatches all .jsp files in the comm path.comm/t?st.jspmatches comm/test.jsp but also comm/tast.jsp or comm/txst.jsp.commm/**/test.jspmatches all test.jsp files in the comm path.org/framework/**/*.jspmatches all .jsp files in the org/framework path.org/**/servlet/test.jspmatches org/framework/servlet/test.jsp but also org/framework/testing/servlet/test.jsp and org/servlet/test.jsp.- Full example:
- Exclude pattern:
**/inner_dir/** - For the following ADLS tree:
- /dir_1
- /dir_2
- /outer_dir
- /outer_dir/inner_dir
- /outer_dir/inner_dir/file.csv
- The result will be:
- /dir_1
- /dir_2
- /outer_dir
- Exclude pattern:
Add Include PathButton to add an additional include path.
- Include paths cannot be empty.
- Currently, the capability does not allow for more than 5 include paths.
Add Crawler Button to define an additional crawler.
If you add a number of crawlers, you might need to scroll more, thus impacting the responsiveness of the UI.
- Click Save.
You can now synchronize the ADLS file system manually or define a synchronization schedule.