Edit a crawler
You can edit a crawler of an S3 File System asset in Data Catalog. For example, you can do this if you want to edit the exclude pattern.
Prerequisites
- You have a resource role with the Configure external system resource permission, for example Owner.
- You have a global role with the Catalog global permission, for example Catalog Author.
- You have registered an Amazon S3 file system.
- You have a global role with the View Edge connections and capabilities global permission.
- You have connected an S3 File System asset to Amazon S3.
Steps
- Open an S3 File System asset page.
-
In the tab pane, click
Configuration. - In the Crawlers section, in the row of the crawler that you want to edit, click
.
The Edit crawler window appears. - Enter the required information.
Field Description Domain
The domain in which the assets of the S3 file system are created.
Read more about linking domains to crawlers.- A specific Storage Catalog domain is created automatically when the S3 File System asset is created. That domain is selected by default. However, you can manually create a new Storage Catalog domain and select it.
- If multiple crawlers point to the same domain, then all assets are created in the same domain.
- If multiple crawlers point to different domains, then all assets are created in their respective domains.
- If multiple crawlers from the same S3 File System asset overlap and point to different domains, then overlapping assets are created in each domain.
- If multiple crawlers from the same S3 File System asset overlap and point to the same domain, then overlapping assets are created once in that domain.
- If crawlers from multiple S3 File System assets overlap and point to different domains, then overlapping assets are created in each domain.
- If crawlers from multiple S3 File System assets overlap and point to the same domain, then overlapping assets are created once in the domain and the S3 Bucket asset has a relation to both S3 File System assets.
Name The name of the crawler in Collibra.
Read more about crawler names.- You cannot use the same name for two crawlers in the same S3 File System asset.
- The name of the corresponding crawler in AWS Glue will contain this name. Its name will follow the following convention:
collibra_catalog_<s3fs asset id>_<name_of_the_crawler_in_Collibra>. - The crawler name must be compliant with the AWS Glue limitations:
- It has to match the single-line string pattern:
[\u0020-\uD7FF\uE000-\uFFFD\uD800\uDC00-\uDBFF\uDFFF\t]*. - The length should be between 1 and 255 bytes long, including the fixed prefix that Collibra adds. That means that you can use roughly 65 characters, depending on the characters that were used.Warning This restriction is imposed by Amazon S3, which allows up to 255 bytes, including the prefix added by Collibra. If you enter too many characters and exceed the byte limit, synchronization fails.
- It has to match the single-line string pattern:
Include path The case-sensitive path to a directory of a bucket in Amazon S3. All objects and subdirectories of this path are crawled.
For more information and examples, see the AWS Glue documentation.
Exclude patterns Glob pattern that represents the objects that are in the include path, but that you want to exclude.
For more information and examples, see the AWS Glue documentation.
Add pattern Button to add additional exclude patterns. - Click Save.