Edit a crawler

Important 

In Collibra 2024.05, we launched a new user interface (UI) for Collibra Platform! You can learn more about this latest UI in the UI overview.

Use the following options to see the documentation in the latest UI or in the previous, classic UI:

You can edit a crawler of an S3 File System asset in Data Catalog. For example, you can do this if you want to edit the exclude pattern.

Prerequisites

  • You have registered an Amazon S3 file system.
  • You have connected an S3 File System asset to Amazon S3.

Requirements and permissions

Steps

  1. Open an S3 File System asset page.
  2. In the tab panebar, click Configuration. In the tab panebar, click Configuration.
  3. In the Crawlers section, click Edit Configuration.
  4. In the Crawlers section, in the row of the crawler that you want to edit, click .
    The Edit crawler window appears.
  5. Enter the required information.
    FieldDescription

    Domain

    The domain in which the assets of the S3 file system are created.

    Name

    The name of the crawler in Collibra.

    Table LevelSpecify the level from which tables have to be created during the integration. By default, tables are created from the top level, level 1.
    Only specify a number if you want to create tables starting from another level, such as 2 or 3. For more information, go to the AWS documentation.
    File Group Pattern

    Add a regular expression to group files with similar file names into a File Group asset during the S3 synchronization. Multiple regular expression grammar variants exist. We use the Java variant.
    A regular expression, also referred to as regex or regexp, is a sequence of characters that specifies a match pattern in text.

    Example If you add the (\w*)_\d\d\d\d\.csv regex, the integration automatically detects files matching this pattern and groups them into a File Group asset.

    You can define one regex per crawler.

    Tip 
    • Multiple websites provide guidelines and examples of regular expressions, for example, Regexlib and RegexBuddy, or even ChatGPT.
    • You can also test your regular expression on various websites, for example, Regex101 (Select the Java 8 option in the Flavor panel).

    The referenced websites serve only as examples. The use of ChatGPT or other generative AI products and services is at your own risk. Collibra is not responsible for the privacy, confidentiality, or protection of the data you submit to such products or services, and has no liability for such use.

    Include path

    The case-sensitive path to a directory of a bucket in Amazon S3. All objects and subdirectories of this path are crawled.

    For more information and examples, go to the AWS Glue documentation.

    Exclude patterns

    Glob pattern that represents the objects that are in the include path, but that you want to exclude.

    For more information and examples, go to the AWS Glue documentation.

    Add patternButton to add additional exclude patterns.
    Custom Classifier

    If you want the AWS crawler created by the S3 integration to use a specified custom classifier, add the name of the classifier in this field. The custom classifier should be created in the AWS Glue console. For more information, go to the AWS Glue documentation.

    You can add multiple classifiers by clicking Add Custom Classifier.

  6. Click Save.