Create a crawler

Important 

In Collibra 2024.05, we've launched a new user interface (UI) for Collibra Data Intelligence Platform! You can learn more about this latest UI in the UI overview.

Use the following options to see the documentation in the latest UI or in the previous, classic UI:

You can create a crawler for an S3 File System asset in Data Catalog.

Important 
  • If you completed the Glue database configuration parameter in the capability, you need to create a dummy crawler. A dummy crawler is a crawler with an invalid include path, such as s3://dummy. This crawler won't be taken into account when you run the synchronization.
    In a future release, we'll remove the need for a dummy crawler.
  • By default, AWS Glue allows up to 25 crawlers per account. For more information, see the AWS Glue documentation. This has consequences for Collibra:
    • If you created crawlers in AWS Glue directly, Collibra can create less crawlers for synchronization.
    • Because Collibra creates the crawlers in AWS Glue during synchronization, you should avoid having 25 or more crawlers in one S3 File System asset.
    • You can synchronize several S3 File System assets simultaneously, but if the total number of crawlers exceeds the maximum amount in AWS Glue, synchronization will fail. Since Collibra deletes the crawlers from AWS Glue after synchronization, it is safer to synchronize each S3 File System asset at a unique time.
  • Crawlers in AWS Glue can crawl multiple buckets, but in Collibra, each crawler can only crawl a single bucket.

Before you begin

  • You have registered an Amazon S3 file system.
  • You have connected an S3 File System asset to Amazon S3.

Requirements and permissions

Steps

  1. Open an S3 File System asset page.
  2. In the tab panebar, click Configuration. In the tab panebar, click Configuration.
  3. In the Crawlers section, click Edit Configuration.
  4. Click Add Crawler.
  5. In the Crawlers section, click Create crawler.
    The Create crawler dialog box appears.
  6. Enter the required information.
    FieldDescription

    Domain

    The domain in which the assets of the S3 file system are created.

    Name

    The name of the crawler in Collibra.

    Table LevelSpecify the level from which tables have to be created during the integration. By default, tables are created from the top level, level 1.
    Only specify a number if you want to create tables starting from another level, such as 2 or 3. For more information, go to the AWS documentation.
    Include path

    The case-sensitive path to a directory of a bucket in Amazon S3. All objects and subdirectories of this path are crawled.

    For more information and examples, go to the AWS Glue documentation.

    Exclude patterns

    Glob pattern that represents the objects that are in the include path, but that you want to exclude.

    For more information and examples, go to the AWS Glue documentation.

    Add patternButton to add additional exclude patterns.
    Custom Classifier

    If you want the AWS crawler created by the S3 integration to use a specified custom classifier, add the name of the classifier in this field. The custom classifier should be created in the AWS Glue console. For more information, go to the AWS Glue documentation.

    You can add multiple classifiers by clicking Add Custom Classifier.

  7. Click Create.

What's next?

You can now synchronize Amazon S3 manually or define a synchronization schedule.