Managing crawlers

A crawler is an automated script that ingests data from Amazon S3 to Data Catalog.

You can create, edit and delete crawlers in Collibra Data Intelligence Platform. When you synchronize Amazon S3, the crawlers are created in AWS Glue and executed. Each crawler crawls a location in Amazon S3 based on its include path. The results are stored in one AWS Glue database per domain assigned to one or more crawlers. Those databases are ingested in Data Catalog in the form of assets, attributes and relations. The databases are stored in AWS Glue until the next synchronization. At that moment, they are deleted and recreated. The crawlers in AWS Glue are deleted immediately after the synchronization is finished.

Important 
  • If you completed the Glue database configuration parameter in the capability, you need to create a dummy crawler. A dummy crawler is a crawler with an invalid include path, such as s3://dummy. This crawler won't be taken into account when you run the synchronization.
    In a future release, we'll remove the need for a dummy crawler.
  • By default, AWS Glue allows up to 25 crawlers per account. For more information, see the AWS Glue documentation. This has consequences for Collibra:
    • If you created crawlers in AWS Glue directly, Collibra can create less crawlers for synchronization.
    • Because Collibra creates the crawlers in AWS Glue during synchronization, you should avoid having 25 or more crawlers in one S3 File System asset.
    • You can synchronize several S3 File System assets simultaneously, but if the total number of crawlers exceeds the maximum amount in AWS Glue, synchronization will fail. Since Collibra deletes the crawlers from AWS Glue after synchronization, it is safer to synchronize each S3 File System asset at a unique time.
  • Crawlers in AWS Glue can crawl multiple buckets, but in Collibra, each crawler can only crawl a single bucket.