Managing crawlers

A crawler is an automated script that ingests data from Amazon S3 to Data Catalog.

You can create, edit and delete crawlers in Collibra Platform Self-Hosted. When you synchronize Amazon S3, the crawlers are created in AWS Glue and executed. Each crawler crawls a location in Amazon S3 based on its include path. The results are stored in one AWS Glue database per domain assigned to one or more crawlers. Those databases are ingested in Data Catalog in the form of assets, attributes and relations. The databases are stored in AWS Glue until the next synchronization. At that moment, they are deleted and recreated. The crawlers in AWS Glue are deleted immediately after the synchronization is finished.

Important 
  • If you completed the Glue database configuration parameter in the capability, you don't need to create crawlers.
  • By default, AWS Glue allows up to 1,000 crawlers per account.
    You can synchronize several S3 File System assets simultaneously, but if the total number of crawlers exceeds the maximum amount in AWS Glue, synchronization will fail. Since CPSH deletes the crawlers from AWS Glue after synchronization, it is safer to synchronize each S3 File System asset at a unique time.
    For more information, see the AWS Glue documentation.
  • Crawlers in AWS Glue can crawl multiple buckets, but in CPSH, each crawler can only crawl a single bucket.