Managing crawlers
A crawler is an automated script that ingests data from Amazon S3 to Data Catalog.
You can create, edit and delete crawlers in Collibra Platform. When you synchronize Amazon S3, the crawlers are created in AWS Glue and executed. Each crawler crawls a location in Amazon S3 based on its include path. The results are stored in one AWS Glue database per domain assigned to one or more crawlers. Those databases are ingested in Data Catalog in the form of assets, attributes and relations. The databases are stored in AWS Glue until the next synchronization. At that moment, they are deleted and recreated. The crawlers in AWS Glue are deleted immediately after the synchronization is finished.
- If you completed the Glue database configuration parameter in the capability, you don't need to create crawlers.
- If you completed the Glue database configuration parameter in the capability, you need to create a dummy crawler. A dummy crawler is a crawler with an invalid include path, such as s3://dummy. This crawler won't be taken into account when you run the synchronization.
In a future release, we'll remove the need for a dummy crawler. - By default, AWS Glue allows up to 1,000 crawlers per account.
You can synchronize several S3 File System assets simultaneously, but if the total number of crawlers exceeds the maximum amount in AWS Glue, synchronization will fail. Since Collibra deletes the crawlers from AWS Glue after synchronization, it is safer to synchronize each S3 File System asset at a unique time.
For more information, see the AWS Glue documentation. - Crawlers in AWS Glue can crawl multiple buckets, but in Collibra, each crawler can only crawl a single bucket.