Create a crawler for Google Cloud Storage

You can create a crawler for Google Cloud Storage (GCS) to specify the folders that you want to synchronize.

Prerequisites

You have registered a GCS file system.
You have connected the GCS File System asset to the GCS Edge capability.
You have a global role with the View Edge connections and capabilities global permission, for example, Edge integration engineer.
You have a global role with the Catalog global permission, for example, Catalog Author.
You have a resource role with the Configure external system resource permission, for example, Owner.

Steps

Open the GCS File System asset.
In the tab bar, click Configuration.
In the Crawlers section, click Edit Configuration.
Click Add Crawler.

Enter the required information.

Field	Description
Domain	The domain in which the assets of the GCS file system are to be created.
Name	The name you want to give to the crawler in Collibra. The crawler name character limit is 255.
Include path	The case-sensitive path to a directory of a bucket in GCS. All objects and subdirectories of this path are taken into account during the synchronization. Use the following structure to refer to the path: `gs://{bucketname}/{path(optional)}` Example In GCS, one of the buckets is called "marketing" with directory "mkt". To include the whole bucket, the path must be: `gs://marketing` To only include the "mkt" directory of that bucket, the path must be: `gs://marketing/mkt/`
Exclude patterns	A pattern that represents the objects that are included via the Include path but that you want to exclude from the synchronization. When you define a pattern, you can use the following rules: `` matches zero or more characters. `` matches zero or more directories in a path. `?` matches one character. Example `comm/.jsp` matches all .jsp files in the comm directory. `comm/t?st.jsp` matches comm/test.jsp but also comm/tast.jsp or comm/txst.jsp. `commm//test.jsp` matches all test.jsp files in the comm path. `org/framework//.jsp` matches all .jsp files in the org/framework path. `org/*/servlet/test.jsp` matches org/framework/servlet/test.jsp but also org/framework/testing/servlet/test.jsp and org/servlet/test.jsp.
Add pattern	A button to add additional exclude patterns.

Click Create.

Example: Filtering logic

Here's an example of how the Include path and Exclude patterns work together.

The following files exist in bucket1 of the GCS file system:

myfolder/departments/finance.json
myfolder/departments/market-us.json
myfolder/departments/market-emea.json
myfolder/departments/market-ap.txt
myfolder/employees/hr.json
myfolder/employees/john.csv
myfolder/employees/jane.csv
myfolder/employees/juan.txt
myfolder/report.xlsx
rubbish.txt

See the table to understand the combinations of Include path and Exclude patterns:

Include path	Exclude pattern	What does it mean?	Result
`gs://bucket1/` or `gs://bucket1`	<none>	All files in `gs://bucket1` are included.	myfolder/departments/finance.json myfolder/departments/market-us.json myfolder/departments/market-emea.json myfolder/departments/market-ap.txt myfolder/employees/hr.json myfolder/employees/john.csv myfolder/employees/jane.csv myfolder/employees/juan.txt myfolder/report.xlsx rubbish.txt
`bucket1`	<none>	None of the files are included because the Include path is defined incorrectly.	<none>
`gs://bucket1/`	`.txt` `*.json`	All files in `gs://bucket1/` are included, except the following files: The TXT files in the main folder `gs://bucket1`. The JSON files in the main folder `gs://bucket1`.	myfolder/departments/finance.json myfolder/departments/market-us.json myfolder/departments/market-emea.json myfolder/departments/market-ap.txt myfolder/employees/hr.json myfolder/employees/john.csv myfolder/employees/jane.csv myfolder/employees/juan.txt myfolder/report.xlsx
`gs://bucket1/`	`*/.txt` `myfolder/employees/*.json`	All files in `gs://bucket1/` are included, except the following files: The TXT files in all subfolders of `gs://bucket1`. The JSON files in the subfolder `gs://bucket1/myfolder/employees/`.	myfolder/departments/finance.json myfolder/departments/market-us.json myfolder/departments/market-emea.json myfolder/employees/john.csv myfolder/employees/jane.csv myfolder/report.xlsx
`gs://bucket1/`	`myfolder/*/txt`	All files in `gs://bucket1/` are included, except the TXT files in all subfolders of `gs://bucket1/myfolder/`.	myfolder/departments/finance.json myfolder/departments/market-us.json myfolder/departments/market-emea.json myfolder/employees/hr.json myfolder/employees/john.csv myfolder/employees/jane.csv myfolder/report.xlsx rubbish.txt
`gs://bucket1/myfolder`	`employees/myfolder/departments/`	All files in `gs://bucket1/myfolder/` are included, except: All files in all subfolders of `gs://bucket1/myfolder/employees`. All files in `gs://bucket1/myfolder/myfolder/ departments/`.	myfolder/departments/finance.json myfolder/departments/market-us.json myfolder/departments/market-emea.json myfolder/departments/market-ap.txt myfolder/report.xlsx
`gs://bucket1/myfolder/departments`	`*json`	All files in `gs://bucket1/myfolder/departments` are included except all the JSON files in this folder.	myfolder/departments/market-ap.txt
`gs://bucket1/`	`*/j???.`	All files in `gs://bucket1/` are included, except the files starting with `j` and followed by 3 characters from all subfolders in bucket1.	myfolder/departments/finance.json myfolder/departments/market-us.json myfolder/departments/market-emea.json myfolder/departments/market-ap.txt myfolder/employees/hr.json myfolder/report.xlsx rubbish.txt
`gs://bucket1`	`myfolder/**`	All files in `gs://bucket1/` are included, except the files in `myfolder/`.	rubbish.txt

What's next

You can now synchronize GCS manually or add a synchronization schedule.