Create a crawler for Google Cloud Storage

By creating a crawler for Google Cloud storage (GCS), you can specify which folders you want to synchronize.

Before you begin

You have registered a GCS file system.
You have connected the GCS File System asset to the GCS Edge capability.

Prerequisites

You have a global role with the View Edge connections and capabilities global permission, for example, Edge integration engineer.
You have a global role with the Catalog global permission, for example, Catalog Author.
You have a resource role with the Configure external system resource permission, for example, Owner.

Steps

Open the GCS File System asset.
In the tab pane, click Configuration.
In the Crawlers section, click Create crawler.
The Create crawler dialog appears.

Enter the required information.

Field	Description
Domain	The domain in which the assets of the GCS file system are to be created.
Name	The name you want to give to the crawler in Collibra.
Include path	The case-sensitive path to a directory of a bucket in GCS. All objects and subdirectories of this path are taken into account during the synchronization. Use the following structure to refer to the path: gs://{bucketname}/{path(optional)} Example In GCS, one of the buckets is called "marketing" with directory "mkt". To include the whole bucket, the path must be: gs://marketing To only include the "mkt" directory of that bucket, the path must be: gs://marketing/mkt/
Exclude patterns	A pattern that represents the objects that are included via the Include path, but that you want to exclude from the synchronization. When you define a pattern, you can use the following rules: `` matches zero or more characters. `` matches zero or more directories in a path. `?` matches one character. Example `comm/.jsp` matches all .jsp files in the comm directory. `comm/t?st.jsp` matches comm/test.jsp but also comm/tast.jsp or comm/txst.jsp. `commm//test.jsp` matches all test.jsp files in the comm path. `org/framework//.jsp` matches all .jsp files in the org/framework path. `org/*/servlet/test.jsp` matches org/framework/servlet/test.jsp but also org/framework/testing/servlet/test.jsp and org/servlet/test.jsp.
Add pattern	A button to add additional exclude patterns.

Click Create.

Example on how the Include path and the Exclude patterns work together

In bucket1 of the GCS system, the following files exist:

myfolder/departments/finance.json
myfolder/departments/market-us.json
myfolder/departments/market-emea.json
myfolder/departments/market-ap.txt
myfolder/employees/hr.json
myfolder/employees/john.csv
myfolder/employees/jane.csv
myfolder/employees/juan.txt
myfolder/report.xlsx
rubbish.txt

Below, you find the results for several Include path and Exclude patterns combinations:

Include path	Exclude pattern	What does it mean?	Result
gs://bucket1/	<none>	All files in gs://bucket1 are taken into account.	myfolder/departments/finance.json myfolder/departments/market-us.json myfolder/departments/market-emea.json myfolder/departments/market-ap.txt myfolder/employees/hr.json myfolder/employees/john.csv myfolder/employees/jane.csv myfolder/employees/juan.txt myfolder/report.xlsx rubbish.txt
gs://bucket1	<none>	All files in gs://bucket1 are taken into account.	myfolder/departments/finance.json myfolder/departments/market-us.json myfolder/departments/market-emea.json myfolder/departments/market-ap.txt myfolder/employees/hr.json myfolder/employees/john.csv myfolder/employees/jane.csv myfolder/employees/juan.txt myfolder/report.xlsx rubbish.txt
bucket1	<none>	None of the files are taken into account because the Include path is not correct.	<none>
gs://bucket1/	.txt *.json	All files in gs://bucket1/ are taken into account, except: the TXT files in the main folder gs://bucket1 the JSON files the main folder gs://bucket1	myfolder/departments/finance.json myfolder/departments/market-us.json myfolder/departments/market-emea.json myfolder/departments/market-ap.txt myfolder/employees/hr.json myfolder/employees/john.csv myfolder/employees/jane.csv myfolder/employees/juan.txt myfolder/report.xlsx
gs://bucket1/	*/.txt myfolder/employees/*.json	All files in gs://bucket1/ are taken into account, except: theTXT files in all subfolders of gs://bucket1 the JSON files in subfolder gs://bucket1/myfolder/employees/	myfolder/departments/finance.json myfolder/departments/market-us.json myfolder/departments/market-emea.json myfolder/employees/john.csv myfolder/employees/jane.csv myfolder/report.xlsx
gs://bucket1/	myfolder/*/txt	All files in gs://bucket1/ are taken into account, except the TXT files in all subfolders of gs://bucket1/myfolder/.	myfolder/departments/finance.json myfolder/departments/market-us.json myfolder/departments/market-emea.json myfolder/employees/hr.json myfolder/employees/john.csv myfolder/employees/jane.csv myfolder/report.xlsx rubbish.txt
gs://bucket1/myfolder	employees/* myfolder/departments/*	All files in gs://bucket1/myfolder/ are taken into account except: all files in all subfolders of gs://bucket1/myfolder/employees all files in gs://bucket1/myfolder/myfolder/ departments/	myfolder/departments/finance.json myfolder/departments/market-us.json myfolder/departments/market-emea.json myfolder/departments/market-ap.txt myfolder/report.xlsx
gs://bucket1/myfolder/departments	*json	All files in gs://bucket1/myfolder/departments are taken into account except all JSON files in this folder.	myfolder/departments/market-ap.txt
gs://bucket1/	*/j???.	All files in gs://bucket1/ are taken into account, except the files starting with j followed by three characters, from all subfolders in bucket1.	myfolder/departments/finance.json myfolder/departments/market-us.json myfolder/departments/market-emea.json myfolder/departments/market-ap.txt myfolder/employees/hr.json myfolder/report.xlsx rubbish.txt
gs://bucket1	myfolder/**	All files in gs://bucket1/ are taken into account expect for the files in myfolder/	rubbish.txt

What's next?

You can now synchronize GCS manually or define a synchronization schedule.