Create a crawler for Google Cloud Storage
You can create a crawler for Google Cloud Storage (GCS) to specify the folders that you want to synchronize.
Prerequisites
- You have registered a GCS file system.
- You have connected the GCS File System asset to the GCS Edge capability.
- You have a global role with the View Edge connections and capabilities global permission, for example, Edge integration engineer.
- You have a global role with the Catalog global permission, for example, Catalog Author.
- You have a resource role with the Configure external system resource permission, for example, Owner.
Steps
- Open the GCS File System asset.
- In the tab bar, click Configuration.
- In the Crawlers section, click Edit Configuration.
- Click Add Crawler.
- Enter the required information.
Field Description Domain
The domain in which the assets of the GCS file system are to be created.
Name The name you want to give to the crawler in Collibra.
The crawler name character limit is 255.
Include path The case-sensitive path to a directory of a bucket in GCS. All objects and subdirectories of this path are taken into account during the synchronization.
Use the following structure to refer to the path:gs://{bucketname}/{path(optional)}ExampleIn GCS, one of the buckets is called "marketing" with directory "mkt".
- To include the whole bucket, the path must be:
gs://marketing - To only include the "mkt" directory of that bucket, the path must be:
gs://marketing/mkt/
Exclude patterns A pattern that represents the objects that are included via the Include path but that you want to exclude from the synchronization.
When you define a pattern, you can use the following rules:*matches zero or more characters.**matches zero or more directories in a path.?matches one character.
Examplecomm/*.jspmatches all .jsp files in the comm directory.comm/t?st.jspmatches comm/test.jsp but also comm/tast.jsp or comm/txst.jsp.commm/**/test.jspmatches all test.jsp files in the comm path.org/framework/**/*.jspmatches all .jsp files in the org/framework path.org/**/servlet/test.jspmatches org/framework/servlet/test.jsp but also org/framework/testing/servlet/test.jsp and org/servlet/test.jsp.
Add pattern A button to add additional exclude patterns. - To include the whole bucket, the path must be:
- Click Create.
Example: Filtering logic
Here's an example of how the Include path and Exclude patterns work together.
The following files exist in bucket1 of the GCS file system:
- myfolder/departments/finance.json
- myfolder/departments/market-us.json
- myfolder/departments/market-emea.json
- myfolder/departments/market-ap.txt
- myfolder/employees/hr.json
- myfolder/employees/john.csv
- myfolder/employees/jane.csv
- myfolder/employees/juan.txt
- myfolder/report.xlsx
- rubbish.txt
See the table to understand the combinations of Include path and Exclude patterns:
| Include path | Exclude pattern | What does it mean? | Result |
|---|---|---|---|
gs://bucket1/ or gs://bucket1 |
<none> | All files in gs://bucket1 are included. |
|
bucket1
|
<none> | None of the files are included because the Include path is defined incorrectly. | <none> |
gs://bucket1/
|
*.txt**.json |
All files in gs://bucket1/ are included, except the following files:
|
|
gs://bucket1/
|
**/*.txt myfolder/employees/*.json |
All files in gs://bucket1/ are included, except the following files:
|
|
gs://bucket1/
|
myfolder/**/*txt
|
All files in |
|
gs://bucket1/myfolder |
employees/*myfolder/departments/* |
All files in gs://bucket1/myfolder/ are included, except:
|
|
gs://bucket1/myfolder/departments
|
*json
|
All files in |
myfolder/departments/market-ap.txt |
gs://bucket1/
|
**/j???.*
|
All files in gs://bucket1/ are included, except the files starting with j and followed by 3 characters from all subfolders in bucket1. |
|
gs://bucket1 |
myfolder/** |
All files in |
rubbish.txt |
You can now synchronize GCS manually or add a synchronization schedule.