Create a crawler for Google Cloud Storage
By creating a crawler for Google Cloud storage (GCS), you can specify which folders you want to synchronize.
Before you begin
- You have registered a GCS file system.
- You have connected the GCS File System asset to the GCS Edge capability.
Prerequisites
- You have a global role with the View Edge connections and capabilities global permission, for example, Edge integration engineer.
- You have a global role with the Catalog global permission, for example, Catalog Author.
- You have a resource role with the Configure external system resource permission, for example, Owner.
Steps
- Open the GCS File System asset.
-
In the tab pane, click
Configuration. - In the Crawlers section, click Create crawler.
The Create crawler dialog appears. - Enter the required information.
Field Description Domain
The domain in which the assets of the GCS file system are to be created.
Name The name you want to give to the crawler in Collibra.
Include path The case-sensitive path to a directory of a bucket in GCS. All objects and subdirectories of this path are taken into account during the synchronization.
Use the following structure to refer to the path: gs://{bucketname}/{path(optional)}ExampleIn GCS, one of the buckets is called "marketing" with directory "mkt".
- To include the whole bucket, the path must be: gs://marketing
- To only include the "mkt" directory of that bucket, the path must be: gs://marketing/mkt/
Exclude patterns A pattern that represents the objects that are included via the Include path, but that you want to exclude from the synchronization.
When you define a pattern, you can use the following rules:*matches zero or more characters.**matches zero or more directories in a path.?matches one character.
Examplecomm/*.jspmatches all .jsp files in the comm directory.comm/t?st.jspmatches comm/test.jsp but also comm/tast.jsp or comm/txst.jsp.commm/**/test.jspmatches all test.jsp files in the comm path.org/framework/**/*.jspmatches all .jsp files in the org/framework path.org/**/servlet/test.jspmatches org/framework/servlet/test.jsp but also org/framework/testing/servlet/test.jsp and org/servlet/test.jsp.
Add pattern A button to add additional exclude patterns. - To include the whole bucket, the path must be: gs://marketing
- Click Create.
Example on how the Include path and the Exclude patterns work together
In bucket1 of the GCS system, the following files exist:
myfolder/departments/finance.json
myfolder/departments/market-us.json
myfolder/departments/market-emea.json
myfolder/departments/market-ap.txt
myfolder/employees/hr.json
myfolder/employees/john.csv
myfolder/employees/jane.csv
myfolder/employees/juan.txt
myfolder/report.xlsx
rubbish.txt
Below, you find the results for several Include path and Exclude patterns combinations:
| Include path | Exclude pattern | What does it mean? | Result |
|---|---|---|---|
| gs://bucket1/ | <none> | All files in gs://bucket1 are taken into account. | myfolder/departments/finance.json myfolder/departments/market-us.json myfolder/departments/market-emea.json myfolder/departments/market-ap.txt myfolder/employees/hr.json myfolder/employees/john.csv myfolder/employees/jane.csv myfolder/employees/juan.txt myfolder/report.xlsx rubbish.txt |
| gs://bucket1 | <none> | All files in gs://bucket1 are taken into account. | myfolder/departments/finance.json
myfolder/departments/market-us.json myfolder/departments/market-emea.json myfolder/departments/market-ap.txt myfolder/employees/hr.json myfolder/employees/john.csv myfolder/employees/jane.csv myfolder/employees/juan.txt myfolder/report.xlsx rubbish.txt |
| bucket1 | <none> | None of the files are taken into account because the Include path is not correct. | <none> |
| gs://bucket1/ | *.txt
**.json |
All files in gs://bucket1/ are taken into account, except:
|
myfolder/departments/finance.json |
| gs://bucket1/ | **/*.txt myfolder/employees/*.json | All files in gs://bucket1/ are taken into account, except:
|
myfolder/departments/finance.json |
| gs://bucket1/ | myfolder/**/*txt |
All files in gs://bucket1/ are taken into account, except the TXT files in all subfolders of gs://bucket1/myfolder/. |
myfolder/departments/finance.json |
| gs://bucket1/myfolder | employees/* myfolder/departments/* | All files in gs://bucket1/myfolder/ are taken into account except:
|
myfolder/departments/finance.json |
| gs://bucket1/myfolder/departments | *json |
All files in gs://bucket1/myfolder/departments are taken into account except all JSON files in this folder. |
myfolder/departments/market-ap.txt |
| gs://bucket1/ | **/j???.* | All files in gs://bucket1/ are taken into account, except the files starting with j followed by three characters, from all subfolders in bucket1. |
myfolder/departments/finance.json |
| gs://bucket1 | myfolder/** |
All files in gs://bucket1/ are taken into account expect for the files in myfolder/ |
rubbish.txt |
What's next?
You can now synchronize GCS manually or define a synchronization schedule.