Create a crawler for Google Cloud Storage

Important 

In Collibra 2024.05, we launched a new user interface (UI) for Collibra Data Intelligence Platform! You can learn more about this latest UI in the UI overview.

Use the following options to see the documentation in the latest UI or in the previous, classic UI:

By creating a crawler for Google Cloud storage (GCS), you can specify which folders you want to synchronize.

Before you begin

Prerequisites

Steps

  1. Open the GCS File System asset.
  2. In the tab panebar, click Configuration. In the tab panebar, click Configuration.
  3. In the Crawlers section, click Edit Configuration.
  4. Click Add Crawler.
  5. In the Crawlers section, click Create crawler.
    The Create crawler dialog appears.
  6. Enter the required information.
    FieldDescription

    Domain

    The domain in which the assets of the GCS file system are to be created.

    Name

    The name you want to give to the crawler in Collibra.

    The crawler name character limit is 255.

    Include path

    The case-sensitive path to a directory of a bucket in GCS. All objects and subdirectories of this path are taken into account during the synchronization.
    Use the following structure to refer to the path: gs://{bucketname}/{path(optional)}

    Example 

    In GCS, one of the buckets is called "marketing" with directory "mkt".

    • To include the whole bucket, the path must be: gs://marketing
    • To only include the "mkt" directory of that bucket, the path must be: gs://marketing/mkt/
    Exclude patterns

    A pattern that represents the objects that are included via the Include path, but that you want to exclude from the synchronization.
    When you define a pattern, you can use the following rules:

    • * matches zero or more characters.
    • ** matches zero or more directories in a path.
    • ? matches one character.
    Example 
    • comm/*.jsp matches all .jsp files in the comm directory.
    • comm/t?st.jsp matches comm/test.jsp but also comm/tast.jsp or comm/txst.jsp.
    • commm/**/test.jsp matches all test.jsp files in the comm path.
    • org/framework/**/*.jsp matches all .jsp files in the org/framework path.
    • org/**/servlet/test.jsp matches org/framework/servlet/test.jsp but also org/framework/testing/servlet/test.jsp and org/servlet/test.jsp.
    Add patternA button to add additional exclude patterns.
  7. Click Create.

Example on how the Include path and the Exclude patterns work together

In bucket1 of the GCS system, the following files exist:

myfolder/departments/finance.json
myfolder/departments/market-us.json
myfolder/departments/market-emea.json
myfolder/departments/market-ap.txt
myfolder/employees/hr.json
myfolder/employees/john.csv
myfolder/employees/jane.csv
myfolder/employees/juan.txt
myfolder/report.xlsx
rubbish.txt

Below, you find the results for several Include path and Exclude patterns combinations:

Include path Exclude pattern What does it mean? Result
gs://bucket1/ <none> All files in gs://bucket1 are taken into account. myfolder/departments/finance.json
myfolder/departments/market-us.json
myfolder/departments/market-emea.json
myfolder/departments/market-ap.txt
myfolder/employees/hr.json
myfolder/employees/john.csv
myfolder/employees/jane.csv
myfolder/employees/juan.txt
myfolder/report.xlsx
rubbish.txt
gs://bucket1 <none> All files in gs://bucket1 are taken into account. myfolder/departments/finance.json
myfolder/departments/market-us.json
myfolder/departments/market-emea.json
myfolder/departments/market-ap.txt
myfolder/employees/hr.json
myfolder/employees/john.csv
myfolder/employees/jane.csv
myfolder/employees/juan.txt
myfolder/report.xlsx
rubbish.txt
bucket1 <none> None of the files are taken into account because the Include path is not correct. <none>
gs://bucket1/ *.txt
**.json
All files in gs://bucket1/ are taken into account, except:
  • the TXT files in the main folder gs://bucket1
  • the JSON files the main folder gs://bucket1

myfolder/departments/finance.json
myfolder/departments/market-us.json
myfolder/departments/market-emea.json
myfolder/departments/market-ap.txt
myfolder/employees/hr.json
myfolder/employees/john.csv
myfolder/employees/jane.csv
myfolder/employees/juan.txt
myfolder/report.xlsx

gs://bucket1/ **/*.txt myfolder/employees/*.json All files in gs://bucket1/ are taken into account, except:
  • theTXT files in all subfolders of gs://bucket1
  • the JSON files in subfolder gs://bucket1/myfolder/employees/

myfolder/departments/finance.json
myfolder/departments/market-us.json
myfolder/departments/market-emea.json
myfolder/employees/john.csv
myfolder/employees/jane.csv
myfolder/report.xlsx

gs://bucket1/ myfolder/**/*txt

All files in gs://bucket1/ are taken into account, except the TXT files in all subfolders of gs://bucket1/myfolder/.

myfolder/departments/finance.json
myfolder/departments/market-us.json
myfolder/departments/market-emea.json
myfolder/employees/hr.json
myfolder/employees/john.csv
myfolder/employees/jane.csv
myfolder/report.xlsx
rubbish.txt

gs://bucket1/myfolder employees/* myfolder/departments/* All files in gs://bucket1/myfolder/ are taken into account except:
  • all files in all subfolders of gs://bucket1/myfolder/employees
  • all files in gs://bucket1/myfolder/myfolder/
    departments/

myfolder/departments/finance.json
myfolder/departments/market-us.json
myfolder/departments/market-emea.json
myfolder/departments/market-ap.txt
myfolder/report.xlsx

gs://bucket1/myfolder/departments *json

All files in gs://bucket1/myfolder/departments are taken into account except all JSON files in this folder.

myfolder/departments/market-ap.txt

gs://bucket1/ **/j???.* All files in gs://bucket1/ are taken into account, except the files starting with j followed by three characters, from all subfolders in bucket1.

myfolder/departments/finance.json
myfolder/departments/market-us.json
myfolder/departments/market-emea.json
myfolder/departments/market-ap.txt
myfolder/employees/hr.json
myfolder/report.xlsx
rubbish.txt

gs://bucket1 myfolder/**

All files in gs://bucket1/ are taken into account expect for the files in myfolder/

rubbish.txt

What's next?

You can now synchronize GCS manually or define a synchronization schedule.