Prepare S3 file system for Edge

Before you integrate an S3 file system via Edge, you need to prepare Amazon S3. You need to:

Required Amazon Web Services (AWS)

Collibra relies on AWS Glue and AWS Identity and Access Management to ingest and synchronize data.

AWS Glue

AWS Glue is an Amazon cloud service to perform extract-transform-load (ETL) processes on data, stored in data sources such as Amazon S3. AWS Glue has the following components:

  • Glue crawlers:
    Glue crawlers analyze and describe a wide range of data sources such as Amazon S3 or MySQL. However, Data Catalog only uses them for the Amazon S3 file system integration.
  • Glue database:
    Glue crawlers store their results in a database in the form of tables and columns. Both the tables and columns in the Glue database contain metadata that describes the content of Amazon S3. Data Catalog reads those databases for data ingestion. The name of the created Glue database is collibra_catalog_<S3 File System-ID>_<Domain-ID>.
  • ETL processes:
    The ETL processes can extract data from a data source, process that data, for example, categorize and clean it and produce output. This component is currently not used by Data Catalog.

Though you need an AWS account, you do not have to work in AWS Glue directly because Collibra does everything for you. For more information about AWS Glue, see the AWS Glue documentation.

Note Collibra only uses AWS Glue to ingest data from Amazon S3. All other features, such as crawling other data sources or ETL processes are not integrated.

AWS Identity and Access Management

Collibra uses the AWS Identity and Access Management (IAM) service to manage access to Amazon S3 and AWS Glue. Similar to AWS Glue, you need an AWS account to use the IAM service, but after setting up the required users and roles, you do not have to work directly with IAM. For more information about IAM, see the IAM documentation.

You need two things in IAM:

  • An AWS programmatic user to access Amazon S3 and AWS Glue.
  • An IAM role for the crawlers.

Programmatic user

Collibra needs programmatic access to Amazon S3 and AWS Glue by means of a user. The following policies and permissions are required:

  • Policies:

    • AWSGlueServiceRole (AWS managed policy)
    • pass_role (inline policy)

      You can use the following JSON content:
      {
          "Version": "2012-10-17",
          "Statement": 
          [
              {
                  "Sid": "VisualEditor0",
                  "Effect": "Allow",
                  "Action": "iam:PassRole",
                  "Resource": "*"
              }
          ]
      }
  • Permissions:
    • In Collibra Data Intelligence Cloud 2020.11 and newer and Collibra Data Governance Center 5.7.7 and newer, the programmatic user needs the following permissions:

    • {
          "Version": "2012-10-17",
          "Statement": [
              {
                  "Sid": "VisualEditor0",
                  "Effect": "Allow",
                  "Action": [
                      "glue:GetCrawler",
                      "glue:GetCrawlers",
                      "glue:DeleteDatabase",
                      "glue:GetTables",
                      "glue:DeleteCrawler",
                      "glue:StopCrawler",
                      "s3:ListBucket",
                      "glue:GetDatabases",
                      "glue:CreateCrawler",
                      "glue:GetDatabase",
                      "iam:PassRole",
                      "glue:StartCrawler",
                      "glue:BatchDeleteTable",
                      "s3:GetBucketLocation"
                  ],
                  "Resource": "*"
              }
          ]
      }

For more information about creating a user with programmatic access, see the IAM documentation.

IAM role

AWS Glue Crawlers need an IAM role, to allow the crawlers to execute an operation on your behalf. The "pass_role" permission policy of the programmatic user is used to assign this role to the crawler.

You need at least the following parameters:

  • Trusted entities: glue.amazonaws.com
  • Policies:
    • AmazonS3ReadOnlyAccess (AWS managed policy, required when you need to access a private S3 bucket.)
    • AWSGlueServiceRole (AWS managed policy)
Note 
  • You can provide more restrictive permissions to the IAM role, if dictated by your security requirements. Your AWS subject matter expert can create the appropriate permission set using the steps in the IAM documentation. We recommend that you test a crawler with an IAM role that has these permissions in the AWS console, to ensure that it is successful before you use the IAM role in Collibra.
  • AWS EC2 role-based Amazon S3 access is not supported in S3 on Edge because the S3 credentials are stored on the Edge site and not in the Collibra repository.