Prepare S3 file system for Edge

Before you integrate an S3 file system via Edge, you need to prepare Amazon S3 by creating the required roles and permissions.
Two types of authentication are available for Amazon S3: IAM or EC2. The preparations in S3 depend on the authentication type you want to use.

IAM is the most common authentication type for Edge.
EC2 is used to connect to an AWS EC2 instance that is configured with role-based authentication.

Each authentication type has different requirements:

Steps

Create a programmatic user.
Collibra needs programmatic access to Amazon S3 and AWS Glue. This is done by creating a user with the following permissions:
- AWS managed policy: AWSGlueServiceRole.
  Note If you don't want to use this out-of-the-box AWS managed policy, go to the Support portal to learn about the required permissions.
- inline policy: pass_role with the following specific JSON content.
  Show JSON content for the inline policy
```
{
    "Version": "2012-10-17",
    "Statement": 
    [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": "iam:PassRole",
            "Resource": "*"
        }
    ]
}
```
Note
If your Glue database is KMS encrypted, also give the permission kms:Decrypt.
Show JSON content for the extra permission
```
{
    "Version": "2012-10-17",
    "Statement": {
            "Effect": "Allow",
            "Action": [
                   "kms:Decrypt"
                ],
            "Resource": "Resource ID"
        }
}
```
where Resource ID is the identifier of the encryption key used to encrypt the Glue database.
For example "Resource": "arn:aws:kms:us-east-1:123456789012:key/abc1234567890def1234567890efg123"
Show step-by-step instructions
1. In AWS IAM, create the user that you want to use to connect to AWS.
2. During the creation, attach permission policy AWSGlueServiceRole.
  This is an AWS managed policy. If you don't want to use this out-of-the-box AWS managed policy, go to the Support portal to learn about the required permissions.
3. After the creation, open the user details and create an inline policy.
4. Use the following JSON content for the inline policy:
```
{
    "Version": "2012-10-17",
    "Statement": 
    [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": "iam:PassRole",
            "Resource": "*"
        }
    ]
}
```
5. Use the following name for the inline policy: pass_role.
When you create the AWS connection in Collibra, you need the programmatic user credentials and the access keys. For information on access keys, go to the IAM documentation.
For more information about creating a user with programmatic access, go to the IAM documentation.
Create an IAM role.
When you create the capability for the AWS connection in Collibra, you need to add an IAM role. The AWS Glue crawlers need this role to run their operation. The pass_role permission policy of the programmatic user is used to assign this role to the crawler. The IAM role needs the following parameters:
- Trusted entities: glue.amazonaws.com
- Policies:
  - AWS managed policy: AWSGlueServiceRole
    Note If you don't want to use this out-of-the-box AWS managed policy, go to the Support portal to learn about the required permissions.
  - AWS managed policy: AmazonS3ReadOnlyAccess, required only when you want to access a private S3 bucket.
Note
If your Glue database is KMS encrypted, also give the permission kms:Decrypt.
Show JSON content for the extra permission
```
{
    "Version": "2012-10-17",
    "Statement": {
            "Effect": "Allow",
            "Action": [
                   "kms:Decrypt"
                ],
            "Resource": "Resource ID"
        }
}
```
where Resource ID is the identifier of the encryption key used to encrypt the Glue database.
For example "Resource": "arn:aws:kms:us-east-1:123456789012:key/abc1234567890def1234567890efg123"
Show step-by-step instructions
1. In AWS IAM, create a new role.
2. During the creation, add permission policy AWSGlueServiceRole.
  This is an AWS managed policy. If you don't want to use this out-of-the-box AWS managed policy, go to the Support portal to learn about the required permissions.
3. If needed, also add permission policy AmazonS3ReadOnlyAccess.
  This permission is required only when you want to access a private S3 bucket.
4. Open the role and in Trust relationships check that glue.amazonaws.com is added as trust policy. This should have been added automatically based on the permission policy AWSGlueServiceRole.
You will need to use this IAM role when you add a capability in Collibra.
If you have enabled Data Lake Formation, complete additional steps.
Show step-by-step instructions
1. Register the Data Lake locations and create an AIM role for Lake Formation.
  To add or update data, Lake Formation needs read/write access to the Amazon S3 path that you want to integrate.
  1. Navigate to AWS Lake Formation → Administration → Data lake locations.
  2. Click Register location.
  3. Select the IAM role for Lake Formation "AWSServiceRoleForLakeFormationDataAccess" and add the S3 location that you want to integrate.
    This IAM role needs the following permission policies:
    AWS managed policy: LakeFormationDataAccessServiceRolePolicy
    inline policy: LakeFormationDAtaAccessPolicyForS3
  4. Click Register Location.
2. Add the programmatic user as a Data Lake administrator.
  To perform this step, go to AWS Lake Formation → Administration → Administrative roles and tasks and add the user as Data Lake administrator.
3. Give your IAM role permission to access specific storage locations.
  1. Go to AWS Lake Formation → Permissions → Data locations.
  2. Click Grant.
  3. Select the IAM role you created and add the S3 location that you want to integrate.
  4. Click Grant.

Important

You can provide more restrictive permissions to the IAM role, if dictated by your security requirements. Your AWS subject matter expert can create the appropriate permission set using the steps in the IAM documentation. We recommend that you test a crawler with an IAM role that has these permissions in the AWS console, to ensure that it is successful before you use the IAM role in Collibra.
The S3 credentials are stored on the Edge site and not in the Collibra repository.

Note EC2 has been validated only for bundled K3S installations of Edge.

If you use an AWS EC2 instance that is configured with role-based authentication, you can connect to Amazon S3 without an access key ID and secret access key. Use the following steps to configure role-based Amazon S3 access control.

Prerequisites

You have access to the AWS IAM console.
You have access to the Amazon EC2 console.
You have an Amazon EC2 instance.

Steps

Go to AWS Identity and Access Management.
Create an IAM role.

When you create the capability for the AWS connection in Collibra, you will need to add an IAM role. The AWS Glue crawlers need this role to run their operation. The pass_role permission policy of the programmatic user is used to assign this role to the crawler. The IAM role needs the following parameters:
- Trusted entities: glue.amazonaws.com
- Policies:
  - AWS managed policy: AWSGlueServiceRole
  - AWS managed policy: AmazonS3ReadOnlyAccess, required only when you want to access a private S3 bucket
  - pass_role (inline policy)
Note
If your Glue database is KMS encrypted, also give the permission kms:Decrypt.
Show JSON content for the extra permission
```
{
    "Version": "2012-10-17",
    "Statement": {
            "Effect": "Allow",
            "Action": [
                   "kms:Decrypt"
                ],
            "Resource": "Resource ID"
        }
}
```
where Resource ID is the identifier of the encryption key used to encrypt the Glue database.
For example "Resource": "arn:aws:kms:us-east-1:123456789012:key/abc1234567890def1234567890efg123"
Show step-by-step instructions
1. In AWS IAM, create a new role.
2. During the creation, add permission policy AWSGlueServiceRole.
  This is an AWS managed policy. This is an AWS managed policy.If you don't want to use this out-of-the-box AWS managed policy, you will need to work with AWS support to define a more restrictive policy.
3. If needed, also add permission policy AmazonS3ReadOnlyAccess.
  
  This permission is required only when you want to access a private S3 bucket.
4. Open the role and in Trust relationships check that glue.amazonaws.com is added as trust policy. This should have been added automatically based on the permission policy AWSGlueServiceRole.
5. After the creation, open the user details and create an inline policy via Add permissions.
6. Use the following JSON content for the inline policy:
```
{
    "Version": "2012-10-17",
    "Statement": 
    [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": "iam:PassRole",
            "Resource": "*"
        }
    ]
}
```
7. Use the following name for the inline policy: pass_role.
You will need to use this IAM role when you add a capability in Collibra.
In the Amazon EC2 console, attach the IAM role to the Amazon EC2 instance.

Note

If the credentials in the Amazon EC2 instance can't be used to authenticate, you can create a credentials file and save it in the user_home/.aws/ folder. The credentials file should look like this:

[default]
aws_access_key_id = <access key ID>
aws_secret_access_key = <secret access key>

For more information, see the AWS developer guide.

Warning Do not use a credentials file unless absolutely necessary.

What's Next?

You can now go to Collibra to register your AWS regions and prepare your Edge site. See steps in Integrate an Amazon S3 file system.

More information about Amazon Web Services (AWS)

Collibra relies on AWS Glue and AWS Identity and Access Management (IAM) to ingest and synchronize data.

AWS Glue

AWS Glue is an Amazon cloud service to perform extract-transform-load (ETL) processes on data, stored in data sources such as Amazon S3. AWS Glue has the following components:

Glue crawlers: Glue crawlers analyze and describe a wide range of data sources such as Amazon S3 or MySQL. However, Data Catalog only uses them for the Amazon S3 file system integration.
Glue database: Glue crawlers store their results in a database in the form of tables and columns. Both, the tables and columns in the Glue database, contain metadata that describes the content of Amazon S3. Data Catalog reads those databases for data ingestion.
The name of the created Glue database is collibra_catalog_<S3 File System-ID>_<Domain-ID>.
ETL processes: The ETL processes can extract data from a data source, process that data, for example, categorize and clean it, and produce output. This component is currently not used by Data Catalog.

Although you need an AWS account, you don't have to work in AWS Glue directly because Collibra does everything for you. For more information about AWS Glue, go to the AWS Glue documentation.

Note Collibra only uses AWS Glue to ingest data from Amazon S3. All other features, such as crawling other data sources or ETL processes are not integrated.

AWS Identity and Access Management

Collibra uses the AWS Identity and Access Management (IAM) service to manage access to Amazon S3 and AWS Glue. Similar to AWS Glue, you need an AWS account to use the IAM service, but after setting up the required users and roles, you don't have to work directly with IAM. For more information about IAM, go to the IAM documentation.

You need two things in IAM:

An AWS programmatic user to access Amazon S3 and AWS Glue, if you use the IAM authentication type.
An IAM role for the crawlers.