Connecting to Amazon S3

This section contains details for Amazon S3 connections.

General information

Amazon Simple Storage Service (Amazon S3) is an object storage service that Collibra Data Quality & Observability treats as a Remote File Connection for accessing data. You can also use S3 buckets to export and store breaking records from Collibra DQ rules externally. This can be helpful if you need to control your storage environment behind a firewall.

Field Description
Data source Amazon S3
Supported versions N/A
Connection string s3://
Packaged?

Yes

Certified?

Yes

Supported features
Analyze data

Yes

Archive breaking records

Yes

Estimate job

Yes

Pushdown

No

Processing capabilities
Spark agent

Yes

Yarn agent

Yes

Minimum user permissions

In order for Collibra DQ to access your S3 bucket, you need the following permissions.

  • ROLE_ADMIN in Collibra DQ.
  • Read access on the S3 bucket where your data is stored.
  • Read and write access on the S3 bucket where breaking records are archived from Collibra DQ. This is only necessary if you use the archive breaking records feature.

Recommended and required connection properties

Required Connection Property Type Value

Yes

Name Text The unique name of your connection. Ensure that there are no spaces in your connection name.

No

HTTPS Option Converts the Connection URL field from an s3:// to https:// URI path.

No

Path Style Access
Option

Allows access to object stores along HTTPS paths. This option is only available when HTTPS is enabled.

Yes

*When HTTPS is selected, Bucket Name is required.

Bucket Name
string

The exact name of the S3 bucket along the URI path you are attempting to access.

Example example_bucket

Yes

Connection URL String

The connection string path of your S3 connection. The path must start with s3:// and point to the root bucket, not a sub-folder.

Example s3://<bucket-name>

You can optionally add a key after the bucket name.

Example s3://<bucket-name>/key

Yes

Region String

The AWS region in which the S3 bucket resides.

The default region is US_EAST_1

No

Target Agent Option The Agent used to submit your DQ Job.

Yes

Auth Type Option

The method to authenticate your connection.

Note The configuration requirements are different depending on the Auth Type you select. See Authentication for more details on available authentication types.

Yes

Save Credentials Option Select this option after you enter your connection details. We recommend selecting this option to allow credentials to be shared with users across the Collibra DQ platform.

No

Driver Properties String

The configurable driver properties for your connection. Multiple properties must be comma delimited. For example, abc=123,test=true

Authentication

Select an authentication type from the dropdown menu. The options available in the dropdown menu are the currently supported authentication types for this data source.

Field Description
Key AWS security credentials that use an access key ID and secret access key combination to access S3 buckets.
Key
The access key ID for your Amazon S3 storage account.
Secret
The secret access key for your Amazon S3 storage account.
Instance Profile

The instance profile used to grant access to the EC2 instance to access your S3 bucket.

Optionally select Assume Role and add an accessRole when you use Instance Profile.

Important DQ Cloud does not support Instance Profile.

Username
The username credential required for the IdP service to authenticate a user.
Password
The password credential required for the IdP service to authenticate a user.
Assume Role

This is optional for Instance Profile and Key authentication.

Select this option to generate a set of temporary security credentials that you can use to access AWS resources.

To assume an S3 IAM Role on EKS-based deployments of Collibra DQ, select Assume Role, then optionally enter the information in the Role to Assume field.

Role to Assume

The IAM role associated with your EC2 instance. This is optional when assuming an S3 IAM Role on EKS-based deployments of Collibra DQ.

Password Manager

Field Description
Username The username used to sign into your password manager service.
Password

Note Password is not required when you use Password Manager.

Script The script that contains the SAML role. The password manager program contains the information needed to automate password-related tasks when you attempt to sign in to Collibra DQ.
App ID The Application Identifier used to help the IdP service authorize Collibra DQ when you attempt to sign in.
Safe The storage vault where your password and other authentication information are stored.
Pwd Mgr Name The name of your safe.
accessRole The IAM role associated with your EC2 instance.

Connect to NetApp or Amazon S3 endpoint in URI format

To connect to NetApp or Amazon S3 endpoints in URI format, select the HTTPS option, specify a Bucket Name, and then add one of the following properties to the Properties tab, depending on the URI endpoint connection type to which you are connecting:

Connection Property
Amazon S3 endpoint URI s3-endpoint=s3
NetApp s3-endpoint=netapp

Connect to MinIO with path style access

Note The ability to connect to MinIO is only available in Collibra DQ versions 2024.03 or newer.

To connect to MinIO using path style access, select the HTTPS and Path Style Access options, specify a Bucket Name, and then add the following property to the Properties tab:

Connection Property
MinIO s3-endpoint=minio

Important When connecting to MinIO, you must select the Path Style Access option.

Known limitations

  • Instance Profile is not supported on DQ Cloud deployments.
  • Filtergram on the Data Preview tab is not available for S3 connections. Currently, there is not a workaround for this limitation.
  • When leveraging the Spark 3.2.2 standalone installation that comes with the AWS marketplace installation, there is a limitation due to a jar mismatch for the AWS S3 Archive feature. The issue is that the existing hadoop-aws.3.2.1 .jar file is incompatible with the feature.
    • A workaround is to update the hadoop-aws-3.2.1.jar to hadoop-aws-3.3.1.jar in the spark/jars directory. The necessary .jar file can be obtained from the following link: Apache Downloads.

      Note If you encounter any difficulties locating the necessary .jar file on the Apache Downloads page, contact your CS or SE for assistance.