Connecting to Amazon S3

This section contains details for Amazon S3 connections.

General information

Amazon Simple Storage Service (Amazon S3) is an object storage service that Data Quality & Observability Classic treats as a Remote File Connection for accessing data. You can also use S3 buckets to export and store breaking records from Collibra DQ rules externally. This can be helpful if you need to control your storage environment behind a firewall.

Field	Description
Data source	Amazon S3
Supported versions	N/A
Connection string	`s3://`
Packaged?	Yes
Certified?	Yes
Supported features
Analyze data	Yes
Archive breaking records	Yes
Estimate job	Yes
Pushdown	No
Processing capabilities
Spark agent	Yes
Yarn agent	Yes

Minimum user permissions

In order for Collibra DQ to access your S3 bucket, you need the following permissions.

ROLE_ADMIN in Collibra DQ.
Read access on the S3 bucket where your data is stored.
Read and write access on the S3 bucket where breaking records are archived from Collibra DQ. This is only necessary if you use the archive breaking records feature.

Recommended and required connection properties

Required	Connection Property	Type	Value
Yes	Name	Text	The unique name of your connection. Ensure that there are no spaces in your connection name.
No	HTTPS	Option	Converts the Connection URL field from an `s3://` to `https://` URI path.
No	Path Style Access	Option	Allows access to object stores along HTTPS paths. This option is only available when HTTPS is enabled.
Yes *When HTTPS is selected, Bucket Name is required.	Bucket Name	string	The exact name of the S3 bucket along the URI path you are attempting to access. Example example_bucket
Yes	Connection URL	String	The connection string path of your S3 connection. The path must start with s3:// and point to the root bucket, not a sub-folder. Example `s3://<bucket-name>` You can optionally add a key after the bucket name. Example `s3://<bucket-name>/key`
Yes	Region	String	The AWS region in which the S3 bucket resides. The default region is US_EAST_1
No	Target Agent	Option	The Agent used to submit your DQ Job.
Yes	Auth Type	Option	The method to authenticate your connection. Note The configuration requirements are different depending on the Auth Type you select. See Authentication for more details on available authentication types.
Yes	Save Credentials	Option	Select this option after you enter your connection details. We recommend selecting this option to allow credentials to be shared with users across the Collibra DQ platform.
No	Driver Properties	String	The configurable driver properties for your connection. Multiple properties must be semicolon delimited. For example, abc=123;test=true
No	Archive Breaking Records	Option	Select this option to automatically export CSV files containing the breaking records of DQ Job to your S3 bucket.
No	Archive Location	String	The path along which break records will be archived when the Archive Break Records and HTTPS options are enabled. Specify a schema output location in your data source for break records to send. For example, /write/FolderName Note This option is only available when the HTTPS option is enabled.

Authentication

Select an authentication type from the dropdown menu. The options available in the dropdown menu are the currently supported authentication types for this data source.

In the Driver Properties input field on the Properties tab, add the following property, adjusting the parameterized section (${ }) according to your IdP details: dq.idp.url=${IdP-URL}

Note If your IdP uses scopes, you may need to add dq.idp.scopes=${IdP-scope} after the IdP URL.

Field	Description
Key	AWS security credentials that use an access key ID and secret access key combination to access S3 buckets.
Key	The access key ID for your Amazon S3 storage account.
Secret	The secret access key for your Amazon S3 storage account.
Instance Profile	The instance profile used to grant access to the EC2 instance to access your S3 bucket. Optionally select Assume Role and add an accessRole when you use Instance Profile. Important DQ Cloud does not support Instance Profile.
Username	The username credential required for the IdP service to authenticate a user.
Password	The password credential required for the IdP service to authenticate a user.
Assume Role	This is optional for Instance Profile and Key authentication. Select this option to generate a set of temporary security credentials that you can use to access AWS resources. To assume an S3 IAM Role on EKS-based deployments of Collibra DQ, select Assume Role, then optionally enter the information in the Role to Assume field.
Role to Assume	The IAM role associated with your EC2 instance. This is optional when assuming an S3 IAM Role on EKS-based deployments of Collibra DQ.

Password Manager

Field	Description
Username	The username used to sign into your password manager service.
Password	Note Password is not required when you use Password Manager.
Script	The script that contains the SAML role. The password manager program contains the information needed to automate password-related tasks when you attempt to sign in to Collibra DQ.
App ID	The Application Identifier used to help the IdP service authorize Collibra DQ when you attempt to sign in.
Safe	The storage vault where your password and other authentication information are stored.
Pwd Mgr Name	The name of your safe.
accessRole	The IAM role associated with your EC2 instance.

Connect to NetApp or Amazon S3 endpoint in URI format

To connect to NetApp or Amazon S3 endpoints in URI format, select the HTTPS option, specify a Bucket Name, and then add one of the following properties to the Properties tab, depending on the URI endpoint connection type to which you are connecting:

Connection	Property
Amazon S3 endpoint URI	s3-endpoint=s3
NetApp	s3-endpoint=netapp

Connect to MinIO with path style access

Note The ability to connect to MinIO is only available in Collibra DQ versions 2024.03 or newer.

To connect to MinIO using path style access, select the HTTPS and Path Style Access options, specify a Bucket Name, and then add the following property to the Properties tab:

Connection	Property
MinIO	s3-endpoint=minio

Important When connecting to MinIO, you must select the Path Style Access option.

Limitations

Instance Profile is not supported on DQ Cloud deployments.
Filtergram on the Data Preview tab is not available for S3 connections. Currently, there is not a workaround for this limitation.
When leveraging the Spark 3.2.2 standalone installation that comes with the AWS marketplace installation, there is a limitation due to a jar mismatch for the AWS S3 Archive feature. The issue is that the existing hadoop-aws.3.2.1 .jar file is incompatible with the feature.
- A workaround is to update the hadoop-aws-3.2.1.jar to hadoop-aws-3.3.1.jar in the spark/jars directory. The necessary .jar file can be obtained from the following link: Apache Downloads.
  Note If you encounter any difficulties locating the necessary .jar file on the Apache Downloads page, contact your CS or SE for assistance.