Connecting to Amazon S3
This section contains details for Amazon S3 connections.
General information
Amazon Simple Storage Service (Amazon S3) is an object storage service that Collibra Data Quality & Observability treats as a Remote File Connection for accessing data. You can also use S3 buckets to export and store breaking records from Collibra DQ rules externally. This can be helpful if you need to control your storage environment behind a firewall.
Field | Description |
---|---|
Data source | Amazon S3 |
Supported versions | N/A |
Connection string | s3://
|
Packaged? |
Yes |
Certified? |
Yes |
Supported features | |
Analyze data
|
Yes |
Archive breaking records
|
Yes |
Estimate job
|
Yes |
Pushdown
|
No |
Processing capabilities | |
Spark agent
|
Yes |
Yarn agent
|
Yes |
Minimum user permissions
In order for Collibra DQ to access your S3 bucket, you need the following permissions.
- ROLE_ADMIN in Collibra DQ.
- Read access on the S3 bucket where your data is stored.
- Read and write access on the S3 bucket where breaking records are archived from Collibra DQ. This is only necessary if you use the archive breaking records feature.
Recommended and required connection properties
Required | Connection Property | Type | Value |
---|---|---|---|
Yes |
Name | Text | The unique name of your connection. Ensure that there are no spaces in your connection name. |
No |
HTTPS | Option | Converts the Connection URL field from an s3:// to https:// URI path. |
No |
Path Style Access
|
Option |
Allows access to object stores along HTTPS paths. This option is only available when HTTPS is enabled. |
Yes *When HTTPS is selected, Bucket Name is required. |
Bucket Name
|
string |
The exact name of the S3 bucket along the URI path you are attempting to access. Example example_bucket |
Yes |
Connection URL | String |
The connection string path of your S3 connection. The path must start with s3:// and point to the root bucket, not a sub-folder. Example You can optionally add a key after the bucket name. Example |
Yes |
Region | String |
The AWS region in which the S3 bucket resides. The default region is US_EAST_1 |
No |
Target Agent | Option | The Agent used to submit your DQ Job. |
Yes |
Auth Type | Option |
The method to authenticate your connection. Note The configuration requirements are different depending on the Auth Type you select. See Authentication for more details on available authentication types. |
Yes |
Save Credentials | Option | Select this option after you enter your connection details. We recommend selecting this option to allow credentials to be shared with users across the Collibra DQ platform. |
No |
Driver Properties | String |
The configurable driver properties for your connection. Multiple properties must be comma delimited. For example, abc=123,test=true |
No |
Archive Breaking Records | Option | Select this option to automatically export CSV files containing the breaking records of DQ Job to your S3 bucket. |
No |
Archive Location | String |
The path along which break records will be archived when the Archive Break Records and HTTPS options are enabled. Specify a schema output location in your data source for break records to send. For example, /write/FolderName Note This option is only available when the HTTPS option is enabled. |
Authentication
Select an authentication type from the dropdown menu. The options available in the dropdown menu are the currently supported authentication types for this data source.
In the Driver Properties input field on the Properties tab, add the following property, adjusting the parameterized section (${ }
) according to your IdP details: dq.idp.url=${IdP-URL}
Note If your IdP uses scopes, you may need to add dq.idp.scopes=${IdP-scope}
after the IdP URL.
Password Manager
Field | Description |
---|---|
Username | The username used to sign into your password manager service. |
Password |
Note Password is not required when you use Password Manager. |
Script | The script that contains the SAML role. The password manager program contains the information needed to automate password-related tasks when you attempt to sign in to Collibra DQ. |
App ID | The Application Identifier used to help the IdP service authorize Collibra DQ when you attempt to sign in. |
Safe | The storage vault where your password and other authentication information are stored. |
Pwd Mgr Name | The name of your safe. |
accessRole | The IAM role associated with your EC2 instance. |
Connect to NetApp or Amazon S3 endpoint in URI format
To connect to NetApp or Amazon S3 endpoints in URI format, select the HTTPS option, specify a Bucket Name, and then add one of the following properties to the Properties tab, depending on the URI endpoint connection type to which you are connecting:
Connection | Property |
---|---|
Amazon S3 endpoint URI | s3-endpoint=s3 |
NetApp | s3-endpoint=netapp |
Connect to MinIO with path style access
Note The ability to connect to MinIO is only available in Collibra DQ versions 2024.03 or newer.
To connect to MinIO using path style access, select the HTTPS and Path Style Access options, specify a Bucket Name, and then add the following property to the Properties tab:
Connection | Property |
---|---|
MinIO | s3-endpoint=minio |
Important When connecting to MinIO, you must select the Path Style Access option.
Known limitations
- Instance Profile is not supported on DQ Cloud deployments.
- Filtergram on the Data Preview tab is not available for S3 connections. Currently, there is not a workaround for this limitation.
- When leveraging the Spark 3.2.2 standalone installation that comes with the AWS marketplace installation, there is a limitation due to a jar mismatch for the AWS S3 Archive feature. The issue is that the existing hadoop-aws.3.2.1 .jar file is incompatible with the feature.
- A workaround is to update the hadoop-aws-3.2.1.jar to hadoop-aws-3.3.1.jar in the spark/jars directory. The necessary .jar file can be obtained from the following link: Apache Downloads.
Note If you encounter any difficulties locating the necessary .jar file on the Apache Downloads page, contact your CS or SE for assistance.
- A workaround is to update the hadoop-aws-3.2.1.jar to hadoop-aws-3.3.1.jar in the spark/jars directory. The necessary .jar file can be obtained from the following link: Apache Downloads.