Connecting to Hadoop Distributed File System (HDFS)

This section contains an overview of Hadoop Distributed File System (HDFS).

General information

Field Description
Data source Hadoop Distributed File System (HDFS)
Supported versions N/A
Connection string hdfs://
Packaged?

Yes

Certified?

Yes

Supported features
Analyze data

Yes

Archive breaking records

No

Estimate job

Yes

Pushdown

No

Processing capabilities
Spark agent

Yes

Yarn agent

Yes

Minimum user permissions

In order for Collibra DQ to access your HDFS bucket, you need the following permissions.

  • Read access to the path in your HDFS connection.

Recommended and required connection properties

Required Connection Property Type Value

Yes

Name Text The unique name of your connection. Do not use spaces in your connection name and only use valid characters.

Yes

Connection URL String

The connection string path of your HDFS connection. The path must start with hdfs:// and point to the root bucket, not a sub-folder.

Example hdfs://<name node>:8020/

Yes

Target Agent Option The Agent used to submit your DQ Jobs.

Yes

Auth Type Option

The method to authenticate your connection.

Note The configuration requirements are different depending on the Auth Type you select. See Authentication for more details on available authentication types.

Yes

Save Credentials Option Select this option after you enter your connection details.

No

Properties String

The configurable driver properties for your connection. Multiple properties must be comma delimited. For example, abc=123,test=true

To ensure that your remote procedures are secure within Collibra DQ, we recommend defining the following driver property:

hadoop.rpc.protection=privacy

Authentication

Select an authentication type from the dropdown menu. The options available in the dropdown menu are the currently supported authentication types for this data source.

Field Description
Principal The service principal used to let Collibra DQ access your connection.

Key

The keytab used to authorize your connection.
TGT

The Ticket Granting Ticket used to authorize your connection.