Supported remote files

This section includes the connection details of remote files supported by Collibra DQ.

Supported remote file data sources

Data Source Packaged Certified Archive Break Records
Amazon S3

Yes

Yes

Yes

Azure Blob Storage

Yes

Yes

Yes

Azure Data Lake Storage (Gen2)

Yes

Yes

Yes

Google Cloud Storage

Yes

Yes

No

Hadoop Distributed Files System (HDFS)

Yes

Yes

No

Network File Storage (NFS)

Yes

Yes

No

Supported file types

Because file formats differ in structure, you may need to prepare your data before establishing a connection.

Note The file types listed below are supported by default using the ALLOWED_UPLOAD_FILE_TYPES variable. You can update the default list by updating the associated config file.

Type File structure Notes
CSV (.csv) Structured The default delimiter is comma (for example, CSV).
PARQUET (.parquet) Structured N/A
AVRO (.avro) Structured N/A
JSON (.json) Semi-structured N/A
ORC (.orc) Semi-structured N/A
XML (.xml) Semi-structured N/A
DELTA (.delta) Semi-structured N/A
DAT (.dat) Structured N/A
TSV (.tsv) Structured N/A
TXT (.txt) Unstructured N/A
HUDI (.hudi) Structured

You can only create DQ jobs for Hudi files from the directory-level, not from the individual file-level. When you create a DQ job from the Remote File Connections section of Explorer, expand a directory, then click Create DQ Job.

View example screenshot...
In the following screenshot, the arrows point to directories where you can create DQ jobs. You would not be able to create DQ jobs for the yellow OTHS files beneath the .hoodie directory in this example.
create a dq job from the directory level, not the individual file level

Note After you select HUDI as your file type, you do not need to change any of the default file information, such as delimiter or charset.

Tip The Hudi Spark connector requires a separate package to work properly due to a security vulnerability in the Hudi bundle jar. Please reach out to your CSM for more information about accessing this package.

Supported delimiters

The following table is a list of supported delimiters available in the Delimiter dropdown menu.

Type Format Description
Comma CSV

, is used to separate values in the file.
This is the default delimiter for files.

Tab TSV tab is used to separate values in the file.
Semicolon CSV ; is used to separate values in the file.
Double Quote CSV " is used to separate values in the file.
Single Quote CSV ' is used to separate values in the file.
Pipe TXT \| is used to separate values in the text file.
SOH TXT A Unicode character 'START OF HEADING' (U+0001) is an invisible control character.
Custom N/A Add a custom delimiter. Support for custom delimiters may vary.

Known limitations