About remote file connections

This section gives an overview of the supported file formats and the limitations when connecting to a remote file.

Supported file types

Because file formats differ in structure, you may need to prepare your data before establishing a connection.

Type File structure Notes
CSV (.csv) Structured The default delimiter is comma (for example, CSV).
PARQUET (.parquet) Structured N/A
AVRO (.avro) Structured N/A
JSON (.json) Semi-structured N/A
ORC (.orc) Semi-structured N/A
XML (.xml) Semi-structured N/A
DELTA (.delta) Semi-structured N/A
HUDI (.hudi) Structured

You can only create DQ jobs for Hudi files from the directory-level, not from the individual file-level. When you create a DQ job from the Remote File Connections section of Explorer, expand a directory, then click Create DQ Job.

View example screenshot...
In the following screenshot, the arrows point to directories where you can create DQ jobs. You would not be able to create DQ jobs for the yellow OTHS files beneath the .hoodie directory in this example.
create a dq job from the directory level, not the individual file level

Note After you select HUDI as your file type, you do not need to change any of the default file information, such as delimiter or charset.

Tip The Hudi Spark connector requires a separate package to work properly due to a security vulnerability in the Hudi bundle jar. Please reach out to your CSM for more information about accessing this package.

Supported delimiters

The following table is a list of supported delimiters available in the Delimiter dropdown menu.

Type Format Description
Comma CSV

, is used to separate values in the file.
This is the default delimiter for files.

Tab TSV tab is used to separate values in the file.
Semicolon CSV ; is used to separate values in the file.
Double Quote CSV " is used to separate values in the file.
Single Quote CSV ' is used to separate values in the file.
Pipe TXT \| is used to separate values in the text file.
SOH TXT A Unicode character 'START OF HEADING' (U+0001) is an invisible control character.
Custom N/A Add a custom delimiter. Support for custom delimiters may vary.

Known Limitations

  • DQ jobs that run on remote file connections with headers that contain white spaces fail with a requirement failed exception message. A possible workaround is to edit the DQ Job command line in the Run CMD tab and place single quotes '' around the column name in -q and double quotes "" around the contents of the -header flag.
  • Filtergram on the Data Preview tab is not available for any remote file connection. Currently, there is not a workaround for this limitation.
  • Array and nested array datatypes in JSON files are not supported.
  • While Collibra DQ supports most UTF-8 encoded characters in column headers of file-based connections, some Chinese characters are not currently supported. Jobs that run with this type of unsupported characters fail with a mismatched input exception message.
  • When you use Validate Source, the Update Source Scope button is not available for remote files. Update Source Scope is only visible for JDBC connections.