Supported Connections

This page is a list of supported data source connection types. A supported data source is a data source that is shipped with the images or standalone bundles, and thus, eligible for support from the Collibra DQ team.

Note Any data source that is compatible with the Java version and server to which you are connected can be used. However, if an issue occurs with an unsupported data source, we cannot guarantee support.

Production

The following is a list of drivers certified for production use.

Connections - Currently Supported

Connection Certified Tested Packaged Optionally Packaged Pushdown Estimate job Filtergram Analyze Data Schedule Spark Agent Yarn Agent Parallel JDBC Session State Kerberos Password Kerberos Password Manager Kerberos Keytab Kerberos TGT Standalone (non-Livy) JDK8 Driver Compatibility JDK11 Driver Compatibility
Athena Yes Yes No Yes Yes Yes Yes Yes Yes Yes Yes Yes No No No No No Yes Yes Yes
Athena CDATA Yes Yes Yes No No Yes Yes Yes No Yes No Yes No No No No No Yes Yes Yes
BigQuery Yes Yes No Yes Yes Yes Yes Yes Yes Yes Yes Yes No No No No No Yes Yes Yes
BigQuery CDATA Yes Yes Yes No No Yes Yes Yes No Yes No Yes No No No No No Yes Yes Yes
Databricks JDBC Yes Yes No Yes No No No No No No No No No No No No No No Yes Yes
Databricks CDATA Yes Yes Yes No No Yes Yes Yes No Yes No Yes No No No No No Yes Yes Yes
DB2 Yes Yes Yes No Yes Yes Yes Yes Yes Yes Yes Yes No No No No No Yes Yes Yes
Dremio Yes Yes Yes No Yes Yes Yes Yes Yes Yes Yes Yes No No No No No Yes Yes Yes
Hive Yes Yes No Yes Yes Yes Yes Yes Yes Yes Yes Yes No Yes Yes Yes Yes Yes Yes Yes
Hive CDATA No No Yes No No No Yes No No No No No No Yes No Yes Yes Yes Yes Yes
Impala Yes No No Yes Yes Yes Yes Yes Yes Yes Yes Yes No Yes Yes Yes Yes Yes Yes Yes
Impala CDATA No No Yes No No No Yes No No No No No No Yes No Yes Yes Yes Yes Yes
Microsoft SQL Yes Yes Yes No Yes Yes Yes Yes Yes Yes Yes Yes No No No No No Yes Yes No
MYSQL Yes Yes Yes No Yes Yes Yes Yes Yes Yes Yes Yes No No No No No Yes Yes Yes
Oracle Yes Yes Yes No Yes Yes Yes Yes Yes Yes Yes Yes No No No No No Yes Yes Yes
Postgres Yes Yes Yes No Yes Yes Yes Yes Yes Yes Yes Yes No No No No No Yes Yes Yes
Presto Yes No Yes No Yes Yes Yes Yes Yes Yes Yes Yes No No No No No Yes Yes Yes
Redshift Yes Yes Yes No Yes Yes Yes Yes Yes Yes Yes Yes No No No No No Yes Yes Yes
Snowflake Yes Yes Yes No Yes Yes Yes Yes Yes Yes Yes Yes No No No No No Yes Yes Yes
Sybase Yes Yes Yes No No Yes Yes Yes Yes Yes Yes No No No No No No Yes Yes Yes
Teradata Yes Yes Yes No Yes Yes Yes Yes Yes Yes Yes Yes No No No No No Yes Yes Yes

Tip A connection listed as Tested is one for which the Collibra DQ team has an environment and is included in regular regression testing.

Note The Dremio connection is compatible with JDK11 if you add the following to owlmanage.sh as a JVM option for the web and Spark instance:
-Dcdjd.io.netty.tryReflectionSetAccessible=true

Remote Connections - Currently Supported

Connection Certified Tested Packaged Optionally packaged Pushdown Estimate job Filtergram Analyze data Spark agent Yarn agent
Azure Data Lake (Gen2) Yes Yes Yes No No Yes Yes Yes Yes Yes
Google Cloud Storage Yes Yes No Yes No Yes Yes Yes Yes Yes
HDFS Yes Yes Yes No No Yes Yes Yes Yes Yes
S3 Yes Yes Yes No No Yes Yes Yes Yes Yes

Under Evaluation

The following is a list of drivers which are under evaluation (not certified yet for production usage). These connections are currently ineligible for escalated support services.

Connections - Tech Preview

Connection Certified Tested Packaged Optional packaging Pushdown Estimate job Filtergram Analyze data Schedule Spark agent Yarn agent Parallel JDBC Session state Kerberos state Kerberos password manager Kerberos keytab Kerberos TGT Standalone (non-Livy)
Cassandra No No No No No No No No No No No No No No No No No No
MongoDB No No No No No Yes No Yes Yes Yes Yes No No No No No No Yes
MongoDB CDATA Yes Yes Yes No No Yes Yes Yes No Yes No Yes No No No No No Yes
SAP HANA No No No No No No No No No No No No No No No No No No
Solr No No No No No No No No No No No No No No No No No No

Streaming - Tech Preview

Connection Certified Tested Packaged Optional packaging Pushdown Estimate job Filtergram Analyze data Schedule Spark agent Yarn agent Parallel JDBC Session state Kerberos password Kerberos password manager Kerberos TGT CRDB metastore Standalone (non-Livy)
Kafka No No No No No No No No No No No No No No No No No No

Files

File type Supported
CSV (and all delimiters) Yes
Parquet Yes
AVRO Yes
JSON Yes
DELTA Yes

Limitations

Authentication

  • DQ Jobs that require Kerberos TGT are not supported on Spark Standalone or Local deployments
    • Recommended to submit jobs via Yarn or K8s

File Limitations

File Sizes

  • Files with more than 250 columns supported in File Explorer, unless you have Livy enabled.
  • Files larger than 5gb are not supported in File Explorer, unless you have Livy enabled.
  • Smaller file sizes will allow for skip scanning and more efficient processing
  • Advanced features like replay, scheduling, and historical lookbacks require a date signature in the folder of file path

S3

  • Please ensure no spaces in S3 connection name
  • Please remember to select 'Save Credentials' checkbox upon establishing connection
  • Please point to root bucket, not sub folders

Local Files

  • Local files can only be run using NO_AGENT default
  • This is for quick testing, smaller files, and demonstration purposes.
  • Local file scanning is not intended for large scale production use.

Livy

  • Livy is only supported for K8s environments

Spark Engine Support

  • MapR is EOL and MapR spark engine not supported to run Collibra DQ jobs.

Databricks

Please refer to this page for more details on Databricks support

The only supported Databricks spark submit option is to use a notebook to initiate the job (Scala and Pyspark options). This is intended for pipeline developers and users knowledgeable with Databricks and notebooks. This form factor is ideal for incorporating data quality within existing Spark ETL data flows. The results are still available for business users to consume. The configuration is not intended for business users to implement. There are three ways that Databricks users can run DQ jobs using Databricks cluster or JDBC connection. 1. Notebook Users can directly open a notebook, upload Collibra DQjars and run a DQ job on Databricks cluster. The full steps are explained in below page. Collibra DQsupports this flow in production.

https://dq-docs.collibra.com/apis-1/notebook/cdq-+-databricks

2. Spark-Submit

There are two ways to run a spark submit job on Databricks's cluster. The first approach is to run a DQ spark submit job using Databricks UI and the second approach is by invoking Databricks rest API. We have tested both approaches against different cluster versions of DataBricks (See below table). Below is the full documentation to demonstrate these paths. https://dq-docs.collibra.com/apis-1/notebook/cdq-+-databricks/dq-databricks-submit\

Please note that these are only examples to demonstrate how to achieve DQ spark submit to Databricks's cluster. These paths are not supported in production and the Collibra DQ team does not support any bug coverages or professional services or customer questions for these flows. \

3. JDBC

Collibra DQ users can create JDBC connections in CDQ UI and connect to their Databricks database. This is scheduled for 2022.05 release.

Warning Delta Lake and JDBC connectivity has been validated against Spark 3.01 Collibra DQ package, Databricks 7.3 LTS and SparkJDBC41.jar. This is available as Preview. No other combinations have been certified at this time.

Warning  Spark submit using the Databricks spark master url is not supported.

CDQ Production support for Databricks.