Connecting to Google BigQuery

This section contains details for Google BigQuery connections.

Select an option from the dropdown menu to display information for a particular driver class.

General information

Field	Description
Data source	Google BigQuery
Supported versions	1.5.2.1005 21.0.8160.0
Connection string	`jdbc:bigquery://jdbc:googlebigquery:`
Packaged?	No Yes
Certified?	Yes
Supported features
Estimate job	Yes
Analyze data	Yes
Schedule	Yes No
Processing capabilities
Pushdown	Yes
Spark agent	Yes
Yarn agent	Yes No
Parallel JDBC	Yes
Java Platform version compatibility
JDK 8	Yes
JDK 11	Yes

Minimum user permissions

To bring your BigQuery data into Data Quality & Observability Classic, you need the following permissions:

Read access on your BigQuery dataset tables.
BigQuery job editor access.
Write access on the materialized view datasets.
ROLE_ADMIN assigned to your user in Collibra DQ.

Networking and firewalls

Account for the following URLs from networking and firewall perspectives:

logging.googleapis.com
oauth2.googleapis.com
googleapis.com
bigquerystorage.googleapis.com
bigquery.googleapis.com

Recommended and required connection properties

Required	Connection Property	Type	Value
Yes	Name	Text	The unique name of your connection. Ensure that there are no spaces in your connection name.
Yes	Connection URL	String	The connection string path of your BigQuery connection. When referring to the example below, replace the `${value}` sections of the connection URL with your actual value. Example `jdbc:bigquery://https://www.googleapis.com/bigquery/v2:443;OAuthType=0;ProjectId=${projectId};OAuthServiceAcctEmail=${service-account};OAuthPvtKeyPath=${/path/to/key};TimeOut=3600jdbc:googlebigquery:AuthScheme="OAuthJWT";OAuthSettingsLocation=memory://;ProjectId=${projectId};InitiateOAuth="GETANDREFRESH";OAuthJWTCert="/tmp/keytab/bigquery.json";OAuthJWTCertType="GOOGLEJSON";OAuthJWTIssuer="iam.service.account";`
Yes	Driver Name	String	The driver class name of your BigQuery connection. `com.simba.googlebigquery.jdbc42.Drivercdata.jdbc.googlebigquery.GoogleBigQueryDriver`
Yes	Port	Integer	The port number to establish a connection to the datasource. The default port is `443`
No	Limit Schemas	Option	Allows you to manage usage and restrict visibility to only the necessary schemas in the Explorer tree. See Limiting schemas to learn how to limit schemas from the Connection Management page. Note When you include a restricted schema in the query of a DQ Job, the query scope may be overwritten when the job runs. While only the schemas you selected when you set up the connection are shown in the Explorer menu, users are not restricted from running SQL queries on any schema from the data source.
No	Source Name	String	N/A
No	Target Agent	Option	The Agent that submits your Spark job for processing.
Yes	Auth Type	Option	The method to authenticate your connection. Note The configuration requirements differ depending on the Auth Type you select. See Authentication for more details on available authentication types.
No	Properties	String	The configurable driver properties for your connection. Multiple properties must be semicolon delimited. For example, abc=123;test=true

Authentication

Select an authentication type from the dropdown menu. The options available in the dropdown menu are the currently supported authentication types for this data source.

Required	Field	Description
Yes	Username	The username of your BigQuery service account. Set Username to any value.
Yes	Password	The password of your BigQuery service account. Enter the base64 encoded string of your OAuth key. base64 encoded commands To encode your credential JSON file as a base64 encoded string, use one of the following commands: `base64 your_json.json -w 0` or `cat your_json.json \| base64 -w 0` Tip Your JSON file is typically available to download in GCP when you create a service account in your IAM profile.
Yes	Script	The file path that contains the script file that the password manager uses to interact with and authenticate a user account. Example /tmp/keytab/bigquery_pwd_mgr.sh
No	Param $1	Optional. An additional parameters to authenticate your BigQuery connection.
No	Param $2	Optional. An additional parameter to authenticate your BigQuery connection.
No	Param $3	Optional. An additional parameter to authenticate your BigQuery connection.
Yes	JSON Credential File	The JSON file containing service account access credentials from your local file storage. Click anywhere in the Service Account Credential field, then upload the JSON credential file containing your Google IAM service account key. Plain text JSON example The JSON file contains the following service account information: Copy { "type": "service_account", "project_id": "owl-hadoop-cdh", "private_key_id": "<your-private-key-id-string>", "private_key": "-----BEGIN PRIVATE KEY-----\<encoded-string>\n-----END PRIVATE KEY-----\n", "client_email": "<your-service-account>@developer.gserviceaccount.com", "client_id": "your-service-account-id-string", "auth_uri": "https://accounts.google.com/o/oauth2/auth", "token_uri": "https://oauth2.googleapis.com/token", "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs", "client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/<your-service-account>@developer.gserviceaccount.com" } Tip Your JSON file is typically available to download in GCP when you create a service account in your IAM profile. Note The cdata.jdbc.googlebigquery.GoogleBigQueryDriver driver does not support JSON Credential File authentication.
Yes	Workload Identity	Refer to Configuring Workload Identity for Google BigQuery for information on how to set it up. Because the Connection URL contains the Project ID associated with the GCP cluster that was set up with the access grant, you only need to select Workload Identity from the Authentication Type dropdown menu, specify an agent, and fill out the other required fields in the connection template once you configure Workload Identity.

Views

When creating views from different datasets (collections), you can add the viewsEnabled=true parameter to the connection property when defining the connection.

Note
For read/write access to BigQuery, you can use the Spark BigQuery connector. To use this connector, ensure that you have the following configurations:

Set viewsEnabled=true.
Set materializationDataset to a dataset where the GCP user has table creation permission.
Optionally set materializationProject.

Important
The connector is implemented by running the query on BigQuery and the results are saved into a temporary table. Spark will then read the results, which could result in additional costs on your BigQuery account.

Miscellaneous

Connection property	Description
`AllowLargeResults`	When set to `true`, this option allows results larger than the maximum response size when reading large tables.
`QueryPassthrough`	When set to `true`, this option passes the query to the Google BigQuery server as is, which can help overcome some SQL syntax errors for valid SQL queries.

Command line example for basic spark-submit job

Copy

-lib "/opt/owl/drivers/bigquery/bigquery/core/" \
-h <IP_ADDRESS>:5432/postgres \
-master spark://<SPARK_MASTER>:7077 \
-ds samples.loan_customer \
-deploymode client \
-q "select * from samples.loan_customer" \
-rd "2021-08-02" \
-driver "com.simba.googlebigquery.jdbc42.Driver" \
-cxn BigQuery

Troubleshooting a connection

Password for the BigQuery Connector form in Collibra DQ must be a base64 encoded string created from the JSON file and input as password. For example: base64 your_json.json -w 0 or cat your_json.json | base64 -w 0
Check that this JAR exists and is on the path of the Collibra DQ Web UI server. For example, <INSTALL_PATH>/owl/drivers/bigquery/core). Look at your driver directory location which contains this BigQuery JAR: spark-bigquery_2.12-0.35.1.jar
Ensure these JARs present in <INSTALL_PATH>/owl/drivers/bigquery/:****animal-sniffer-annotations-1.23.jar,annotations-4.1.1.4.jar,api-common-2.20.0.jar,auto-value-1.10.4.jar,auto-value-annotations-1.10.4.jar,avro-1.11.3.jar,cdata.jdbc.googlebigquery.jar,commons-codec-1.16.0.jar,commons-compress-1.26.0.jar,commons-logging-1.2.jar,conscrypt-openjdk-uber-2.5.2.jar,core,error_prone_annotations-2.22.0.jar,failureaccess-1.0.1.jar,gax-2.37.0.jar,gax-grpc-2.37.0.jar,google-api-client-2.2.0.jar,google-api-services-bigquery-v2-rev20231008-2.0.0.jar,google-auth-library-credentials-1.20.0.jar,google-auth-library-oauth2-http-1.20.0.jar,google-cloud-bigquerystorage-2.46.0.jar,google-cloud-core-2.27.0.jar,google-http-client-1.43.3.jar,google-http-client-apache-v2-1.42.3.jar,google-http-client-gson-1.43.3.jar,google-oauth-client-1.34.1.jar,GoogleBigQueryJDBC42.jar,grpc-alts-1.59.0.jar,grpc-api-1.59.0.jar,grpc-auth-1.59.0.jar,grpc-context-1.59.0.jar,grpc-core-1.59.0.jar,grpc-google-cloud-bigquerystorage-v1-2.46.0.jar,grpc-google-cloud-bigquerystorage-v1beta1-0.170.0.jar,grpc-google-cloud-bigquerystorage-v1beta2-0.170.0.jar,grpc-googleapis-1.59.0.jar,grpc-grpclb-1.59.0.jar,grpc-inprocess-1.59.0.jar,grpc-netty-shaded-1.59.0.jar,grpc-protobuf-1.59.0.jar,grpc-protobuf-lite-1.59.0.jar,grpc-stub-1.59.0.jar,grpc-util-1.59.0.jar,gson-2.10.1.jar,guava-32.1.2-jre.jar,httpclient-4.5.14.jar,httpcore-4.4.16.jar,j2objc-annotations-2.8.jar,jackson-annotations-2.14.2.jar,jackson-core-2.14.2.jar,jackson-databind-2.14.2.jar,javax.annotation-api-1.3.2.jar,joda-time-2.10.10.jar,json-20231013.jar,jsr305-3.0.2.jar,listenablefuture-9999.0-empty-to-avoid-conflict-with-guava.jar,opencensus-api-0.31.1.jar,opencensus-contrib-http-util-0.31.1.jar,perfmark-api-0.26.0.jar,proto-google-cloud-bigquerystorage-v1-2.46.0.jar,proto-google-cloud-bigquerystorage-v1beta1-0.170.0.jar,proto-google-cloud-bigquerystorage-v1beta2-0.170.0.jar,proto-google-common-protos-2.28.0.jar,proto-google-iam-v1-1.23.0.jar,protobuf-java-3.24.4.jar,protobuf-java-util-3.24.4.jar,slf4j-api-1.7.36.jar,threetenbp-1.6.8.jar
You may get a CLASSPATH conflict regarding the JAR files.
Ensure the BigQuery connector Scala version matches your Spark Scala version.

Limitations

A limitation exists where Spark is unable to process columns of DateTime data types from BigQuery, which causes it to incorrectly parse DateTime columns as string columns.
- A possible workaround is to cast the DateTime column to timestamp so it processes as a timestamp column instead of a string.
A limitation exists where Spark does not support decimal values with fixed precision greater than 38. For more information, see the official Spark documentation.
- A possible workaround is to cast the data type as a string.
The GEOGRAPHY column data type is not supported on Pushdown for BigQuery.
When referencing quoted identifiers with backticks ` in the query connection, add -corefetchmode to the DQ Job command line to prevent the removal of the backticks ` at runtime.
Using subqueires in a dataset query is currently unsupported. Instead, we recommend changing the query to a view.