Release 2022.10
New Features
Warning For the Collibra Data Quality 2022.10 release, all Docker images run on JDK11. Standalone packages contain JDK8 and JDK11 options. If you are an existing customer who requires JDK11, please upgrade your runtime before upgrading to 2022.10. Most Hadoop environment versions (EMR/HDP/CDH) still run on JDK8, so customers using these environments can upgrade with the JDK8 packages. If you prefer to upgrade to JDK11, you must follow the documentation of your respective Hadoop environment to upgrade to JDK11 before deploying the 2022.10 release.
The MS SQL driver that comes with JDK11 standalone packages does not currently work in the JDK11 environment. MSSQL requires a separate JAR for JDK11. Please contact your Customer Success Manager for the compatible driver.
Dremio is not currently supported for JDK11 standalone packages. If you plan to run JDK11, add -Dcdjd.io.netty.tryReflectionSetAccessible=true
to owlmanage.sh as a JVM option for your Web and Spark instances. Please contact your Customer Success Manager for assistance.
As of October 18, 2022, all images for the 2022.10 release have a Critical CVE (CVE-2022-42889). If you picked up the 2022.10 release before October 18, 2022, there should be no issue with your scans. If issues persist, please contact your Customer Success Manager for a new build.
Rules
- You can now define a rule to detect the number of days a job runs without data by using
$daysWithoutData
. - You can now define a rule to detect the number of days a job runs with 0 rows by using
$runsWithoutData
. - You can now define a rule to detect the number of days since a job last ran by using
$daysSinceLastRun
.
Profile
- You can now use a string length feature by toggling the Profile String Length checkbox when you create a data set.
- When Profile String Length is checked, the min/max length of a string column is saved to table dataset_field
Validate Source
- You can now write rules against a loaded source data frame when
-postclearcache
is configured in the agent.
Note The DQ UI will be converted to the React MUI framework with the 2022.11 release. Prior to the 2022.11 release, you can turn the React flag on, but note that some features may be temporarily limited.
Enhancements
DQ Job
- Start Time and Update Time are now based on the server time zone of the DQ Web App.
Scheduler
- The Job Schedule page now has pagination.
Scorecards
- From Pulse View, you can now view missing runs, runs with 0 rows, and runs with failed scores.
Admin/Catalog
- Connection details are now masked when non-admin users attempt to view or modify database connection details from the Catalog page. Only users with role_admin or role_connection_manager have the ability to view connection details on this page. (ticket #94430)
API
- The /v2/getRunIdDetailsByDataset endpoint now provides the following:
- The RunIDs for a given data set.
- All completed DQ Jobs for a given data set.
Snowflake Pushdown (beta)
- You can now detect shapes that do not conform to a data field. Pushdown jobs scan all columns for shapes by default.
- You can now view Histogram and Data Preview details for the Profile activity.
Connections
- The Snowflake JDBC driver is now updated to 3.13.14.
Fixes
Rules
- Fixed an issue with the Rule Validator that resulted in missing table errors. The Validator now correctly detects columns. (ticket #93430)
DQ Job
- Fixed an issue that caused queries with joins to fail on the load activity when Full Profile Pushdown was enabled. Pushdown profiling now supports SQL joins. (ticket #92409)
- Fixed an issue that caused jobs to fail at the load activity when using the CTE query. Please note that CTE support is currently limited to Postgres connections. (ticket #88287, 89150)
- Fixed an issue that caused inconsistencies between the time zones represented in the Start Time and Update Time columns.
Agent
- Fixed the loadBalancerSourceRanges for web and spark_history services in EKS environments. (ticket #95398)
- The helm property
global.ingress.*
has been removed to separate the config for web and spark_history. Please update the property as follows:__global.web.ingress.*
``global.spark_history.ingress.*
- The helm property
- Added support to specify the inbound CIDRs for the Ingress using the property
.global.web.service.loadBalancerSourceRanges
. (ticket #95398)- Though Ingress is supported as part of Helm charts, we recommend attaching your own Ingress to the deployment if you need further customization.
- This requires a new Helm chart.
- Fixed an issue that caused Livy file estimates to fail for GCS on K8s deployments.
- Fixed an issue that caused jobs to fail for GCS on K8s deployments.
Validate Source
- The Add Column Names feature is scheduled for removal with the upcoming 2022.11 release. (ticket #96066)
- This was a previous functionality before being able to limit the query directly (
srcq
) and Update Scope was added. - Use the query to edit/limit columns and also use Update Scope.
- This was a previous functionality before being able to limit the query directly (
- Fixed an issue that caused the incorrect message to display for [VALUE_THRESHOLD] when validate source was specified for a matched case. (ticket #94435)
Dupes
- The Advanced Filter is scheduled for removal from the Dupes page with the upcoming 2022.11 release. (ticket #96065)
Explorer
- Fixed an issue that caused BigQuery connections to incorrectly update the library (
-lib
) path when a subset of columns was selected. (ticket #96768)
Scheduler
- Fixed an issue that prevented the scheduler from running certain scheduled jobs in multi-tenancy setups. Email server information is now captured from the correct tenant. (ticket #92898)
Known Limitations
Rules
- When a data set has 0 rows returned, stat rules applied to the data set are not executed. While a full fix is planned for a future release, this limitation is only partially fixed as of 2022.10.
DQ Job
- CTE query support is currently limited to Postgres connections. DB2 and MSSQL are currently unsupported.
Catalog
- When using the new bulk actions feature, updates to your job are not immediately visible in the UI. Once you apply a rule, run a DQ Job against that data set. From the Rules tab, a row with the newly applied rule is visible.
Snowflake Pushdown (beta)
- Freeform (SQLF) rules cannot use a data set name but instead must use
@dataset
because Snowflake does not explicitly understand data set names. - When using the SQL Query workflow, selecting a subset of columns in your SQL query must be enclosed in double quotes to prevent the job from running infinitely and without failing.
- Min/Max precision and scale are only calculated for
double
data types. All other data types are currently out of scope.
DQ Security Metrics