Release Notes
Disclaimer - Failure to upgrade to the most recent release of the Collibra Service may adversely impact the security, reliability, availability, integrity, performance or support (including Collibra’s ability to meet its service levels) of the Service. Collibra hereby disclaims all liability, express or implied, for any reduction in the security, reliability, availability, integrity, performance or support of the Service to the extent the foregoing would have been avoided had you allowed Collibra to implement the most current release of the Service when scheduled by Collibra. Further, to the extent your failure to upgrade the Service impacts the security, reliability, availability, integrity or performance of the Service for other customers or users of the Service, Collibra may suspend your access to the Service until you have upgraded to the most recent release.
2023.03
Highlights
- When your Collibra DQ license is about to expire, a counter in a warning banner on the Collibra DQ home page now counts down from 30 days until your license expires. You can update your license key at any time to avoid any disruptions.
- A message now appears in the Description field of the Jobs activity logs and the Job Exception field of the Jobs log table view when a DQ Job fails because of an expired or invalid license key.
- An error message now appears on the Findings page when a job fails because of an expired or invalid license key.
- When admin users update a license with a valid license key on the License page of the Admin Console, a success message appears.
- When admin users update a license with an invalid license key on the License page of the Admin Console, an error message appears.
- DQ Jobs will again run normally once a valid license key is saved.
Warning
Any DQ Jobs that run on accounts with expired or invalid license keys will no longer run until a valid license key is saved. - To improve how Collibra DQ handles Jobs concurrently at lower memory agent configurations, the default agent batch size for concurrent Jobs is now 5. For example, when you run more than 5 Jobs concurrently, only 5 Jobs run at once. All additional Jobs remain in Staged status until the initial 5 Jobs are complete, which allows new Jobs to continue to enter the queue on the Jobs page.
- The following are the default values in the Helm owl-agent-configmap.yaml file:
DEFAULT_BATCH_ID: <base64-encoded-string>
DEFAULT_BATCH_SIZE: 5
- The following are the default values in the Helm values.yaml and values_test.yaml files:
agent_batch_id: "default?batchSize=5"
agent_batch_size: 5
- It may be necessary to override the default values for high concurrency requirements. Increase the
DEFAULT_BATCH_ID
andDEFAULT_BATCH_SIZE
values and update the corresponding Helm values for DQ Agent configuration accordingly.
- The following are the default values in the Helm owl-agent-configmap.yaml file:
- You can now convert log outputs to JSON formatting for ingestion by a third-party log consumer.
- You can now optionally include additional fields and hyperlinks to the environments from which alerts originate in email alerts.
- Admin users can turn this option on or off from the Alerts page within the Admin Console.
Note
If you only install owlweb or owlagent, you need to copy the owlmanage.sh file from the installation package into the bin directory with the following command:cp $PACKAGE_LOCATION/owlmanage.sh $INSTALL_PATH/bin/owlmanage.sh
New Features
Platform
- Collibra DQ now displays in-app messages about expired, expiring, and invalid licenses, including a banner message on the Collibra DQ home screen that counts down from 30 days until your license expires.
- You can update or edit your license from the License Management page in the Admin Console.
Enhancements
Jobs
- The icon for the Jobs page is now the
clipboard icon.
Rules
- Export LinkIds now has the same export capabilities as Export and Export with Details.
- You can now use keyboard shortcuts on the Rule Builder page to display dataset-level stat rules.
Alerts
- You can now optionally include additional fields and hyperlinks to the environments from which alerts originate in email alerts.
Agent
- To improve how Collibra DQ handles Jobs concurrently at lower memory agent configurations, the default agent batch size for concurrent jobs is now 5.
Admin
- You can now set default values on the Add External Service Configuration form on the Assignment Queues page when you want to override the values for individual finding validate action.
Security
- Collibra DQ now enforces a Spring CSRF token, as well as CORS and SameSite, as standards to mitigate against CSRF attacks.
- Any cookie-based auth requests without this token are rejected with a 401 error.
- This feature is active by default, but can be deactivated if necessary. Deactivating this feature does not compromise the security of the Collibra DQ app because we have other mitigation standards in place for CSRF.
- All non-GET HTTP requests need to carry a CSRF token header when they call the API. This happens automatically when you use the app within Collibra DQ.
- This implementation only impacts cookie-based authentication. Additionally, Swagger UI and APIs that use bearer token authentication are not impacted by this change.
- Server-side validation for username, first name, last name, and email is now included in the /v2/createlocaljdbcuser internal registration API.
Platform
- New User Request emails now include additional fields and hyperlinks to the environment from which the request originates for easier access.
- You can now convert log outputs to JSON formatting for ingestion by a third-party log consumer.
Snowflake Pushdown (beta)
- You can now scan for exact and fuzzy matches as part of the Dupes check layer.
- You can now use the Correlation tab on the Profile page to understand correlation levels between numerical columns.
- Minimum and maximum string length profiling is now available.
Fixes
Explorer
- Fixed an issue where a custom Transform statement was corrupted after running a DQ Job and clicking the Edit button. (ticket #104992)
Rules
- Fixed an issue where custom rules with primary and secondary datasets that shared similar naming conventions failed because of incorrect replacement of @Dataset or @secondary with the name of the dataset by Collibra DQ's replace logic. (tickets #105812, 108490)
Source
- Fixed an issue that resulted in an Authentication error when the source data loaded because both the source and target used the same Kerberos TGT-based connection. (ticket #102639)
Platform
- Fixed an issue where incorrect syntax in the owldq/templates/rbac.yaml manifest caused errors for some Kubernetes deployments on AKS.
- Fixed an issue where an upgrade to Release 2023.01 caused errors, which referred to the agent_displayname column on the agent_q table when the web pods launched.
- Fixed an issue where the Flyway script V188 did not run successfully after an upgrade from Release Version 2022.08 to 2023.01. (ticket #107222)
Connections
- Fixed an issue that prevented Collibra DQ from establishing a connection to Azure Data Lake after an upgrade to Release Version 2022.11. (tickets #104271, 107451)
Outliers
- Fixed an issue where jobs that used union lookback and whose findings were marked as ignored, off-peak, or removed, were incorrectly included in the outlier calculation. (ticket #104746)
Known Limitations
Snowflake Pushdown (beta)
- Mixed datatype rules are not supported in Pushdown.
DQ Security Metrics
2023.02
Highlights
- To enhance user experience, the React MUI is now on by default across the Collibra DQ app. You can restore the legacy experience by following the steps on the React page.
- You can now use the new /v3/rules/{source}/copy API to bulk copy all rules from an existing dataset to another target dataset. With this new API, you can input a source dataset from which to copy rules, then choose one or more target datasets to which your rules copy.
- You can now leverage the Spring Framework to access controller-level logs and debug API calls.
- The new Oversized Job Report is now available by default in the Reports section. The Oversized Job Report lets you review usage analytics for jobs that use more hardware than what is required for their workloads.
- You can now view and edit your Agent Display Name from the Agent Management screen. This lets you customize your Agent Display Name to make it easier to identify your agent throughout Collibra DQ.
- When running large DQ Spark jobs, you can now bring in on-demand Persistent Volume Claims to use as Spark scratch disk space on Kubernetes deployments.
Note
If you only install owlweb or owlagent, you need to copy the owlmanage.sh file from the installation package into the bin directory with the following command:cp $PACKAGE_LOCATION/owlmanage.sh $INSTALL_PATH/bin/owlmanage.sh
New Features
Reports
- The new Oversized Job Report is now available by default.
Agent
- You can now view and edit your Agent Display Name from the Agent Management screen.
- When running large DQ Spark jobs, you can now bring in on-demand Persistent Volume Claims to use as Spark scratch disk space on Kubernetes deployments.
APIs
- You can now use the new /v3/rules/{source}/copy API to bulk copy all rules from an existing dataset to another target dataset.
- You can now leverage the Spring Framework to access controller-level logs and debug API calls.
Enhancements
Rules
- Native rules with invalid SQL queries now display an exception message instead of incorrectly displaying a passing status.
- You can now drill into a pulse view chart on the $daysWithoutData stat rules on the Findings page.
Platform
- React MUI is now on by default.
- To ensure proper compliance with Collibra's security protocol, new installations now require you to set up a password that meets Collibra's password policy and a default email address associated with the admin user account.
- Instructions for providing these credentials are interactive as part of the setup script for Standalone installs.
- For new installs in non-interactive mode, such as automated deployment, export the following variables, as they are required for the script to execute correctly:
export DQ_ADMIN_USER_PASSWORD
export DQ_ADMIN_USER_EMAIL
- For Cloud native installs, you must use the following Helm variables to set the password and default admin email:
--set global.web.admin.email=${email} \
--set global.web.admin.password=${password} \
- For multi-tenant instances, admin users are created via the UI, not with this process.
Security
- To ensure proper compliance with Collibra's security protocol, all deployments of Collibra DQ on Kubernetes now use a read-only root filesystem.
- All ephemeral storage directories, such as /opt/owl/log, /tmp/scratch, and /opt/owl/cache, are still writable for Collibra DQ to function correctly.
- The read-only root filesystem is enforced on all long-running pods except for Spark jobs and Livy sessions.
- Ensure that you pick the corresponding Helm charts for a successful upgrade.
- Any customizations must be ported to the corresponding Helm charts. For additional assistance, please contact Collibra Professional Services.
- You can no longer upload custom JDBC drivers under the /opt/owl/drivers folder. Refer to the documentation for more details on how to configure custom JDBC drivers.
Note Standalone deployments of Collibra DQ are not affected by these updates.
Snowflake Pushdown (beta)
- You can now scan for categorical outliers.
- You can now apply weighting and boundaries as part of the Outlier layer.
Fixes
Rules
- Fixed an issue where regex rules, which worked in previous DQ versions, appeared as exceptions after upgrading to the 2022.09 version. (ticket #102615)
Dupes
- Fixed an issue that prevented jobs from running with Dupes check switched on for tables with only one column. (ticket #104585)
Platform
- Fixed an issue with the print statement in the stage 3 logs where an excessive number of log records were written due to too many observations. (ticket #104081)
Security
- Fixed an issue with role mapping where invalid values for
LDAP_GROUP_RESULT_DN_ATTRIBUTE
andLDAP_GROUP_RESULT_CONTAINER_BASE
set to the base OU of the AD groups caused the role mapping to fail.
Connections
- Fixed an issue with connections using Kerberos authentication that prevented jobs from running because of IT restrictions around writing to Openshift containers. (ticket #102964)
- Fixed an issue with Databricks cluster connections to allow
PWD={}
input in the connection string URL. (ticket #104845)- You are still required to enter
PWD={your-text-here}
in the URL. - This fix only applies to Databricks cluster connections, not to Databricks connections that use SQL-endpoints.
- You are still required to enter
- The logging of entire ServiceNow response messages has been added to help identify errors.
Agent
- Fixed an issue where associating or disassociating connections from a remote Agent resulted in a PreparedStatementCallback error, which prevented connections from being deleted. (tickets #102322, 102763)
Snowflake Pushdown (beta)
- Fixed an issue with Min and Max on the Profile page that caused strings with numerical values to display invalid results. (ticket #105935)
Known Limitations
Rules
- The $daysWithoutData stat rule has a limitation where rules that are not named "daysWithoutData" do not display their Pulse View charts when you drill into the rule on the Findings page. A workaround for this is to name your stat rule exactly as it appears here:
daysWithoutData
.
Platform
The dollar sign $
is a special character that must be escaped with a backslash \
when running the setup script or Helm via the command line.
DQ Security Metrics
The following image shows a chart of Collibra DQ security vulnerabilities arranged by release version.
The following image shows a table of Collibra DQ security metrics arranged by release version.
2023.01
New Features
Admin
- Admin users can now view and modify the DQ license key and license name from the new License page.
Platform
- You can now deploy new Helm charts on new and existing releases of Collibra DQ Cloud Native deployments without encountering character size limitations.
Reports
- Four new reports are now available by default:
- The Missing Jobs Report shows jobs that were expected to run but didn't run as scheduled.
- The Hardware Usage Report shows the datasets that required the most total cores to run and more general hardware usage statistics.
- The Observability Score Roll-Up Report shows the aggregated scores of all AdaptiveRules (all datasets + all columns) and averages passing and breaking for all columns over 30 days.
- The Rules Passing Fraction Roll-Up Report shows all the passing rows and total rows scanned for user-defined rules aggregated by dimensions over the past 30 days.
Enhancements
Rules
- The Export LinkIds button is now displayed within the Rules tab of the Findings page. This button was previously only available under the Breaks tab on the Rule Builder page.
- You can enable LinkIDs from the Scope workflow on the Explorer page to export LinkIds for rule break records.
- The Copy Rules API (/v3/rules/copy) now has the following enhancements:
- Copied rules now copy to a new dataset correctly regardless of their rule type.
- Security logs for every rule copy request are now available for admin users to review in the Audit Trail section of the Admin Console.
Explorer
- When creating a DQ Job for a table with no rows, the columns are now shown in the Scope section.
Platform
- A new property,
LOCAL_REGISTRATION_ENABLED
, in the owl-env.sh script and K8s config map is now available to display or hide the registration link on the Sign in page for local users.- For owl-env.sh:
- The command
export LOCAL_REGISTRATION_ENABLED=true
allows the registration link to display on the Sign in page.Note Because the registration link is visible by default, this property is also set to true by default.
- The command
export LOCAL_REGISTRATION_ENABLED=false
hides the registration link from the Sign in page.
- The command
- For K8s:
- The configuration
LOCAL_REGISTRATION_ENABLED:"true"
allows the registration link to display on the Sign in page.Note Because the registration link is visible by default, this property is also set to true by default.
- The configuration
LOCAL_REGISTRATION_ENABLED:"false"
hides the registration link from the Sign in page.
- The configuration
- For owl-env.sh:
Connections
- SPARK322 and SPARK320 are now shipped with a Spark JDBC connection provider for Standalone deployments.
- When you bring Collibra DQ jars into Databricks, you are now required to set the property
spark.sql.sources.disabledJdbcConnProviderList='basic,oracle,mssql'
at either the Spark Cluster-level or the SparkSession-level before using Collibra DQ's set of functions for Spark profiles 3.2.1 and onwards.
Admin
- You can now sort the Date column on the Usage page. Dates now appear in descending order by default.
Fixes
Rules
- Fixed an issue that prevented the descriptions of saved rules from being edited when the rule name contained a greater than symbol after a single quote. (ticket #100114)
- Fixed an issue with rule builder validation that caused a rule syntax exception message to throw. (tickets #99735, 101165)
- Fixed an issue with rules with complex conditions (multiple rlike strings) for Freeform SQL rules that resulted in an exception message. (ticket #100116)
DQ Job
- Fixed an issue where behavioral observations made for a dataset did not subtract points from the data quality score. (ticket #98539)
Alerts
- Fixed an issue that limited the ability to edit or delete alerts with names containing apostrophes from the UI. (ticket #98864)
Outliers
- Fixed an issue where recalibrating an outlier would bulk apply downtrain labeling to it. (ticket #100085)
Reports
- Fixed an issue where Completeness Reports were not generated when the Custom Range filter was used. (ticket #99786)
DQ Connector
- Fixed an issue with the Collibra DQ - Collibra Data Intelligence Cloud integration that prevented Rules and Charts from importing. (ticket #104872)
Admin
- Fixed an issue with time-based data retention when using linkId that caused too many break records to store in the metastore. (ticket #99072, 102900)
Known Limitations
Rules
- The new Export LinkIds button generates a CSV file limited to viewing only via a spreadsheet program, like Excel.
- A workaround is to Save/Export the CSV file from the spreadsheet program in order to allow viewing in general text editors.
DQ Job
- Remote file jobs with headers containing white spaces fail with a requirement failed exception message.
- A workaround is to edit the DQ Job command line in the Run CMD tab and place single quotes around the column name in
-q
and double quotes around the entire-header
flag.
- A workaround is to edit the DQ Job command line in the Run CMD tab and place single quotes around the column name in
DQ Security Metrics
Hotfixes
Collibra Data Quality & Observability 2023.01.1
- Fixed an issue with Databricks cluster connections to allow
PWD={}
input in the connection string URL. (ticket #104845)- You are still required to enter
PWD={your-text-here}
in the URL. - This fix only applies to Databricks cluster connections, not to Databricks connections that use SQL-endpoints.
- You are still required to enter
2022.12
New Features
Explorer
- The types of queries that can run from the View Data page are now restricted to read-only queries only.
APIs
- You can now copy SQLG- and SQLF-type rules from an existing dataset to another existing dataset with the /v3/rules/copy API call.
Connections
- You can now create a MongoDB connection with a CDATA driver.
Snowflake Pushdown (beta)
- You can now detect outliers when running a Pushdown job.
Enhancements
DQ Job
- All tables on the Jobs page now include pagination, dropdown filters, and the ability to export.
Rules
- Rules associated with datasets with zero rows now execute successfully.
- Stat rule evaluation on secondary datasets is now supported for SQLF rules.
Profile
- You can now view run execution details and stale data by toggling the box chart on the findings page.
APIs
- The getRecords notebook API function is now updated and the getGeneric query is renamed getDupesPreview.
- You can now obtain rules from a dataset and reassign them to another dataset with the following Databricks notebook API functions:
def addRules(rules: List[Rule], dataset: String): Owl
def getRulesDfByDataset(dataset: String): DataFrame
def getRulesByDataset(dataset: String): List[Rule]
def getRuleNamesByDataset(dataset: String): DataFrame
Platform
- TechPreview (TP) labels are now removed from the UI.
Connections
Warning As of September 2022, Databricks JDBC driver version 2.6.27 is packaged as part of both standalone and Kubernetes download packages. The Databricks Simba driver (version 2.6.22) is no longer packaged for Kubernetes. As a result of this change, the Databricks connection template has changed, and any existing connection using the old driver (2.6.22) must be updated. For more information on updating your drivers, refer to Standalone Upgrade.
- The Databricks SQL endpoint is now supported for JDBC connections.
- 3DES and DES encryption cipher for Kerberos authentication types are no longer supported because of recent Red Hat OS (RHEL 8.7) cipher deprecation.
Fixes
Explorer
- Fixed an issue with the
-rdEnd
variable in the command line the variable in the query to be improperly escaped. (ticket #98702)
Profile
- Fixed an issue where the confidence score (Conf) displayed values greater than the threshold of 100. (ticket #99636)
- Fixed an issue where HTML in data fields was rendered on the Data Preview section of the findings page. (ticket #97883)
- Fixed an issue with Data Preview that resulted in an OOM error when the data_preview table contained a large number of records.
Rules
- Fixed an issue where values on the Rules tab did not correctly display in scientific notation format. (ticket #89738)
- Fixed an issue when using a secondary dataset that prevented @dataset for primary dataset from being supported.
Scorecards
- Profile is now removed from the DQ Scorecards submenu.
Security
- Fixed an issue with LDAP external groups to role mappings when there was no fully qualified path for the LDAP group that caused malformed API calls and did not save properly.
- Fixed an issue with the Dataset Security feature. (ticket #100317)
- When the following security settings are configured, the system fully restricts access to the findings page for admin users:
- Dataset security is turned on.
- Default owner access is unchecked.
- Dataset belongs to no roles or no roles to which the user has access.
- When the following security settings are configured, the system fully restricts access to the findings page for admin users:
APIs
- Fixed an issue with the /v2/gethints endpoint that prevented the Hints table from displaying correctly on the findings page. (ticket #98941)
- Fixed an issue with the getRecords and getGenerics APIs that prevented any information from being returned. (ticket #98820)
Alerts
- Fixed an issue with the SQL on the Alert Notifications page that prevented data from appearing in the DataTables error message.
Agent
- Fixed an issue with GKP deployments where job scans failed because the driver pod could not create connections to the metastore. (ticket #102175)
Validate Source
- Fixed an issue where the Source to Target scorecard incorrectly displayed a mismatch because of an unexpected column type checked during a schema order check. (ticket #98300)
Connections
- Resolved connection issues in certain cases by upgrading the Athena driver to version 2.0.33. (ticket #100340)
- Fixed an issue where HDFS connections could not rerun a job successfully because certain parameters were automatically appended to the Free Form (Appended) field of the Agent configuration. (ticket #95810)
- Fixed an issue with Dremio connection timeouts on Kubernetes deployments. (ticket #101221)
- To prevent Dremio connection issues, set the following value in the Free Form (Appended) field of the Agent configuration:
-conf spark.driver.extraJavaOptions=-Dcdjd.io.netty.tryReflectionSetAccessible=true
- To prevent Dremio connection issues, set the following value in the Free Form (Appended) field of the Agent configuration:
Known Limitations
Rules
- Freeform rules with fully qualified column names are currently unsupported when they use the following syntax:
select <column name> FROM @<dataset name> WHERE @<dataset name>.<column name> condition
- A workaround to this limitation is to use aliasing instead.
APIs
- When using the new /v3/rules/copy API, the copied rule automatically appends "copied" to the rule name. After copying a rule, you may need to manually update the rule name.
- If the copied rule is performed on a target dataset that does not conform to the compatible columns, then you need to manually update the rule to ensure the columns are compatible across datasets.
- Dataset Security is not enforced when using the /v3/rules/copy API.
DQ Security Metrics
Hotfixes
Collibra Data Quality & Observability 2022.12.3
- Fixed an issue with Databricks cluster connections to allow
PWD={}
input in the connection string URL. (ticket #104845)- You are still required to enter
PWD={your-text-here}
in the URL. - This fix only applies to Databricks cluster connections, not to Databricks connections that use SQL-endpoints.
- You are still required to enter
2022.11
Warning
The MS SQL driver that comes with JDK11 standalone packages does not currently work in the JDK11 environment. MSSQL requires a separate JAR for JDK11. Please contact your Customer Success Manager for the compatible driver.
Dremio is not currently supported for JDK11 standalone packages. If you plan to run JDK11, add -Dcdjd.io.netty.tryReflectionSetAccessible=true
to owlmanage.sh as a JVM option for your Web and Spark instances. Please contact your Customer Success Manager for assistance.
Dremio jobs currently fail on both K8s and standalone JDK11 deployments. Add the following config to the Free Form (Appended) field of the Agent Configuration template: -conf spark.driver.extraJavaOptions=-Dcdjd.io.netty.tryReflectionSetAccessible=true
.
As of October 18, 2022, all images for the 2022.10 release have a Critical CVE (CVE-2022-42889). If you picked up the 2022.10 release before October 18, 2022, there should be no issue with your scans. If issues persist, please contact your Customer Success Manager for a new build.
Note
After you complete an upgrade or a new installation of Collibra DQ, you are now required to enter a license name by following either a one-time prompt on the login page, entering the LICENSE_NAME
environment variable in the environment variable file (owl-env.sh), or by entering the global.configMap.data.license_name
Helm chart variable. Your license name is the value after YOUR NAME IS =
found in the license provision email sent to you by Collibra. Customers who do not have this information due to being issued a license before March 2022 should input license information following the format below.
For a single instance: <yourcompanyname>
For multiple instances: <yourcompanyname>-dev, <yourcompanyname>-test, <yourcompanyname>-prod
No spaces or special characters are permitted except for hyphens -
.
Note Before you execute the setup.sh script, update the SPARK_PACKAGE
variable to the desired Spark package, for example spark-3.2.2-bin-hadoop3.2.tgz
. You can check the ../packages
folder for the Spark package that was previously downloaded. You do not need to update the setup.sh script if you use the following export statement:export SPARK_PACKAGE="spark-3.2.2-bin-hadoop3.2.tgz"
New Features
Platform
- The following pages now support the new React MUI:
- Scorecards
- List View
- Assignments
- Pulse View
- Alerts
Note React is turned off by default for the 2022.11 release. If you would like to try the new React pages, you can toggle it on from the Admin Console, or contact your Customer Success Manager for assistance.
DQ Job
- You can now terminate jobs from the Jobs page if they are in progress, incorrectly submitted, or stuck in Staged status. When you terminate a job, two alerts are generated.
- Jobs in the Spark UI display Finished statuses, even though they are terminated from the DQ UI.
Alerts
- You can now generate alerts for the following stale data stat rules:
$daysWithoutData
$runsWithoutData
$daysSinceLastRun
- You can now generate alerts for jobs stuck in Staged status for more than one hour.
Admin
- You can now configure LDAP for user access in multi-tenant environments.
Connections
- You can now use key-pair authentication for Snowflake connections.
- When you append to the Connection URL string, your entry must be comma separated.
- When you manually modify the Driver Properties field, your entry must be semicolon separated.
- CDATA connections are now supported in standalone deployments.
- CDATA drivers are now included in the release package.
Cloud Storage
- Azure Blob Storage is now a supported target storage system.
Snowflake Pushdown (beta)
- Schema Change monitoring from the AdaptiveRules tab is now enabled by default.
- Schema is now separated from basic profiling.
- The new DatasetDefDTO API now returns Pushdown information.
- Dataset security checks are now implemented for Pushdown jobs.
Enhancements
Explorer
- The Job Estimate dialogue now has improved guidance on executors and cores. The Job Estimate now estimates when a max core, max executor, and max memory is reached.
DQ Job
- Job schedule time zone is now a read-only field and can no longer be configured. Existing scheduled jobs reflect their current settings, but all other scheduled jobs are now based on the time zone of the DQ server (UTC). (ticket #88797, 89736, 92611, 95231)
Dupes
- A new warning message now displays when increasing the duplicate check limit from the UI. (ticket #95604)
Security
- Kubernetes service accounts associated with AWS IAM pod roles for controlling access to AWS services for cloud native DQ deployments on AWS EKS are now supported.
- When DATASET SECURITY is enabled, DATASET ACCESS is now required to edit, map, or retrieve datasets or business units. (ticket #92934)
Fixes
Rules
- Fixed an issue that prevented freeform rules containing double backslashes from saving. (ticket #96636, 96640)
- Fixed an issue that caused rules containing open brackets (
[
) to display break records incorrectly. (ticket #94399) - Fixed an issue that caused rules containing regex to throw out of range exceptions. (ticket #98435)
DQ Job
- Fixed an issue where run time was not displayed on the findings page because run_id column type in the metastore did not include time zone. (ticket #96050)
- Fixed an issue that caused Parquet files to fail during the LOAD activity. (ticket #96191)
- Other NFS file types, including ORC, CSV, and Avro, also run successfully.
Alerts
- Fixed an issue when saving batch names that used spaces between delimiters, which caused an invalid error to occur. (ticket #97028)
Validate Source
- The Add Column Names feature is now removed from the Source tab. (ticket #96066)
- Instead, use the query to edit/limit columns or use Update Scope.
- Fixed an issue where disabling source check on a cloned dataset resulted in an error. You can now disable source validation on cloned datasets. (ticket #97795)
Dupes
- The Advanced Filter is now hidden from the Dupes tab. (ticket #96065)
Shapes
- Fixed an issue when editing a dataset that reverted the Shape Detection setting (Off, Auto, or Manual) applied when it was created. (ticket #95471, 95473)
Schema
- Fixed an issue with schema detection on files where schema detection was performed on all columns when a subset of columns was selected. (ticket #92476)
- Use the
headercheckoff
flag when it is necessary to see only when columns are added or dropped.
- Use the
- Fixed an issue where schema changes were not correctly identified and updated. (ticket #96013)
Behavior
- Fixed an issue with behavior lookback(
-bhlb
) that caused Row Count changes to be misrepresented. (ticket #94840)
Connections
- Azure Blob connections in standalone environments require the following jars to be added to the
$SPARK_HOME/jars
folder:- hadoop-azure-3.2.0.jar
- wildfly-openssl-1.1.3.Final.jar
API
- Fixed an issue with the DB import process to ensure JobSchedule records import without error. (ticket #98405)
Known Limitations
DQ Job
- Job termination is not supported for jobs in Unknown status.
Validate Source
- Cloning and saving, enabling, or disabling the source tab is associated with the original dataset name and fails on the screen when an update is made, but does not affect the actual job run.
Connections
- When adding driver properties using the +Add Property option for Snowflake connections, semicolons are incorrectly appended to key values. Instead, use comma format to separate key values.
DQ Security Metrics
2022.10
New Features
Warning For the Collibra Data Quality 2022.10 release, all Docker images run on JDK11. Standalone packages contain JDK8 and JDK11 options. If you are an existing customer who requires JDK11, please upgrade your runtime before upgrading to 2022.10. Most Hadoop environment versions (EMR/HDP/CDH) still run on JDK8, so customers using these environments can upgrade with the JDK8 packages. If you prefer to upgrade to JDK11, you must follow the documentation of your respective Hadoop environment to upgrade to JDK11 before deploying the 2022.10 release.
The MS SQL driver that comes with JDK11 standalone packages does not currently work in the JDK11 environment. MSSQL requires a separate JAR for JDK11. Please contact your Customer Success Manager for the compatible driver.
Dremio is not currently supported for JDK11 standalone packages. If you plan to run JDK11, add -Dcdjd.io.netty.tryReflectionSetAccessible=true
to owlmanage.sh as a JVM option for your Web and Spark instances. Please contact your Customer Success Manager for assistance.
As of October 18, 2022, all images for the 2022.10 release have a Critical CVE (CVE-2022-42889). If you picked up the 2022.10 release before October 18, 2022, there should be no issue with your scans. If issues persist, please contact your Customer Success Manager for a new build.
Rules
- You can now define a rule to detect the number of days a job runs without data by using
$daysWithoutData
. - You can now define a rule to detect the number of days a job runs with 0 rows by using
$runsWithoutData
. - You can now define a rule to detect the number of days since a job last ran by using
$daysSinceLastRun
.
Profile
- You can now use a string length feature by toggling the Profile String Length checkbox when you create a data set.
- When Profile String Length is checked, the min/max length of a string column is saved to table dataset_field
Validate Source
- You can now write rules against a loaded source data frame when
-postclearcache
is configured in the agent.
Note The DQ UI will be converted to the React MUI framework with the 2022.11 release. Prior to the 2022.11 release, you can turn the React flag on, but note that some features may be temporarily limited.
Enhancements
DQ Job
- Start Time and Update Time are now based on the server time zone of the DQ Web App.
Scheduler
- The Job Schedule page now has pagination.
Scorecards
- From Pulse View, you can now view missing runs, runs with 0 rows, and runs with failed scores.
Admin/Catalog
- Connection details are now masked when non-admin users attempt to view or modify database connection details from the Catalog page. Only users with role_admin or role_connection_manager have the ability to view connection details on this page. (ticket #94430)
API
- The /v2/getRunIdDetailsByDataset endpoint now provides the following:
- The RunIDs for a given data set.
- All completed DQ Jobs for a given data set.
Snowflake Pushdown (beta)
- You can now detect shapes that do not conform to a data field. Pushdown jobs scan all columns for shapes by default.
- You can now view Histogram and Data Preview details for the Profile activity.
Connections
- The Snowflake JDBC driver is now updated to 3.13.14.
Fixes
Rules
- Fixed an issue with the Rule Validator that resulted in missing table errors. The Validator now correctly detects columns. (ticket #93430)
DQ Job
- Fixed an issue that caused queries with joins to fail on the load activity when Full Profile Pushdown was enabled. Pushdown profiling now supports SQL joins. (ticket #92409)
- Fixed an issue that caused jobs to fail at the load activity when using the CTE query. Please note that CTE support is currently limited to Postgres connections. (ticket #88287, 89150)
- Fixed an issue that caused inconsistencies between the time zones represented in the Start Time and Update Time columns.
Agent
- Fixed the loadBalancerSourceRanges for web and spark_history services in EKS environments. (ticket #95398)
- The helm property
global.ingress.*
has been removed to separate the config for web and spark_history. Please update the property as follows:__global.web.ingress.*
``global.spark_history.ingress.*
- The helm property
- Added support to specify the inbound CIDRs for the Ingress using the property
.global.web.service.loadBalancerSourceRanges
. (ticket #95398)- Though Ingress is supported as part of Helm charts, we recommend attaching your own Ingress to the deployment if you need further customization.
- This requires a new Helm chart.
- Fixed an issue that caused Livy file estimates to fail for GCS on K8s deployments.
- Fixed an issue that caused jobs to fail for GCS on K8s deployments.
Validate Source
- The Add Column Names feature is scheduled for removal with the upcoming 2022.11 release. (ticket #96066)
- This was a previous functionality before being able to limit the query directly (
srcq
) and Update Scope was added. - Use the query to edit/limit columns and also use Update Scope.
- This was a previous functionality before being able to limit the query directly (
- Fixed an issue that caused the incorrect message to display for [VALUE_THRESHOLD] when validate source was specified for a matched case. (ticket #94435)
Dupes
- The Advanced Filter is scheduled for removal from the Dupes page with the upcoming 2022.11 release. (ticket #96065)
Explorer
- Fixed an issue that caused BigQuery connections to incorrectly update the library (
-lib
) path when a subset of columns was selected. (ticket #96768)
Scheduler
- Fixed an issue that prevented the scheduler from running certain scheduled jobs in multi-tenancy setups. Email server information is now captured from the correct tenant. (ticket #92898)
Known Limitations
Rules
- When a data set has 0 rows returned, stat rules applied to the data set are not executed. While a full fix is planned for a future release, this limitation is only partially fixed as of 2022.10.
DQ Job
- CTE query support is currently limited to Postgres connections. DB2 and MSSQL are currently unsupported.
Catalog
- When using the new bulk actions feature, updates to your job are not immediately visible in the UI. Once you apply a rule, run a DQ Job against that data set. From the Rules tab, a row with the newly applied rule is visible.
Snowflake Pushdown (beta)
- Freeform (SQLF) rules cannot use a data set name but instead must use
@dataset
because Snowflake does not explicitly understand data set names. - When using the SQL Query workflow, selecting a subset of columns in your SQL query must be enclosed in double quotes to prevent the job from running infinitely and without failing.
- Min/Max precision and scale are only calculated for
double
data types. All other data types are currently out of scope.
DQ Security Metrics