Release 2023.09
Release Information
- Expected release date of Collibra Data Quality & Observability 2023.09: October 8, 2023
- Publication dates:
- Release notes: September 24, 2023
- Documentation Center: September 29, 2023
Highlights
- Pushdown
We're delighted to announce that Pushdown processing for Amazon Athena and Redshift is now available as public betas! Pushdown is an alternative compute method for running DQ jobs, where Collibra DQ submits all of the job's processing directly to a SQL data warehouse, such as Athena and Redshift. When all of your data resides in Athena or Redshift, Pushdown reduces the amount of data transfer, eliminates egress latency, and removes the Spark compute requirements of a DQ job.
- Job Estimator
Collibra DQ utilizes Spark's ability to break large datasets into smaller, more manageable segments called partitions. When you run large Pullup jobs, you can now leverage the job estimator to automatically calculate and update the number of partition columns required to optimally run and write rules against them. Previously, the only way to know when a job required the scaling of resources was when it failed.
Important
We have migrated our code to a new repository. Consequently, Collibra DQ owl-env.sh jar files are no longer prepended with owl-*
. Instead, they are now prepended with dq-*
. For more details, it's crucial that you review the Migration Updates section below.
Migration Updates
We have migrated our code to a new repository for improved internal procedures and security. Because owl-env.sh jar files are now prepended with dq-*
instead of owl-*
, if you have automation procedures in place to upgrade Collibra DQ versions, you can use the RegEx replace regex=r"owl-.*-202.*-SPARK.*\.jar|dq-.*-202.*-SPARK.*\.jar"
to update the jars.
Additionally, please note the following:
- Standalone Upgrade Steps When upgrading to Collibra DQ 2023.09 on Spark Standalone, the upgrade steps have changed.
- Open a terminal session.
- Move the old jars from the owl/bin folder with the following commands.
- Copy the new jars into the owl/bin folder from the extracted package.
- Copy the latest
owlcheck
andowlmanage.sh
to /opt/owl/bin directory. - Start the Collibra DQ Web application.
- Start the Collibra DQ Agent.
- Validate the number of active services.
mv owl-webapp-<oldversion>-<spark301>.jar /tmp
mv owl-agent-<oldversion>-<spark301>.jar /tmp
mv owl-core-<oldversion>-<spark301>.jar /tmp
mv dq-webapp-<newversion>-<spark301>.jar /home/owldq/owl/bin
mv dq-agent-<newversion>-<spark301>.jar /home/owldq/owl/bin
mv dq-core-<newversion>-<spark301>.jar /home/owldq/owl/bin
Tip You may also need to run chmod +x owlcheck owlmanage.sh
to add execute permission to owlcheck
and owlmanage.sh
.
./owlmanage.sh start=owlweb
./owlmanage.sh start=owlagent
ps -ef | grep owl
Enhancements
Capabilities
- When running rules that reference secondary datasets, you now have the option to use serial rule processing to reduce operational costs.
- Set
-serialrule
totrue
to leverage the Spark cache for the secondary dataset.
- Set
- When authenticating your connection to CockroachDB with a PostgreSQL driver, you can now leverage Kerberos TGT without errors.
- When creating a DQ job to run against a remote file data source, you can now select BEL as a delimiter.
- When adding a name to a rule on the Rule Workbench, a helpful message displays if you use an invalid special character.
- Rule names can only contain alphanumerical characters, underscores, and hyphens.
- When reviewing Rules findings, the default number of rows available to preview is now 6. Previously, the Rules tab only displayed 5 preview rows.
- When creating a Pullup job from Explorer, the Mapping step now automatically maps source columns to target columns.
- We've updated the connection icons on the Explorer, Pulse View, and Admin Connections pages.
- When you add a new connection from the Admin Connections page, the icon will also update accordingly.
- When monitoring the Jobs page with React on, you can now right-click to open a dataset in a new tab.
- When assigning or validating a finding to an external user whose first name, last name, and external user ID cannot be found or do not exist, you can now set a backup display name in the ConfigMap to ensure you can still validate or assign that finding to the external user.
- Set
SAML_USE_EXTERNAL_USER_ID_FOR_DISPLAY
totrue
.
- Set
Platform
- When deleting a user, the user is now removed from both the user and user_profile metastore tables.
- When loading a large remote file into Explorer, a progress bar now tracks its loading status.
DQ Integration
- When using the configuration wizard in Collibra DQ to set up an integration, your Collibra Platform credentials are now encrypted in the metastore to ensure that your information is always secure.
DQ Cloud
- We've introduced a new endpoint to retrieve aggregated WAL (write-ahead logs) stats.
- When deploying a new Edge site, the TenantAlignmentService no longer stops checking for new tenants in DQ Cloud after 100 attempts.
Pushdown
- When using Archive Break Records for Databricks Pushdown, the 'seqno' column for all break records tables created in Databricks is no longer designated as an identity column. Instead, its default value is now NULL. We've made this adjustment because Databricks does not support concurrent transactions for Delta tables with identity columns.
- If you already created these tables in your Databricks environment, you need to delete them. Subsequently, allow the Collibra DQ application to re-create these tables for you, ensuring compatibility with the latest changes. To do this, you can run the following SQL commands on your Databricks target schema dedicated to maintaining records of source breaks:
- After you run a DQ job, the tables will be re-created on your Databricks schema.
drop table collibra_dq_outliers
drop table collibra_dq_duplicates
drop table collibra_dq_rules
drop table collibra_dq_breaks
drop table collibra_dq_shapes - We’ve improved the memory usage to prevent large quantities of rule break records from causing out-of-memory errors.
- When running a Pushdown job, the entire allocated connection pool is now used to extract the maximum allowed parallelism to allow profiling to run in parallel with other layers and reduce the latency of the job.
- Only the required number of connection threads are used for an activity.
- When creating rules to run against Pushdown datasets, you can now use cross-join queries.
- We've added a Pendo tracking event to track the number of Pushdown jobs and columns in an environment.
Fixes
Capabilities
- When editing DQ jobs for KDB (PostgreSQL) connections, you can now successfully execute a query with a large number of records. (ticket #113493, #116740)
- When creating a BigQuery job, you can now create a dataset for a destination table without throwing an error. (ticket #118534, #122761)
- When archiving break records from Pullup jobs, you can again write break records to S3 storage buckets. Previously, an invalid rule error returned which stated "Exception while inserting break records into S3: No FileSystem for scheme s3". (ticket #121509)
- When you open the Oversized Job Report, you can again see the reports without any errors. (ticket #121752)
Platform
- When reviewing the configuration after running a Validate Source job, you no longer receive a validation error due to lost database, schema, table, field, and query information. (ticket #113977)
- Oracle dataset host strings no longer parse incorrectly. Previously, Oracle dataset host strings were parsed as "jdbc" instead of displaying the correct host string. To see the updated and correct host string for Oracle datasets, rerun your jobs manually via the scheduler or API. (ticket #124846)
DQ Integration
- When completing the connection mapping for your Collibra DQ to Collibra Platform integration, you now correctly see database views from Collibra DQ to the tables and columns to which they relate in Collibra Platform. (ticket #124191, #124213, #125676)
DQ Cloud
- When upgrading to Collibra DQ version 2023.06, you can now see entries in your List View scorecards. Previously, there was a discrepancy between Edge and the Cloud metastore. (ticket #121624)
Pushdown
- When running a Pushdown job with the /v3/jobs/run API, the username now correctly updates to the authenticated user. (ticket #121192)
- When upgrading to Collibra DQ version 2023.07.2, you can now see the Data Preview for breaking record count for a freeform SQL rule against a Snowflake Pushdown dataset. (ticket #122585)
Known Limitations
Capabilities
- There is a limitation with Validate Source where source columns containing white spaces do not map properly to the target columns.
- A workaround is to remove the white spaces from the command line and then copy/paste the command line into a new DQ job.
- When using the Pulse View page after adding a new connection, there is a limitation where the icon of the connection does not automatically appear on the Pulse View page. Instead, it appears as a generic JDBC icon.
DQ Security Metrics
Note The medium, high, and critical vulnerabilities of the DQ Connector are now resolved.
Warning We found 1 critical and 1 high CVE in our JFrog scan. Upon investigation, these CVEs are disputed by Red Hat and no fix is available. For more information, see the official statements from Red Hat:
https://access.redhat.com/security/cve/cve-2023-0687 (Critical)
https://access.redhat.com/security/cve/cve-2023-27534 (High)
The following image shows a chart of Collibra DQ security vulnerabilities arranged by release version.
The following image shows a table of Collibra DQ security metrics arranged by release version.
Beta UI
Beta UI Status
The following table shows the status of the Beta redesign of Collibra DQ pages as of this release.
Page | Location | Status |
---|---|---|
Homepage | Homepage |
![]() |
Sidebar navigation | Sidebar navigation |
![]() |
User Profile | User Profile |
![]() |
List View | Views |
![]() |
Assignments | Views |
![]() |
Pulse View | Views |
![]() |
Catalog by Column (Column Manager) | Catalog (Column Manager) |
![]() |
Dataset Manager | Dataset Manager |
![]() |
Alert Definition | Alerts |
![]() |
Alert Notification | Alerts |
![]() |
View Alerts | Alerts |
![]() |
Jobs | Jobs |
![]() |
Jobs Schedule | Jobs Schedule |
![]() |
Rule Definitions | Rules |
![]() |
Rule Summary | Rules |
![]() |
Rule Templates | Rules |
![]() |
Rule Workbench | Rules |
![]() |
Data Classes | Rules |
![]() |
Explorer | Explorer |
![]() |
Reports | Reports |
![]() |
Dataset Profile | Profile |
![]() |
Dataset Findings | Findings |
![]() |
Sign-in Page | Sign-in Page |
![]() |
Note Admin pages are not yet fully available with the new Beta UI.
Beta UI Limitations
Explorer
- When using the SQL compiler on the dataset overview for remote files, the Compile button is disabled because the execution of data files at the Spark layer is unsupported.
- You cannot currently upload temp files from the new File Explorer page. This may be addressed in a future release.
- The Formatted view tab on the File Explorer page only supports CSV files.
Connections
- When adding a driver, if you enter the name of a folder that does not exist, a permission issue prevents the creation of a new folder.
- A workaround is to use an existing folder.
Admin
- When adding another external assignment queue from the Assignment Queue page, if an external assignment is already configured, the Test Connection and Submit buttons are disabled for the new connection. Only one external assignment queue can be configured at the same time.
Profile
- When adding a distribution rule from the Profile page of a dataset, the Combined and Individual options incorrectly have "OR" and "AND" after them.
- When using the Profile page, Min Length and Max Length does not display the correct string length. This will be addressed in an upcoming release.
Scorecards
- When creating a new scorecard from the Page dropdown menu, because of a missing function, you cannot currently create a scorecard.
- While a fix for this is planned for the September (2023.09) release, a workaround is to select the Create Scorecard workflow from the three dots menu instead.
Navigation
- The Dataset Overview function on the Metadata Bar is not available for remote files.
- The Dataset Overview modal throws errors for the following connection types:
- BigQuery (Pushdown and Pullup)
- Athena CDATA
- Oracle
- SAP HANA
- The Dataset Overview function throws errors when you run SQL queries on datasets from S3 and BigQuery connections.