Release Notes

2022.06 (In Progress)

Fixes / Enhancements

  • DQ Job
    • Fixed an issue with the Learning Phase in the Behavior feature. (ticket #82907)
  • Admin
    • The Agent Group (H/A) and its associated endpoints are now deprecated. (ticket #83086)

2022.05

Fixes / Enhancements

  • DQ Job
    • You can no longer update the dataset name (-ds) from the command line.
      • A helpful error message now appears if changes are made to -ds.
    • Stop Job action is no longer enabled for K8s.
    • Fixed an issue for Dremio jobs where jobs hang when editing or cloning an existing dataset.
  • Outliers
    • Added "username" to outlier boundary table to track who creates the boundary.
      • The outlier boundary again saves correctly after the addition of a username.
    • Fixed an issue that caused jobs to fail when Day from By dropdown was selected.
  • Rules
    • Rules Preview drill-in capabilities are now improved:
      • You can now configure Preview Limits based on the individual rule.
        • Freeform and Simple rules are currently supported for the Preview Limit feature.
      • You can now set any positive number as the Rules Preview Limit.
        • When you update a Preview Limit value, you must re-run to apply the updated limit value.
      • On the DQ Job page, the details of an individual rule now displays a paginated sub-table of all the break records.
      • When a rule is labeled as BREAKING for rule types other than Freeform and SQL, UI text now displays, "Data preview records are only available for Freeform and Simple rules."
    • You can now hover over stat rules to see their conditions.
    • Data Concepts is renamed Data Categories.
    • Semantics is renamed Data Classes.
    • When a Data Class is assigned to a dataset via Profile controls, a rule is now created.
  • Security
    • Vulnerabilities identified Jfrog
      • Vulns 0, criticals 0, high severity 9
      • For a visual readout, see the DQ Security Metrics section below.
    • The OS vulnerabilities from the images of Collibra DQ 2022.04 have been resolved by using the base image of RHEL8 to build the images for Collibra DQ 2022.05. The following OS utilities will not be available in the 2022.05 release images:
      • Unified, OpenSSL crypto/stack
      • Full YUM stack
      • OS tools, including tar, gzip, and vi
    • AD users can again use auth/signin REST API.
    • The Highcharts CVSS2: 9.3/CVSS3: 9.8 vulnerability is resolved.
    • The LOGJAM (CVE-2015-400) SSL/TLS vulnerability is resolved.
    • The SpringShell (CVE-2022-22965) vulnerability is resolved.
    • TLS < 1.2 is no longer supported.
    • When Azure AD SSO sends a groups.link assertion, the application now tries to resolve the groups via the link.
      • You can now activate this setting by using the property, SAML_GROUP_LINK_PROP.
  • Profile
    • You can now edit or delete semantics by clicking anywhere in the semantics cell of the Profile column table.
    • You can now save annotations with special characters.
      • Special characters that are not currently supported include percent sign %, backslash \, and caret ^.
    • Fixed an issue where columns of broken rules were not highlighted.
  • Connections
    • You can now view a list of all packaged and optionally packaged drivers on our new Builds page.
    • The Databricks JDBC driver is now available.
    • You can now add Databricks datasets using the Databricks Simba driver.
  • Catalog
    • Fixed an issue where the deletion of a dataset caused orphaned links to datasets in other areas of Collibra DQ.
  • Admin
    • *Tech Preview* [TP] You can now use the ServiceNow integration through a proxy server from the Assignment Queues screen.
    • You can now access the new Usage page to view monthly historical usage statistics.
    • AD users with Admin privileges can now add Business Units.
    • AD users with Admin privileges can now manage local users.
    • The Agent Groups (H/A) feature is marked for deprecation and will be removed from the app in the 2022.06 release.
  • Explorer
    • You can again edit schema and table name from the Catalog page.
    • You can now navigate to a specific behavior tab directly from the Assignments page.
    • Fixed an issue when viewing Schemas in View Data wizard.
  • Scorecard
    • Single-space , underscore _, and period . are now supported characters when saving Scorecard name.
  • API
    • Improved API calls for the UserManagement Save function.
  • Reporting
    • *Tech Preview* [TP] Rule Summary page enhancements
      • You can now filter rule breaks by a specified date range and view charts for Most Used Rule Types, Dataset with Most Rules, and Top Rules Run.

Known Limitations

Warning Delta Files

A bug was introduced as a result of removing CVEs in 2022.05. If you use Delta files -deltait is not advised to upgrade until an update is available.

  • Explorer
    • Except for underscore _, special characters are not currently supported in schema or table names.
  • Admin
    • *Tech Preview* [TP] ServiceNow integration
      • Only the local Docker container proxy has been tested and verified.
      • The Test Connection button's validating credentials capabilities is currently limited if the ServiceNow URL is valid.
      • The Validate All Rules function currently results in a failure.
      • You cannot edit an active ServiceNow assignment.
        • Invalidate/Validate or Resolve actions result in a failure.
      • You can assign a ServiceNow ticket with an embedded URL when escaped with double quotes.
        • No assignment is sent without this process.
  • Multi-Tenant
    • Tenant names should be lowercase. Use lower case characters, when creating a tenant from the multi tenant admin page. The current limitation is around the schema that is generated.
  • Reporting
    • *Tech Preview* Rule Summary page enhancements
      • Sorting any column returns an error.
      • User must use date picker as manual date entry is not honored.
      • The start and end date are out of order when navigating to the page.
      • The last page on the paginated list does not change when date criteria is updated.

DQ Security Metrics

2022.04

Install

Tip For standalone installations, within the setup.sh script find/replace the variable for spark_package.

Change spark-3.01-bin-hadoop3.2.tgz to spark-3.1.2-bin-hadoop3.2.tgz

Copy
spark_package=${SPARK_PACKAGE:-"spark-3.0.1-bin-hadoop3.2.tgz"}

# replace with 

spark_package=${SPARK_PACKAGE:-"spark-3.1.2-bin-hadoop3.2.tgz"}

Fixes / Enhancements

  • DQ Job
    • Entering negative values for the downscore is no longer supported and will now produce an error message.
    • You can now invalidate schema with special characters.
    • Spark table names of historical dataset loaded and other spark tables are now available on Jobs Log table.
    • Long type values larger than Integer.Max no longer breaks the Profile.
    • View Findings now displays user's full name, if applicable, in Validate Modal. Assignment queue page also displays the full name of user, if applicable.
  • Alerts
    • You can once again use the Cancel action button on the Alerts page.
    • You can now set up alerts to reach multiple email recipients.
    • If email_server table is not yet configured, a helpful message will now display in the Description column in the job log directing you to register an email Server under Admin - Alerts. The job will still run successfully.
  • Rules
    • You can now modify Rules definitions from the primary DQ Job dashboard without loading the Rules page.
    • Mean value check once again triggers correctly for Integer and Long columns.
      • This fix triggers the mean value check for Integer and Long columns and shows an infinity percentage change in behavior for a period, depending on -bhlb. After this period, it should disappear.
    • For Native SQL rules, jobs now behave the same whether or not a semicolon ";" is included in the SQL query.
    • You can now use a hyphen "-" in a dataset name.
      • Acceptable special characters now include a hyphen "-", period ".", and underscore "_".
    • Added a tooltip that displays which condition is being checked in a DQ Job when using a Stat rule when you hover your cursor over a condition in the Condition column.
    • Improved the exception message for when there are no values for a specific column while using a Stat rule.
    • The WebUI passing boundaries range has been updated to ().
    • For Freeform rules, IS Null and IS NOT NULL no longer return invalid results in the Validation tab.
    • Added a pop-up success message for when the correct syntax rule passes for Freeform rules with secondary datasets after the Validate button is clicked.
  • Security
    • Vulnerabilities identified by Jfrog
      • Vulns 2, criticals 2, high vulnerabilities
      • For a visual readout, see the DQ Security Metrics section below.
    • Authorization restriction is now enforced for the following endpoints:
      • /v2/deletefiledir
      • /v2/getRunIdsByDataset
      • /v2/putDatasetWeight
      • /v2/checkListofFilesPath
      • /v2/getlistagents
      • /v2/checkDriver
      • /v2/getconnectionssensitive
      • /v2/getemailgroups
      • /v2/getemailserver
      • /v2/addemailgroup
      • /v2/validateEmailAddress
      • /v2/getlistoffiles
      • /v2/getlistoffilespath
      • /v2/getlistoffiles
      • /v2/getDriverDir
      • /v2/getlistrolesbydataset
      • /v2/getlistrolesbydistnctdatasets
      • /v2/getlistrolesbyfunctiontypename
      • /v2/getlistusersbyauthority
      • /v2/getlocalDBRoles
      • /v2/getsecuritysettingsbytype
      • /v2/getowlcheckinventory
      • /v2/getconnectionspwdmgrsensitive
      • /v2/getsecuritysettingsbycoltype
      • /v2/getdbuserlist
      • /v2/getdbuserdetailsbyuser
      • /v2/getexternaladgroupstointernalroles
      • /v2/getlistdatasets
      • /v2/getlistdatasetsbyrole
      • /v2/getaudittrailitems
      • /v2/get-all-audit
      • /v2/get-datasets-audit-trail-items
      • /v2/get-all-dataset-audit
      • /v2/getactivityaudit
      • /v2/getallactivityaudit
      • /v2/getlocaldbrolesbyuser
      • /v2/getdatasetaclsecurity
      • /v2/getexternaladgrouplist
      • /v2/getexternaladuserlist
      • /v2//external-service-configuration
    • Local user accounts now have an account lockout feature implemented with the following restrictions:
    • A user's account will be locked if a password is entered incorrectly more than 10 times (configurable via app config).
    • The locked account can only be unlocked by Admin user in user management screen.
    • If an Admin is locked, another Admin can unlock their account.
    • If all the Admins are locked, enable the account via DB (ubdate users table "accountNonLocked" colun to "1").
    • User cannot use forgot password to reset password while the account is locked.
    • CORS restriction is now enforced for SAML and multi-tenancy.
      • This breaks SAML unless the IDP is configured as a trusted origin in DQ, so the following property must be added to environment variables in order for DQ and SAML to work: CORS_ALLOWED_ORIGINS=
    • SAML login no longer automatically triggers on the login page during an existing session when accessing DQ base URL. For SAML login, you should instead use /saml/login.
      • API requests (v2/v3) return proper JSON response in case of failures.
      • auth/signin API is updated to provide JWT token for MT & local users.
  • Profile
    • Mean value once again displays in the Volume column.
    • When connecting to MSSQL server on Windows from a Linux DQ environment, the connection no longer fails.
      • We recommend (not required) a TLS connection for MSSQL connections from a DQ Linux environment with a properly signed certificate setup on MSSQL server to connect only via TLS.
    • You can now edit annotations in the Labels tab.
  • S3
    • Added an enhancement for -addlib flag.
  • Connections
    • Added new Jconn4 driver for encrypted connections.
    • You can now save a local (NFS) file directory as a connection type.
    • See our newest connections page for a definitive guide to driver support.
    • BigQuery is now certified for production, but removed from packaged install.
  • Explorer
    • When toggling between fullfile and Union LookBack options, -fullfile and -fllb flags can no longer be generated together in the DQ Job command line.
    • Data Preview for Temp files loading in Explorer now correctly shows the order of columns of the original Temp file.
    • You can now drill in and search files within the connection.
    • You can now browse multiple local (NFS) file connections.
  • Scorecard
    • You can now create scorecards with special characters "^[A-Za-z0-9]+$" in their names.
  • Dupes
    • Added linkID column for exact match in both UI and REST API. linkID can now be either included or excluded from Dupes for exact match.
    • linkID is now shown at the aggregate level for Exact Match.
      • We recommend using this feature from a primary key perspective for its first iteration.
      • The aggregate function used is min().
        • For example: if you have 6 occurrences, you will get 1 example linkID, the min.
  • API
    • Updated the /v2/getlistdataschemapreviewdbtablebycols API call method from GET to POST to support the long query (-q) or very large columns table.
    • Added a new SAML load balancer so the syestem picks the appropriate schema and SAML server URL for Swagger.

Known Limitations

  • Profile
    • Special characters are not currently supported in annotations in the Label tab.
  • Scorecard
    • Space " ", underscore "_", and period "." are not yet supported for scorecard edit.

DQ Security Metrics

2022.03

Fixes / Enhancements

  • DQ Job
    • The -validatevaluesshowmissingkeys options now allows the extrapolation of missing keys between target and source.
    • Newly created jobs will no longer be marked incorrectly with enclosing double quotes.
    • File names with spaces are now handled with double quotes within the application.
  • Alerts
    • Email notifications now have Collibra branding and terminology.
    • Fixed Cancel Action for Delete functionality on Alert page.
  • Outliers
    • Fixed the issue where Numerical Outlier drill in graph wasn't displaying when perChange is NaN.
  • Rules
    • Added additional HealthCare Data Classes to Rule Library.
    • Fixed input validation rule of POST - /v3/rules/ endpoints. The following validation rules have been applied to RuleDTO.ruleName field:
      • Maximum size is 100.
      • Must comply with the following regular expression: ^[a-zA-Z0-9_]+$
    • The rules on the Hoot page now show the correct exception data when expanded if there are two or more rules with exceptions attached to the dataset.
  • Security
    • Vulnerabilities identified by Jfrog
      • Vulns 0, critical, 6 high vulnerabilities
    • Password length has increased to a maximum of 72 characters.
    • Forgot password screen will now always show success message in UI regardless of success or failure.
    • Fixed an issue of a throwing error message when adding/editing user roles.
    • Added error checks if the password manager script throws any errors.
    • Added the helper text "Enforce user roles to run the job" to DQ Job Security row.
    • User password field removed while updating user in user management screen.
      • Admin can only set password for another user wile creating new user, but not while updating/modifying them.
      • To change a password, users can now use either the profile page or the self-service (Forgot password) feature.
    • XSS security
      • Fixed the vulnerability on scorecard, jobs, rules and catalog pages.
      • Fixed the vulnerability via remote connection.
    • Mitigated the endpoint "/v2/getrawpreview" vulnerable to Local File inclusion attack.
    • DQ HTTP session cookie is now secured by default when HTTPS is enabled.

{% hint style="info" %} Rule Discovery Terminology Alignment

Data Concepts => Data Categories

Semantics => Data Classes {% endhint %}

  • Profile
    • Precision and Scale metrics are correct when using multi executors.
  • Admin
    • Edge download page within Admin Console (for Cloud customers).
  • Validate Source
    • *Tech Preview* [TP] Update Source Scope.
      • Added "Update Source Scope" in the Query section of the Source tab.
  • Connection
    • Added handling for errors during log cleanup process.
  • API
    • Improved API calls for the Save function.

Known Limitations

  • Validate Source
    • *Tech Preview* [TP] Update Source Scope.
      • Only works for JDBC connections. Feature is hidden for remote, temp, local files.
      • Valsrc query won't be updated automatically when modifying column mappings. Use 'Preview' button to reset the feature if column mappings need to be changed.

2022.02

{% hint style="info" %} For new Standalone Collibra DQ installations, please double check 'Number of Core(s)' field when setting up 'Edit Agent' {% endhint %}

{% hint style="info" %} Added UUIDs for Jobs may take additional time on initial startup after upgrade {% endhint %}

Enhancements

  • DQ Job
    • Added UUIDs for jobs for better tracking between web and core
    • Improved DQ Job page load performance by optimizing calls
    • Fixed issue DQ jobs would fail when -rd is in "yyyy-mm-dd HH" format
  • Outliers
    • *Tech Preview* [TP] Outlier Calibration
      • Feature flag can be set within owl-env.sh or configMap with export outlier_calibration_enabled=true (Default is off)
      • Ability to suppress Outlier observations for a user-determined length of time that would have otherwise surfaced as anomalies
      • Once feature is enabled, accessible within Outliers tab on DQ Job page
  • Alerts
    • Ability to navigate to dataset specific Alerts from DQ Job page
    • Ability to test SMTP alert configurations when adding an email relay
    • Fixed issue where 'Reply Email' field did not properly accept user input value
      • Please note there are no (Collibra imposed) domain restrictions on Reply Email field
  • Security
    • Stricter password policy is enforced on all user/tenant management screens/APIs.
      • The restriction is as follows: Minimum length of 8 characters
      • Maximum length of 20 characters.
      • At least one upper-case letter.
      • At least one numeric character.
      • At least one special character (supported are !,%,&,@,#,$,^,*,?,_,~)
      • User ID and password cannot be the same.
      • Password cannot contain user ID.
    • Change Password functionality on user profile requires a current password of the user.
    • Mitigated 64 critical, 15 high, and 12 medium vulnerabilities identified by JFrog (internal-only report link)
    • Upgrade Log4J to 2.17.1
    • Added connection security checks to users to prevent running jobs and query the tables that are not authorized per connection. This is applicable when DB Connection Security is enabled in the Admin Console under General.
    • Implemented stricter session management
    • Implemented CORS restriction to mitigate potential CSRF vulnerability
      • Enforced strict CORS policy by not allowing any domain. In order to allow other domains and tweak this behavior, we have exposed the following properties as environment variables in owl-env:
      • CORS_ALLOWED_ORIGINS=http://facebook.com,http://google.com
      • CORS_ALLOWED_METHODS=GET,POST,OPTIONS,DELETE,PUT,PATCH
      • CORS_ALLOWED_HEADERS=X-Requested-With,Origin,Content-Type,Accept,Authorization
      • CORS_EXPOSE_HEADERS=
      • CORS_ALLOW_CREDENTIALS=false
      • CORS_MAX_AGE=0
  • *Tech Preview* [TP] Collibra Native DQ Connector
    • Fixed issue where tenant specified on DQ Connector configuration (issuer of the jwt token field within DGC Edge Management page) was not properly accepted; only rules that existed with 'public' schema were brought over; now the DQ Connector will accept the proper values
  • Agent
    • Upon potential deletion of an agent, added server side validation to indicate number of scheduled jobs so that users can understand if jobs fail going forward
  • Rules
    • Enhanced stability on Parallel Rule execution to ensure all rules load by reverting back to fixed thread counts
    • Display exceptions upon rule execution failure to improve rule management experience
    • Improvements to user experience in Rule Library tab (within Rules page) including filters and column alignment
    • Quick Rule dropdown within the Rules page will save with default severity of 1 point and a threshold of 1 percent
    • Enhanced validation for rules generated in Profile tab
    • Fixed issue where removing semantic tag may not have removed corresponding auto-generated rule
    • Rule name character limit of 100
    • Rule Builder page now returns error messages where the dataset contained 0 records
  • Catalog
    • Renaming Dataset from Catalog page keeps associated rules
      • Clone only creates the dataset shell (with DQ job run configs, no additional rules, etc.) will be copied
    • Bulk actions support for Data Concepts
    • Fixed issue where child of business unit could be assigned as parent
    • Fixed issue where clearing individual filters were not functioning
  • Validate Source
    • *Tech Preview* [TP] New collapsible section for Query in Source tab; enables users to use custom srcq, similar to query on section on Home tab so that users do not need to edit -srcq in cmd line editor on Run tab
    • Introducing new observation types via -valscrshowmissingkey flag
      • Key not in source
      • Key not in target
    • Source Name should be fetched as part of getcatalogandconnsrcnamebydataset API call for a given dataset
    • Fixed issue which prevented Hive from working as Target
  • Export / Import
    • Fixed issue that import could not accommodate more than one table insert
    • Fixed bug where certain values were inadvertently inserted into RegEx rules upon Export
    • New endpoints added for db-export and db-import
  • Connection
    • Fixed Out Of Memory issue with Dremio
      • Explicitly added limit clause in the preview query within Update Scope
      • Dremio driver requires double quotes in Schema, Table, and Column names e.g. "SchemaName"."TableName"
    • Fixed Oracle TIMESTAMPLTZ conversion error
  • Explorer
    • Fixed issue where 'Analyze Table' option did not populate for Hive
    • Fixed the static date values showing up in Managed Template and Run Check while running the job via v2/runtemplate API call from swagger UI
  • Files
    • File names with spaces are now handled with double quotes t
    • Implemented Supported File Type Check at time of uploading the Temp Files via Explorer
      • Default supported file types are “csv,json,parquet,avro,delta".
      • In order to add/update the supported file types and ensure validation, a new environment variable needs to be added in owl-env.sh as below: export ALLOWED_UPLOAD_FILE_TYPES="csv,json,parquet,avro,delta"
      • Tip: For remote files with delimiter, please use the csv dropdown options for files with .txt extension
    • *Tech Preview* [TP] Users have ability to assign an agent when using temp file and local file Explorer paths without manually appending -master to agent or job (previous known limitation)
    • LIMIT values are now properly accepted on the Scope & Range query panel
  • Dupes
    • Fixed issue where column selections were not retained from the original DQ Job with Dupes ON for future runs

Known Limitations

  • Rules
    • Cannot currently create rule with API /v3/rules; will be fixed in future release
      • Please use /v2/createrule API
  • Profile
    • Stat Rules
      • Tool tips will only generate when Max Precision and Max Scale are greater than 0
  • DQ Job
    • /v2/runtemplate API still creates 'zombie' job
      • Please use /v3/jobs/run
  • LinkID
    • LinkID column selection is case sensitive; breaks may not appear if case does not match
  • Outliers
    • Outlier Calibrate
      • Outliers cannot retrain on-demand; to suppress existing Outliers, must rerun the DQ Job for those date(s)
      • In-app labels do not exist for Outliers which have been subject to past, current, or future calibration; references only exist within the outlier_boundary table in the metastore

[Informational Only] New Tables Introduced To Metastore In 2022.02

  • outlier_boundary

[Informational Only] Changes To Metastore Made In 2022.02

Copy
ALTER TABLE validate_source_metadata ADD COLUMN IF NOT EXISTS validate_values_show_missing_keys boolean DEFAULT false
ALTER TABLE opt_source ADD COLUMN IF NOT EXISTS validate_values_show_missing_keys boolean DEFAULT false

ALTER TABLE opt_source ADD COLUMN IF NOT EXISTS filter_cols character varying[]

ALTER TABLE user_profile ADD COLUMN IF NOT EXISTS external_user_id VARCHAR

ALTER TABLE owlcheck_q ADD COLUMN IF NOT EXISTS agent_job_uuid UUID
ALTER TABLE job_log ADD COLUMN IF NOT EXISTS job_uuid UUID
ALTER TABLE platform_logs ADD COLUMN IF NOT EXISTS job_uuid UUID
ALTER TABLE platform_logs DROP CONSTRAINT IF EXISTS platform_logs_job_uuid_ux
ALTER TABLE platform_logs ADD CONSTRAINT platform_logs_job_uuid_ux UNIQUE (job_uuid)
ALTER TABLE opt_owl ADD COLUMN IF NOT EXISTS job_uuid UUID

2022.01

Enhancements

  • DQ Job
    • Fixed issue where backrun "-br" flag was inadvertently added on future runs (error contained in 2021.12) if the initial DQ Job setup Explorer selected backrun
    • Improved validation to not allow for slashes in dataset name
  • Validate Source
    • Fixed potential DQ Job failure with Source turned on for some legacy installations when upgrading from older versions to 2021.11 and newer
  • Explorer
    • DB_VIEWS_ON can be added with TRUE or FALSE values by adding new App Config (Add Custom within Admin -> Configuration)
    • -Addlib flag now working across JDBC connections
    • Update Scope now supports rdEnd
  • Rules
    • When creating rules, run-time limit for each rule (in minutes) can be set on the Rule page UI and on the V3 API (by setting runTimeLimit property). The default is 30 minutes if not explicitly set. This 30 minute limit sets the overall timeout limit for all rules in a particular job. For example, if there are 10 rules with 9 rules with 30 min limit and 1 rule as 90 min limit, then the DQ Job will wait up to 90 min for all 10 rules to finish. This is because all rules must finish before the Rule stage in DQ Job to finish and move to the next stage. We do not support async stages where one long running rule is running while the job itself moves on to the next stage.
    • Added ability to specify score of 0 to a rule
    • Improvement to Stat Rules to fail without exception when result is not within range
  • Profile
    • Fixed ability to remove a business unit from a dataset
    • Fixed issue where data concepts were not correctly displaying on a dataset's Profile page
    • Fixed sensitive labels not being assigned from Discovery
    • Treat certain doubles, floats, decimal types as Decimal format that preserves length and prevents Java from truncating to E11 format
    • Removed commas when displaying date columns
  • Security
    • SAML Login fix for IDPs that use POST binding as default
  • S3
    • Enhanced support where "." in column headers were causing large jobs to not complete
      • Underscores now replace periods and large jobs should no longer hang
  • Connections
    • Updated default Snowflake template connection properties
      • Corrected 'db' parameter placeholder on connection string versus former 'databaseName'
    • Added BigQuery connection troubleshooting information

Known Limitations

  • Local files using NO_AGENT require a valid $SPARK_HOME on the machine where the web server is running.
  • Supported data types
    • CLOB datatypes are unsupported
  • Explorer
    • -Addlib not yet supported for Remote Files e.g. S3

[Informational Only] Changes To Metastore Made In 2022.01

Copy
ALTER TABLE owl_rule ADD COLUMN IF NOT EXISTS run_time_limit DOUBLE PRECISION NOT NULL DEFAULT 30.0;
ALTER TABLE owl_rule ADD COLUMN IF NOT EXISTS scoring_scheme INT4 NOT NULL DEFAULT 0;

ALTER TABLE job_log ALTER COLUMN stage TYPE character varying; -- stage set to varchar because RULE logs rule_nm into stage
ALTER TABLE job_log ALTER COLUMN log_desc TYPE character varying;
ALTER TABLE job_log ALTER COLUMN log_hint TYPE character varying;

2021.12

*Note to Standalone Collibra DQ Customer Upgrades*: We have upgraded to Log4J 2.17, please refer to standalone-upgrade.md for additional steps

Enhancements

  • Rules
    • Semantic and data concept management: Run Discovery feature
      • Run Discovery feature can be accessed from Catalog by selecting 'Data Concept' option from Actions or clicking the 'Run Discovery' button on the Rules tab of the DQ Job page. This will run a DQ Scan to detect for the semantics assigned to the selected data concept
      • Algorithm now selects best match if column matches 2 or more data classes based on % match and name distance
    • *Tech Preview* [TP] Configurable rule break preview limit
      • Global default is 6 max rows per rule
      • Any change from 6 must be specified with previewLimit (API /v2/createrule) or in the Preview Limit field (UI)
      • Maximum of 50 from UI
    • Introducing additional Stat Rules including minPrecision, maxPrecision, minScale, maxScale
  • Behavior
    • Min and max value checks are now triggered for all numeric columns when selected, even if column contains zeroes in lookback period
    • AR column view graph now shows theMean value for current day (runId). No re-run of DQ Job is necessary. The displayed Mean makes it clear that the % change is the % change from the mean, not runId - 1 day.
    • Findings in behaviors that were directly correlated to a row count shift as the root cause have been optimized, such that a major deviation in row count will no longer down-score related fields in the dataset to reduce noise
  • Catalog
    • Catalog now features intelligent ranking based on Recency, Most Scanned, User
  • Outliers
    • Dynamic minimum history allows for gaps in dates when establishing lookback period, which is established by history with row count > x (specified by user)
    • Fixed issue where outlier data preview graphics were not displayed
    • Fixed issue where outlier results did not honor the initial scope where clause, in particular for Remote Files (S3)
  • Connections
    • BigQuery: Enhanced support for cataloging host name
  • Pulse View
    • Pulse view can filter Connections and Users
    • Pulse view can serve as proxy verification on whether scheduled jobs were successfully completed
  • Profile
    • Viewable precision and scale statistics for double, float, and decimal data types
  • Shapes
    • Fixed issue where data shape preview not available when same shape is detected on the same row for different columns
  • Files
    • *Tech Preview* [TP] Users have ability to assign an agent when using temp file and local file Explorer paths
      • Known limitation: -master must be freeform appended to the agent or to each job
    • Support for multicharacter delimiters
    • Improved delimiter support to distinguish string commas versus actual CSV commas to align data to respective columns
  • Agent
    • Fixed issue where certain completed jobs could not be re-run on the DQ Jobs page. In other words, NO_AGENT was the only available option in the Agent dropdown. Now, users can select valid agents in the dropdown and this will persist for future scheduled jobs
  • Schedule
    • Implemented validation to enforce user to choose days when picking schedule to avoid Java error messages
  • Explorer
    • Fixed issue where '&' was not properly supported when adding additional parameters
  • API
    • JSESSIONID session time is configurable
    • Bearer token and JSESSIONID authentication paths are properly forked
  • Pattern
    • Patterns activity now shows Count (number of times the current dataset has the Pattern breaks). This Count is interpreted the same way as Outlier activity Count

2021.11

Enhancements

  • Rules
    • *Tech Preview* [TP] Semantics and data concepts management
      • The application now supports dynamic semantics checks. This allows you to create custom semantics that can be checked for when running a DQ check on a data set. Previously the application checked against a predefined set of semantics. You also have access to controls to organize and apply these semantics checks. The following is a list of changes:
        • There is a new data concepts management page. You can access it from Catalog or Admin Console. You can assign multiple semantics to a data concept.
        • When running a DQ check, you can select a data concept. The semantics assigned to this data concept will be checked against each column of dataset.
        • You have a list of predefined semantics that are not editable. You also have the ability to create/edit/delete custom semantics.
        • Repo on rules page has been added to Rules Library where semantics will be viewable.
  • Resource Limits
    • You can edit the Performance Settings to supply limits to executors, cores, memory and cells so that a user can be warned if submitting a job that requires a lot of resources and admins can control maximum resources submitted.

Enhancements

  • Explorer
    • *Tech Preview* [TP] Dynamic query reload allows you to view JOIN query columns in other activities.
      • User can update and reload the schema table with the custom query in the scope section by clicking the [Update Scope] button. It will enable using the new columns from the custom query in all activities (Profile, Outlier, Dupes, Patterns, Source)
      • Since the first tab is for compositing the query, updating fields will change the user's custom query. Therefore, all areas are locked except the "query" field in the first tab to keep the query unchanged after updating the scope table
    • Support for some special characters in table name.
    • Fixed the ability to add additional libs that were previously not being properly saved on subsequent runs. Under DQ Job tag, please utilize -fllb boolean (union lookback) and libsrc input box for lib directory path (will materialize as -addlib).
  • Connections
    • *Tech Preview* [TP] BigQuery Views and Joins
    • Please add the following to the BigQuery connection property
Copy
viewsEnabled=true
  • API
    • You can perform multiple imports without conflicts.
    • You can have an incremental import such as updating matching records / insert new / leave existing. There is no requirement to delete tables first before running import.
  • Profile
    • Fixed backrun timebin to work with weeks and quarters instead of days.
  • Outliers
  • Source
    • Fixed an issue where settings were not sticky for subsequent runs.
  • Security

Patches

  • 2021.11.1 Explorer
    • Allow ampersand in metastore host name for additional parameters
    • In below example, support for ampersand needed for required SSL flags
Copy
metastore01.us-east1-b.c.customer-dq-prod.internal:5432/dev?sslmode=required&currentSchema=public

Known Limitations

  • Rules
    • Semantics and data concepts:
      • Not supported in pushdown mode
      • Exporting RegEx semantics not currently supported
    • While it is possible to create joins and cross-dataset rules using Freeform SQL, it is best practice to create a view and handle the join prior to running the DQ Job.
  • Behavior
    • Schema is not eligible for invalidate
  • Files
    • Local files using UPLOAD_PATH, UPLOAD_FILE_PATH, and temp files are only eligible to be deployed using the default NO_AGENT option. These are only intended for quick tests and not intended for production-scale use. Best practice is to use a remote file system connection (S3, Google storage or ADLS).
    • Delimiter support for special characters is limited. Supported file delimiters are comma, pipe, tab, semicolon, double quote and single quote. Custom delimiters will work for many characters, but not all combinations.
    • Temp files and NO_AGENT should have -master local[*] or -master spark://:7077 defined in freeform append of the agent options
  • DQ Job
    • When submitting jobs via API from a different machine with a different timezone, timezone discrepancies are not accounted for automatically. Best practice is to align each component to use UTC.
    • Jobs submitted via API with a run date that include HH:MM in the -rd (run date) will submit to the job queue and leave a remnant ‘STAGED’ job
  • Connections
    • Postgres limits max connections per spark job. The default is 100. Please refer to Postgres official documentation how to increase max_connection and shared_buffers.
    • BigQuery
      • Updating scope to include joins in BigQuery can only be materialized when tables are part of the same dataset collection
      • Should you receive an error for pre-existing BigQuery jobs, please add -dssafeoff to the cmd line or select ‘Allow Overwrite’ to enable this from Edit mode in the Explorer
  • Alerts
    • After an upgrade to 2021.11, you may need to set the environment variable ALERT_SCHEDULE_ENABLED=true in owl-env.sh and restart owl-web to enable email alerts to work again.

2021.10

Enhancements

  • DQ Job
    • Refactored DQ Job Score to Gauge Chart
  • Explorer
    • Fixed issue where permissions are checked on datasets that do not yet exist
  • Connections
    • Sybase 'Test / Preview' now available
    • Updated web model of saving additional connection properties
    • Fixed scenario where editing connection yields null instead of empty for multiple values
  • Rules
    • Placeholder new searchable Rule Summary Page for Rule statistics / insights
  • Alerts
    • Updated Alert Mailer to TLS 1.2 to resolve Third Party Error exception
    • Fixed issue where alerts are deleted even when clicking cancel button
  • Behavior
    • Fixed issue where user must refresh to have invalidated item removed from UI
  • Search
    • Fixed search on Audit Datasets and Dataset Management page
  • Scorecards
    • Date ranges are now customizable
  • Validate Source
    • Added feature that provides 'trim' option on String columns when running source-target validation, extra spaces in the cell are trimmed on both ends (left and right)
  • Dupes
    • Resolved issue with white spaces in column headers blocking duplicate detection
  • Security
    • Added configuration for setting the SAML_ENTITY_BASEURL, which sets the Consumer service url for the SP Metadata
  • Shapes
    • Fixed issue where custom values override even after toggling Shapes back to auto or off
  • Console
    • Fixed uncaught TypeError on login screen
    • Fixed GET timeout error on registration page
  • Export/Import API
    • Users will be able to run the export/import API calls to conduct multiple promotions on the repo, schedule, and rule tables.

Patches

  • 2021.10.1 Import / Export API without constraint conflicts
    • Import must match exactly to the format of our export in order to parse out columns and values to perform an update when existing records are already there
Copy
owl_rule
owl_check_repo
job_schedule
rule_repo

Known Limitations

  • File sizes
    • Individual files greater than 5gb will experience performance degradation in Explorer for Standalone installs. Best practice is to save in smaller chunks and use bypass schema in the Explorer if needed.
    • Individual files greater than 25gb will experience performance degradation in Core for Standalone installs.
  • Files
    • Explorer / browser will generally have difficulty supporting > 250 columns in files
  • Profiling
    • Pushdown profiling on Bigquery, Redshift, Athena and Presto is available for specific datatypes.
    • Backrun option and flag will persist beyond the first run (-br). Please remove this flag if you do not want to backrun again.
  • Explorer
    • QUARTER and WEEK are not supported time bins in this release.
    • On non-csv files, Explorer will not automatically infer file types. Users must change file type to the required value and click Step 2 "Load File". Nothing will change in Step 1 "File Information". A future enhancement will be added to automatically check filetypes by reading the first file
    • Dataset names should not contain special characters
  • Rules
    • Out of the box semantic rules cannot be edited (STATECHECK, GENDERCHECK, etc). Users can still apply their own global rules which can be customized.
    • LinkId does not support alias columns that are not part of the -LinkId definition
  • Connections
    • Connection names should not contain spaces
  • Validate Source
    • Complex Validate Source queries can only be edited from the CMD line or JSON directly before hitting Run.
  • Security
    • Active Directory in Azure SQL can connect via LDAP (basic auth) or Kerberos.
  • S3 / GS / ADLS
    • Remote storage connections should be defined using the root bucket only.
  • Estimate Job is only available for files when Livy is being used.
  • Stop Job on jobs page is limited and does not work for all installation types.
  • Bigquery connector does not work with views

2021.09 (09-2021)

New Feature

  • Alert
    • Alert notification page displaying a searchable list of alerts emails sent. Email Alerts
  • Job Page
    • UI refresh
    • New chart with successful and failed jobs

Enhancements

  • Profile
    • When faced with a few errors e.g. 0.005% null, highlight issues more clearly and visibly instead of the notion of rounding up to and displaying 100.0%
  • Jobs
    • Enhanced query and file date templating and variable options. This allows easier scheduling and programmatic templating for common date variables
    • Job Template corrupt time portion of $ on last run of replay
    • Refactor job actions column
  • Catalog
    • Completeness report refactor / consolidation to improve performance
  • Export
    • Outlier tab in DQ Job page (hoot page) displays linkIds and included in the export
  • Security
    • Added property for authentication age to reduce token expiration
    • UI labels more generic when configuring a connection with password manager script
  • Agent
    • Agent no longer shows as red if services are correctly running
  • Logging
    • Jobs log retention policy now configurable in Admin Console -> App Config via "JOB_LOG_RETENTION_HR" (variable must be added along with value). If not added, default to 72 hours
    • Platform logs retention policy now configurable in Admin Console -> App Config via "PLATFORM_LOG_RETENTION_HR" (variable must be added along with value). If not added, default to 24 hours
  • Outliers
    • Fixed connection properties behavior given how multiple custom properties are handled in Hive
    • Fixed outliers issue that ignored WHERE clause on remote files
  • Scorecards
    • Fixed missing search results issue in list view for Patterns type
  • Connections
    • New templates for Redshift and Solr
  • Connections Security
    • Ticket Granting Ticket (TGT) authentication for HDFS & Hive
      • You can now choose the TGT auth model for connections and point to a TGT file as an additional kerberos authentication model
    • Kerberos Principal + Password Manager for Hive
      • You can now use a password manager script to fetch a hive password for a princiapl to authenticate
    • S3 SAML Auth (TP)
      • DQ is configured to use SAML based authentication to S3 buckets with password manager or provided credentials. Testing is limited to OneLogin for SAML Provider in this tech preview release

Patches

  • 2021.09.1 Validate Source
  • 2021.09.2 Validate Source Large DB load
  • 2021.09.3 Save on datashapes from new DQ Job

2021.08 (08-2021)

Please note updated Collibra release name methodology

Enhancements

  • Explorer
    • Support for handling large tables
    • Implemented pagination function for navigation
    • Improved error handling for unsupported java data types
    • Fix preview for uploaded temp files
  • Collibra DQ V3 REST APIs
    • Additional rest APIs for easier programmatic job initiation, job response status, and job result decisioning (used in pipelines). Streamlined documentation and user inputs allow users to choose any language for their orchestration wrapper (python, c#, java, etc). More info on Collibra DQ Rest APIs
  • Patterns
    • Fix load query generation issue when WHERE clause is specified
  • Behaviors
    • Fix behavior score calculation after suppressing AR
    • Fix percent change calculations in behavior AR after retrain
    • Mean Value Drift [New Feature] Behaviors
  • Security
    • Introduce new role of ROLE_DATA_GOVERNANCE_MANAGER with ability to manage (create / update / delete) Business Units and Data Concepts. More info on Collibra DQ Security Roles
    • Relaxed field requirements for password manager connections for App ID, Safe, and Password Manager Name
  • Scorecard
    • Enhanced page loading speeds for scorecard pages
  • Rules
    • Rule activity is now more efficiently parallelized across available resources
  • Validate Source
    • Pie chart will highlight clearly visible ‘issue’ wedge for anything less than 100%
  • UI/UX
    • Updated with more distinct icon set

2021.07 (07-09-2021)

Please note updated Collibra release name methodology

Enhancements

  • *Tech Preview* [TP] Collibra Native DQ Connector
    • No-code, native integration between Collibra DQ and Collibra Catalog
  • UX/UX
    • Full redesign of web user experience and standardization to match Collibra layout
    • Search any dataset from any page
  • Hoot
    • Rules Display with [more] links to better use the page space
    • Auditing for changes per scan
  • Explorer
    • JDBC Filter enablement by just search input
  • Profile
    • Add more support for data-concepts from UI or future release
  • Behaviors
    • Down-training per issue type
    • AR user feedback loop (pass/fail) for learning phase
  • Scheduler
  • Security
    • SQL View data by role vs just Admin
  • Reports

2.15.0 (05-31-2021)

Enhancements

  • Hoot
    • Down-training per activity vs globally
  • Logging
    • Expose server logs on the jobs page from the agent and cluster
  • Explorer
    • Enhanced Experience for display of stats for database tables
    • Validation for Dupes section to ensure all input is validated before save
    • Support for edit mode with Dremio connections
    • Allow file scan skip-N to skip a number of rows if extra headers are present in the file
    • Support Livy sessions over SSL for files
  • Profile
    • Add quick click rules based on profile distribution stats
  • Behaviors
    • Down-training per issue type
  • Scheduler
    • Support for $rdEnd in the template
    • Auto update schedule template based on last successful run
    • Support S3 custom config values in scheduled template
  • Security
    • SAML Auth
    • Support for JWT Authentication to the Multi-Tenant management section
  • Multi-Tenant
    • Support for an alternate display name for each tenant to be displayed in the UI and login tenant selection

2.14.0 (03-31-2021)

Enhancements

  • Hoot
    • Edit mode on Password Manager supported connections
    • Edit mode on complex query
    • Behavior chart display of last 2 runs
  • Explorer
    • ValSrc auto Save
    • Remote File
      • Support for Google Cloud Storage (GCS)
      • Support for Google Big Query
      • Folder Scans in Val Src
      • Auto generate ds name
      • FullFile for S3
      • in file path naming
      • Estimate Jobs (Only on K8s)
      • Analyze Days (Only on K8s)
      • Preview Data (Only on K8s)
  • Connection
    • Store source name to connect a column to its source db/table/schema
    • Custom driver props for remote file connection
  • Profile
    • Filtergrams for Password Manager connections
    • Filtergrams for Alternate agent path connections
    • Filtergrams on S3/GCS data source (Only on K8s)
  • Rules
    • UX on page size options
  • Scheduler
    • Support multiple db timezone vs dataset timezone
  • Outliers
    • Notebook API returns true physical null value in DataFrame instead of string "null"
  • Shapes
    • Expanded options for numeric/alpha
    • Expanded options for length on alphanumerics