Collibra DQ improves your data lake

Warning This documentation is archived and is no longer maintained.

Data lakes support analytics, which will ultimately drive actions that increase revenue, support compliance, prevent churn, etc. However, whether that action is near to real-time or not, none of those can be performed without first performing a DQ check. For example, can you trigger an action before first checking the “GDPR Remove” list? A Data Quality check must always be the first step in any action. Collibra DQ with Schema Learned can perform 100+ DQ checks. However beyond simply those checks, it is Collibra DQ's unique Spark-based architecture listed below that enables innovation. Churn, credit check, AML, infosec checks developed in the data lake could be added as part of DQ check on the streaming data.

  • Data and Privacy in Place. Data never has to move for a DQ check. The latency saved from operating in place, the added hybrid flexibility, the privacy maintained serves many new use cases that were not possible before. It also removes any unnecessary consolidation for the sake of simply consolidation. DQ doesn't have to start by first moving it into a data lake.
  • DQ or Any Rules applied in the Stream. The DQ rules learned by DQ can be applied back to the source on data in the stream. However, other non-DQ rules learned in the data lake can also be added to the DQ check.
  • Self-Service and DQ push-down fix. DQ can enable a self-service push-down fix (recommendation engine) to anything flagged at the source. The best time to fix DQ is when and where the problem started. This enables tighter integration with Data Governance tools since DQ is maintained at the source once, not downstream where corruption beyond just the data can occur.
  • Multi-cloud/On-prem/Hybrid. Collibra DQ can scan/alert/report at the source or can operate natively on the target data lake such as Databricks Delta in Azure or Snowflake on AWS, or Qubole on GCP. Why compromise DQ just because your data is not in one place? Why settle on a DQ strategy that only works if the data is first migrated or moved?
  • DQ Dashboards. Many DQ problems result from an improper or a too slow observation of business rules related to the data. What is not caught by handmade visual inspection or a potentially outdated man-made rule can only be flagged by AI Machine Learning. Conversely, what does get flagged should also be easily triaged and then immediately fixed with the aid of AI. The most important metric for a DQ Dashboard is the time to fix, not simply the overall DQ score.