Data Retention Policy
Data Retention Policy allows admins to purge data from Collibra Data Quality & Observability based on time- or size-based policies in both single- and multi-tenant environments.
- By Size
- By Time
| Option | Description |
|---|---|
| Row Count Threshold | The minimum number of rows in the data_preview table for the system to consider purging data from it. |
| Runs Threshold | The minimum number of DQ Job runs of a given dataset that exceeds the Rows Per Dataset and Row Count Threshold minimum requirements for the system to consider purging data from it. |
| Rows Per Dataset | The minimum number of rows of a dataset that exceeds the Runs Threshold and Row Count Threshold minimum requirements for the system to consider purging data from it. |
| Option | Description |
|---|---|
| Retention by Fields |
Options include:
|
| Retention Days | The number of days data is retained before being automatically purged. |
What is purged?
When data is purged from Collibra Data Quality & Observability, data from each dataset included in the purge is cleared on a rolling basis. When a dataset is cleaned as part of this process, the following tables are purged from the Metastore.
| Table | Table description | Data |
|---|---|---|
| Shared by many activities | ||
|
data_preview
|
The drill-in records for rules, outliers, shapes, and so on. | dataset and runId |
|
observation
|
The type of finding discovered during a DQ Job run. | dataset and runId |
| Profile | ||
|
dataset_scan
|
The findings scores and pass and fail information for a given DQ Job run. | dataset and runId |
|
dataset_field
|
The profiling stats relative to any given column for a database table scan. | dataset and runId |
|
dataset_field_value
|
The top and bottom N values of a dataset, including the unique count. | dataset and runId |
|
datashape
|
The data shape format, associated linkID, and assignments. | dataset and runId |
| Dataset Histogram | ||
|
dataset_hist
|
The historical job run values, including the averages, medians, and quartile information for a given dataset. | dataset and runId |
| Dataset Correlation | ||
|
dataset_corr
|
The correlation data between columns. | dataset and runId |
| Behavior | ||
|
behavior
|
The profile data observed during a DQ Job run, including min, max, and stats relative to a given column. | dataset and runId |
|
item_label
|
The status of a behavioral observation, for example, Validate, Invalidate, and Resolve. | dataset and runId |
| Rules | ||
|
rule_breaks
|
The rule breaks and link IDs for a given dataset and DQ Job run. | dataset and runId |
|
rule_output
|
The results of a rule observed during a DQ Job run. | dataset and runId |
| Outliers | ||
|
outlier
|
The outlier column, type, value, confidence score, associated link ID, and assignments. | dataset, runId, and true |
| Validate Source | ||
|
validate_source
|
The source dataset and its associated observation types and counts related to validate source correlation. | dataset and runId |
|
dataset_schema_source
|
The schema differences between source and target datasets observed during a DQ Job run. | dataset and runId |
| Other | ||
|
alert_output
|
The contents of the alert email that send when the required conditions are met to trigger an alert during a DQ Job run. | dataset and runId |
|
dataset_activity
|
A rollup of all the aggregate stats for a given DQ Job run, including the time it takes to run per activity. | dataset and runId |
|
hint
|
The DQ Job activity stages and the stage details that populate on the Job log. For example, LOAD - 15520 rows loaded to Historical dataframe. | dataset and runId |
|
dq_inbox
|
All processed findings with rank values to help calculate the final impact to the DQ Job score. | dataset and runId |
Additionally, the following dataset run data is included in the purge:
| Element | Description | Data |
|---|---|---|
| Dataset | Any dataset containing data flagged as sensitive. | String dataset, Date runId, dataset, runId, and "admin" |
| PII | Personally identifiable information observed in datasets during DQ Job runs. | String dataset, String colName, dataset, and colName |
| MNPI | Material nonpublic information observed in datasets during DQ Job runs. | String dataset, String colName, dataset, and colName |
After a data purge, dataset and runId details are included in the Security Audit.