Data Retention Policy
Data Retention Policy allows admins to purge data from Collibra Data Quality & Observability based on time- or size-based policies in both single- and multi-tenant environments.
- By Size
- By Time
Option | Description |
---|---|
Row Count Threshold | The minimum number of rows in the data_preview table for the system to consider purging data from it. |
Runs Threshold | The minimum number of DQ Job runs of a given dataset that exceeds the Rows Per Dataset and Row Count Threshold minimum requirements for the system to consider purging data from it. |
Rows Per Dataset | The minimum number of rows of a dataset that exceeds the Runs Threshold and Row Count Threshold minimum requirements for the system to consider purging data from it. |
Option | Description |
---|---|
Retention by Fields |
Options include:
|
Retention Days | The number of days data is retained before being automatically purged. |
What is purged?
When data is purged from Collibra Data Quality & Observability, data from each dataset included in the purge is cleared on a rolling basis. When a dataset is cleaned as part of this process, the following tables are purged from the Metastore.
Table | Table description | Data |
---|---|---|
Shared by many activities | ||
data_preview
|
The drill-in records for rules, outliers, shapes, and so on. | dataset and runId |
observation
|
The type of finding discovered during a DQ Job run. | dataset and runId |
Profile | ||
dataset_scan
|
The findings scores and pass and fail information for a given DQ Job run. | dataset and runId |
dataset_field
|
The profiling stats relative to any given column for a database table scan. | dataset and runId |
dataset_field_value
|
The top and bottom N values of a dataset, including the unique count. | dataset and runId |
datashape
|
The data shape format, associated linkID, and assignments. | dataset and runId |
Dataset Histogram | ||
dataset_hist
|
The historical job run values, including the averages, medians, and quartile information for a given dataset. | dataset and runId |
Dataset Correlation | ||
dataset_corr
|
The correlation data between columns. | dataset and runId |
Behavior | ||
behavior
|
The profile data observed during a DQ Job run, including min, max, and stats relative to a given column. | dataset and runId |
item_label
|
The status of a behavioral observation, for example, Validate, Invalidate, and Resolve. | dataset and runId |
Rules | ||
rule_breaks
|
The rule breaks and link IDs for a given dataset and DQ Job run. | dataset and runId |
rule_output
|
The results of a rule observed during a DQ Job run. | dataset and runId |
Outliers | ||
outlier
|
The outlier column, type, value, confidence score, associated link ID, and assignments. | dataset, runId, and true |
Validate Source | ||
validate_source
|
The source dataset and its associated observation types and counts related to validate source correlation. | dataset and runId |
dataset_schema_source
|
The schema differences between source and target datasets observed during a DQ Job run. | dataset and runId |
Other | ||
alert_output
|
The contents of the alert email that send when the required conditions are met to trigger an alert during a DQ Job run. | dataset and runId |
dataset_activity
|
A rollup of all the aggregate stats for a given DQ Job run, including the time it takes to run per activity. | dataset and runId |
hint
|
The DQ Job activity stages and the stage details that populate on the Job log. For example, LOAD - 15520 rows loaded to Historical dataframe. | dataset and runId |
dq_inbox
|
All processed findings with rank values to help calculate the final impact to the DQ Job score. | dataset and runId |
Additionally, the following dataset run data is included in the purge:
Element | Description | Data |
---|---|---|
Dataset | Any dataset containing data flagged as sensitive. | String dataset, Date runId, dataset, runId, and "admin" |
PII | Personally identifiable information observed in datasets during DQ Job runs. | String dataset, String colName, dataset, and colName |
MNPI | Material nonpublic information observed in datasets during DQ Job runs. | String dataset, String colName, dataset, and colName |
After a data purge, dataset and runId details are included in the Security Audit.