Create profiles based on a table, view, or file.
Note Users have the option to scan the entire dataset or users can apply custom filtering to select the depth (row filtering) and width (columns).
Select the Scope
You can find detailed instructions about selecting the scope in the Explorer section. You can run limits, by time, or full table scans if you have enough resources.
Select Options (or leave defaults)
Save / Run
View the Results
Collibra DQ automatically profiles data sets over time to enable drill-ins for detailed insights and automated data quality. A profile is just the first step towards an amazing amount of auto discovery. Visualize segments of the data set and how the data set changes over time.
Collibra DQ offers click or code options to run profiling.
Data Set Profile
Collibra DQ creates a detailed profile of each dataset under management. This profile will later be used to both provide insight and automatically identify data quality issues.
Collibra DQ can compute the Profile of a data set either via Spark (default) or a Data Warehouse (Profile Pushdown) where the data lives as the engine. When the Profile is computed using the datasource DBMS the user can choose two levels of pushdown:
- Full Profile - Perform full profile calculation except for TopN
- Count - Only perform row and column counts
- SQL Server
Warning Pushdown and parallel JDBC cannot be used together. If you are using pushdown, do not select the parallel JDBC option.
By gathering a variety of different statistics, Collibra DQ's profile can provide a great deal of insight about a data set.
To see the difference between baseline (historical) and current values, Collibra DQ provides a Delta % change column. In the Delta % change column, data is represented in a pie chart for quick visualization of the changes.
To elaborate on the quality metrics:
The profile can discover attributes then helps delineate the relative metrics around numeric v. non-numeric discovered.
|Null||||The percentage of data that has no value at all.|
|Empty||[""]||The percentage of data that has a string instance of zero length.|
Profile includes the following statistics:
- Actual Datatype
- Discovered Datatypes
- Percent Null
- Percent Empty
- Percent Mixed Types
- TopN / BottomN
- Value Quartiles
- Minimum (String) Length
- Maximum (String) Length
From the Profile page in Catalog, you can view a TopN Values chart. The TopN Values chart represents the top 10 distinct values that appear most frequently.
Sensitive Data Detection and Data Class
Collibra DQ can automatically identify any type of common PII columns.
Note Collibra DQ is able to detect the following types of PII:
- ZIP CODE
- STATE CD
- CREDIT CARD
- IP ADDRESS
Once detected, you can tag the column from the Profile tab as the discovered sensitive data type and data class and automatically applies a rule. To remove a tag, click the tag and edit from the Add Data Class/Sensitivity Label modal.
The first step in many data science projects is to segment the data. Collibra DQ automatically does this with histograms on the Histogram tab.
The correlation matrix on the Correlation tab lets you discover hidden relationships and measure the strength of those relationships.
After profiling the data, for those users with appropriate rights, the Data Preview tab provides a glimpse of the dataset and basic insights such as highlights of Shape issues, Outliers (when enabled), and column Filtergram visualization.