About the data profile

You can evaluate the data profile of your Data Quality Job via the Profile tab in the Job Details page. Data profiling provides a detailed analysis of your data quality and historical trends. It forms the foundation of a robust data quality and observability strategy.

When you run a Data Quality Job, the scan results include rich insights such as column-level statistics, charts, and other data quality and observability metrics. These insights help you identify common patterns and emerging trends, offering a better understanding of your data's structure and contents.

The following screenshot and table highlight the various elements of the Profile tab.

screenshot of data quality job profile tab

Element number Element Description
Run history

Run history is a chart or list that shows high-level data profiling details from each run of a Data Quality Job, including:

  • The completion or failing status of a given run.
  • The row count evolution.
  • The date of a run.
Example You can view these details in a chart or list view. You can also click the passing or failing status icon to open the data profile of any run on its associated date.

screenshot of run history chart

Run metadata

Run metadata provides important context about a given run. The metadata includes:

  • The date and time of a Data Quality Job run.
  • The completion or failure status of a run.
  • The number of rows in the Data Quality Job.
Data profile statistics

The data profile statistics table captures what was observed in each column following a Data Quality Job run.

screenshot of column data profile stats

The statistics of the following columns are shown in the table:

  • Column Name
  • Defined Type
  • Inferred Type
  • Completeness
  • Values
  • Values +/- %
  • Nulls
  • Nulls +/- %
  • Empties
  • Empties +/- %
Descriptive column profile statistics

In the data profile statistics section, you can click Beak right icon in any row to expand the descriptive profiling statistics of a column.

Example In the screenshot, the most frequently observed string data type is "Double", with 100,401 occurrences. The least frequently observed string data type is "int". Hover your pointer over a bar on the chart to see the data type and its total occurrences.

screenshot of column drill down statistics

The following column-level statistics are available:

  • Min/max
  • Median
  • 1st quartile
  • 3rd quartile

Top Shapes shows the string data type cell values that appear the most and least frequently in the column. This can help identify inconsistent data types or unexpected string values, recognize frequently recurring patterns, and understand the distribution of data in the column.

Data profiling statistics

The following describes the data profiling statistics captured in the table on the Profile tab.

Column Description Calculation
Column Name The name of the column whose data profile statistics are represented in the same row in the data profile table. N/A
Defined Type

The data type of a column when the table was created.

Data types: Categorical and numeric

The data type is extracted directly from the database. Possible values include:

  • Bit
  • BIGINT
  • Boolean
  • Byte
  • CHAR
  • Date
  • Decimal
  • Double
  • Float
  • Integer
  • Long
  • LONGVARCHAR
  • LONGNVARCHAR
  • NCHAR
  • Numeric
  • NVARCHAR
  • Real
  • Short
  • Smallint
  • String
  • Time
  • Timestamp
  • Tinyint
  • VARCHAR
Inferred Type

The data type of a column that Collibra observes as possible deviations from the defined type. If a column does not contain a potential deviation, the data type matches the defined type.

Data types: Categorical and numeric

The actual data type is mapped to an inferred data type. The table below shows the mapping of the various data types.

Actual data type Inferred data type
Bit Boolean
BIGINT Long
Boolean Boolean
Byte Int
CHAR String
Date Date
Decimal Decimal
Double Double
Float Double
Integer Int
Long Long
LONGVARCHAR String
LONGNVARCHAR String
NCHAR String
Numeric Decimal
NVARCHAR String
Real Double
Short Int
Smallint Int
String String
Time Timestamp
Timestamp Timestamp
Tinyint Int
VARCHAR String
Completeness

The percentages of unique and missing values in a column. A chart shows the distribution of unique, null, and empty value percentages in a column. Hover your pointer over the chart to show the exact percentages.

Data types: Categorical and numeric

Null:
SELECT ROUND(COUNTIF(TRIM(<column_name>) = ‘ ‘)) / COUNT (*)*100,3) AS null_percentage FROM <dataset>

Empty:
SELECT ROUND(COUNTIF(TRIM(<column_name>) = ‘ ‘)) / COUNT (*) * 100, 2) AS empty_percentage FROM <dataset>

Mixed:
SELECT ROUND (COUNTIF(REGEXP_CONTAINS(<column_name, r’|A-Za-a|’) OR <column_name> IS NULL / COUNT(*) * 100, 5 AS mixed_percentages FROM <dataset>

Values

The number of unique values in a column.

The unique distribution, or cardinality, of a column. This can help you understand whether the values in a column continue to be unique, or if it contains the same values. For example, the expectation of an "ID" column might be that all values should be unique, whereas the expectation of a "date" column might be that all values should be the same.

Data types: Categorical and numeric

SELECT COUNT(DISTINCT <column name>) FROM <dataset>
Values +/- %

The percentage increase (+) or decrease (-) of rows with unique values since the previous run.

N/A
Nulls

The number of rows without any values.

Data types: Categorical and numeric

SELECT ROUND(COUNTIF(<column_name> IS NULL)) FROM <dataset>
Nulls +/- %

The percentage increase (+) or decrease (-) of rows in a column without any values since the previous run.

N/A
Empties

The number of rows for which the string lengths are zero.

Data types: Categorical and numeric

SELECT ROUND(COUNTIF(TRIM(<column_name>) = ‘ ‘)) FROM <dataset>
Empties +/- %

The percentage increase (+) or decrease (-) of rows in a column for which the string lengths are zero since the previous run.

N/A

Column-level statistics drill-down

The following table describes the statistics that are shown when you drill down into a column on the data profile table.

Statistic Description Calculation
Min/max

The smallest and largest values in a column.

Data type: Categorical and numeric

SELECT min(<column_name>) AS min, max(<column_name>) AS max FROM <dataset>
Median

The value that represents the 50th percentile of the values in a column, below which, 50% of the data falls.

Data type: Numeric

SELECT PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY <column_name> AS first_quartile FROM <dataset>
1st Quartile

The value that represents the 25th percentile of the values in a column, below which 25% of data falls.

Data type: Numeric

SELECT PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY <column_name> AS first_quartile FROM <dataset>
3rd Quartile

The value that represents the 75th percentile of the values in a column, below which 75% of data falls.

Data type: Numeric

SELECT PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY <column_name> AS first_quartile FROM <dataset>

What's next?

About data quality monitors