About the data profile

Updated: June 4, 2025

You can evaluate the data profile of your Data Quality Job via the Profile tab in the Job Details page. Data profiling provides a detailed analysis of your data quality and historical trends. It forms the foundation of a robust data quality and observability strategy.

When you run a Data Quality Job, the scan results include rich insights such as column-level statistics, charts, and other data quality and observability metrics. These insights help you identify common patterns and emerging trends, offering a better understanding of your data's structure and contents.

The following screenshot and table highlight the various elements of the Profile tab.

Element number	Element	Description
	Run history	Run history is a chart or list that shows high-level data profiling details from each run of a Data Quality Job, including: The completion or failing status of a given run. The row count evolution. The date of a run. Example You can view these details in a chart or list view. You can also click the passing or failing status icon to open the data profile of any run on its associated date.
	Run metadata	Run metadata provides important context about a given run. The metadata includes: The date and time of a Data Quality Job run. The completion or failure status of a run. The number of rows in the Data Quality Job.
	Data profile statistics	The data profile statistics table captures what was observed in each column following a Data Quality Job run. The statistics of the following columns are shown in the table: Column Name Defined Type Inferred Type Completeness Values Values +/- % Nulls Nulls +/- % Empties Empties +/- %
	Descriptive column profile statistics	In the data profile statistics section, you can click in any row to expand the descriptive profiling statistics of a column. Example In the screenshot, the most frequently observed string data type is "Double", with 100,401 occurrences. The least frequently observed string data type is "int". Hover your pointer over a bar on the chart to see the data type and its total occurrences. The following column-level statistics are available: Min/max Median 1st quartile 3rd quartile Top Shapes shows the string data type cell values that appear the most and least frequently in the column. This can help identify inconsistent data types or unexpected string values, recognize frequently recurring patterns, and understand the distribution of data in the column.

Data profiling statistics

The following describes the data profiling statistics captured in the table on the Profile tab.

Column Description Calculation

Column Name The name of the column whose data profile statistics are represented in the same row in the data profile table. N/A

Defined Type

The data type of a column when the table was created.

Data types: Categorical and numeric

The data type is extracted directly from the database. Possible values include:

Bit
BIGINT
Boolean
Byte
CHAR
Date
Decimal
Double
Float
Integer
Long
LONGVARCHAR
LONGNVARCHAR
NCHAR
Numeric
NVARCHAR
Real
Short
Smallint
String
Time
Timestamp
Tinyint
VARCHAR

Inferred Type

The data type of a column that Collibra observes as possible deviations from the defined type. If a column does not contain a potential deviation, the data type matches the defined type.

Data types: Categorical and numeric

The actual data type is mapped to an inferred data type. The table below shows the mapping of the various data types.

Actual data type	Inferred data type
Bit	Boolean
BIGINT	Long
Boolean	Boolean
Byte	Int
CHAR	String
Date	Date
Decimal	Decimal
Double	Double
Float	Double
Integer	Int
Long	Long
LONGVARCHAR	String
LONGNVARCHAR	String
NCHAR	String
Numeric	Decimal
NVARCHAR	String
Real	Double
Short	Int
Smallint	Int
String	String
Time	Timestamp
Timestamp	Timestamp
Tinyint	Int
VARCHAR	String

Completeness

The percentages of unique and missing values in a column. A chart shows the distribution of unique, null, and empty value percentages in a column. Hover your pointer over the chart to show the exact percentages.

Data types: Categorical and numeric

Null:
SELECT ROUND(COUNTIF(TRIM(<column_name>) = ‘ ‘)) / COUNT (*)*100,3) AS null_percentage FROM <dataset>

Empty:
SELECT ROUND(COUNTIF(TRIM(<column_name>) = ‘ ‘)) / COUNT (*) * 100, 2) AS empty_percentage FROM <dataset>

Mixed:
SELECT ROUND (COUNTIF(REGEXP_CONTAINS(<column_name, r’|A-Za-a|’) OR <column_name> IS NULL / COUNT(*) * 100, 5 AS mixed_percentages FROM <dataset>

Values

The number of unique values in a column.

The unique distribution, or cardinality, of a column. This can help you understand whether the values in a column continue to be unique, or if it contains the same values. For example, the expectation of an "ID" column might be that all values should be unique, whereas the expectation of a "date" column might be that all values should be the same.

Data types: Categorical and numeric

SELECT COUNT(DISTINCT <column name>) FROM <dataset>

Values +/- %

The percentage increase (+) or decrease (-) of rows with unique values since the previous run.

N/A

Nulls

The number of rows without any values.

Data types: Categorical and numeric

SELECT ROUND(COUNTIF(<column_name> IS NULL)) FROM <dataset>

Nulls +/- %

The percentage increase (+) or decrease (-) of rows in a column without any values since the previous run.

N/A

Empties

The number of rows for which the string lengths are zero.

Data types: Categorical and numeric

SELECT ROUND(COUNTIF(TRIM(<column_name>) = ‘ ‘)) FROM <dataset>

Empties +/- %

The percentage increase (+) or decrease (-) of rows in a column for which the string lengths are zero since the previous run.

N/A

Column-level statistics drill-down

The following table describes the statistics that are shown when you drill down into a column on the data profile table.

Statistic	Description	Calculation
Min/max	The smallest and largest values in a column. Data type: Categorical and numeric	`SELECT min(<column_name>) AS min, max(<column_name>) AS max FROM <dataset>`
Median	The value that represents the 50th percentile of the values in a column, below which, 50% of the data falls. Data type: Numeric	`SELECT PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY <column_name> AS first_quartile FROM <dataset>`
1st Quartile	The value that represents the 25th percentile of the values in a column, below which 25% of data falls. Data type: Numeric	`SELECT PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY <column_name> AS first_quartile FROM <dataset>`
3rd Quartile	The value that represents the 75th percentile of the values in a column, below which 75% of data falls. Data type: Numeric	`SELECT PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY <column_name> AS first_quartile FROM <dataset>`

What's next?

About data quality monitors