Profile (automatic)

Create profiles based on a table, view, or file.

Note Users have the option to scan the entire dataset or users can apply custom filtering to select the depth (row filtering) and width (columns).

Select the Scope

You can find detailed instructions about selecting the scope in the Explorer section. You can run limits, by time, or full table scans if you have enough resources.

Select Options (or leave defaults)

Save / Run

Profile is on by default and is part of onboarding a dataset

View the Results

Automatically Profile

Collibra DQ automatically profiles data sets over time to enable drill-ins for detailed insights and automated data quality. A profile is just the first step towards an amazing amount of auto discovery. Visualize segments of the data set and how the data set changes over time.

Collibra DQ offers click or code options to run profiling.

Data Set Profile

Collibra DQ creates a detailed profile of each dataset under management. This profile will later be used to both provide insight and automatically identify data quality issues.

setting up a data set profile

Pushdown Profiling

Collibra DQ can compute the Profile of a data set either via Spark (default) or a Data Warehouse (Profile Pushdown) where the data lives as the engine. When the Profile is computed using the datasource DBMS the user can choose two levels of pushdown:

  • Full Profile - Perform full profile calculation except for TopN
  • Count - Only perform row and column counts

Note The following DBMS systems are supported for "Profile Pushdown":
  • Impala
  • Hive
  • Snowflake
  • Presto
  • Teradata
  • SQL Server
  • PostgreSQL
  • Redshift
  • MySQL
  • Oracle
  • DB2

Warning Pushdown and parallel JDBC cannot be used together. If you are using pushdown, do not select the parallel JDBC option.

Profile

Profile Insights

a view of the profile insights page

By gathering a variety of different statistics, Collibra DQ's profile can provide a great deal of insight about a data set.

To see the difference between baseline (historical) and current values, Collibra DQ provides a Delta % change column. In the Delta % change column, data is represented in a pie chart for quick visualization of the changes.

To elaborate on the quality metrics:

The profile can discover attributes then helps delineate the relative metrics around numeric v. non-numeric discovered.

Metric Type Description
Filled [1] Integer
  • The percentage of data that is numeric (or non-numeric) in a numeric (or non-numeric) discovered column.
  • Mixed [String] Integer
  • The percentage of data that is non-numeric (or numeric) in a numeric (or non-numeric) discovered column.
  • Null [] The percentage of data that has no value at all.
    Empty [""] The percentage of data that has a string instance of zero length.

    Profile includes the following statistics:

    • Actual Datatype
    • Discovered Datatypes
    • Percent Null
    • Percent Empty
    • Percent Mixed Types
    • Cardinality
    • Minimum
    • Maximum
    • Mean
    • TopN / BottomN
    • Value Quartiles
    • Minimum (String) Length
    • Maximum (String) Length

    TopN Values

    From the Profile page in Catalog, you can view a TopN Values chart. The TopN Values chart represents the top 10 distinct values that appear most frequently.

    TopN values

    Sensitive Data Detection and Data Class

    Collibra DQ can automatically identify any type of common PII columns.

    Note Collibra DQ is able to detect the following types of PII:

    • EMAIL
    • PHONE
    • ZIP CODE
    • STATE CD
    • CREDIT CARD
    • GENDER
    • SSN
    • IP ADDRESS
    • EIN

    Sensitive data

    Once detected, you can tag the column from the Profile tab as the discovered sensitive data type and data class and automatically applies a rule. To remove a tag, click the tag and edit from the Add Data Class/Sensitivity Label modal.

    Histograms

    The first step in many data science projects is to segment the data. Collibra DQ automatically does this with histograms on the Histogram tab.

    Histogram

    Correlation

    The correlation matrix on the Correlation tab lets you discover hidden relationships and measure the strength of those relationships.

    Correlation matrix

    Data Preview

    After profiling the data, for those users with appropriate rights, the Data Preview tab provides a glimpse of the dataset and basic insights such as highlights of Shape issues, Outliers (when enabled), and column Filtergram visualization.

    Data Preview