Configure the profiling options via Edge

Important 

In Collibra 2024.05, we launched a new user interface (UI) for Collibra Data Intelligence Platform! You can learn more about this latest UI in the UI overview.

Use the following options to see the documentation in the latest UI or in the previous, classic UI:

Through the profiling options, you can define:

  • Whether you want to start the profiling process automatically after each synchronization.
  • The default profiling behavior for the schemas, such as whether the profiling is based on all data or on a random subset of the data.
  • Whether specific schemas do not use the default behavior but instead have their own behavior.
  • Which schemas you want to profile.
  • Which tables and table types you want to profile.

Note The Unified Data Classification process does not automatically run at the same time as profiling. You need to activate the classification process separately.

Before you begin

Required permissions

Steps

  1. Open a Database asset page.
  2. In the tab panebar, click Configuration. In the tab panebar, click Configuration.
  3. Click the Profiling tab.
    The Profiling options open.

    Tip Only the synchronized schemas are available in the list.

  4. In the Default Rule section, click Edit.
  5. Enter the required information.
    OptionDescription
    Automatically Profile after Metadata synchronization

    Enable to automatically profile columns every time the synchronization process of one or more schemas finishes.

    This may take a long time. You can also add a schedule to profile at regular intervals.

    Select Rows to Profile
    Do Not Profile (unless specified in the schema-specific rule)

    Select if you don't want to define a default profiling behavior for the schemas.

    Important Use this option if you only want to profile some of the schemas.
    If you select this option, Collibra only profiles the schemas for which a specific profiling rule has been defined.

    All RowsSelect to, by default, profile the schemas based on all data. This is also called full scan.
    Random Rows

    Select to, by default, profile schemas based on a subset of the data. This is also called partial scan.
    If you select this option, the Maximum Number of Rows field becomes available. You can enter the maximum number of rows that you want to use for profiling. By default, the maximum number of rows is 20 000.

    Note 
    • The value must be between 100 and 1 000 000. Your data source creates the set of data to profile from that amount of rows.
    • If you typed a value that is bigger than the amount of rows in the data source, the entire data source is used to profile the data.

    Warning Only some data sources support the use of random rows (partial scan). To verify if your data source allows it, go to Collibra-provided JDBC drivers.

    Exclude Table Types

    A comma-separated list of table types that you don't want to profile.

    Example VIEW, MATERIALIZED VIEW

    Note 

    The list is not case-sensitive. We automatically change the values to upper cases after you confirm the list.

    For data sources that support the use of random rows, the Random Rows option is selected by default. For data sources that don't support it, the Do Not Profile (unless specified in the schema-specific rule) option is selected by default.

  6. Click Save.
  7. If you want to define a specific profiling rule for a schema:
    1. In the Available Schemas section, select the schema.
      The schema-specific information opens.
    2. Do one of the following:
      • To create a new rule, click Add Rule.
      • To edit an existing rule, click Edit .
    3. Enter the required information.
      OptionDescription
      Include Tables

      A comma-separated list with names of the tables that you want to profile.

      • The default value is *, which means all registered tables are taken into account.
      • You can use * as a wildcard. For example, CUSTOMER*.
      • If the name of a table contains a special character, like . + * \ ? ^ $ ( ) [ ] { } | then add a \ before the special character for it to be correctly evaluated. For example, *CUSTOMER\+*.
      • The Include Tables field is processed before the field.
      Example 
      • Out of all registered tables in a schema, you want to profile only the table with name "CUSTOMERS" and the tables with a name that starts with "ORDER".
        To do this:
        In the Include Tables field, enter: CUSTOMERS,ORDER*.
      • Out of all registered tables in a schema, you want to profile only the tables with a name that contains "SKU".
        To do this:
        In the Include Tables field, enter: *SKU*.
      Exclude Tables

      A comma-separated list of the names of the tables you don't want to profile.

      • By default, this field is not completed.
      • You can use * as a wildcard.
      • If the name of a table contains a special character, like . + * \ ? ^ $ ( ) [ ] { } | then add a \ before the special character for it to be correctly evaluated. For example, *SKU\+*.
      • The Include Tables field is processed before the field.

      You can use exclude to do the following:

      • Profile all registered tables except the ones defined in the Exclude Tables field.
      • Profile all tables as defined in the Include Tables field, with the exception of tables that are listed in the Exclude Tables field.
      Example 
      • Out of all registered tables in a schema, you don't want to profile a table with the name "LAST_NAME".
        To do this:
        In the Include Tables field, enter: * and in the Exclude Tables field, enter: LAST_NAME.
      • Out of all registered tables in a schema, you want to profile the tables with a name that starts with "SKU", but exclude the tables with a name that contains "bkp".
        To do this:
        In the Include Tables field, enter: SKU* and in the Exclude Tables field, enter: *bkp*.
      Do Not ProfileSelect to indicate you don't want to profile this schema.
      This option is useful if you want to exclude a schema from the profiling process.
      All RowsSelect to profile the schema based on all data. This is also called full scan.
      Random Rows

      Select to profile the schema based on a subset of the data. This is also called partial scan.
      If you select this option, the Maximum Number of Rows field appears. Enter the maximum number of rows you want to use for profiling. By default, the maximum number of rows is 20,000.

      Note 
      • The value must be between 100 and 1,000,000. Your data source creates the set of data to profile from that amount of rows.
      • If you typed a value that is bigger than the amount of rows in the data source, the entire data source is used to profile the data.

      Warning Only some data sources support the use of random rows. To verify if your data source allows it, go to Collibra-provided JDBC drivers.

      Exclude Table Types

      A comma-separated list of table types that you don't want to profile.

      Example VIEW, MATERIALIZED VIEW

      Note 

      The list is not case-sensitive. We automatically change the values to upper cases after you confirm the list.

      For data sources that support the use of random rows, the Random Rows option is selected by default. For data sources that don't support it, the Do Not Profile option is selected by default.

    4. Click Save.

What's next?

You can now profile the data manually, automatically, or add a schedule.