Important Mapping is only available in Pullup mode. If you are using Pushdown mode, proceed to the next available step.

Mapping

Mapping allows you to discover inconsistencies between the columns from one database, file, or remote file and those of another. Mapping checks for inconsistencies such as row count, schema, and value differences in data copied from data sources to target tables. By mapping data from the source system to a different target system, you can protect against the two systems from getting out of sync from one another.

Note Mapping is disabled by default.

Steps

  1. On the stepper, click Mapping.
    The Mapping step opens.
  2. Select one of the following options and follow the steps accordingly.

    Database Mapping screenshot

    1. Select the Database option.
    2. Under the Select a Data Source section, click the Select a resource dropdown menu.
      A list of available database connections appears.
    3. Select the source database from the list of connections to which your target columns will map for source validation.
    4. Tip You can click Dataset Overview icon to open the Data Preview where you can see a preview of the data you selected.

    5. Click Start Mapping.
      The Column Selector modal appears. The Column Selector shows the available columns in your target dataset and includes details about the data type present in each column.
    6. Note The Column Selector step allows advanced SQL users to optionally click the three dots icon in the upper right corner to switch to SQL View for manual SQL compilation.

    7. Optionally click Enable Timeslice and select a Timeslice Column from the dropdown menu.
      This adds a ${rd} to your query, where the datetime value is substituted at runtime.
    8. Select the columns to include in the mapping using the checkbox options next to each column name.
    9. Note All columns are selected by default.

    10. Click To Mapper in the lower right corner.
      The Source to Target Mapper appears. The Source to Target Mapper allows you to map source columns on the left side of the screen to their targets on the right.
    11. Note Collibra DQ automaps columns it detects as matches between the source and target databases.

    12. In the Target Column, click the dropdown menu within a given row and choose the columns you wish to map to the source column within the corresponding row. If the automapping that Collibra DQ suggested is correct, you can proceed to the next step without making modifications.
    13. Click Validate Overlay.
      The Overlay Preview appears. The Overlay Preview shows you the source to target mapping based on your selections in the previous steps. A helpful legend is available in the upper right corner of the screen.
    14. Optionally select the Key checkbox at the top of any of the columns to assign them as key columns.
    15. Click Validate.
      The main Mapping screen displays the number of mapped columns in the stepper.
    1. Select the File option.
    2. Under the Select File Location section, select a file protocol option from the Protocol dropdown menu, then enter the folder in the Folder input field where the file is stored. Alternatively, click the Use Temporary file dropdown menu to use a temp file.
    3. Under the Select a Data Source section, click the Select a resource dropdown menu.
      A list of available file-based connections appears.
    4. Select the source file from the list of file-based connections to which your target columns will map for source validation.
    5. Tip You can click Dataset Overview icon to open the Data Preview where you can see a preview of the data you selected.

    6. Click the Delimiter dropdown menu and select a delimiter option reflective of how columns are separated in your file.
    7. Optionally select the Add Header option and add comma-separated column names in the Column Names input field below.
    8. Optionally select Bypass Schema Evaluation if Spark is unavailable.
    9. Note Bypass Schema Evaluation is a way to avoid loading a SparkSession for CSV and TXT files. You should only select this option when Spark is unavailable or you are unable to load Spark.

    1. Select the Remote file option.
    2. Under the Select a Data Source section, click the Select a resource dropdown menu.
      A list of available database connections appears.
    3. Select the source remote file from the list of file-based connections to which your target columns will map for source validation.
    4. Tip You can click Dataset Overview icon to open the Data Preview where you can see a preview of the data you selected.

    5. Click the Delimiter dropdown menu and select a delimiter option reflective of how columns are separated in your file.
    6. Collibra DQ automatically detects the extension type, but you can manually select one by clicking the Extension dropdown and selecting an option.
    7. Note Depending on the extension type, there may be additional options that appear when you make your selection. See the table below for reference.

      Option Description .csv .txt .orc parquet avro json delta xml hudi
      Skip Lines Add a numerical value to instruct Collibra DQ on the number of lines (rows) to skip. These are the lines in your file after the non-data information ends and the data to include in your job begins.

      Yes

      Yes

      No

      No

      No

      No

      No

      No

      No

      Add Header Denotes that the file contains a header row.

      Yes

      Yes

      No

      No

      No

      No

      No

      No

      No

      Column Names
      A comma-separated list of the header column names. For example, Date,Name,Title

      Yes

      Yes

      No

      No

      No

      No

      No

      No

      No

      Bypass Schema Evolution A method to avoid loading a SparkSession for CSV and TXT files. You should only select this option when Spark is unavailable or you are unable to load Spark.

      Yes

      Yes

      No

      No

      No

      No

      No

      No

      No

      Flatten Refers to the level of nesting in your file.

      No

      No

      No

      No

      No

      Yes

      No

      Yes

      No

      Multiline Refers to the content of the file being more than one line.

      No

      No

      No

      No

      No

      Yes

      No

      Yes

      No

    8. Click Load Schema.

  3. Click Start Mapping.

    The Column Selector modal appears. The Column Selector shows the available columns in your target dataset and includes details about the data type present in each column.
  4. Note The Column Selector step allows advanced SQL users to optionally click the three dots icon in the upper right corner to switch to SQL View for manual SQL compilation.

  5. Optionally click Enable Timeslice and select a Timeslice Column from the dropdown menu.
    This adds a ${rd} to your query, where the datetime value is substituted at runtime.
  6. Select the columns to include in the mapping using the checkbox options next to each column name.
  7. Note All columns are selected by default.

  8. Click To Mapper in the lower right corner.
    The Source to Target Mapper appears. The Source to Target Mapper allows you to map source columns on the left side of the screen to their targets on the right.
  9. Note Collibra DQ automaps columns it detects as matches between the source and target.

  10. In the Target Column, click the dropdown menu within a given row and choose the columns you wish to map to the source column within the corresponding row. If the automapping that Collibra DQ suggested is correct, you can proceed to the next step without making modifications.
  11. Click Validate Overlay.
    The Overlay Preview appears. The Overlay Preview shows you the source to target mapping based on your selections in the previous steps. A helpful legend is available in the upper right corner of the screen.
  12. Optionally select the Key checkbox at the top of any of the columns to assign them as key columns.
  13. Click Validate.
    The main Mapping screen displays the number of mapped columns in the stepper.
  14. Continue to the next step on the stepper or click the right arrow on the right side of the page.

Mapping Configuration

You can click three dots icon in the upper right corner of the Mapping main screen to apply different configuration options. The following table shows the available options.

Option Description
Pushdown Count Runs COUNT (*) FROM <database> to limit the rows included in the mapping query to only those that meet the Pushdown Count criteria. This can increase the efficiency of your DQ job by preventing data that does not meet the criteria of Pushdown Count from being included in the Spark compute tier.
Check case Checks for case sensitivity when comparing source and target schemas by creating a new mapping to instruct Collibra DQ about the case-sensitive source to target mapping.
Order Checks for schema order consistency when comparing source and target schemas.
Check type Checks for schema type consistency when comparing source and target schemas.
Validate values

Unavailable when Pushdown Count is enabled. Checks for cell value when comparing source and target cells. If no rows are present in either the source or target data source, the cell value comparison activity is skipped when the job runs.

Note This requires a large amount of computational resources.

Validate for matches Unavailable when Pushdown Count is enabled. Checks for matching instead of mismatching cell values when comparing source and target cells.
Trim Unavailable when Pushdown Count is enabled. Removes leading and trailing spaces from strings to improve data cleanliness when comparing source and target cells.
Missing Keys Unavailable when Pushdown Count is enabled. Shows keys values that are found in one data source but missing in the other.
Ignore Precision Ignores precise matches between the total number of digits in numeric data type columns when comparing source and target schemas.
Strict Source Downscoring Bases the downscore value on the number of cell mismatches across all columns. For example, if there are 20 cell mismatches, then 20 points are deducted from the overall data quality score.