Advanced

A DQ Check is a bash script that is essentially the launch point for any DQ Job to scan a data set. A data set can be a flat file, such as textfile, json file, parquet file, etc, or a table from any number of databases, such as Oracle, Postgres, Mysql, Greenplum, DB2, SQLServer, Teradata, etc.

Example Run a data quality check on any file by setting the file path.

Copy
./owlcheck -ds stock_trades -rd 2019-02-23 -f /path/to/file.csv -d ,

Example output below. A hoot is a valid JSON response

Copy
{
  "dataset": "stock_trades",
  "runId": "2019-02-03",
  "score": 100,
  "behaviorScore": 0,
  "rows": 477261,
  "passFail": 1,
  "peak": 1,
  "dayOfWeek": "Sun",
  "avgRows": 0,
  "cols": 5,
  "activeRules": 0,
  "activeAlerts": 0,
  "runTime": "00:00:23",
  "dqItems": {},
  "datashapes": [],
  "validateSrc": [],
  "alerts": [],
  "prettyPrint": true
}

Monthly Data

Sometimes you may want to run monthly profiles with aggregated data. In this case, the scheduling tool can supply the $ as a variable such as $runDate and the end date as $endDate. 1 line examples for bash or shell below.

Copy
echo "Hello World Collibra DQ"

runDate=$(date +"%Y-%m-%d")
endDate=$(date -d "$runDate +1 month" +%Y-%m-%d)

echo $runDate
echo $endDate

./owlcheck \
-q "select * from table where date >= '$runDate' and date < '$endDate' " \
-ds example \
-rd $runDate \
-tbin MONTH

Monthly BackRun (Using Collibra Data Quality's built-in Monthly)

Collibra Data Quality has 2 convenient features here:

  1. The use of built-in $ and $ removes the need for any shell scripting.
  2. Using -br, DQ will replay 20 months of data using this template automatically.
Copy
./owlcheck \
-q "select * from table where date >= '${rd}' and date < '${rdEnd}' " \
-ds example
-rd 2019-01-01
-rdEnd 2019-02-01
-tbin MONTH
-br 20

Daily Data

One of the most common examples is data loading or running once a day. A job control framework can pass in this value or you can pull it from shell.

Copy
echo "Hello World Collibra DQ"

runDate=$(date +"%Y-%m-%d")
echo $runDate

./owlcheck \
-q "select * from table where date = '$runDate' " \
-ds example \
-rd $runDate \
-tbin DAY

Daily Data (Using Collibra Data Quality's built-in Daily)

Copy
./owlcheck \
-q "select * from table where date = '${rd}' " \
-ds example \
-rd 2019-03-14

Daily Data with Timestamp instead of Date

Copy
./owlcheck \
-q "select * from table where TS >= '${rd} 00:00:00' and TS <= '${rd} 23:59:59' " \
-ds example \
-rd 2019-03-14

OR Timestamp using $

Copy
./owlcheck \
-q "select * from table where TS >= '${rd} 00:00:00' and TS < '${rdEnd} 00:00:00' " \
-ds example \
-rd 2019-03-14 \
-rdEnd 2019-03-15 \
-tbin DAY

Hourly Data

Copy
./owlcheck \
-q "select * from table where TS >= '${rd}' and TS < '${rdEnd}' " \
-ds example \
-rd    "2019-03-14 09:00:00" \
-rdEnd "2019-03-14 10:00:00" \
-tbin HOUR

DQ Check Template with Service Hook

The best practice is to make a generic job that would be repeatable for every DQ Check. Below is an example that first hits Collibra Data Quality using a REST call and then runs the response.

Copy
curl -X GET "http://$host/v2/getowlchecktemplate?dataset=lake.loan_customer" \
-H "accept: application/json"

The above REST call returns the below DQ Check. It is left up to the Job Control to replace the $ with the date from the Job Control system. You can use Collibra DQ's built-in scheduler to save these steps.

Copy
./owlcheck \
-lib "/home/danielrice/owl/drivers/mysql/" \
-cxn mysql \
-q "select * from lake.loan_customer where load_dt = '${rd}' " \
-key post_cd_num -ds lake.loan_customer \
-rd ${rd} \
-dc load_dt -dl -dlkey usr_name,post_cd_num -dllb 5 \
-tbin DAY -by DAY -dupe -dupeinc ip_address_home,usr_name -dupecutoff 85 \
-fpgon -fpgkey usr_name,post_cd_num -fpgdc load_dt -fpglb 5 -fpgtbin DAY \
-loglevel INFO \
-h $host:5432/owltrunk \
-owluser {user}

REST API End Point

The easiest option is to use the runtemplate end point API call to make requests to from cmdLine or JobControl System. This endpoint gets the DQ Check saved in Collibra instead of the client needing to know the DQ Check details.

https://$host/v2/runtemplate?dataset=lake.spotify

RunTemplate

Parameters

Path

dataset string name of dataset. -ds OR opt.dataset
rd string yyyy-MM-dd format can add time or timezone. if note passed in it will use the current day
rdEnd string yyyy-MM-dd format can add time or timezone. if not passed it will not be used
Responses

200

Copy
{
  "msg": "Success, DQ Check is Running as process 13996",
  "pid": "13996",
  "runid": "2017-01-01",
  "starttime": "Thu Oct 17 13:27:01 EDT 2019",
  "cmd": "cmd": "-ds lake.spotify -rd 2019-10-17   -q \"select * from lake.spotify\" -cxn mysql -lib /opt/owl/drivers/mysql/ -drivermemory 2G -histoff   -owluser {user}",
  "dataset": "lake.spotify"
}

Curl example for the above Rest Call

Copy
TOKEN=$(curl -s -X POST http://$host/auth/signin -H "Content-Type:application/json" -d "{\"username\":\"$username\", \"password\":\"$password\"}" | jq -r '.token')

curl -i -H 'Accept: application/json' \
  -H "Authorization: Bearer ${TOKEN}" \
  http://$host/v2/runtemplate?dataset=lake.spotify

Bash Script

A generic and repeatable DQCheck script for job schedulers, that hooks into Collibra to get the template.

Copy
#1 authenticate
curl -sb -X POST -d username={user} -d password={password} http://$OWL_HOST/login -c cookies.txt

#2 get template
owlcheck_args=$(curl -b cookies.txt -H "accept: application/json" -X GET http://$OWL_HOST/v2/getowlcheckcmdlinebydataset\?dataset=insurance | sed 's/.*\[\(.*\)\]/\1/' | sed -e "s/^\"//" -e "s/\"$//"  | sed 's/\\\"\(.*\)\\\"/\x27\1\x27/')

#3 replace ${rd} with job_run_date
job_run_date="2019-03-14 10:00:00"
owlcheck_args=${owlcheck_args//'${rd}'/$job_run_date}

#4 run owlcheck
eval owlcheck $owlcheck_args

For more Information on Collibra Data Quality's Scheduler, visit the DQ Job Cron page.