Advanced
A DQ Check is a bash script that is essentially the launch point for any DQ Job to scan a data set. A data set can be a flat file, such as textfile, json file, parquet file, etc, or a table from any number of databases, such as Oracle, Postgres, Mysql, Greenplum, DB2, SQLServer, Teradata, etc.
Example Run a data quality check on any file by setting the file path.
./owlcheck -ds stock_trades -rd 2019-02-23 -f /path/to/file.csv -d ,
Example output below. A hoot is a valid JSON response
{
"dataset": "stock_trades",
"runId": "2019-02-03",
"score": 100,
"behaviorScore": 0,
"rows": 477261,
"passFail": 1,
"peak": 1,
"dayOfWeek": "Sun",
"avgRows": 0,
"cols": 5,
"activeRules": 0,
"activeAlerts": 0,
"runTime": "00:00:23",
"dqItems": {},
"datashapes": [],
"validateSrc": [],
"alerts": [],
"prettyPrint": true
}
Monthly Data
Sometimes you may want to run monthly profiles with aggregated data. In this case, the scheduling tool can supply the $ as a variable such as $runDate and the end date as $endDate. 1 line examples for bash or shell below.
echo "Hello World Collibra DQ"
runDate=$(date +"%Y-%m-%d")
endDate=$(date -d "$runDate +1 month" +%Y-%m-%d)
echo $runDate
echo $endDate
./owlcheck \
-q "select * from table where date >= '$runDate' and date < '$endDate' " \
-ds example \
-rd $runDate \
-tbin MONTH
Monthly BackRun (Using Collibra Data Quality's built-in Monthly)
Collibra Data Quality has 2 convenient features here:
- The use of built-in $ and $ removes the need for any shell scripting.
- Using -br, DQ will replay 20 months of data using this template automatically.
./owlcheck \
-q "select * from table where date >= '${rd}' and date < '${rdEnd}' " \
-ds example
-rd 2019-01-01
-rdEnd 2019-02-01
-tbin MONTH
-br 20
Daily Data
One of the most common examples is data loading or running once a day. A job control framework can pass in this value or you can pull it from shell.
echo "Hello World Collibra DQ"
runDate=$(date +"%Y-%m-%d")
echo $runDate
./owlcheck \
-q "select * from table where date = '$runDate' " \
-ds example \
-rd $runDate \
-tbin DAY
Daily Data (Using Collibra Data Quality's built-in Daily)
./owlcheck \
-q "select * from table where date = '${rd}' " \
-ds example \
-rd 2019-03-14
Daily Data with Timestamp instead of Date
./owlcheck \
-q "select * from table where TS >= '${rd} 00:00:00' and TS <= '${rd} 23:59:59' " \
-ds example \
-rd 2019-03-14
OR Timestamp using $
./owlcheck \
-q "select * from table where TS >= '${rd} 00:00:00' and TS < '${rdEnd} 00:00:00' " \
-ds example \
-rd 2019-03-14 \
-rdEnd 2019-03-15 \
-tbin DAY
Hourly Data
./owlcheck \
-q "select * from table where TS >= '${rd}' and TS < '${rdEnd}' " \
-ds example \
-rd "2019-03-14 09:00:00" \
-rdEnd "2019-03-14 10:00:00" \
-tbin HOUR
DQ Check Template with Service Hook
The best practice is to make a generic job that would be repeatable for every DQ Check. Below is an example that first hits Collibra Data Quality using a REST call and then runs the response.
curl -X GET "http://$host/v2/getowlchecktemplate?dataset=lake.loan_customer" \
-H "accept: application/json"
The above REST call returns the below DQ Check. It is left up to the Job Control to replace the $ with the date from the Job Control system. You can use Collibra DQ's built-in scheduler to save these steps.
./owlcheck \
-lib "/home/danielrice/owl/drivers/mysql/" \
-cxn mysql \
-q "select * from lake.loan_customer where load_dt = '${rd}' " \
-key post_cd_num -ds lake.loan_customer \
-rd ${rd} \
-dc load_dt -dl -dlkey usr_name,post_cd_num -dllb 5 \
-tbin DAY -by DAY -dupe -dupeinc ip_address_home,usr_name -dupecutoff 85 \
-fpgon -fpgkey usr_name,post_cd_num -fpgdc load_dt -fpglb 5 -fpgtbin DAY \
-loglevel INFO \
-h $host:5432/owltrunk \
-owluser {user}
REST API End Point
The easiest option is to use the runtemplate end point API call to make requests to from cmdLine or JobControl System. This endpoint gets the DQ Check saved in Collibra instead of the client needing to know the DQ Check details.
https://$host/v2/runtemplate?dataset=lake.spotify
RunTemplate
Parameters
Path
dataset | string | name of dataset. -ds OR opt.dataset |
rd | string | yyyy-MM-dd format can add time or timezone. if note passed in it will use the current day |
rdEnd | string | yyyy-MM-dd format can add time or timezone. if not passed it will not be used |
Responses |
200
{
"msg": "Success, DQ Check is Running as process 13996",
"pid": "13996",
"runid": "2017-01-01",
"starttime": "Thu Oct 17 13:27:01 EDT 2019",
"cmd": "cmd": "-ds lake.spotify -rd 2019-10-17 -q \"select * from lake.spotify\" -cxn mysql -lib /opt/owl/drivers/mysql/ -drivermemory 2G -histoff -owluser {user}",
"dataset": "lake.spotify"
}
Curl example for the above Rest Call
TOKEN=$(curl -s -X POST http://$host/auth/signin -H "Content-Type:application/json" -d "{\"username\":\"$username\", \"password\":\"$password\"}" | jq -r '.token')
curl -i -H 'Accept: application/json' \
-H "Authorization: Bearer ${TOKEN}" \
http://$host/v2/runtemplate?dataset=lake.spotify
Bash Script
A generic and repeatable DQCheck script for job schedulers, that hooks into Collibra to get the template.
#1 authenticate
curl -sb -X POST -d username={user} -d password={password} http://$OWL_HOST/login -c cookies.txt
#2 get template
owlcheck_args=$(curl -b cookies.txt -H "accept: application/json" -X GET http://$OWL_HOST/v2/getowlcheckcmdlinebydataset\?dataset=insurance | sed 's/.*\[\(.*\)\]/\1/' | sed -e "s/^\"//" -e "s/\"$//" | sed 's/\\\"\(.*\)\\\"/\x27\1\x27/')
#3 replace ${rd} with job_run_date
job_run_date="2019-03-14 10:00:00"
owlcheck_args=${owlcheck_args//'${rd}'/$job_run_date}
#4 run owlcheck
eval owlcheck $owlcheck_args
For more Information on Collibra Data Quality's Scheduler, visit the DQ Job Cron page.