Job Command Line Parameters
This topic provides information on using command line parameters to run jobs. It includes information specific to managing Spark resources, as well as a reference guide describing the commands available in the job run command.
Managing Spark resources from the command line
Scale linearly with your data
Scale linearly with your data by adding executors and/or memory. For example:
-f "file:///Users/home/salary_data.csv" \
-d "," \
-rd "2018-01-08" \
-ds "salary_data"
-numexecutors 2 \
-executormemory 2g
Yarn Master
Spark Master
Collibra DQ also runs using spark master by using the -master
command and passing in spark:url
.
Spark Standalone
Collibra DQ typically runs in standalone, but it will not by default distribute the processing beyond the hardware it was activated on.
Options | Description |
---|---|
deploymode | The Spark deploymode option. For example, cluster . |
drivermemory | The driver memory of your local Spark instance in gigabytes. |
executorcores | Spark executor cores. |
executormemory | The total Spark executor memory in gigabytes, for example, 3G . |
master | Overrides local[*] |
sparkprinc | Kerberos principal name, example: [email protected]. |
Use Spark-Submit directly bypassing DQCheck
spark-submit \
--driver-class-path /opt/owl/drivers/postgres42/postgresql-42.2.4.jar \
--driver-library-path /opt/owl/drivers/postgres42/postgresql-42.2.4.jar \
--driver-memory 3g --num-executors 2 --executor-memory 1g \
--master spark://Kirks-MBP.home:7077 \
--class com.owl.core.cli.OwlCheck /opt/owl/bin/owl-core-trunk-jar-with-dependencies.jar \
-u user -p pass -c jdbc:postgresql://xyz.chzid9w0hpyi.us-east-1.rds.amazonaws.com/postgres \
-ds accounts -rd 2019-05-05 -dssafeoff -q "select * from accounts"
-driver org.postgresql.Driver -lib /opt/owl/drivers/postgres42/
Parallel JDBC Spark-Submit
spark-submit \
--driver-class-path /opt/owl/drivers/postgres42/postgresql-42.2.4.jar \
--driver-library-path /opt/owl/drivers/postgres42/postgresql-42.2.4.jar \
--conf spark.driver.extraJavaOptions=-Dlog4j.configuration=file:///opt/owl/config/log4j-TRACE.properties \
--conf spark.executor.extraJavaOptions=-Dlog4j.configuration=file:///opt/owl/config/log4j-TRACE.properties \
--files /opt/owl/config/log4j-TRACE.properties \
--driver-memory 2g --num-executors 2 --executor-memory 1g --master spark://Kirks-MBP.home:7077 \
--class com.owl.core.cli.OwlCheck /opt/owl/bin/owl-core-trunk-jar-with-dependencies.jar \
-u us -p pass -c jdbc:postgresql://xyz.chzid9w0hpyi.us-east-1.rds.amazonaws.com/postgres \
-ds aumdt -rd 2019-05-05 -dssafeoff -q "select * from aum_dt" \
-driver org.postgresql.Driver -lib /opt/owl/drivers/postgres42/ \
-connectionprops fetchsize=6000 -master spark://Kirks-MBP.home:7077 \
-corroff -histoff -statsoff \
-columnname updt_ts -numpartitions 4 -lowerbound 1557597987353 -upperbound 1557597999947
Command Line Reference
The following table describes the parameters available for a job run command.
Parameter | Description |
---|---|
adddc | Add Date Column as Run Id that is used |
addlib | Additional library directory to be added to the classpath for the DQ Job (spark-submit) |
agentjobid | Internal use only |
agg | Grouping function for flexibility |
aggq |
select * from dataset where |
alertemail | Automatically add an alert with score greater than 75, to the email value supplied |
alias | Dataset name alias, example: userTable or users or user_file |
archivecxn | Connection name to archive break records |
avro | avro file data flag |
avroschema | avro schema file |
bd | Column count if you want to group by a particular set of values for behavioral statistics |
bdcol | Behavioral function if you want to aggregate for behavioral statistics |
bdfunc | Behavioral function if you want to aggregate for behavioral statistics |
bdgrp | Behavioral group to dynamically collect stats if you want to group by a particular set of values for behavioral statistics |
behaviorscoreoff | Turn off behavior scoring |
bhemptyoff | Behavior empty check detection off |
bhlb | The behavior lookback period where the entered value represents the number of days. For example, a value of 12 looks back 12 days of data. |
bhmaxoff | Behavior max value detection off |
bhmaxon | Behavior max value detection on |
bhmeanoff | Behavior mean value detection off |
bhmeanon | Behavior mean value detection on |
bhminoff | Behavior min value detection off |
bhminon | Behavior min value detection on |
bhminsupport | Behavior min support, set to 4 by default, min number of days to learn from, learning phase |
bhnulloff | Behavior null check detection off |
bhrowoff | Behavior row count detection off |
bhsensitivity | Behavior sensitivity: NARROW , NULL , WIDE |
bhtimeoff | Behavior load time detection off |
bhtimeon | Behavior load time detection on |
bhuniqueoff | Behavior unique detection off |
bhuniqueon | Behavior unique detection on |
br | Number of back runs to fill training history, should be an integer value |
brbin | Time bin for back runs, example: -brbin DAY |
bt | Back-tick character (`) to escape SQL queries when returning to database |
by | Compare by DAY , HOUR , or MIN . |
c | jdbc://hive:3306 (connection URL) |
cacheoff | Turn caching off. Caching is on by default. It can be turned off if the dataset is too large or cache optimization is not desired. |
cardoff | Turn off profiling section of owlcheck |
categoricallimit | Limit for categorical outliers stored |
categoricallimitui | Limit for categorical outliers displayed |
categoricalscore | Score for each categorical outlier |
catoff | Disables categorical outlier detection. |
caton | Turn on categorical outliers |
catOutAlgo | Specify ML algorithm for categorical outliers. Default: "" (no ML) |
catOUtAlgoParams | Optional params for catOutAlgo to override Owl-suggested params. Default: "". E.g. "k=5,initSteps=5" |
catOutBottomN | Max number of categorical outliers in a column |
catOutConfType | Method to use to calculate likelihood of category level |
catOutMaxCategoryN | Maximum number of categories within key that will trigger homogenous past categorical outlier case |
catOutMaxConf | Confidence upper bound to qualify as an outlier |
catOutMaxFreqPercentile | Frequency percentile upper bound to qualify as an outlier |
catOutMinFreq | Minimum frequency needed to be considered an outlier. Raise to make less sensitive |
catOutMinVariance | Minimum frequency count variance (within key) required to be considered an outlier. set to negative to be more sensitive |
catOutParallelOff | Turn off parallel column-wise processing of categorical outliers |
catOutTopN | Number of top frequently appearing level in a column to include in preview |
columnname | Column name to split on for spark JDBC |
concat | No arguments. Concatenate option for categorical oultiers columns |
conf | The Spark configuration option. For example, spark.kubernetes.memoryOverheadFactor=0.4,spark.kubernetes.executor.podTemplateFile=local:///opt/owl/config/k8s-executor-template.yml,spark.kubernetes.driver.podTemplateFile=local:///opt/owl/config/k8s-driver-template.yml |
connectionprops | key=value,hive.resultset.use.unique.column.names=false
|
connectionpropssrc | key=value,hive.resultset.use.unique.column.names=false
|
corefetchmode | Let core go fetch query from meta store in stead using the one passed in command line |
corroff | Dataset correlation flag force off |
corron | Dataset correlation flag force on |
cxn | The name of the saved database or file connection from which your dataset originates. |
d | Delimiter, ',' |
dataconceptid | Identifier of the group of semantic rules by datatype |
datashapeexc | Exclude a column from data shapes discovery |
datashapegranular | Check length for alphanumeric fields, and independent check for numbers and letters |
datashapeinc | Include a column that has been excluded from data shapes discovery |
datashapelimit | Limit for datashapes stored |
datashapelimitui | Limit for datashapes displayed |
datashapemaxcolsize | Maximum length of a string column before it is disqualified from shapes detection |
datashapemaxpercol | Maximum number of shapes per column before column is ignored during shapes processing |
datashapeoff | Turn DataShape Activity Off |
datashapescore | Score for each datashape |
datashapesense | Maximum occurrence rate (%) to be considered a shape |
dateoff | Turn date detection off. In some cases date detection is a costly operation with little value |
dblb | DB lookback to check owl check history for previous histories |
dc | The date column for outlier detection. |
delta | Delta file data flag |
deploymode | The Spark deploymode option. For example, cluster. |
depth | The depth of duplicate discovery between 1-3, increasing runtime non-linearly. The default value is 1. |
df | Date Format, example: yyyy-MM-dd |
diff | Percentage difference between two days to do a reference for keys missing |
divisor | Divisor for unix timestamp. s for seconds or ms for milliseconds. Default is ms . |
dl | Deep learning. This enables the outliers activity. |
dlcombine | When numerical outlier appears more than once, combine them as single outlier |
dlcombineoff | When numerical outlier appears more than once, do not combine them as single outlier |
dlexc | Deep learning col exclusion, example: open,close,high,volume |
dlinc | The column limit for deep learning. This can be a comma delimited list of columns to include in your job. For example, if you want to include columns called account_id, date, and frequency, the correct syntax would be account_id,date,frequency . |
dlkey | The natural key for deep learning. This is the column in your dataset that you set as the key column. |
dllb | The deep learning lookback period where the entered value represents the number of days included in the outlier activity lookback. |
dlminhist | Minimum records for outlier history, default dllb - 2 |
dlmulti | Pass multiple dlkey=dlinc key value pairs. Split by pipe for multiple |
dn | Driver name org.apache.jdbc.PhoenixDriver |
dpoff | Do not store data preview records |
dprev | Data preview turned off, same as onReadOnly |
dq | Double-quote character (") to escape SQL queries when returning to database |
driver | The driver class name of a custom driver. |
drivermemory | The driver memory of your local Spark instance in gigabytes. |
ds | The name of the dataset. |
dssafeoff | Best practice naming convention flag, provides a globally unique and meaningful natural key to all datasets |
dupe | Enables the dupe activity. |
dupeapprox | Approximate groupings default value =1 [0-3] |
dupecutoff |
The duplicate score lower boundary for non-exact matching percentage. For example, if you set the dupecutoff value to 40, then the lowest percentage of a potential duplicate match would be 40%. This can be used in conjunction with -dupepermatchupperlimit to specify a range of matches. Note If Exact Match is enabled, this value cannot be set.
|
dupeexc | Duplicate record detection, column exclusion list |
dupeinc | The column limit for duplicate record detection. This can be a comma delimited list of columns to include in your job. For example, if you want to include columns called account_id, date, and frequency, the correct syntax would be account_id,date,frequency . |
dupelb | Duplicate lower bounds on percent match, default [85] |
dupelimit | Limit for dupe rows stored |
dupelimitui | Limit for dupe rows displayed |
dupenocase | Duplicate record case sensitivity off |
dupeonly | Only run duplicate section |
dupepermatchupperlimit | The duplicate score upper boundary for non-exact matching percentage, set to 100 by default. |
dupescore | Score for each duplicate record |
dupesperdupe | Max dupes to calculate per duplicate match |
dupetruecase | Enables case sensitivity. |
dupeub | Duplicate upper bounds on percent match, default [100] |
ec | Add custom escape character to escape SQL queries when returning to database |
encoding | Load file charset encoding other than UTF-8 |
erlq | Explicit k,v string of rule_name and rule sql for secondary datasets |
executorcores | Spark executor cores |
executormemory | The total Spark executor memory in gigabytes, for example, 3G. |
f | File path for load, /dir/filename.csv |
files | Pass additional spark files for distribution on cluster |
filter | Only use rows containing this value |
filtergram | filtergram |
filternot | Only use rows containing this value |
flatten | Option to flatten json and explode arrays |
fllb | File Lookback to check owl check history for previous files |
fllbminrow | Minimum number of rows (inclusive) that owl check history needs to be considered for File Lookback. Default 0 (which includes all owlchecks) |
fpgbucketlimit | Limit bucket size for Pattern algorithm, example: -fpgbucketlimit 20000 |
fpgconfidence | Minimum occurrence rate at which an association rule has to be found to be true |
fpgdc | The column in your dataset that you set as the date column. |
fpgdupeoff | Pattern mining do not remove dupe cols, helps performance impacts quality |
fpgexc | Pattern mining is expensive use this input to limit the observed cols |
fpginc |
The column limit for pattern mining. This can be a comma delimited list of columns to include in your job. For example, if you want to include columns called account_id, date, and frequency, the correct syntax would be account_id,date,frequency Because pattern mining is expensive, limiting the number of columns in your query can be an effective way to control costs. |
fpgkey | The natural key for pattern mining. This is the column in your dataset that you set as the key column. |
fpglb | The lookback period where the entered value represents the number of days included in the pattern activity lookback. |
fpglimit | Limit for frequent pattern mining results stored |
fpglimitui | Limit for frequent pattern mining results displayed |
fpgmatchoff | Turn off match for only patterns that appear in today dataset scope |
fpgmulti | Pass multiple fpgkey=fpginc key value pairs. Split by pipe for multiple |
fpgon | Enables the pattern (mining) activity. |
fpgq | Select * from file (sql) |
fpgscore | Score for pattern mining records |
fpgsupport | Minimum occurrence rate for an itemset to be identified as frequent |
fpgtbin | Time bin for pattern mining, example: -fpgtbin DAY |
fq | Select * from file (sql) |
fullfile | Use entire file for lookbacks instead of just filequery |
h | The hostname where CDQ is installed. This option is for running DQ jobs remotely. |
header | Comma delimited list of headers: fname,lname,price |
headercheckoff | Check headers for invalid chars |
help | Print this message |
histlimit | Limit for histograms stored |
histlimitui | Limit for histograms displayed |
histoff | Dataset histogram flag force off |
histon | Dataset histogram flag force on |
hive | Turn on native hive for Hive non JDBC recommended |
hivehwc | Use hive warehouse connector to access data in HDP3.x Hive Warehouse |
hootonly | Only display hoot at stdout |
hootprettyoff | Hoot json pretty print flag off |
host | Owl metadata store host |
hudi | hudi file data flag |
in | Validate distinct column values against another dataset |
inferschemaoff | Turn off inferschema when loading files |
iot | Automatically store a numeric column without specifying a tsk, tsv |
jars | Spark - Comma-separated list of jars to include on the driver and executor classpaths. |
jdbckeytab | Path and location to jdbc principal keytab file |
jdbcprinc | Kerberos principal name specifically to connect to Kerberized JDBC, example: [email protected] |
jdbctgt | Path and location to jdbc principal tgt file |
jobschemaname | Mainly needed for Big Query, but can be used for any database to set the schema name explicitly versus parsing it out of the sql query later |
json | Json data flag |
kafka | Indicates that the target data source is Kafka |
kafkabroker | Kafka port, example: 9092 |
kafkagroup | Kafka consumer group, example: machine-group |
kafkakeyserde | Optional --kafka_key_deserializer org_apache_kafka_common_serialization_StringSerializer |
kafkaport | Kafka host, example: localhost |
kafkasasl | Enable kafka SASL (Kerberos) authentication. If this option is set, also set kafkasaslservice flag |
kafkasaslservice | The name of the SASL service fr authenticate |
kafkassl | Enable kafka SSL 1-way and/or 2-way ssl. If this option is set, also set ssltruststore/sslkeystore flags |
kafkatopic | Topic, topic name, example: test_stream |
kafkavalserde | Optional --kafka_value_deserializer org_apache_kafka_common_serialization_StringSerializer |
kerbpwd | Kerberos password to aquire TGT |
key | Primary key or unique key, typically business key, example : sym,exch (compound use comma) |
keyDelim | Delimiter for primary key or unique key when concatenating the values to single string, example: sym,exch -> sym~~exch |
lib | The library class directory, for example, “/opt/owl/drivers/postrgres/”. |
libsrc | Library class directory for val src cxn |
lickey | Passes lickey from owlcheck to owl-core |
linkid | linkid is for client datasets to pass their primary_key or record_id field so that when Owl serves back the results they are linked to the original dataset |
logfile | Allow user to add their own custom logging |
loglevel | The logging level. This can be either INFO or DEBUG . |
lookbacklimit | Limit for lookback intervals |
lower | Median Q2 multiplier to impact lower boundary |
lowerbound | Number or Timestamp |
maps | Contains maps in json that requires extra handling |
master | Overrides local[*] |
matches | (Deprecated) Show matches in validate source instead of mismatches |
maxcolumnlimit | Limit for max columns |
minintervals | Minimum streaming intervals for profiling |
mixedjson | Contains non-json and json flag |
mu | Measurement unit for outliers |
multiline | Multiline json flag |
notin | Validate distinct column values against another dataset |
nulls | -nulls 'null' treats 'null' as NULL |
numericlimit | Limit for numeric outliers stored |
numericlimitui | Limit for numeric outliers displayed |
numericscore | Score for numeric outliers |
numexecutors | The number of Spark executors. |
numpartitions | Number of partitions or splits for parallel queries,default 3 |
obslimit | Limit for observations stored |
obslimitui | Limit for observations displayed |
obsscore | Score for observation record |
opt | key=value, [escape='', quote='', timestampFormat='yyyy-MM-dd' ] |
optpath | /file/path/to/dsOption.properties [escape=value] |
orc | orc file data flag |
otlrq | Select * from file (sql) |
outlierlimit | Limit for outliers stored |
outlierlimitui | Limit for outliers displayed |
outlieronly | Only run outlier section |
outlierscore | Score for mismatching source to target records |
owluser | The username of the CDQ user running the job. |
p | Password |
packages | Spark - Comma-separated list of maven coordinates of jars to include on the driver and executor classpaths. Will search the local maven repo, then maven central and any additional remote repositories given by --repositories. The format for the coordinates should be groupId:artifactId:version |
parallel | Turn on parallel task execution vs sequential (default). Performance advantage but uses more hardware resources |
parquet | Parquet file data flag |
partitionnum | Number of partitions calculated by estimator/overide by user |
passfail | Set the failing score, example: 75 |
passfaillimit | Limit for passing or failing runs |
patternonly | Only run pattern mining section |
pgpassword | Password for Owl's postgres metastore |
pguser | Username for Owl's postgres metastore |
pipeline | List of activities to analyze |
plan | Turn on execution plan. Describes the executions plan |
port | Owl metadata store port |
postclearcache | Delay clear cache process to the end of owlcheck |
precisionoff | Turn Profile Precision Off, do not calculate the length of doubles |
profile2 | Run inline version of column stats |
profileonly | Only run profile and shape section |
profilepushdown | Compute profile in the target database |
profileStringLength | Profile min/max length for String type columns on |
profoff | Turn off profiling section of owlcheck |
pwdmgr | lookup a password manager password via script and obtain the password for the JDBC connection |
q | The SQL query of your job. For example, select * from [table] . |
q1 | The lower quartile boundary impact (IQR) value between 0-0.45. If this is not specified, the lower quartile is 0.15 by default. |
q3 | The upper quartile boundary impact (IQR) value between 0.55-1. If this is not specified, the upper quartile is 0.85 by default. |
qhist | Select * from table (sql) |
queue | YARN queue name |
rc | Record detection |
rcBy | Record compare by function |
rcDateCol | Record detection date column |
rcKeys | Record detection keys |
rcTbin | Record detction time bin |
rd | The run date of your job in either yyyy-MM-dd or yyyy-MM-dd HH:mm format. |
rdAdj | Adjusts the run date (rd) value for replacement date variables yyyy:MM:dd HH:mm:ss, formatting XX:NNN (example dd:-2 overrides the run date by substracting 2 days) |
rdEnd | End date for query ranges t_date >= ${rd} and t_date < ${rdEnd} , must be in format 'yyyy-MM-dd' or for incremental use Hours or Minutes yyyy-MM-dd HH:mm |
readonly | Do not connect to meta store good for testing or trials |
record | Validate distinct column values against runs |
recordoff | Check for records that were added or dropped from dataset |
repartitionoff | Do not repartition |
rlc | Rule secondary src jdbc://hive:3306 (connection URL) |
rld | Rule secondary src driver path |
rlds | srcDataset (silo.account) |
rlp | Rule secondary src password |
rlq | Rule secondary src SQL |
rlu | Rule secondary src username |
rootds | Context based predictions you can assign a root dataset, example: user -> userLoan, user -< userCredit. rootds = user |
rulename | Only for rules validation testing to run single rule |
ruleserial | Run rules in serial mode |
rulesoff | Rules section flag off |
rulesonly | Only run rules section off |
schemaregistrypass | Password to login to schema registry where stream schema can be found |
schemaregistryurl | url of schema registry where stream schema can be found |
schemaregistryuser | Username to login to schema registry where stream schema can be found |
schemascore | Score for schema changes |
scorecardsize | Limit for size of scorecard displayed |
sdriver | Classname for custom secondary driver entered by user for complex rule |
selectall | Select * override cols |
semanticoff | Semantic forced off |
semanticon | Semantic forced on |
skipfirstrow | Indicates that the first row contains header values |
skiplines | Skip first N lines of a csv file before loading |
sourceonly | Only run validate source section |
sp | Sample percentage [0.0 - 1.0], default value 1.0 = 100% |
sparkkeytab | Path and location to keytab file |
sparkprinc | Kerberos principal name, example: [email protected] |
sq | Single quote (') character to escape SQL queries when returning to database |
srcauto | Auto generates validate source params from owl check history. Only needs -srcds and -valsrcfq or -q |
srcavro | avro file data flag for source |
srcavroschema | Validate source avro schema file |
srcc | jdbc://hive:3306 (connection URL) |
srccxn | Instead providing the user, pass, connectionurl for a connection, provide the saved connection name for validate source |
srcd | src driver oracle.driver.JDBC |
srcdel | Source delimiter , |
srcdelta | Delta file data flag for source |
srcds | srcDataset (silo.account) |
srcencoding | Load source file charset encoding other than UTF-8 |
srcfile | Validate source file |
srcflatten | Option to flatten json and explode arrays for source |
srcfullfile | Use entire file for lookbacks instead of just filequery for source |
srcheader | Validate source header for a file |
srchive | -srchive for validate source on Hive using HCat non JDBC |
srcinferschemaoff | Turn off inferschema when loading files or source |
srcjson | json data flag for source |
srcjsonmaps | Contains maps in json that requires extra handling for source |
srcmixedjson | Validate source contains non-json and json flag |
srcmultiline | Multiline json flag for source |
srcorc | orc file data flag for source |
srcp | src password |
srcparquet | Parquet file data flag for source |
srcpwdmgr | Lookup a password manager password via script and obtain the password for the JDBC connection |
srcq | src SQL |
srcskiplines | Skip first N lines of a source csv file before loading () |
srcu | src username |
srcxml | Xml data flag for source |
srcxmlrowtag | Xml Row Tag for source |
sslciphers | Comma separated list of valid ciphers for the target secure socket connection |
ssldisablehostverify | Disable SSL hostname verification when deciding whether to trust the host's certificate |
sslkeypass | ssl key password (Only required when ssl key stored in keystore has a password) |
sslkeystore | Location of the ssl keystore |
sslkeystorepass | ssl keystore password |
sslkeystoretype | Type of the ssl keystore (Default: JKS) |
ssltruststore | Location of the ssl truststore |
ssltruststorepass | ssl truststore password |
ssltruststoretype | Type of the ssl truststore (Default: JKS) |
statsoff | Column stats flag off, on by default |
stock | Optimized for stock data, price history |
stream | Indicates that the target data source is a stream of data |
streamformat | Format, example: csv,avro,json,xml |
streaminterval | Interval, in second format, example: 10 |
streammaxlull | The maximum time in seconds that a stream should not be empty |
streamprops | key=value,hive.resultset.use.unique.column.names=false
|
streamschema | col:integer,col1:double,col2:string,col3:long
|
streamtype | Type of stream. Possible values: Kafka |
stringmode | All data types forced to strings for type safe processing |
t1 | Select * from @dataset.column (sql) |
t1q | Select * from @dataset.column (sql) |
tbin | MIN -> minute [14:27], HOUR -> hour military [13], DAY -> [05], SEC -> Second [14:27:35] |
tbq | Select * from dataset where time_bin = '2018-12-10 10'. Ex: for time bin outliers override. |
tc | Time Column for cases when date time are separate |
timestamp | Converts timestamp column to date format. Uses -dc date column flag as column to convert. Must be accompnaied with -transform flag to transform string to DateType/TimestampType |
todq | Select * from dataset where time_bin = '2018-12-10 10', example: for today override. |
transform | Transform expressions. can be on or delimited by | . Example: colname=cast(colname as string),colname2=colname2(cast as date) |
ts | Flag this dataset as a Time-Series dataset |
u | Username |
ubon | Use boundaries flag off |
upper | Median Q2 multiplier to impact higher boundary |
upperbound | Number or Timestamp |
usespark | usespark flag, forces spark, intended for datasets > 30 mil rows |
usesql | usesql implies to use the -q select * from table where etc as a subselect of the partitioning |
usetemplate | Does not require cmd line params uses saved properties, can override by adding them |
validateschemaorderon | Validate source column name order |
validatevalues | Validate source matches on cell values and show mismatches |
validatevaluesfilter | Spark sql where clause to limit rows to validate values, example: "id = 123" |
validatevaluesignoreempty | Validate value ignores empty string as an issue |
validatevaluesignorenull | Validate values ignores null as an issue |
validatevaluesignoreprecision | Validate value ignore precision for decimal values |
validatevaluesshowall | Validate values shows findings for all columns instead of one per row |
validatevaluesshowmissingkeys | Provide options to show missing keys on both target and source for validating source with key case |
validatevaluesthreshold | Validate value threshold ratio. Default .9 (=90%) |
validatevaluesthresholdstrictdownscore | Validate values turn on strict downscore for threshold category |
validatevaluestrimon | Provide options to trim extra space for source target join and cell to cell comparison |
valsrccaseon | Validate source column name case sensitivity off |
valsrcexc | Validate source column exclusion list for target dataset |
valsrcexcsrc | Validate source column exclusion list for source dataset |
valsrcfq | Validate source file query |
valsrcinc | Validate source column inclusion list for target dataset |
valsrcincsrc | Validate source column inclusion list for source dataset |
valsrcjoinonly | Skip validate source row count, schema comparison and validate values, pair use with postcacheclear |
valsrckey | Validate source column key list for target dataset |
valsrclimit | Limit for validate source stored |
valsrclimitui | Limit for validate source displayed |
valsrcmap | Validate source file column mapping (sourceCol=targetCol,sourceCol2=targetCol2) |
valsrcpdc | Push row count to source database |
valsrctypeoff | Validate source don't check for schema type |
version | 0.1 |
vs | Turn on validate source |
where | Allows you to place a common where clause and still accept partitioning |
xml | Xml data flag |
xmlRowTag | Xml Row Tag |
zfn | Zero fill null, NULL values will be 0.0 |
zkhost | zk host |
zkpath | zk path |
zkport | Zookeeper port |