Intraday Positions
Warning This documentation is archived and is no longer maintained.
It is common for financial organizations to receive a steady stream of files that have hourly or minutely data. The files might trail the market in a near real-time fashion. Below is an example:
--positions/
|--2019/
|--01/
|--22/
position_2019_01_22_09.csv
position_2019_01_22_10.csv
position_2019_01_22_11.csv
position_2019_01_22_12.csv
File Contents @ 9am
TIME | COMPANY | TICK | SIDE | QTY |
---|---|---|---|---|
2019-01-22 09:00 | T&G | xyz | LONG | 300 |
2019-01-22 09:00 | Fisher | abc | SHORT | 20 |
2019-01-22 09:00 | TradeServ | def | LONG | 120 |
File Contents @ 10am
TIME | COMPANY | TICK | SIDE | QTY |
---|---|---|---|---|
2019-01-22 10:00 | T&G | xyz | LONG | 280 |
2019-01-22 10:00 | BlackTR | ghi | SHORT | 45 |
Collibra DQ Pipeline
// Part of your pipeline includes the ingestion of files that have the date
// and hour encoded in the file name. How do you process those files using Collibra DQ?
//
// Format: <name>_<year>_<month>_<day>.csv
val filePath = // <set this> positions/2019/01/22/positions_2019-01-22_09.csv
// Configure Collibra DQ.
val opt = new OwlOptions
opt.dataset = "positions"
opt.load.delimiter = ","
opt.load.fileQuery = "select * from dataset"
opt.load.filePath = file.getPath
opt.outlier.on = true
opt.outlier.key = Array("COMPANY")
opt.outlier.timeBin = TimeBin.HOUR
opt.dupe.on = true
opt.dupe.include = Array("COMPANY", "TICK")
opt.dupe.exactMatch = true
// Parse the filename to construct the run date (-rd) that will be passed
// to Collibra DQ.
val name = file.getName.split('.').head
val parts = name.split("_")
val date = parts.slice(2, 5).mkString("-")
val hour = parts.takeRight(1).head
// Must be in format 'yyyy-MM-dd' or 'yyyy-MM-dd HH:mm'.
val rd = s"${date} ${hour}"
// Tell Collibra DQ to process data
opt.runId = rd
// Create a DataFrame from the file.
val df = OwlUtils.load(opt.load.filePath, opt.load.delimiter, spark)
// Instantiate an OwlContext with the dataframe and our custom configuration.
val owl = OwlUtils.OwlContext(df, spark, opt)
// Make sure Collibra DQ has catalogued the dataset.
owl.register(opt)
// Let Collibra DQ do the rest!
owl.owlCheck
DQ Coverage for Position data
- Schema evolution
- Profiling
- Correlation analysis
- Segmentation
- Outlier detection
- Duplicate detection
- Pattern mining