Provenance for Multi-Source Datasets

EmVIS Lab Meeting, 9 May 2024

Cynthia A. Huang

Department of Econometrics and Business Statistics, MBUS

Introduction

About Me!

  • ๐Ÿ‘ฉโ€๐ŸŽ“ PhD Candidate in Econometrics & Business Statistics, Monash Business School
  • ๐Ÿ™ Supervised by: Rob Hyndman (EBS), Simon Angus (Econ) and Sarah Goodwin!
  • ๐Ÿ‘‹ affiliated at Monash with:
    • NUMBATS (Non-Uniform Monash Business Analytics Team)
    • SoDa Labs โ€“ alternative data for social science insights
    • MDFI (Monash Data Futures Institute)
  • ๐Ÿ’ฑ Previously: Economics at Unimelb

About Me!

  • ๐Ÿ“Š Interested in adapting data for research purposes:
    • web-scraped retail product & price data for public health research
    • mining wikipedia edit metadata for evidence of state-sponsored misinformation
    • train/test sampling strategies for satellite (image) deep learning
  • ๐Ÿ‘ฉโ€๐ŸŽ“ Provenance and statistical imputation for multi-source datasets
    • ๐Ÿ‡ฉ๐Ÿ‡ช SoDa Statistiks at LMU in June
    • ๐Ÿ‡จ๐Ÿ‡ฆ InfoVis group at UBC from Jul-Oct
  • ๐Ÿง—๐Ÿปโ€โ™€๏ธ climber, ๐ŸŽ™๏ธ regular host on The Random Sample podcast

Visualising Multi-Source Harmonisation Logic

  • Crossmaps Framework
  • CS Research (Collaboration) Opportunities
    • Data provenance communication
    • Interactive data merging (multi-table)
  • Other downstream projects

 

Domain Problem

 

 

Provenance Model

 

 

Visual Encoding

Interactive Tools

 

Crossmaps Framework

Ex-Post Harmonisation 1/2

Ex-post (or retrospective) data harmonization refers to procedures applied to already collected data to improve the comparability and inferential equivalence of measures from different studies (Koล‚czyล„ska 2022)

Figure 1: Procedures in ex-post harmonisation

Ex-Post Harmonisation 2/2

Defining or selecting mappings between classifications or taxonomies,

Implementing and validating mappings on given data,

Documenting and analysing the implemented mapping.

Current Approach: Input/Output Comparison

Proposed Alternative: Input & Function Capture

Implications (ASC23 Poster)

https://www.cynthiahqy.com/research/assets/asc-poster.pdf

Contributions

  • non-hierarchical provenance task abstraction (c.f. Bors et al. 2019)
  • compare with inference/diff-based solutions:
    • smallsets, anteater, COMANTICS etc.
    • translation -> summary/sense-making
  • unified approach to:
    • transformation auditing (data quality)
    • transformation documentation (data provenance)
    • data imputation modelling (statistical robustness)

Future Work

Visualisation Concepts

Transformation Logic:

Merged Dataset Quality:

Provenance Communication & Tools

 

Domain Problem

 

 

Provenance Model

 

 

Visual Encoding

Interactive Tools

 

  • layouts & ordering (similar to sankey diagram optimisation)
  • encoding distribution weights & relation styles
  • properties of merged dataset

Data Wrangling Tools

  • implementing graph, matrix & table representation in R, with symbolic (fractional) weights
  • extracting mapping logic from existing scripts
    • manipulate data input
    • parse AST into computational graph
  • authoring and auditing interfaces for non-technical collaborators

Other Projects

  • summary properties of the computational graph as a data imputation model
  • provenenace model for causal loop diagrams
  • multiverse analysis of alternative mapping decisions
  • provenance abstraction for statistical sampling vs. database queries

References

Bors, Christian, John Wenskovitch, Michelle Dowling, Simon Attfield, Leilani Battle, Alex Endert, Olga Kulyk, and Robert S. Laramee. 2019. โ€œA Provenance Task Abstraction Framework.โ€ IEEE Computer Graphics and Applications 39 (6): 46โ€“60. https://doi.org/10.1109/MCG.2019.2945720.
Koล‚czyล„ska, Marta. 2022. โ€œCombining Multiple Survey Sources: A Reproducible Workflow and Toolbox for Survey Data Harmonization.โ€ Methodological Innovations 15 (1): 62โ€“72. https://doi.org/10.1177/20597991221077923.