Ex-Post Harmonisation and “Statistical” Data Provenance

SoDa Lab Meeting, LMU Munich, 16 Jun 2024

Cynthia A. Huang

Department of Econometrics and Business Statistics, Monash Business School


About Me!

  • 💱 Previously:
    • Economics at the University of Melbourne
    • Tutoring undergraduate economics
    • Assisting with data collection & curation for empirical economists
  • 👩🏻 Outside of Research:
    • 🧗🏻‍♀️ Climbing, 🧘🏻‍♀️ Yoga, 👩🏻‍🍳 Foodie
    • 🎙️ Regular host on The Random Sample podcast

  • 📊 Research Interests
    • 🌰 Statistically sound, well-documented and low-friction adaptation of “alternative” data for research purposes.
    • 🖇️ Data provenance models that capture both statistical decisions, and computational implementation details.
  • 👩‍🎓 Thesis: Unified Statistical Principles and Computational Tools for Data Harmonisation and Provenance
    • Conceptual framework for redistributing numeric mass between categories in related statisical classifications
    • Software implementation in R

  • 📋 Collaborative work:
    • Review of Data Provenance approaches across CS and Statistics
    • Adapting web-scraped retail product & price data for public health research
    • Human in the Loop verification for data extraction from spreadsheets using Generative AI
  • 💡 Reproducible and reusable research and teaching tools:

Thesis Background & Motivation

Harmonising and Integrating Data

  • Opportunities to combine existing data for analysis abound,
  • Existing literature exists on a spectrum from conceptual to applied,
  • with keywords such as data preprocessing, cleaning, fusion, integration, harmonisation etc.

Aspects of Ex-Post Harmonisation

Defining or selecting mappings between classifications or taxonomies,

Implementing and validating mappings on given data,

Documenting and analysing the implemented mapping.

Existing Conceptual Contributions

Existing Applied Contributions

Ex-Post Harmonisation of Aggregate Statistics

Stylised Example

Example: ANZSCO22 and ISCO8 Occupation Codes

Current Approach: Input/Output Comparison

Proposed Alternative: Input & Function Capture

Proposed Approach: Task Abstraction

The crossmap transform takes (data input):

  • numeric values which form a conceptually shared mass and are indexed by a specific set of keys (e.g. occupation codes), a shared mass array

and (function):

  • redistributes the numeric values into a new set of index keys, according to a mapping, the crossmap, between the source and target keys

produces (output):

  • a counter-factual/imputed shared mass array indexed by the target keys

Insights from Equivalent Encodings

Crossmaps can be encoded as:

  • Computational graphs: multi-partite graph visualisation
  • Linear mappings: matrix multiplication constraints
  • Edge lists: rectangular data wrangling tools

🟢 Framework Implications


Domain Problem: Ex-Post Harmonisation



Provenance Model: Crossmaps



Documenting & Auditing

Interactive Tools

Data Imputation Models



Floating Point Computation

Visual Encoding

Sensitivity and Robustness Analysis


🟠 Conceptual and Statistical Implications

Crossmap (graph) properties could be used to quantify and explore:

  • How does the degree and extent of imputation differ between crossmaps?
  • How robust are downstream results to alternative harmonisation designs?
  • How much imputation has been performed on a given dataset with a given crossmap?
  • Which observations in a harmonised dataset have undergone the most (or least) transformation?

🔵 Computational and Design Implications

  • data provenance documentation
    • multi-partite graph layouts
    • graph summaries
  • extracting mapping logic from existing scripts
    • manipulate data input
    • parse AST into computational graph
  • authoring and auditing interfaces
    • interactive (multi-table) data merging
    • workflow constraints (missing values etc.)

Discussion & Future Work

Current: Software Implementation


  • Presenting at UseR! (Jul 8-11)
  • Will be on CRAN (soon), with accompanying R Journal paper

Package goals:

  • implements graph, matrix & table representation in R, with symbolic (fractional) weights
  • worked examples in vignettes

Soon: Review of Data Provenance Approaches

  • Joint work with PhD Candidate Francis Nguyen, supervised by Prof. Tamara Munzner at the InfoVis group in Dept. Computer Science, University of British Columbia
  • Aiming to describe approaches to data provenance across:
    • statistical theory
    • statistical computing
    • database systems
    • data analytics and visualisation

Publication Venues?

🤔 Where to publish & share work on data harmonisation, provenance and quality?

  • Data Science: ACM/IMS Journal of Data Science*, Harvard Data Science Review, ???
  • CS/HCC: IEEE VIS*, CHI, ???
  • Statistics & Statistical Programming: R Journal*, JSS1, JCGS2
  • Applied Venues: e.g. “Data Reviews” in Australian Economic Review

Thanks for Listening!

Connect with me (and other cool Monash folks):

  • 🇩🇪 LMU until Weds, June 26
  • 🇦🇹 UseR!, Salzburg (Jul 8-11)
  • 🇺🇸 JSM, Portland (Aug 3-9)
  • 🇺🇸 posit::conf(2024), Seattle (Aug 12-14)
  • 🇨🇦 UBC, Vancouver, (Jul-Nov)
  • 🌏 ???, March 2025 onwards…

Or online: @cynthiahqy & cynthiahqy.com


