Ex-Post Harmonisation and “Statistical” Data Provenance

SoDa Lab Meeting, LMU Munich, 16 Jun 2024

Cynthia A. Huang

Department of Econometrics and Business Statistics, Monash Business School

Introduction

About Me!

About Me!

  • 💱 Previously:
    • Economics at the University of Melbourne
    • Tutoring undergraduate economics
    • Assisting with data collection & curation for empirical economists
  • 👩🏻 Outside of Research:
    • 🧗🏻‍♀️ Climbing, 🧘🏻‍♀️ Yoga, 👩🏻‍🍳 Foodie
    • 🎙️ Regular host on The Random Sample podcast

About Me!

  • 📊 Research Interests
    • 🌰 Statistically sound, well-documented and low-friction adaptation of “alternative” data for research purposes.
    • 🖇️ Data provenance models that capture both statistical decisions, and computational implementation details.
  • 👩‍🎓 Thesis: Unified Statistical Principles and Computational Tools for Data Harmonisation and Provenance
    • Conceptual framework for redistributing numeric mass between categories in related statisical classifications
    • Software implementation in R

About Me!

  • 📋 Collaborative work:
    • Review of Data Provenance approaches across CS and Statistics
    • Adapting web-scraped retail product & price data for public health research
    • Human in the Loop verification for data extraction from spreadsheets using Generative AI
  • 💡 Reproducible and reusable research and teaching tools:

Thesis Background & Motivation

Harmonising and Integrating Data

  • Opportunities to combine existing data for analysis abound,
  • Existing literature exists on a spectrum from conceptual to applied,
  • with keywords such as data preprocessing, cleaning, fusion, integration, harmonisation etc.

Aspects of Ex-Post Harmonisation

Defining or selecting mappings between classifications or taxonomies,

Implementing and validating mappings on given data,

Documenting and analysing the implemented mapping.

Existing Conceptual Contributions

Existing Applied Contributions

Ex-Post Harmonisation of Aggregate Statistics

Stylised Example