Department of Econometrics and Business Statistics, Monash University
supervised by Rob J Hyndman, Sarah Goodwin and Simon Angus
Overview
Background & Motivation
Ex-Post Harmonisation
Occupation Codes (ANZSCO22) Example
Proposed Contributions
Cross-Taxonomy Transformation
Crossmap Information Structure
Crossmap Visualisations
Implications and Future Work
Validate data quality and document preprocessing decisions
Explore imputation properties
Background & Motivation
When do we encounter category recoding and redistribution?
Ex-Post Harmonisation 1/2
Ex-post (or retrospective) data harmonization refers to procedures applied to already collected data to improve the comparability and inferential equivalence of measures from different studies (Kołczyńska 2022)
Ex-Post Harmonisation 2/2
Defining or selecting mappings between classifications or taxonomies,
Implementing and validating mappings on given data,
Documenting and analysing the implemented mappings.
Example Harmonisation 1/3
Typical cases: labour statistics, macroeconomic and trade data, census and election data.
Table 1: Stylised ANZSCO22 occupation counts from total of 2000 observed individuals
anzsco22
anzsco22_descr
count
111111
Chief Executive or Managing Director
1000
111211
Corporate General Manager
500
111212
Defence Force Senior Officer
40
111311
Local Government Legislator
300
111312
Member of Parliament
150
111399
Legislators nec
10
Example Harmonisation 2/3
Australian and New Zealand Standard Classification of Occupations (ANZSCO)
# A tibble: 6 × 2
anzsco22 anzsco22_descr
<chr> <chr>
1 111111 Chief Executive or Managing Director
2 111211 Corporate General Manager
3 111212 Defence Force Senior Officer
4 111311 Local Government Legislator
5 111312 Member of Parliament
6 111399 Legislators nec
International Standard Classification of Occupations (ISCO)
# A tibble: 5 × 2
isco8 isco8_descr
<chr> <chr>
1 1112 Senior government officials
2 1114 Senior officials of special-interest organizations
3 1120 Managing directors and chief executives
4 0110 Commissioned armed forces officers
5 1111 Legislators
Example Harmonisation 3/3
Possible “relations” from source ANZSCO22 to target ISCO8 codes
Use visual channels such as layout/ordering, text style, line style, colour saturation, and annotations to highlight key preprocessing decisions:
which data are split vs. not split?
what are the split proportions?
what is the composition of the transformed data?
Implications & Future Work
What else can the crossmap approach reveal or illuminate?
Tracking and Quantifying Data Imputation
Valid transformation logic doesn’t guarantee the quality or usability of transformed data
Crossmaps allow us to visualise the extent and quantify the degree of imputation
We can also extract transformation logic from existing scripts
Comprehension of Preprocessing Decisions
isic-non-std-split.R [59 lines]
# function: split values between isic code in isiccomb groupsplit_isiccomb <-function(threefour_df) {#' Helper function to split isiccomb values across isic codes#' @param threefour_df df with 3/4 digit values across isic & isiccomb# make list for interim tables interim <-list()# extract rows with isiccomb codes interim$isiccomb.rows <- threefour_df %>%filter(., str_detect(isiccomb, "[:alpha:]"))# test that we are not losing any data through splitingtest_that("No `country,year` has more than one recorded `value` per `isiccomb` group", { rows_w_many_values_per_isiccomb <- interim$isiccomb.rows %>%group_by(country, year, isiccomb) %>%## get no of recorded (not NA) values for given `country, year, isiccomb`summarise(n_obs =sum(!is.na(value))) %>%filter(n_obs !=1) %>%nrow()expect_true(rows_w_many_values_per_isiccomb ==0) })# calculate average value over isiccomb group for each country, year interim$isiccomb.avg <- interim$isiccomb.rows %>%# group isiccomb rows, replace na with 0 for averaginggroup_by(country, year, isiccomb) %>%mutate(value =replace_na(value, 0)) %>%# split combination value over standard isic codes in isiccomb groupsummarise(avg.value =mean(value),## checking variablesn_isic =n_distinct(isic),n_rows =n() ) %>%mutate(row_check = (n_isic == n_rows))# return(interim$isiccomb.avg)## check n_isic == n_rowstest_that("isiccomb split average is calculated with correct denominator", {expect_true(all(interim$isiccomb.avg$row_check)) })# output processed data final <-left_join(threefour_df, interim$isiccomb.avg, by =c("country", "year", "isiccomb")) %>%rename(value.nosplit = value) %>%mutate(value =coalesce(avg.value, value.nosplit),split.isiccomb =!is.na(avg.value) ) %>%select(country, year, isic, isiccomb, value, value.nosplit, split.isiccomb) # not checking variablesreturn(final)}
We can use Quantitative User Study Experiments to explore:
Which representations of transformation logic are best for communicating data preprocessing decisions?
Are crossmap visualisations more easily interpreted than code or table representations?
Does effectiveness differ by audience (e.g. replication, peer-review, non-technical domain experts)?
Thanks! Any Questions?
Final remarks
Ex-Post Harmonisation is a complex form of data imputation!
Visualisation can be used to communicate important data preprocessing decisions
Designing visualisations can also lead to new statistical insights
I’m looking for (imputation) case studies and (comprehension) experiment participants!
Hulliger, Beat. 1998. “Linking of Classifications by Linear Mappings.”Journal of Official Statistics 14 (January): 255–66.
Kołczyńska, Marta. 2022. “Combining Multiple Survey Sources: A Reproducible Workflow and Toolbox for Survey Data Harmonization.”Methodological Innovations 15 (1): 62–72. https://doi.org/10.1177/20597991221077923.
Zhou, Xiantian, and Carlos Ordonez. 2020. “Matrix Multiplication with SQL Queries for Graph Analytics.” In 2020 IEEE International Conference on Big Data (Big Data), 5872–73. Atlanta, GA, USA: IEEE. https://doi.org/10.1109/BigData50022.2020.9378275.
Acknowledgements
Thank you to Laura Puzzello for her ongoing support and funding of earlier iterations of this work. Many thanks also to Rob Hyndman, Sarah Goodwin, Simon Angus, Patrick Li, Emi Tanaka and my other colleagues at Monash EBS and Monash SoDa Labs for their helpful guidance, feedback and suggestions. The author is supported in part by top-up scholarships from Monash Data Futures Institute and the Statistical Society of Australia.
Appendix
Equivalent Representation Definitions
Crossmaps can be represented or conceptualised in the following forms:
Weighted Bi-Partite Graph
Edge weights represent the proportion of source node value to be redistributed to target node.
Linear Mapping / Bi-Adjacency Matrix
Makes explicit non-correspondence between source-target pairs (represented as zeroes). The transformation matrix has the same constraints as a Markov chain transition matrix.
Edge List Table / Adjacency List
Facilitates implementation of data transformations using database join, mutate and summarise operations.
Future Visualisation Work
Scaling to larger crossmaps via existing graph visualisation tools and idioms
Interactivity (e.g. tooltips for category description labels)
Filter by graph properties (e.g. leave out one-to-unique links)
Visualising multiple related crossmaps?
Multiple crossmaps simultaneously (e.g. for multiple countries in the same year)
Multiple crossmaps sequentially (e.g. for multiple years)