Evaluating Record Linkage Algorithms Using Complete-Count U.K. Census Data

Krzysztof Karbownik, Northwestern University
Anthony Wray, University of Southern Denmark

As complete-count censuses have become increasingly available, researchers have developed methods for linking individuals across data sets, including fully automated (Abramitzky et al. 2019), machine learning (Feigenbaum, 2016), and human decision-intensive (Bailey et al. 2019) approaches, or a combination of the three (Price et al. 2019). Since applications of these linkage methods have primarily involved U.S. records, it remains unclear if the performance of such algorithms remains the same in contexts where the distribution of name commonness, the quality of transcription, and the availability of linking variables may differ. We evaluate the linkage methods when applied to complete-count censuses from Britain, which has the second largest historical census microdata collection (Ruggles, 2014). U.K. censuses are characterized by important differences in comparison to the U.S., which presents additional challenges with data linkage. First and foremost, U.K. censuses report both county and parish of birth, whereas the U.S. only reports state-level birthplace. While state and county are comparable in terms of data quality and number of unique entries, the birth parish field contains over 500,000 unique strings accounting for over 10,000 parishes, which change names and boundaries over time. We develop modified versions of the standard linkage algorithms that account for changes in geographic units over time. Furthermore, we consider the application of record-linkage algorithms to family history research where the objective involves linking families across censuses, and thus algorithms can include household-level matching variables such as the identity and birthplaces of household members. Standard linking algorithms ignore contextual variables due to a lack of representativeness in the constructed sample, but such information can increase the match rate and reduce false positives. Moreover, when estimating sibling fixed-effects models the use of household-level variables will improve match performance while leaving the internal validity of within-household estimates unaffected (Karbownik and Wray, 2019).

No extended abstract or paper available

 Presented in Session 131. Evaluating Record Linkage Methods