Historical record linking
Census Linking Project
We are excited to share our Census Linking Project, which posts links between all historical censuses (1850-1940) using various linking algorithms. Our goal is to reduce barriers and open up new research possibilities by providing customizable linked historical datasets to the broad research community. We hope that the availability of these cross-walks will cut down on the use of NBER server time by users of the complete-count census data. We are grateful to the Economic History Association for awarding us in 2024 the Engerman-Goldin Prize for "contributions made in the previous six years" in "creating, compiling, and sharing data and information with scholars."
This webpage is designed to discuss various approaches to historical record linking and to share the codes required to implement these approaches. We also post here replications of our published papers, using the newly-available complete-count census data and employing newly-developed matching methods. Because the field of historical record linking is rapidly evolving, we will be updating this page regularly with code that implements the latest approaches.
If you have any questions or feedback about the code, please be in touch with us at ranabr@stanford.edu (Ran Abramitzky), lboustan@princeton.edu (Leah Boustan), and/or kaeriksson@ucdavis.edu (Katherine Eriksson).
Linking approaches and codes
Here we introduce the various approaches to historical record linkage and the codes that implement them.
1. The ABE fully automated approach. This approach (Abramitzky, Boustan and Eriksson (ABE 2012, 2014, 2017) is a fully automated method for linking historical datasets (e.g. complete-count Censuses) by first name, last name and age. The approach was first developed by Ferrie (1996) and adapted and scaled for the computer by Abramitzky, Boustan and Eriksson (2012, 2014, 2017). Because names are often misspelled or mistranscribed, our approach suggests testing robustness to alternative name matching (using raw names, NYSIIS standardization, and Jaro-Winkler distance). To reduce the chances of false positives, our approach suggests testing robustness by requiring names to be unique within a five year window and/or requiring the match on age to be exact. Here are the codes to implement the ABE approach. And here is an R Code to implement the ABE approach (developed by Ugur Yildirim). Coming soon: a fully automated approach that uses extra information in linking.
2. A fully automated probabilistic approach. This approach (Abramitzky, Mill, and Perez 2019) suggests a fully automated probabilistic method for linking historical datasets. We combine distances in reported names and ages between each two potential records into a single score, roughly corresponding to the probability that both records belong to the same individual. We estimate these probabilities using the Expectation-Maximization (EM) algorithm, a standard technique in the statistical literature. We suggest a number of decision rules that use these estimated probabilities to determine which records to use in the analysis. Here are the codes and Stata Command to implement this approach.
3. A machine learning approach. This approach (due to Feigenbaum 2016) teaches an algorithm to replicate how a well trained and consistent researcher would create a linked sample across sources. The method begins by extracting a subset of possible matches for each record, and then uses training data to tune a matching algorithm that attempts to minimize both false positives and false negatives. For information about the implementation of this approach, please contact James Feigenbaum.
Here is a our paper that evaluates and discusses Best Practices for Automated Linking of Historical Data (comments are very welcome!)
Self-replication
Our published papers were based on smaller samples available at the time and on "first generation" matching methods. Here we post replications of our published papers using newly-available data and employing newly-developed matching methods. Specifically, complete-count Census data have recently been made available through a partnership with Ancestry.com and the Minnesota Population Center and can now be accessed via the NBER. Scholars with interest in the complete-count US Census data should contact the Minnesota Population Center to inquire about access.
We replicate the following papers using approaches 1 and 2 above. The specific matching methods we used in all replications are outlined in detail here
1. To the New World and Back Again: Return Migrants in the Age of Mass Migration (ILRR 2017)
Match Norwegian-born men between the ages of 28-45 and living either in Norway or the US in 1910 to 1900 Norwegian Census.
2. Europe’s Tired, Poor, Huddled Masses: Self-Selection and Economic Outcomes in the Age of Mass Migration (AER 2012)
Match Norwegian-born men from the 1900 US Census and the 1900 Norwegian Census back to their childhood household in 1865 Norwegian Census
3. A Nation of Immigrants: Assimilation and Economic Outcomes in the Age of Mass Migration (JPE 2014)
Match immigrants from 16 sending countries and comparison group of natives from 1900 Census to 1910 Census and 1920 Census.
4. Have the Poor Always Been Less Likely to Migrate? (JDE 2013)
Match Norwegian-born men from the 1900 US Census and the 1900 Norwegian Census back to their childhood household in 1865 Norwegian Census