Skip to content Skip to navigation

Historical record linking

This webpage is designed to discuss various approaches to historical record linking and to share the codes required to implement these approaches. We also post here replications of our published papers, using the newly-available complete-count census data and employing newly-developed matching methods. Because the field of historical record linking is rapidly evolving, we will be updating this page regularly with code that implements the latest approaches.

If you have any questions or feedback about the code, please be in touch with us at ranabr@stanford.edu (Ran Abramitzky),  lboustan@princeton.edu (Leah Boustan), and/or kaeriksson@ucdavis.edu (Katherine Eriksson).

 

Matching approaches and codes

Here we introduce the various approaches to historical record linkage and the codes that implement them. 

1. The basic fully automated approach. We suggest a fully automated method for linking historical datasets (e.g. complete-count Censuses) by first name, last name and age. The approach was first developed by Ferrie (1996) and adapted and scaled for the computer by Abramitzky, Boustan and Eriksson (201220142017). Because names are often misspelled or mistranscribed, we suggest testing robustness to alternative name matching (using raw names, NYSIIS standardization, and Jaro-Winkler distance).  To reduce the chances of false positives, we suggest testing robustness by requiring names to be unique within a five year window and/or requiring the match on age to be exact. Here are the codes to implement this approach.

2. A fully automated probabilistic approach. We suggest a fully automated probabilistic method for linking historical datasets.  We combine distances in reported names and ages between each two potential records into a single score, roughly corresponding to the probability that both records belong to the same individual. We estimate these probabilities using the Expectation-Maximization (EM) algorithm, a standard technique in the statistical literature. We suggest a number of decision rules that use these estimated probabilities to determine which records to use in the analysis. Here are the codes and Stata Command to implement this approach. 

3. A machine learning approach.  This approach (due to Feigenbaum 2016) teaches an algorithm to replicate how a well trained and consistent researcher would create a linked sample across sources. The method begins by extracting a subset of possible matches for each record, and then uses training data to tune a matching algorithm that attempts to minimize both false positives and false negatives.  For information about the implementation of this approach, please contact James Feigenbaum

Here is a (preliminary) presentation of our paper that discusses Best Practices for Automated Linking Using Historical Data (comments are very welcome!)

Replication tables

Our published papers were based on smaller samples available at the time and on "first generation" matching methods. Here we post replications of our published papers using newly-available data and employing newly-developed matching methods. Specifically, complete-count Census data have recently been made available through a partnership with Ancestry.com and the Minnesota Population Center and can now be accessed via the NBER. Scholars with interest in the complete-count US Census data should contact the Minnesota Population Center to inquire about access.

We replicate the following papers using approaches 1 and 2 above. The specific matching methods we used in all replications are outlined in detail here

1. To the New World and Back Again: Return Migrants in the Age of Mass Migration (ILRR 2017)

Match Norwegian-born men between the ages of 28-45 and living either in Norway or the US in 1910 to 1900 Norwegian Census. 

2. Europe’s Tired, Poor, Huddled Masses: Self-Selection and Economic Outcomes in the Age of Mass Migration (AER 2012)

Match Norwegian-born men from the 1900 US Census and the 1900 Norwegian Census back to their childhood household in 1865 Norwegian Census 

3. A Nation of Immigrants: Assimilation and Economic Outcomes in the Age of Mass Migration (JPE 2014) (to come)

Match immigrants from 16 sending countries and comparison group of natives from 1900 Census to 1910 Census and 1920 Census. 

  • Replication summary
  • Replication tables
  • Readme and codes

4. Have the Poor Always Been Less Likely to Migrate? (JDE 2013) (to come)

Match Norwegian-born men from the 1900 US Census and the 1900 Norwegian Census back to their childhood household in 1865 Norwegian Census