Individual Submission Summary
Share...

Direct link:

Probabilistic Record Linkage Using Pre-trained Text Embeddings

Sat, September 7, 10:00 to 11:30am, Pennsylvania Convention Center (PCC), 106B

Abstract

When merging data from different sources, social scientists often rely on fuzzy string matching to determine whether two records refer to the same entity. But for many social science applications, the most commonly used string distance metrics are imperfect, because they capture lexical similarity rather than similarity of meaning (e.g. Jim is a better match with James than with Tim, USN is a better match with Navy than with USPS). Pre-trained text embeddings, by contrast, are a fast and scalable method for determining whether two strings have similar meaning. In this paper, I show that incorporating these measures into a probabilistic record linkage procedure yields considerable gains in accuracy and efficiency. Across three applications, I show that these performance gains can be achieved with only minimal alterations to existing record linkage workflows, and provide open-source statistical software for researchers to implement the proposed method.

Author