Individual Submission Summary
Share...

Direct link:

Validating Automatic Text Digitization

Thu, September 5, 10:00 to 11:30am, Pennsylvania Convention Center (PCC), 112A

Abstract

Advances in optical character recognition (OCR) and related technologies ---a suite of computational tools for digitizing raw document scans ---have increased researchers' ability to construct large corpora from historical source material. These technologies can dramatically expand the temporal and topical breadth of text analysis in the social sciences, particularly for researchers working with text data predating widespread digital word-processing. While these tools are promising, there are no agreed-upon best practices for using and validating the many modeling and data-processing choices available to researchers when setting up OCR-based text digitization frameworks. This paper takes three steps towards such a framework. First, it offers an overview of best practices for researchers digitizing new source material. Second, it outlines a procedure for choosing an optimal digitization pipeline, consisting of choices in model, as well as pre- and post-processing decisions. Third, it examines the efficacy of our proposed procedure for selecting a pipeline using simulations of downstream text analysis tasks (text classification, topic modeling, and dictionary methods). We demonstrate our approach with a manually digitized ground truth dataset of Papal political and policy documents from the 6th to 18th centuries.

Authors