Search
Browse By Day
Browse By Time
Browse By Person
Browse By Mini-Conference
Browse By Division
Browse By Session or Event Type
Browse Sessions by Fields of Interest
Browse Papers by Fields of Interest
Search Tips
Conference
Location
About APSA
Personal Schedule
Change Preferences / Time Zone
Sign In
X (Twitter)
Advances in optical character recognition (OCR) and related technologies ---a suite of computational tools for digitizing raw document scans ---have increased researchers' ability to construct large corpora from historical source material. These technologies can dramatically expand the temporal and topical breadth of text analysis in the social sciences, particularly for researchers working with text data predating widespread digital word-processing. While these tools are promising, there are no agreed-upon best practices for using and validating the many modeling and data-processing choices available to researchers when setting up OCR-based text digitization frameworks. This paper takes three steps towards such a framework. First, it offers an overview of best practices for researchers digitizing new source material. Second, it outlines a procedure for choosing an optimal digitization pipeline, consisting of choices in model, as well as pre- and post-processing decisions. Third, it examines the efficacy of our proposed procedure for selecting a pipeline using simulations of downstream text analysis tasks (text classification, topic modeling, and dictionary methods). We demonstrate our approach with a manually digitized ground truth dataset of Papal political and policy documents from the 6th to 18th centuries.