Individual Submission Summary
Share...

Direct link:

What a DRAG! Data from Retrieval-Augmented Generation

Fri, September 6, 8:00 to 9:30am, Pennsylvania Convention Center (PCC), 110B

Abstract

Large language models (LLMs), popularized by the release of ChatGPT, are in the process of revolutionizing nearly every industry in some way. Despite this, the use of these “foundation models” is still in its infancy within the Political Science community. When these models have been used, it has typically been for tasks such as sentiment analysis, generating persuasive text, and voter profile generation/survey piloting. To date, there has been almost no work done on how these tools can be leveraged to extract raw data from difficult or disparate sources. In this paper, I explore how large language models can be used to extract raw data from large texts using a method called Retrieval-Augmented Generation (RAG). RAG, a framework for improving the accuracy of large language models by supplementing an LLM’s internal information with external text, can be used to comb through long texts and a large number of texts to extract specified information that can then be converted into quantitative data. To test the viability of this method, I use RAG to extract information about city institutional structures from municipal codes. These municipal codes, which often stretch into the hundreds of pages, contain detailed information about city government. Because of their length however, there are few studies that interact with these documents directly. Using the results of a survey of city clerks as a comparison for close to 1,000 cities, I demonstrate that even simplistic implementations of RAG can collect largely accurate information from a document. I discuss which types of information and formats are most likely to be successful, as well as suggest various other use cases within political science that could benefit from RAG. Large language models offer new opportunities to categorize and interact with the information contained in documents that were previously too unwieldy to quantify. This paper discusses one such opportunity, the Retrieval-Augmented Generation framework, how it can be used to manage large amounts of unwieldy text, best practices for implementation, and potential use cases beyond the example of city municipal codes.

Author