Individual Submission Summary
Share...

Direct link:

Quantifying Narrative Reuse across Languages

Thu, September 5, 10:00 to 11:30am, Pennsylvania Convention Center (PCC), 113A

Abstract

How can one trace the spread of information, ideas, and narratives across the world using text data? Social scientists have long sought to answer this question, which requires identifying pairs of documents that contain statements with the same underlying meaning about the same subject. Past approaches that rely on n-gram matching or topic modeling to date have yielded only a loose approximation to this ideal. We propose a method to track the global diffusion of information: first applying a highly scalable method called locality sensitive hashing (LSH) to cross-language embedded representations of text based on a large-language model (LLM) to generate a relatively small number of candidate pairs, then fine-tuning an instruct-trained LLM to identify the actual pairs of sentences that contain the same idea. It is extremely difficult to create a gold-standard labeled data set to evaluate performance for this pairwise problem--we do so by creating data set of thousands of benchmark sentence pairs that contain iterations of equivalent and different statements about the same and different topics. Our method has far higher recall than verbatim text reuse methods and is more precise than topic modeling.

This approach can be applied to the study of propaganda, misinformation, diffusion of innovations. In this paper, we apply the approach to show how U.S. media sources reuse information from Russian state media in the context of the 2022 Russian invasion of Ukraine, for example accusations that Ukraine is developing bioweapons.

Authors