Individual Submission Summary
Share...

Direct link:

The Power of Transfer Learning: Using New LLMs for Large, Multilingual Datasets

Thu, September 5, 10:00 to 11:30am, Pennsylvania Convention Center (PCC), 112A

Abstract

Most recent work in political science that attempts to classify text along various dimensions
relies on supervised models, where human-coded text is used to train a
classifier to segment the rest of the corpus into relevant categories. While this approach can produce reasonably high levels of classification accuracy (Barberá et al. 2021), it requires large corpora of hand-labeled data to train the models. This cost can grow exponentially if multiple classification tasks are needed (ex., categorizing salience, stance, and political conflict in a text) or if the corpus involves multiple languages. Such costs and time limitations have made the production of multi-dimensional, multi-lingual text-derived datasets rare.

This paper examines the potential to greatly diminish the required data needed to train a classifier by tapping into transfer learning from a recent explosion in larger, pre-trained models. We test the value of using several new and popular multilingual Large Language Models (LLMs), such as XLM-roBERTa, GPT, or mDeBERTa, to classify newspaper articles in news articles on multiple dimensions, in multiple languages, and from multiple countries. Compared to earlier transformer-based LLMs, such as the base BERT models, GPT and mDeBERTa claim to have expanded multilinguistic flexibility and better zero-shot classification performance. We evaluate the performance of transformer-based transfer learning models on a human-coded dataset of news articles about government Covid response drawn from identical time periods for the United Kingdom, the United States, Mexico, and Turkey.

In comparison to previous approaches, the performance of mDeBERTa and GPT is greater than that of other language models and traditional machine learning-based classification models. Most importantly for comparative researchers, we demonstrate that models can be fine-tuned only in English and still achieve state-of-the-art performance in other languages. These results appear to be robust to the language (English, Spanish, or Turkish) or moderate changes in model hyperparameters. We also find no substantive difference in performance between the closed-source, and potentially cost-prohibitive, GPT model and the most recent release of the open-source multi-lingual mDeBERTa model when fine-tuned on identical datasets.

Overall, the results of this study suggest that either mDeBERTa or GPT can be used to classify multi-lingual text data with a high degree of accuracy and precision with only a small English-language training corpus. By leveraging transfer learning, these new pre-trained models have significant implications for political science research, particularly for projects that require the analysis of multilingual datasets. Overall, we show that tools such as DeBERTa and GPT, most publicly noted for text production, should be highly valuable tools for researchers wanting to analyze text corpora across domains and languages.

Authors