Individual Submission Summary
Share...

Direct link:

GPT4 and LLaMa 2 as Expert Coders in Social Science Task

Fri, September 6, 8:00 to 9:30am, Pennsylvania Convention Center (PCC), 110B

Abstract

Large Language Models (LLM) have revolutionized how social scientists approach text as data. Although still narrowly applied, researchers have seen the benefits of replacing conventional supervised machine-learning models with LLM in tasks often used by political scientists--sentiment analysis, ideological scaling, and topic modeling (see Ornstein, Blasingame, and Truscott 2023). In this paper, we test the application of LLM as expert coders in complex labelling tasks, and the performance of this output as training sets in supervised machine-learning models. We extend previous work in two important ways: 1) we compare the performance of the two leading Generative Pre-trained Transformer (GPT) models--OpenAI's ChatGPT-4 (pay-walled) and Meta's LLaMa 2 (open-source)--to expert coders in multi-language labelling tasks relevant to the political sciences; and 2) compare the performance of the resulting data when used as a training set in Transformers-based machine-learning models (e.g., XLM-RoBERTa). To do this, we use multi-language human-labelled sentences from the Manifesto Project as a baseline for comparison to our LLMs of interest. We then compare the performance of all the labelled data as training set to train an XLM-RoBERTa model. Initial results suggests that GPT models, at the very least, perform just as well as expert coders in complex labelling tasks. Further, the training sets produced by the GPT-models outperform human-labelled training set when training a Transformers-based model. We argue that the greater internal consistency of the training sets produced by GTP models boosts the predictive power of machine-learning model. Furthermore, we find no significant difference in the performance of ChatGPT-4 and LLaMa 2, yet LLaMa 2 is a fraction (technically free) of the cost of either ChatGPT-4 or human coders. Finally, we provide a guideline for best practices when using GPT models as part of a downstream tasks, and when using LLM more broadly. We also provide suggestions on tasks where GPT models can be particularly beneficial for researchers (e.g., labelling rare-events in data).

Authors