Individual Submission Summary
Share...

Direct link:

Beyond Human Judgment: How to Evaluate Language Model Uncertainty

Thu, September 5, 12:00 to 1:30pm, Pennsylvania Convention Center (PCC), 112A

Abstract

Language models (LMs) have become a vital component of the political science research workflow. Comparatively minimal work exists for estimating LM uncertainty and incorporating variance in downstream analysis. We show that this leads scholars to misrepresent estimated quantities as known, resulting in attenuation bias and/or loss of efficiency in subsequent regression coefficients. Crucially, the usual fixes for estimating uncertainty, e.g., the repeated independent judgements of coders, are not available. What is more, for many cases it is unclear whether humans (researchers or coders) or LMs are more expert. To calibrate the problem empirically, we compare LM classifications and associated reliability to "gold standard" data for which we have known human inter-coder reliability and confidence. We then provide a framework for incorporating the uncertainty we elicit from LMs in subsequent analysis. Finally, we offer best-practice advice for 'edge cases' where the researcher lacks expertise and is (fully) reliant on the LM judgment.

Authors