Individual Submission Summary
Share...

Direct link:

Interpretable LDA Topic Models with Near-Optimal Posterior Probability

Thu, September 5, 10:00 to 11:30am, Pennsylvania Convention Center (PCC), 112A

Abstract

Despite their widespread popularity, LDA topic models have three well-known problems. First, unlike other mainstream methods in the social science toolkit, mainstream LDA solvers have no provable guarantees, meaning they find only locally modal topics that are arbitrarily badly fit to the data compared to the topics that would maximize their log posterior objective functions. Second, LDA solvers are interpretively opaque, and understanding their results requires researchers to manually contrive descriptions of what each topic is ‘about’. Third, while the LDA model itself is compatible with causal inference frameworks that would allow researchers to study topics as treatments or outcomes, standard LDA solvers violate the independence assumptions necessary to do so.

In this paper, we show how to (1) obtain solutions to LDA models that provably fit the text data near-optimally (i.e. maximize an LDA log-posterior objective function). We then show we can obtain these near optimal solutions while also (2) guaranteeing that each resulting topic is rigorously interpretable via a near-optimally selected keyword and a known co-occurrence function. We also show it is possible to obtain these solutions (3) without violating the independence assumptions required to use the topics for downstream causal inference methods, and how to obtain them (4) in logarithmic parallel computation time.

Our approach provides researchers with a first practical means to run topic models with provable guarantees and interpretability properties similar to those of linear regression and other workhorse methods in the political methodology toolkit. We also show that this approach consistently returns solutions of higher semantic quality than solutions from state-of-the-art LDA solvers across a diverse set of text datasets and a wide range of input and evaluation parameters.

Author