Individual Submission Summary
Share...

Direct link:

Extracting Keywords from Unlabeled Corpora Using Word Embeddings

Thu, September 5, 10:00 to 11:30am, Pennsylvania Convention Center (PCC), 112A

Abstract

Researchers frequently need to extract information, such as events or target topics, from large corpora. One common solution involves applying semantically-related keywords to identify tweets, news articles, or other documents of interest. However, it is rarely the case that dictionaries of relevance to the topic, event, or language both exist and are accessible. Moreover, existing algorithms for extracting dictionaries, require many user-provided seed words or hand-coded documents to generate useful results and do not incorporate contextual information from natural language. In this paper, I present a novel algorithm, conclust, that extracts keywords from unlabeled text using a small number of user-provided seed words and a fitted word embeddings model. Compared to existing methods of lexicon extraction, conclust requires few seed words, is computationally efficient, and takes word context into account. I describe this algorithm's properties and benchmark its performance with existing methods of lexical dictionary extraction, comparing differences in user labor, conceptual clarity, and the ability to replicate existing keyword dictionaries.

Author