Individual Submission Summary
Share...

Direct link:

Human Rights Documents as Data: Comparing and Combining Human and Machine Coding

Thu, September 5, 12:30 to 1:00pm, Pennsylvania Convention Center (PCC), Hall A (iPosters)

Abstract

In considering how best to extract quantifiable data from text documents requiring a certain level of interpretation, the “gold standard” of coding data by hand remains valuable, but it is expensive and time-consuming. In recent years, advanced machine-learning approaches have been shown to overcome some of the major limitations of human-coded data (especially cost and scalability, human error, and inter-coder reliability) while maintaining rather high accuracy.
In this research note, we describe lessons learned from combining a large hand-coded set of early Amnesty International Urgent Action (UA) documents (1975-2007) with a similarly large keyword-coded set of such documents from a later, overlapping time period (1991-2023). Some were thus coded by hand, some by machine, and some by both methods. As a result, we were able to compare and combine the results of hand coding with with relatively simple machine-supported approaches like automated keyword searches in a way that built on the strengths of both methods to increase accuracy and scalability, while also perhaps being more widely employable than advanced statistical models.
Our project had the specific objective of identifying and analyzing the subset of UA documents concerning human rights activists under threat. While we found that the accuracy of a keyword search by itself was limited, cross-checking results in the corpus of documents included in both datasets allowed to progressively refine the search and reduce the number of both false positives and false negatives in the automated search. At the same time, this approach assisted in identifying human errors and inconsistencies in the hand-coded data and allowed us to scale up the dataset to cover 15 additional years with comparatively few additional resources.
Our experience is of interest as new human rights texts become available digitally and as machine-coding capability evolves, for three reasons. First, as researchers we used the comparison to reliably update and back-fill material that was partially hand-coded while also using the hand-coding as a check to validate and improve new automated work. Second, as digitized text becomes increasingly available, the ability of human rights researchers to make use of complex computational methods to code the documents at their disposal may remain limited. We provide some suggestions for ways to take advantage of word-search and other simple approaches. Third, in working with a set of human rights documents covering almost half a century, we leveraged the advantages of the two approaches in coding for the different kinds of language used to identify activists (as subjects of human rights protection) over time.

Authors