Published online by Cambridge University Press: 25 July 2019
Word vectorization is an emerging text-as-data method that shows great promise for automating the analysis of semantics—here, the cultural meanings of words—in large volumes of text. Yet successes with this method have largely been confined to massive corpora where the meanings of words are presumed to be fixed. In political science applications, however, many corpora are comparatively small and many interesting questions hinge on the recognition that meaning changes over time. Together, these two facts raise vexing methodological challenges. Can word vectors trace the changing cultural meanings of words in typical small corpora use cases? I test four time-sensitive implementations of word vectors (word2vec) against a gold standard developed from a modest data set of 161 years of newspaper coverage. I find that one implementation method clearly outperforms the others in matching human assessments of how public dialogues around equality in America have changed over time. In addition, I suggest best practices for using word2vec to study small corpora for time series questions, including bootstrap resampling of documents and pretraining of vectors. I close by showing that word2vec allows granular analysis of the changing meaning of words, an advance over other common text-as-data methods for semantic research questions.
Author’s note: Replication materials for this paper are available (Rodman 2019). This work was supported by the Center for American Politics and Public Policy (CAPPP) at the University of Washington and by the National Science Foundation [#1243917]. I am grateful for the invaluable advice and feedback received at various stages of this project from Chris Adolph, Jeffrey Arnold, Andreu Casas, Ryan Eastridge, Aziz Khan, Brendan O’Connor, Brandon Stewart, Rebecca Thorpe, Nora Webb Williams, and John Wilkerson, as well as from participants at the Ninth Annual Conference on New Directions in Analyzing Text as Data (TADA 2018). The paper was also much improved by thoughtful editorial and reviewer feedback at PA. Allyson McKinney and Molly Quinton contributed cheerful and diligent research assistance. This project was also improved by statistical and computational consulting provided by the Center for Statistics and the Social Sciences as well as the Center for Social Science Computation and Research, both at the University of Washington.
Contributing Editor: Daniel Hopkins