Toward a framework for creating trustworthy measures with supervised machine learning for text

Ju Yeon Park; Jacob M. Montgomery

doi:10.1017/psrm.2025.10042

Toward a framework for creating trustworthy measures with supervised machine learning for text

Published online by Cambridge University Press: 29 September 2025

Ju Yeon Park

and

Jacob M. Montgomery

Show author details

Ju Yeon Park: Affiliation:
Department of Political Science, The Ohio State University, Columbus, OH, USA
Jacob M. Montgomery*: Affiliation:
Washington University in St. Louis, St. Louis, MO, USA
*: Corresponding author: Jacob M. Montgomery; Email: jacob.montgomery@wustl.edu

Article contents

Abstract
Introduction
A supervised learning pipeline for measurement
Transparently reporting the pipeline
Application: Senate confirmation hearings
Considerations for large language models
Conclusion
Supplementary material
Footnotes
References

Rights & Permissions

Abstract

Supervised learning is increasingly used in social science research to quantify abstract concepts in textual data. However, a review of recent studies reveals inconsistencies in reporting practices and validation standards. To address this issue, we propose a framework that systematically outlines the process of transforming text into a quantitative measure, emphasizing key reporting decisions at each stage. Clear and comprehensive validation is crucial, enabling readers to critically evaluate both the methodology and the resulting measure. To illustrate our framework, we develop and validate a measure assessing the tone of questions posed to nominees during U.S. Senate confirmation hearings. This study contributes to the growing literature advocating for transparency and rigor in applying machine learning methods within computational social sciences.

Keywords

confirmation hearings Senate supervised learning text analysis US Congress

Information

Type: Original Article
Information: Political Science Research and Methods , First View , pp. 1 - 17

DOI: https://doi.org/10.1017/psrm.2025.10042 [Opens in a new window]
Creative Commons: This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright: © The Author(s), 2025. Published by Cambridge University Press on behalf of EPS Academic Ltd.

1. Introduction

In the past decade, computer-assisted text analysis has evolved from a rarely used technology to an increasingly ubiquitous approach in political science. Access to massive volumes of digital text combined with advances in computing and algorithm designs make text analysis easier to implement and more relevant in research domains as diverse as state repression (Gohdes, Reference Gohdes2020), news exposure (Guess, Reference Guess2021; Stier et al., Reference Stier, Mangold, Scharkow and Breuer2022), political advertising (Fowler et al., Reference Fowler, Franz, Martin, Peskowitz and Ridout2021), congressional committees (Casas et al., Reference Casas, Denny and Wilkerson2020; Park, Reference Park2021), and more.

One common task is measuring important latent concepts embedded within text via supervised machine learning. Generically, we imagine a researcher who wishes to measure a predefined construct, such as negativity (Fowler et al., Reference Fowler, Franz, Martin, Peskowitz and Ridout2021), grandstanding (Park, Reference Park2021), or Islamophobia (Alrababa’h et al., Reference Alrababa’h, Marble, Mousa and Siegel2021). The goal is to create a mapping from the written text to a valid quantitative measure. However, doing so requires multiple steps including data partitioning, label acquisition, text pre-processing, and model fitting. At each stage, researchers must make choices, many of which are consequential to the results. Thus, creating a measure can be conceptualized as a set of interconnected procedures. It is a pipeline, where the outputs of each stage are passed on to the next so that the final result reflects the cumulative decisions made along the way.

Although these methods are powerful and increasingly accurate, when combined with current reporting standards, the complexity of this pipeline presents a problem. To begin with, few of the decisions made at each stage are justified to readers, and often they are not even described. This lack of transparency is particularly problematic given the sequential nature of the pipeline; information about one step of the process in isolation can be uninformative or even deceptive. Worse, standard practice provides readers with little objective evidence about the validity of the procedures or even the final outputs. Together, this means that far too often readers are presented with statistical analyses of text-based measures where the measurement procedure has not been adequately explained, evaluated, or validated.

To address this limitation, this article introduces a conceptual framework for understanding, reporting, and validating the text-to-measure pipeline. Inspired by the total survey error approach in public opinion research (e.g., Groves and Lyberg, Reference Groves and Lyberg2010; Althaus et al., Reference Althaus, Peyton and Shalmon2022), we take a holistic view to identify potential sources of error in the pipeline and clarify what should be reported. We systematically outline each stage and provide detailed guidelines, aiming to establish a reporting standard that enhances transparency and credibility. Additionally, we emphasize the importance of rigorous validation, echoing Grimmer and Stewart Reference Grimmer and Stewart(2013a): “validate, validate, validate.” Specifically, we highlight three key stages where transparent assessment of performance is crucial: labeling, model fitting, and out-of-sample imputation.

Our goal is not to criticize any prior research. Nor do we intend to provide definitive “best practices,” since the field of text-as-data is too diverse and methods are evolving too rapidly to make this feasible. Best practices for any particular measurement exercise are too context-dependent for any advice to be authoritative, although we do provide a set reporting standards that can serve as reasonable default. Rather, this paper adds to the ongoing efforts in the computational social science community to establish frameworks for conceptualizing, reporting, and evaluating text-as-data methods (Grimmer and Stewart, Reference Grimmer and Stewart2013a; Ying et al., Reference Luwei, Montgomery and Stewart2021; Grimmer et al., Reference Justin, Roberts and Stewart2022; Kapoor et al., Reference Kapoor, Cantrell, Peng, Pham, Bail, Gundersen, Hofman, Hullman, Lones, Malik, Nanayakkara, Poldrack, Raji, Roberts, Salganik, Serra-Garcia, Stewart, Vandewiele and Narayanan2022), specifically focusing on the use of supervised learning to measure latent concepts. We provide an approachable and coherent explanation of the various steps involved, the key decision points for researchers, and a framework to make empirical results more credible and transparent to readers.

The rest of the paper proceeds as follows. The next section provides a brief overview of the measurement pipeline—including sampling, label acquisition, text representation, model fitting, and out-of-sample imputation. Then, Section 2 reviews current practices in the field in terms of whether and how these stages are reported and evaluated. We find that there is a great deal of heterogeneity in reporting practices and that robust validation at any stage of the pipeline is surprisingly rare, even in the field’s top journals. To address this, in Section 3, we present a guideline that researchers should follow including key decisions to be reported and validation analyses to be undertaken. In Section 4, we then illustrate our framework with an application to U.S. senatorial confirmation hearings held from 1997 to 2019. Specifically, we build and validate a new measure of tone in over 89,000 statements made by senators in these hearings. We conclude with a brief discussion of how these recommendations might shift as the field more widely adopts alternative measurement strategies based on large language models (LLMs).

2. A supervised learning pipeline for measurement

To limit the scope of this discussion, we make several assumptions about the research goal or “use case” we address. First, we assume that the researcher has a predefined concept of interest, thus excluding exploratory techniques, such as unsupervised topic models; second, the goal is to create a valid measure at the document level, such that each speech, tweet, or article is scored; third, the researcher wishes to use a supervised learning method.

2.1. Pipeline overview

The goal in the text-to-measure pipeline is to train a model using a subset of the data so that values for the latent variable in the rest of the corpus can be imputed. A schematic for this exerciseFootnote ¹ is shown in Figure 1. First, we select a subset of the data to use to fit a supervised model. At this stage, we also set aside documents for validation. Identifying validation sets up front is a crucial step, which we emphasize at multiple stages in our proposed guideline. Second, we label the document in the training set on the latent trait of interest. In the simplest case, this is done manually by research assistants, crowdsourced workers, or generative LLMs. Third, we pre-process the text and extract a set of features that can be represented numerically. For example, we might remove punctuation and create a term-document matrix (see Online Supplementary Material Section B for more details). More advanced pre-processing steps are possible, such as named entity recognition, which identifies important persons or places in the text (e.g.,Fowler et al., Reference Fowler, Franz, Martin, Peskowitz and Ridout2021). Fourth, we fit some machine learning algorithm(s) to accurately predict the labels as a function of the inputs. This stage typically requires the selection of hyperparameters. Finally, we impute the latent trait with the fitted model for the entire corpus.

This figure visually depicts the text-to-measure pipeline for supervised machine learning including document subsetting, labelling, text transformation, model fitting, and imputation.

Figure 1. Summary of the supervised learning pipeline.

This approach is attractive for several reasons. Most importantly, supervised learning is much more cost-effective relative to labeling the entire corpus through a manual or crowdsourced process. Although these approaches are possible for tasks like open-text responses (e.g., Bø ggild et al., Reference Bøggild, Aarøe and Petersen2021), the size of contemporary text corpora can quickly go beyond the budget of any research team. Fowler et al. Reference Fowler, Franz, Martin, Peskowitz and Ridout(2021), for instance, analyzed over 400,000 Facebook ads. In our own example below, the corpus includes 89,279 statements. In addition, supervised learning makes it relatively easy to impose a research question. It builds a model designed to precisely estimate a predefined concept rather than making broad discoveries within the corpus. This makes supervised learning output more interpretable for downstream tasks, such as theory testing, and helps it perform much better than dictionary methods (Grimmer and Stewart, Reference Grimmer and Stewart2013a).

The downside is that the end-to-end process can become elaborate; to some, it may even seem convoluted. Furthermore, at each step, researchers must make many decisions, often with little guidance. This adds many “researcher degrees of freedom” to the process and makes reporting results cumbersome. Therefore, researchers should consider the following questions. Which decisions need to be detailed to the readers? What evidence should we provide to give the reader confidence in the procedures and the validity of the measure?

2.2. Current practices for reporting and validation

To motivate our discussion in the following sections, we first review current practices in the field in terms of how these steps are reported and validated. To do so, we searched the 2020-2023 volumes of the American Political Science Review (APSR) and the American Journal of Political Science (AJPS) for all articles that used supervised learning of texts to create measures of latent concepts. In total, we identified 14 observations and reviewed the articles and their appendices.Footnote ² The results are shown in Table 1.

Table 1. A review of current reporting and validation practices applying the text-to-measure pipeline

Analysis of transparency in the text-to-measure pipeline articles using supervised machine learning in the APSR and AJPS (2020–2023). The total number of articles changes since not all categories are relevant to all projects.

First, most cases (10/13) provide at least some details of how a subset was selected for scoring.Footnote ³ However, in only three out of 14 articles were any observations “held out” in advance for downstream validation, a striking finding given the importance of out-of-sample validation in supervised learning methods.

Second, we examine how document scores were assigned and evaluated. Only eight out of 14 articles reported any details for labeling/scoring. In two other cases, however, the articles provide a reference to existing coding schemes that are not explained in the article itself (Fowler et al., Reference Fowler, Franz, Martin, Peskowitz and Ridout2021; Wahman et al., Reference Wahman, Frantzeskakis and Yildirim2021). It is even less common for articles to provide assessments for these scores. Only four out of 14 (29%) evaluated the quality of their labels. Alrababa’h et al. Reference Alrababa’h, Marble, Mousa and Siegel(2021) report inter-coder reliability. Schub Reference Schub(2022) reports inter-coder reliability and provides two example posts to support the face validity of the labels. Only Zubek et al. Reference Zubek, Dasgupta and Doyle(2021) and Emeriau Reference Emeriau(2023) report steps to validate the labels more extensively.

Next, it is common to describe text pre-processing steps. Ten out of 14 articles provided at least some details. Model validation of some sort is almost universal (13/14), mostly appearing as some form of within-sample fit statistic (e.g., precision and recall for binary classifiers). However, other forms of model validationFootnote ⁴ were more rare, appearing in six (43%) of articles. Two examples are Schub Reference Schub(2022) and Stier et al. Reference Stier, Mangold, Scharkow and Breuer(2022), which report highly predictive word stems. Four other papers took similar steps.Footnote ⁵

In addition, assessment of fit almost always were conducted within the training sample.Footnote ⁶ This raises concerns because in many cases the machine learning model was selected or tuned based on performance within the training sample. For example, Casas et al. Reference Casas, Denny and Wilkerson(2020) fits a large number of models and chooses the most accurate to create an ensemble. If models are tuned and evaluated on the same sample, this increases the risk of overfitting. However, in our review, only Hager and Hilbig (Reference Hager and Hilbig2020); Wahman et al. (Reference Wahman, Frantzeskakis and Yildirim2021), and Emeriau Reference Emeriau(2023) preserved cases for out-of-sample assessments.Footnote ⁷ In addition, whether the tuning and testing were done on the same samples is difficult to assess because the tuning is explained or justified in only nine out of 14 articles.Footnote ⁸

Finally, and most critically, we examined whether the final scores—the measure to be used in downstream analyses—were validated. Stier et al. Reference Stier, Mangold, Scharkow and Breuer(2022), for instance, compares the final scores to a researcher-created metric to assess convergent validity. Anastasopoulos and Bertelli Reference Anastasopoulos and Anthony(2020) provides a validation by replicating basic findings from the existing literature. Gohdes Reference Gohdes(2020) provides the reader with a random selection of documents along with their predicted scores to assess face validity. Zubek et al. Reference Zubek, Dasgupta and Doyle(2021) adopts a combination of these strategies.Footnote ⁹ Emeriau Reference Emeriau(2023) shows that the scores correlate with document metadata as expected by theory. However, in the remaining articles—nine out of 14 articles (64%) in the field’s top journals—no validation of the final measure is provided despite the fact that they often serve as a main explanatory or dependent variable.Footnote ¹⁰

In all, we found that current reporting standards for the text-to-measure pipeline are surprisingly sparse. While some steps are commonly reported, others are rare, including (surprisingly) assessing the quality of the labels and final measures. In other words, in many articles neither the model inputs nor the model outputs are validated, leaving readers to judge the validity of the measure based only on in-sample predictive performance, which, in the absence of validated labels, does not actually speak to the quality of the measure.

We suspect that researchers are conducting additional analyses, but they are not included in the manuscript since there are few clear expectations about what should be reported. To address this, in the next section we articulate a framework for self-assessment and reporting of the text-to-measure pipeline.

3. Transparently reporting the pipeline

With the general schematic shown in Figure 1 in mind, we now discuss the most critical aspects of the text-to-measure pipeline that researchers should report. For convenience, we provide a detailed guideline in Figure 2 that captures these key considerations.

Figure 2. A guideline for reporting and validating the text-to-measure pipeline.

First, researchers should subset the overall corpus for labeling and validation. This subsetting procedure should be reported since this decision can significantly affect downstream outcomes. Ideally, the readers should be able to assess the representativeness, size, and diversity of the labeled corpus and validation sets.

Random sampling is a simple strategy to ensure representativeness. If the labeled set is not representative, it can introduce subtle distortions. For example, in our example below, selecting only recent congressional documents might overemphasize temporal-specific features (e.g., “trump” or “vaccine”) that are not dispositive for earlier periods. Similarly, Anastasopoulos and Bertelli Reference Anastasopoulos and Anthony(2020) notes that relying solely on major legislation texts could be consequential. However, researchers may choose to over-sample specific traits or use block sampling based on existing dictionaries for a balanced training set. In some cases, such as Guess Reference Guess(2021), the sample is built on the basis of available labels (news articles from specific web domains).

The optimal number of documents to score depends on various factors: sampling strategy, distribution, and complexity of the latent trait, text feature processing, and learning algorithms. Although it is difficult to pinpoint the ideal size of the training set for these reasons, we suggest starting with labeling at least 300 observations. Note that, however, in many cases far more will be needed. Then, one can use a “learning curve” approach, adding observations until out-of-sample prediction rates plateau (Mohr and van Rijn, Reference Mohr and van Rijn2022), while still holding out reasonably sized validation sets.

Researchers should also report the diversity of the labeled subset along key traits. This is crucial because any feature absent from the labeled set cannot inform the learning algorithms about its mapping into the latent space. At a minimum, researchers should disclose the variation of the labeled set along the latent dimension of interest. For instance, if the labels are categorical, it is important to note how many observations in each class are included in the labeled set (e.g., histogram). For continuous measures, researchers should report measures of central tendency (e.g., median and mean) and spread (e.g., variance, range, and/or interquartile range).

Using the sampled data that are representative, diverse, and large enough, we suggest separating them into four groups for model fitting and validation. We present an example of this grouping strategy in the application section (see Table 2). Basically, the objective is to let the first group serve as the largest, baseline group and to reserve the other three smaller groups to validate the labels, the fitted models, and the final measure.

Table 2. Partitioning of the labeled data for training and three forms of validation

Second, researchers should detail their labeling process: who did the coding (e.g., researchers, trained research assistants, or online workers), what rules were followed, and any steps taken to improve label quality (e.g., multiple coding). They should assess both reliability, meaning the consistency of labels (e.g., inter-coder reliability such as Krippendorff’s alpha for categorical variables and Pearson’s correlation coefficient for continuous variables among many options), and validity. Validity can be established through face validity (providing readers with specific documents and their predicted scores), convergent validity (showing that the labels correspond to related measures as expected), or some other validation strategy such as comparing the labels to gold standard coding, if available or constructed by researchers, and measuring accuracy rates (e.g., F1 or Root Mean Square Errors (RMSE)) (Grimmer and Stewart, Reference Grimmer and Stewart2013b).

Third, researchers must address the fundamental challenge of reducing the textual input’s high-dimensional feature space to a manageable size while preserving essential semantic information. The most common approach begins with basic pre-processing steps: tokenization to create a term-document frequency (TDF) matrix (Grimmer and Stewart, Reference Grimmer and Stewart2013a), removal of stopwords, and stemming.Footnote ¹¹ More sophisticated dimensionality reduction techniques, such as word embeddings (Mikolov et al., Reference Mikolov, Chen, Corrado and Dean2013; Rodriguez and Spirling, Reference Rodriguez and Spirling2022), can capture complex semantic relationships in a lower-dimensional space. When selecting a feature representation strategy, researchers must balance two competing concerns: computational tractability and information preservation. Over-aggressive dimensionality reduction can lead to sparse representation in the labeled corpus, potentially eliminating crucial semantic distinctions. To navigate this trade-off, researchers can either empirically evaluate different feature representations through cross-validation to select the best performing approach, or, as we demonstrate in our application, combine models trained on different feature representations to optimize performance. When researchers lack a clear reason to favor any single representation, the form of ensemble approach to model building used below will typically lead to more accurate predictions than any single model. Regardless of how this is navigated, researchers should report these steps to readers.

Fourth, researchers train models to predict labels for unlabeled documents. They should disclose model fitting procedures and justify hyperparameter choices. Model validation can involve assessing predictive accuracy within the labeled set. However, for complex, tuned models, this risks overfitting. To mitigate this, researchers must evaluate predictive accuracy on a held-out validation set. The type of accuracy metrics are determined based on the type of measures of interest. For continuous measures, RMSE and Pearson’s correlation coefficients can be used. For categorical measures, Accuracy, F1-score, Recall, and Precision are widely used while there are other options (e.g., Cohen’s Kappa). To ensure transparency, we suggest always reporting a confusion matrix for categorical measures so that readers can compute alternative accuracy metrics if needed.Footnote ¹² In addition, reporting face validity can provide more confidence to readers that the scores that model predicted captures the latent trait. For example, Stier et al. Reference Stier, Mangold, Scharkow and Breuer(2022)’s model for identifying political news articles heavily weighted words like “trump,” “president,” “house,” “mueller,” “democrats,” and “campaign,” reassuring readers that predictions are driven by sensible features.

Finally, scores are imputed for all documents to create the final measure. The quality of this measure depends on all previous steps, sometimes in ambiguous ways. Therefore, we recommend multiple validation methods including face validity, convergent validity, and predictive validity. We provide explicit examples in the following section (see also Grimmer and Stewart, Reference Grimmer and Stewart2013a; Goet, Reference Goet2019).

In summary, this section provides a framework for transparently reporting key decisions and validation steps in a supervised text-to-measure pipeline. By following the guideline in Figure 2, researchers ensure readers understand critical choices at each stage: subsetting the corpus, labeling training data, transforming textual features, modeling feature-label relationships, and imputing scores. This transparency enhances reproducibility and allows readers to critically evaluate the measure’s validity and reliability.

Our guidelines emphasize the importance of thoroughly reporting the details of the measurement process and its validation. Ideally, this information should be included in the main text of the study wherever possible, especially if the measure is a key explanatory or dependent variable. Depending on the centrality of the measurement to the research and the space constraints of journal articles, many of these details may need to be provided in the appendix. However, we encourage researchers to present at least the three-stage validation metrics in the main text to ensure transparency and rigor.

To illustrate these principles in practice, we now turn to a concrete example: measuring the tone of senators’ statements in 916 confirmation hearings. By walking through each pipeline step, we demonstrate how careful reporting can bolster confidence in the final measure and provide insights for future applications.

4. Application: Senate confirmation hearings

In this section, we turn to an example that illustrates our framework as well as a specific set of decisions to operationalize our recommendations. Again, we emphasize that we view these particular choices as appropriate for our application. In other contexts, other decisions may be preferable. However, we believe that this example will serve as a template that other researchers can build from, alter, and revise to suit their own purposes.

In this example, we measure the positive or negative tone of senators’ statements in 916 US Senate confirmation or nomination committee hearing transcripts. These are from the 105th-115th Congresses and include both bureaucratic and judicial nominations. We collected these transcripts from the Government Publishing Office and Congress.gov. Typically, a committee hearing transcript records statements made by both members of Congress and witnesses. We focus only on statements made by committee members.Footnote ¹³ In total, our corpus includes 89,279 member statements.

Our goal is to estimate the tone of the questions senators asked of nominees. An example is the following statement:

“If you are confirmed, would you be willing to hold a moratorium until you are sure that all of the local law enforcement officers have been properly trained and understand what the authority is under 287(g), which is not to stop people just based on the color of their skin? I am wondering whether or not you would be able to move in that direction.”

By our definition, a positive statement either expresses support for a nominee (e.g., by praising their qualifications, achievements, or character) or provides a favorable description of a situation (such as highlighting policy successes).

4.1. Partitioning

We used a simple random sampling scheme to select 3,600 paragraphs for labeling and/or validity testing.Footnote ¹⁴ The paragraphs were chosen through the following procedure: statements containing more than 120 words and composed of multiple paragraphs were broken down into paragraphs; paragraphs that were too short (containing less than fifty words) were combined with adjacent paragraphs in the same statement; subsequently, among those containing fifty to 120 words,Footnote ¹⁵ 3,600 paragraphs were chosen at random. The 3,000 documents used to train our final model (Groups A and B described below) include statements from 781 unique hearings and 200 unique senators (45.7% Republican and 55.17% Democrat).

These 3,600 paragraphs were grouped into four subsets with 2,900 (Group A), 100 (Group B), 500 (Group C), and 100 (Group D) paragraphs, respectively. Table 2 summarizes our partitioning strategy. The first three sets (n = 3,500) were labeled by Amazon Mechanical Turk (MTurk) online workers using a common procedure described below. We hand-labeled 200 documents (Groups B and D) on a five-point scale for downstream validation.Footnote ¹⁶ The 500 paragraphs in Group C were labeled but held out from the model building, and the 100 documents in Group D were excluded from the entire procedure.

The partitioning strategy we propose is valuable in that it allows us to perform validations at three different stages. First, we can validate the crowdsourced labels by comparing them in Group B with the expert labels. Second, we can validate our model using a held-out sample by training the model on Groups A and B only, and comparing the model’s prediction for Group C to its crowd-sourced labels. Third, we can validate our final measure using a held-out sample by comparing the model’s prediction for Group D with our own coding of the documents. We explain these procedures in more detail in the following sections.

4.2. Labeling

We used MTurk workers to create a measure of tone based on pairwise comparisons following Carlson and Montgomery Reference Carlson and Montgomery(2017). Briefly, we randomly paired two documents in the sample and let workers choose the one that exhibited positivity more strongly. We created 25,000 random pairs to generate the same number of human intelligence tasks (HITs). Each document appeared in the pairwise comparisons with a similar frequency. We then used the Bradley-Terry model to estimate the latent position of each document on a continuous scale, which we refer to as the crowdsourced score of the documents. Figure C3 in the Online Supplementary Material shows the distribution of the measure. We explain coding instructions and worker management in the Online Supplementary Material Section C. We assess the reliability of the measure by comparing estimates using only the first half of the HITs to estimates generated from the second half of the HITs. These estimates are correlated at 0.593. This seems low but should be understood given the fact that the pairwise comparison approach is significantly affected by the volume of the data and we reduced it to half.

Now we validate whether our labels capture our concept of interest because low-quality labels can lead to a low-quality final measure. One option is to check the face validity of the measure. Table C2 shows 10 randomly chosen paragraphs along with their estimated scores.Footnote ¹⁷ These generally confirm that more negative statements are assigned to lower values.Footnote ¹⁸

Second, we also assess the convergent validity of our measure by comparing it with our expert coding. For 100 of these documents (Group B in Table 2), we hand-coded them on a five-point scale. The correlation coefficient between the two scores is 0.808 (see Figure C4), again suggesting that the labels are a valid measure.

4.3. Feature representation

As we had no a priori reason to prefer any specific pre-processing approach, we created an ensemble of models based on term-document frequency representations (TDF), embeddings, and Bidirectional Encoder Representations from Transformers (BERT) (Devlin et al., Reference Devlin, Chang, Lee and Toutanova2019).

For the TDF, we remove punctuations and numbers, put words into lowercase, stem, remove stopwords,Footnote ¹⁹ include unigrams and bigrams, and remove infrequent words by retaining only the 5,000 features—unigrams and bigrams—that appear most frequently in the corpus. To determine which of these pre-processing steps to choose, we used the Support Vector Machine (SVM) model in the R package and chose the alternative in each pre-processing step showing the best prediction power as measured based on 10-fold cross validation in the training sample. The validation statistics we relied on are the out-of-fold RMSE and the correlation coefficient between the crowdsourced labels and the model prediction. The end result was a TDF representation with 5,025 features.Footnote ²⁰ Of these, 4,921 (97.93%) words or features are contained in the training set.Footnote ²¹

For embeddings, we use the word2vec model, which is an unsupervised learning model also known as distributed vector representation of words (Mikolov et al., Reference Mikolov, Chen, Corrado and Dean2013). This model assigns each word a numeric vector that captures the context in which the word appears in a corpus by measuring its relationship with the words surrounding it. We first took pre-processing steps, which includes removing punctuations, numbers, and stopwords, lowercasing, and stemming. To tune embeddings for our task, we varied four key parameters generating ninety-six different combinations: 400, 500, and 600 for the dimension of a word vector; 5, 7, 10, and 12 for the window size; 10, 15, 30, and 45 epochs; only unigrams and both uni- and bigrams. Among the ninety-six model specifications, the one with 400 dimensions, a window size of 5, and 15 epochs including only unigrams performed the best in our cross validation.

BERT is an unsupervised deep learning model that was pre-trained on a large corpus to capture representation of words and sentences and their semantic relationship. It can be fine-tuned with small data at the user’s end (Devlin et al., Reference Devlin, Chang, Lee and Toutanova2019). As a tuning parameter, we varied epoch size and chose the one that resulted in the smallest loss.Footnote ²²

4.4. Fitting learners

To fit our model, we combined Group A and Group B in Table 2 to create a training set of 3,000 statements.Footnote ²³ We fit 13 models in total and combined them using ensemble Bayesian model averaging (Montgomery et al., Reference Montgomery, Hollenbach and Ward2012). Table D4 provides a complete list of these learners and shows their predictive performance where prediction is calculated with a cross validation within our 3,000 document training set.Footnote ²⁴

Assessing the model itself is a second validation point. However, we have used the training data set to (i) determine the best approaches for text pre-processing, (ii) tune individual models, and (iii) create ensemble weights for the final model. Reusing the training data in this way—even with techniques like cross validation—risks overfitting.

Therefore, we used our model to predict the score on the 500 documents held back for model validation (Group C). We found that the RMSE for this set was 0.538. This compares favorably to the RMSE of the best single learning algorithm, 0.604, the support vector machine using the doc2vec matrix.Footnote ²⁵ The Pearson’s correlation coefficient between the crowdsourced scores and the ensemble scores is 0.757 (see Figure D6). In all, these results suggest that the ensemble learner can accurately, if imperfectly, predict crowdsourced labels from textual features.

In addition, we checked the most frequent word stems representing positive and negative tones obtained through the following procedure. First, we selected paragraphs whose scores fell in the top quartile (most positive statements) and the bottom quartile (most negative statements) of the entire corpus excluding the scored setFootnote ²⁶ and extracted the 300 most frequent word stems from each side.Footnote ²⁷ Then, we excluded 178 overlapping features to create a list of most frequent and exclusive stems.

We present the top 50 words from each category in Table D5 in the Online Supplementary Material and find the following patterns: positive statements are featured with appreciations (e.g., “thank-much”, “appreci”), greetings (e.g., “pleas”, “honor”), endorsement (e.g., “proud”, “friend”) and, most importantly, compliments on nominees’ ability (e.g., “experi”, “energi”, “distinguish”, “leadership”) or work-related experiences (e.g., “ambassador”, “develop”, “director”, “manag”, “univers”); in contrast, negative statements tend reference problems (e.g., “problem”, “matter”), personal views (e.g., “agre”, “opinion”, “view”), money (e.g., “$”, “money”,), fact-checks (e.g., “correct”, “report”, “percent”, “quote”), and rules (e.g., “constitut”, “rule”, “regul”, “standard”, “suprem-court”). Given the context, these results give us further confidence that the model is leveraging appropriate textual features to measure the latent trait.

4.5. Final output

Model validation provides evidence that our learner can accurately predict the labels, and we also took steps to validate the labels themselves. However, we can also assess the validity of the final measure itself.

To assess the convergent validity of the measure, we turn to the 100 paragraphs we scored on a five-point scale but were held back from all of the previous steps (Group D). The correlation coefficient between the two measurements is 0.743 suggesting that the expert ratings and the ensemble predictions are picking up the same latent trait (see Figure D7).

To assess the face validity of the measure, Table 3 provides five examples of scored statements along with the five-point scale expert coding and the scores generated by the ensemble prediction. These were randomly selected to ensure that one document from each of the five-point categories was included. In these examples, both scores go hand in hand. The statements scoring lower tend to convey more negative tone than those scoring higher, which provides strong face validity of our ensemble prediction score.

Table 3. Five random sample statements

Finally, we check predictive validity by analyzing whether members’ speaking tone in Senate confirmation hearings correlates with institutional factors in theoretically expected directions. First, we expect that members’ speaking tone should become more negative over time due to the increasing level of partisan polarization (McCarty, Reference McCarty2019) and intensified partisan conflict in Congress (Lee, Reference Lee2016). Second, members’ questions asked to appointees across party lines should be more negative than questions asked to appointees from their own party. Third, we expect that tone will be more negative in judicial hearings relative to nominations to the bureaucracy.Footnote ²⁸

To facilitate the substantive interpretation of the changes in members’ speaking tone, we rescaled the ensemble prediction scores to range from 0 to 100. It has a mean of 45.08 and a standard deviation of 15.58. Using this measure, we find support for most of our expectations. First, Figure 3a) presents the average tone of the members in each congress with 95% confidence intervals. We can see a slight downward trend, especially in the 115th congress, which is the first two years of the Trump administration. However, the pattern is not as strong as expected.

In graph (a), members’ tone averaged by congress; in graph (b), the average tone of statements made by members who participated in both bureaucratic and judicial confirmation hearings at any point of time included in the current analysis measured for each appointment type. In both graphs, 95% confidence interval is around for each estimate.

Figure 3. Changes in members’ tone (a) By congress. (b) By appointment type.

Second, we analyze whether the same member tends to be tougher on judicial nominees than on bureaucratic nominees. We include only the statements made by those who participated in confirmation hearings for both types of appointments. This totals 79,158 statements made by 161 senators. Then, we compare the average tone of member statements made in bureaucratic nomination hearings and those from judicial nomination hearings. These statistics are presented in Figure 3 with 95% confidence intervals. As expected, senators tend to speak more negatively to judicial appointees (43.557) than they would to bureaucratic nominees (45.782).

Third, we calculate the average tone for Democrats and Republicans, respectively, for each presidential administration.Footnote ²⁹ Figure 4 shows that senators are generally more negative toward those nominated by presidents from the opposite party. The partisan gap is visible in all four administrations. It is interesting to note that, first, the partisan gap grows over time, and second, the tones of both parties become somewhat more negative across administrations. Both findings are consistent with the intensified partisan conflict reported in the literature (Lee, Reference Lee2016).

Points are the average statement tone for Democrats and Republicans, respectively for four presidential administrations.

Figure 4. Tone by party and administration.

One potential question is whether taking these steps actually matters to the quality of the measure. In part, this question is tangential to our argument, in that our goal is not merely to build superior measures but to communicate the quality of the measure to readers both transparently and robustly. Nonetheless, it is informative to compare our measure to alternative strategies to address this question. Thus, in Section E, we pursue three alternative strategies—dictionary method to measure sentiment, a pre-trained sentiment classifier, and our own method but with various shortcuts—and show that the measurement procedure trumps the other alternatives.

5. Considerations for large language models

Before concluding, we consider how our recommendations apply to large language models (LLMs). While the use of LLMs reduces reliance on labeled data, the “black box” nature of LLMs heightens the need for clear disclosure and rigorous validation. Thus, our framework remains essential, though certain aspects require adaptation.

Two key scenarios illustrate this. For small corpora, LLMs can generate labels for the entire dataset, reducing the role of supervised learning. For large corpora, full LLM processing may be computationally costly, making hybrid approaches more practical. In such cases, LLMs can generate high-quality training data for traditional supervised models, necessitating careful documentation of their role in label creation.

In both scenarios, LLM-generated labels require methodological transparency. Researchers must document prompting strategies by reporting the exact prompts used, providing example outputs, and systematically testing prompt effectiveness. Additional techniques, such as varying prompts to test stability or comparing results across different LLMs or across time to account for model updates, may be necessary. Researchers should also document the version of the model and parameter settings while acknowledging LLM limitations, such as biases and replication challenges.

As with human-generated labels, LLM labels should undergo rigorous validation processes including face validity checks and comparison with expert coding to ensure alignment with intended concepts. Researchers should maintain held-out validation datasets to assess label consistency and alignment with human judgments, ensuring an out-of-sample test of the prompt and model. As these methods continue to advance, establishing rigorous standards for documentation and validation will be critical to ensuring scientific credibility.

6. Conclusion

Social scientists increasingly use supervised learning to study latent concepts in large data sets. Validating new measurements has long been crucial, yet our review of recent articles using supervised learning finds inconsistent reporting and validation standards. To address this, we propose a framework that identifies key decision points to be reported and multiple stages that require validation—namely, the label, the prediction model, and the final measure. While not every process must follow all steps, our guidelines establish general standards for transparency and reliability in supervised learning-based measurement.

Our framework can be easily extended to analyze other types of data. Importantly, while our specific example involves a continuous measure, the general guidance we provide should work equally well for categorical outcomes just with alternative predictive performance metrics suitable for a categorical outcome. In addition, our guideline can be extended to the research using non-text data, such as images, videos, audio, etc. While the specific steps will differ for these research domains, researchers will face similar underlying issues such as how to choose subsets, acquire labels, pre-process inputs, train models, and validate results.

As computational tools evolve, the core challenge remains: researchers must clearly communicate their measurement procedures and validate results. Our proposed standards are not restrictive but establish a shared vocabulary and baseline expectations for assessing new measures. Just as reporting standards exist for surveys and experimental designs, clear guidelines for supervised learning will support scientific progress. Our framework lays this foundation, adapting to rapid technological advancements while emphasizing transparency and validation. As new tools emerge, maintaining these commitments will be essential.

Supplementary material

The supplementary material for this article can be found at https://doi.org/10.1017/psrm.2025.10042. To obtain replication material for this article, https://doi.org/10.7910/DVN/AFBW80.

Footnotes

¹ Given the diversity of methods and applications, exceptions exist for nearly every characterization we make. There is an inherent trade-off between a maximalist approach providing a complete summary of the literature and discussing all issues involved and a generalist approach providing a focused discussion of general issues. Adopting the latter, we acknowledge that we cannot do full justice to all studies.

² We focus only on steps that are reported. While research teams might have done more to assess the quality of the measurement than what is ultimately reported, it is not possible for us to assess all the validation procedures if these are not disclosed to readers.

³ Anastasopoulos and Bertelli Reference Anastasopoulos and Anthony(2020) relied on a corpus of major legislation identified in a previously published paper to train their model. This represents an edge case insomuch as the method of choosing the sample is public, but not explained within the article itself. So, we coded this observation as missing.

⁴ As we discuss below, this refers to inspecting parameters or other features of the model to validate that it is relying on text features that “make sense” for the given task and context.

⁵ This form of validation is less feasible with more complex “black box” methods, which is related to a larger debate in the literature about the need for more interpretable machine learning methods (Rudin, Reference Rudin2019; Krishnan, Reference Krishnan2020). We do not take a position in this debate, but note that the inability to understand the specific textual features that drive predictions only places even greater emphasis on the need to validate the model inputs and outputs.

⁶ In some cases, articles report fit statistics but do not specify whether or not they are calculated within the sample. We assume here that these are in-sample statistics unless a test set is specified.

⁷ Zubek et al. Reference Zubek, Dasgupta and Doyle(2021) validates the model using the subset of data not used to train models, but the final model was tuned on this data and then validated on it.

⁸ However, two models relied on naive Bayes classifiers, which may require no tuning.

⁹ It is interesting to note that both Anastasopoulos and Bertelli Reference Anastasopoulos and Anthony(2020) and Zubek et al. Reference Zubek, Dasgupta and Doyle(2021) are research notes focused on developing and evaluating a new measure or measurement technique. Extensive validations are, perhaps, much more important where the measure itself is the major output.

¹⁰ Out of the 14 studies we reviewed, the three that were not mentioned at all in this section are Park et al. (Reference Park, Greene and Colaresi2020); Guess (Reference Guess2021), and Malesky et al. Reference Malesky, Todd and Tran(2023). Note that in Malesky et al. Reference Malesky, Todd and Tran(2023), the supervised machine learning application was a tertiary analysis designed to further validate the main analysis based exclusively on hand-coded documents.

¹¹ Denny and Spirling Reference Denny and Spirling(2018) provides a detailed explanation on general pre-processing steps for a text analysis.

¹² The rule of thumb is to choose the model with the smallest error rates. While it would be useful to have thresholds for these metrics preset discipline-wide to help researchers determine whether a model is acceptable or not, it is too application-specific. So, we do not suggest such thresholds.

¹³ The following types of procedural statements are not included: the initial and final statements of a hearing with eighty words or fewer; statements that come before the first witness’s opening statement, which usually outlines the witnesses’ backgrounds; statements with less than or equal to eighty words that include “come to order,” “recognize,” “expired,” “yield,” “adjourn,” or “recess”; and statements with less than or equal to fifty words that include both “thank you” and “yield.”

¹⁴ The number of paragraphs was chosen based on the previous research (Park, Reference Park2021) that used the same labeling strategy we use and produced a measure with reasonably good validation metrics. However, below we further validate this choice.

¹⁵ These lengths were determined by reading subsets of speeches and considering the length where semi-trained workers could reliably summarize the tone.

¹⁶ The following was our coding scheme: 1 = very negative; 2 = somewhat negative; 3 = neutral; 4 = somewhat positive; 5 = very positive.

¹⁷ Note that random sampling of the documents is important to assure readers that examples have not been cherry picked.

¹⁸ The sentiment measure in our example is a relatively straightforward concept. Measuring a more complicated concept may require additional pilot trials to optimize wordings in the instructions, training of the workers, sampling of the data, etc.

¹⁹ The list of stopwords we used is provided in the Online Supplementary Material Section D.

²⁰ Due to ties in the ranking, 5,021 features, instead of the exact 5,000 were retained.

²¹ Reporting this is important, as this means that the predictions from the TDF models will be based only on these 8,496 words.

²² For BERT, we used the English uncased pre-processing model version 3. For more details, see https://tfhub.dev/tensorflow/bert_en_uncased_pre-process/3.

²³ Section B.3. in the Online Supplementary Material demonstrates how the model’s prediction performance changes as we change the size of the training set.

²⁴ In addition to BERT, we fit six learners on each of the TDF and word2vec based document matrices. These six learners are Support Vector Machine, Kernlab’s Support Vector Machine, LASSO, Random Forest, Bayesian GLM, and Gradient Boosting Machine.

²⁵ RMSE for all component models ranged from 0.604 to 1.024.

²⁶ Another alternative would be to focus on the training set for this validation.

²⁷ We used the “topfeatures” function in the R package.

²⁸ First, there are fewer seats for judicial nominations and they are lifetime appointments making them more consequential. Second, the president wields greater influence over bureaucratic agencies, as they are the head of the bureaucratic branch; the president’s influence over the judiciary, the third branch, via nominations may convey greater political implications.

²⁹ Independent senators were excluded from this analysis. These cases include Jim Jefferson from the 107th to 109th Congresses and Bernie Sanders from the 110th to 115th Congresses.

References

Alrababa’h, A, Marble, W, Mousa, S and Siegel, AA (2021) Can exposure to celebrities reduce prejudice? The effect of mohamed salah on islamophobic behaviors and attitudes. American Political Science Review 115, 1111–1128.10.1017/S0003055421000423CrossRef Google Scholar

Althaus, S, Peyton, B and Shalmon, D (2022) A total error approach for validating event data. American Behavioral Scientist 66(5), 603–624.10.1177/00027642211021635CrossRef Google Scholar

Anastasopoulos, LJ and Anthony, MB (2020) Understanding delegation through machine learning: A method and application to the european union. American Political Science Review 114, 291–301.10.1017/S0003055419000522CrossRef Google Scholar

Bøggild, T, Aarøe, L and Petersen, MB (2021) Citizens as complicits: Distrust in politicians and biased social dissemination of political information. American Political Science Review 115, 269–285.10.1017/S0003055420000805CrossRef Google Scholar

Carlson, D and Montgomery, JM (2017) A pairwise comparison framework for fast, flexible, and reliable human coding of political texts. American Political Science Review 111, 835–843.10.1017/S0003055417000302CrossRef Google Scholar

Casas, A, Denny, MJ and Wilkerson, J (2020) More effective than we thought: Accounting for legislative hitchhikers reveals a more inclusive and productive lawmaking process. American Journal of Political Science 64, 5–18.10.1111/ajps.12472CrossRef Google Scholar

Denny, MJ and Spirling, A (2018) Text preprocessing for unsupervised learning: Why it matters, when it misleads, and what to do about it. Political Analysis 26, 168–189.10.1017/pan.2017.44CrossRef Google Scholar

Devlin, J, Chang, M-W, Lee, K and Toutanova, K (2019) BERT: Pre-training of deep bidirectional transformers for language understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Minneapolis, Minnesota.Google Scholar

Emeriau, M (2023) Learning to be unbiased: Evidence from the French asylum office. American Journal of Political Science 67, 1117–1133.10.1111/ajps.12720CrossRef Google Scholar

Fowler, EF, Franz, MM, Martin, GJ, Peskowitz, Z and Ridout, TN (2021) Political advertising online and offline. American Political Science Review 115, 130–149.10.1017/S0003055420000696CrossRef Google Scholar

Goet, ND (2019) Measuring polarization with text analysis: Evidence from the UK house of commons, 1811–2015. Political Analysis 27(4), 518–539.10.1017/pan.2019.2CrossRef Google Scholar

Gohdes, AR (2020) Repression technology: Internet accessibility and state violence. American Journal of Political Science 64, 488–503.10.1111/ajps.12509CrossRef Google Scholar

Grimmer, J and Stewart, BM (2013a) Text as data: The promise and pitfalls of automatic content analysis methods for political texts. Political Analysis 21, 267–297.10.1093/pan/mps028CrossRef Google Scholar

Grimmer, J and Stewart, BM (2013b) Text as data: The promise and pitfalls of automatic content analysis methods for political texts. Political Analysis 21, 267–297.10.1093/pan/mps028CrossRef Google Scholar

Groves, RM and Lyberg, L (2010) Total survey error: Past, present, and future. Public Opinion Quarterly 74, 849–879.10.1093/poq/nfq065CrossRef Google Scholar

Guess, AM (2021) (Almost) everything in moderation: New evidence on americans’ online media diets. American Journal of Political Science 65, 1007–1022.10.1111/ajps.12589CrossRef Google Scholar

Hager, A and Hilbig, H (2020) Does public opinion affect political speech? American Journal of Political Science 64, 921–937.10.1111/ajps.12516CrossRef Google Scholar

Justin, G, Roberts, ME and Stewart, BM (2022) Text As Data: A New Framework for Machine Learning and the Social Sciences. Princeton, NJ: Princeton University Press.Google Scholar

Kapoor, Sayash, Cantrell, E, Peng, K, Pham, TH, Bail, CA, Gundersen, OE, Hofman, JM, Hullman, J, Lones, MA, Malik, MM, Nanayakkara, P, Poldrack, RA, Raji, ID, Roberts, M, Salganik, MJ, Serra-Garcia, M, Stewart, BM, Vandewiele, G Narayanan, A (2022) Reforms: Reporting standards for machine learning based science. arXiv preprint arXiv:2201.12150.Google Scholar

Krishnan, M (2020) Against interpretability: A critical examination of the interpretability problem in machine learning. Philosophy & Technology 33, 487–502.10.1007/s13347-019-00372-9CrossRef Google Scholar

Lee, FE (2016) Insecure Majorities: Congress and The Perpetual Campaign. Chicago: University of Chicago Press.10.7208/chicago/9780226409184.001.0001CrossRef Google Scholar

Luwei, Y, Montgomery, JM and Stewart, BM (2021) Topics, concepts, and measurement: A crowdsourced procedure for validating topics as measures. Political Analysis 30(4), 1–20.Google Scholar

Malesky, EJ, Todd, JD and Tran, A (2023) Can elections motivate responsiveness in a single-party regime? Experimental evidence from vietnam. American Political Science Review 117, 497–517.10.1017/S0003055422000879CrossRef Google Scholar

McCarty, N (2019) Polarization: What Everyone Needs to Know. New York: Oxford University Press.10.1093/wentk/9780190867782.001.0001CrossRef Google Scholar

Mikolov, T, Chen, K, Corrado, G and Dean, J (2013) Efficient estimation of word representations in vector space. arXiv:1301.3781.Google Scholar

Mohr, F and van Rijn, JN (2022) Learning curves for decision making in supervised machine learning–A survey. arXiv preprint arXiv:2201.12150.Google Scholar

Montgomery, JM, Hollenbach, FM and Ward, MD (2012) Improving predictions using ensemble bayesian model averaging. Political Analysis 20, 271–291.10.1093/pan/mps002CrossRef Google Scholar

Park, B, Greene, K and Colaresi, M (2020) Human rights are (increasingly) plural: Learning the changing taxonomy of human rights from large-scale text reveals information effects. American Political Science Review 114, 888–910.10.1017/S0003055420000258CrossRef Google Scholar

Park, JY (2021) When Do politicians grandstand? Measuring message politics in committee hearings. The Journal of Politics 83, 214–228.10.1086/709147CrossRef Google Scholar

Rodriguez, PL and Spirling, A (2022) Word embeddings: What works, what doesn’t, and how to tell the difference for applied research. The Journal of Politics 84, 101–115.10.1086/715162CrossRef Google Scholar

Rudin, C (2019) Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence 1, 206–215.10.1038/s42256-019-0048-xCrossRef Google Scholar PubMed

Schub, R (2022) Informing the leader: Bureaucracies and international crises. American Political Science Review 116, 1460–1476.10.1017/S0003055422000168CrossRef Google Scholar

Stier, S, Mangold, F, Scharkow, M and Breuer, J (2022) Post post-broadcast democracy? News exposure in the age of online intermediaries. American Political Science Review 116, 768–774.10.1017/S0003055421001222CrossRef Google Scholar

Wahman, M, Frantzeskakis, N and Yildirim, TM (2021) From thin to thick representation: How a female president shapes female parliamentary behavior. American Political Science Review 115, 360–378.10.1017/S000305542100006XCrossRef Google Scholar

Zubek, R, Dasgupta, A and Doyle, D (2021) Measuring the significance of policy outputs with positive unlabeled learning. American Political Science Review 115, 339–346.10.1017/S000305542000091XCrossRef Google Scholar

Figure 1. Summary of the supervised learning pipeline.

This figure visually depicts the text-to-measure pipeline for supervised machine learning including document subsetting, labelling, text transformation, model fitting, and imputation.