1. Introduction
In the past decade, computer-assisted text analysis has evolved from a rarely used technology to an increasingly ubiquitous approach in political science. Access to massive volumes of digital text combined with advances in computing and algorithm designs make text analysis easier to implement and more relevant in research domains as diverse as state repression (Gohdes, Reference Gohdes2020), news exposure (Guess, Reference Guess2021; Stier et al., Reference Stier, Mangold, Scharkow and Breuer2022), political advertising (Fowler et al., Reference Fowler, Franz, Martin, Peskowitz and Ridout2021), congressional committees (Casas et al., Reference Casas, Denny and Wilkerson2020; Park, Reference Park2021), and more.
One common task is measuring important latent concepts embedded within text via supervised machine learning. Generically, we imagine a researcher who wishes to measure a predefined construct, such as negativity (Fowler et al., Reference Fowler, Franz, Martin, Peskowitz and Ridout2021), grandstanding (Park, Reference Park2021), or Islamophobia (Alrababa’h et al., Reference Alrababa’h, Marble, Mousa and Siegel2021). The goal is to create a mapping from the written text to a valid quantitative measure. However, doing so requires multiple steps including data partitioning, label acquisition, text pre-processing, and model fitting. At each stage, researchers must make choices, many of which are consequential to the results. Thus, creating a measure can be conceptualized as a set of interconnected procedures. It is a pipeline, where the outputs of each stage are passed on to the next so that the final result reflects the cumulative decisions made along the way.
Although these methods are powerful and increasingly accurate, when combined with current reporting standards, the complexity of this pipeline presents a problem. To begin with, few of the decisions made at each stage are justified to readers, and often they are not even described. This lack of transparency is particularly problematic given the sequential nature of the pipeline; information about one step of the process in isolation can be uninformative or even deceptive. Worse, standard practice provides readers with little objective evidence about the validity of the procedures or even the final outputs. Together, this means that far too often readers are presented with statistical analyses of text-based measures where the measurement procedure has not been adequately explained, evaluated, or validated.
To address this limitation, this article introduces a conceptual framework for understanding, reporting, and validating the text-to-measure pipeline. Inspired by the total survey error approach in public opinion research (e.g., Groves and Lyberg, Reference Groves and Lyberg2010; Althaus et al., Reference Althaus, Peyton and Shalmon2022), we take a holistic view to identify potential sources of error in the pipeline and clarify what should be reported. We systematically outline each stage and provide detailed guidelines, aiming to establish a reporting standard that enhances transparency and credibility. Additionally, we emphasize the importance of rigorous validation, echoing Grimmer and Stewart Reference Grimmer and Stewart(2013a): “validate, validate, validate.” Specifically, we highlight three key stages where transparent assessment of performance is crucial: labeling, model fitting, and out-of-sample imputation.
Our goal is not to criticize any prior research. Nor do we intend to provide definitive “best practices,” since the field of text-as-data is too diverse and methods are evolving too rapidly to make this feasible. Best practices for any particular measurement exercise are too context-dependent for any advice to be authoritative, although we do provide a set reporting standards that can serve as reasonable default. Rather, this paper adds to the ongoing efforts in the computational social science community to establish frameworks for conceptualizing, reporting, and evaluating text-as-data methods (Grimmer and Stewart, Reference Grimmer and Stewart2013a; Ying et al., Reference Luwei, Montgomery and Stewart2021; Grimmer et al., Reference Justin, Roberts and Stewart2022; Kapoor et al., Reference Kapoor, Cantrell, Peng, Pham, Bail, Gundersen, Hofman, Hullman, Lones, Malik, Nanayakkara, Poldrack, Raji, Roberts, Salganik, Serra-Garcia, Stewart, Vandewiele and Narayanan2022), specifically focusing on the use of supervised learning to measure latent concepts. We provide an approachable and coherent explanation of the various steps involved, the key decision points for researchers, and a framework to make empirical results more credible and transparent to readers.
The rest of the paper proceeds as follows. The next section provides a brief overview of the measurement pipeline—including sampling, label acquisition, text representation, model fitting, and out-of-sample imputation. Then, Section 2 reviews current practices in the field in terms of whether and how these stages are reported and evaluated. We find that there is a great deal of heterogeneity in reporting practices and that robust validation at any stage of the pipeline is surprisingly rare, even in the field’s top journals. To address this, in Section 3, we present a guideline that researchers should follow including key decisions to be reported and validation analyses to be undertaken. In Section 4, we then illustrate our framework with an application to U.S. senatorial confirmation hearings held from 1997 to 2019. Specifically, we build and validate a new measure of tone in over 89,000 statements made by senators in these hearings. We conclude with a brief discussion of how these recommendations might shift as the field more widely adopts alternative measurement strategies based on large language models (LLMs).
2. A supervised learning pipeline for measurement
To limit the scope of this discussion, we make several assumptions about the research goal or “use case” we address. First, we assume that the researcher has a predefined concept of interest, thus excluding exploratory techniques, such as unsupervised topic models; second, the goal is to create a valid measure at the document level, such that each speech, tweet, or article is scored; third, the researcher wishes to use a supervised learning method.
2.1. Pipeline overview
The goal in the text-to-measure pipeline is to train a model using a subset of the data so that values for the latent variable in the rest of the corpus can be imputed. A schematic for this exerciseFootnote 1 is shown in Figure 1. First, we select a subset of the data to use to fit a supervised model. At this stage, we also set aside documents for validation. Identifying validation sets up front is a crucial step, which we emphasize at multiple stages in our proposed guideline. Second, we label the document in the training set on the latent trait of interest. In the simplest case, this is done manually by research assistants, crowdsourced workers, or generative LLMs. Third, we pre-process the text and extract a set of features that can be represented numerically. For example, we might remove punctuation and create a term-document matrix (see Online Supplementary Material Section B for more details). More advanced pre-processing steps are possible, such as named entity recognition, which identifies important persons or places in the text (e.g.,Fowler et al., Reference Fowler, Franz, Martin, Peskowitz and Ridout2021). Fourth, we fit some machine learning algorithm(s) to accurately predict the labels as a function of the inputs. This stage typically requires the selection of hyperparameters. Finally, we impute the latent trait with the fitted model for the entire corpus.

Figure 1. Summary of the supervised learning pipeline.
This approach is attractive for several reasons. Most importantly, supervised learning is much more cost-effective relative to labeling the entire corpus through a manual or crowdsourced process. Although these approaches are possible for tasks like open-text responses (e.g., Bø ggild et al., Reference Bøggild, Aarøe and Petersen2021), the size of contemporary text corpora can quickly go beyond the budget of any research team. Fowler et al. Reference Fowler, Franz, Martin, Peskowitz and Ridout(2021), for instance, analyzed over 400,000 Facebook ads. In our own example below, the corpus includes 89,279 statements. In addition, supervised learning makes it relatively easy to impose a research question. It builds a model designed to precisely estimate a predefined concept rather than making broad discoveries within the corpus. This makes supervised learning output more interpretable for downstream tasks, such as theory testing, and helps it perform much better than dictionary methods (Grimmer and Stewart, Reference Grimmer and Stewart2013a).
The downside is that the end-to-end process can become elaborate; to some, it may even seem convoluted. Furthermore, at each step, researchers must make many decisions, often with little guidance. This adds many “researcher degrees of freedom” to the process and makes reporting results cumbersome. Therefore, researchers should consider the following questions. Which decisions need to be detailed to the readers? What evidence should we provide to give the reader confidence in the procedures and the validity of the measure?
2.2. Current practices for reporting and validation
To motivate our discussion in the following sections, we first review current practices in the field in terms of how these steps are reported and validated. To do so, we searched the 2020-2023 volumes of the American Political Science Review (APSR) and the American Journal of Political Science (AJPS) for all articles that used supervised learning of texts to create measures of latent concepts. In total, we identified 14 observations and reviewed the articles and their appendices.Footnote 2 The results are shown in Table 1.
Table 1. A review of current reporting and validation practices applying the text-to-measure pipeline

Analysis of transparency in the text-to-measure pipeline articles using supervised machine learning in the APSR and AJPS (2020–2023). The total number of articles changes since not all categories are relevant to all projects.
First, most cases (10/13) provide at least some details of how a subset was selected for scoring.Footnote 3 However, in only three out of 14 articles were any observations “held out” in advance for downstream validation, a striking finding given the importance of out-of-sample validation in supervised learning methods.
Second, we examine how document scores were assigned and evaluated. Only eight out of 14 articles reported any details for labeling/scoring. In two other cases, however, the articles provide a reference to existing coding schemes that are not explained in the article itself (Fowler et al., Reference Fowler, Franz, Martin, Peskowitz and Ridout2021; Wahman et al., Reference Wahman, Frantzeskakis and Yildirim2021). It is even less common for articles to provide assessments for these scores. Only four out of 14 (29%) evaluated the quality of their labels. Alrababa’h et al. Reference Alrababa’h, Marble, Mousa and Siegel(2021) report inter-coder reliability. Schub Reference Schub(2022) reports inter-coder reliability and provides two example posts to support the face validity of the labels. Only Zubek et al. Reference Zubek, Dasgupta and Doyle(2021) and Emeriau Reference Emeriau(2023) report steps to validate the labels more extensively.
Next, it is common to describe text pre-processing steps. Ten out of 14 articles provided at least some details. Model validation of some sort is almost universal (13/14), mostly appearing as some form of within-sample fit statistic (e.g., precision and recall for binary classifiers). However, other forms of model validationFootnote 4 were more rare, appearing in six (43%) of articles. Two examples are Schub Reference Schub(2022) and Stier et al. Reference Stier, Mangold, Scharkow and Breuer(2022), which report highly predictive word stems. Four other papers took similar steps.Footnote 5
In addition, assessment of fit almost always were conducted within the training sample.Footnote 6 This raises concerns because in many cases the machine learning model was selected or tuned based on performance within the training sample. For example, Casas et al. Reference Casas, Denny and Wilkerson(2020) fits a large number of models and chooses the most accurate to create an ensemble. If models are tuned and evaluated on the same sample, this increases the risk of overfitting. However, in our review, only Hager and Hilbig (Reference Hager and Hilbig2020); Wahman et al. (Reference Wahman, Frantzeskakis and Yildirim2021), and Emeriau Reference Emeriau(2023) preserved cases for out-of-sample assessments.Footnote 7 In addition, whether the tuning and testing were done on the same samples is difficult to assess because the tuning is explained or justified in only nine out of 14 articles.Footnote 8
Finally, and most critically, we examined whether the final scores—the measure to be used in downstream analyses—were validated. Stier et al. Reference Stier, Mangold, Scharkow and Breuer(2022), for instance, compares the final scores to a researcher-created metric to assess convergent validity. Anastasopoulos and Bertelli Reference Anastasopoulos and Anthony(2020) provides a validation by replicating basic findings from the existing literature. Gohdes Reference Gohdes(2020) provides the reader with a random selection of documents along with their predicted scores to assess face validity. Zubek et al. Reference Zubek, Dasgupta and Doyle(2021) adopts a combination of these strategies.Footnote 9 Emeriau Reference Emeriau(2023) shows that the scores correlate with document metadata as expected by theory. However, in the remaining articles—nine out of 14 articles (64%) in the field’s top journals—no validation of the final measure is provided despite the fact that they often serve as a main explanatory or dependent variable.Footnote 10
In all, we found that current reporting standards for the text-to-measure pipeline are surprisingly sparse. While some steps are commonly reported, others are rare, including (surprisingly) assessing the quality of the labels and final measures. In other words, in many articles neither the model inputs nor the model outputs are validated, leaving readers to judge the validity of the measure based only on in-sample predictive performance, which, in the absence of validated labels, does not actually speak to the quality of the measure.
We suspect that researchers are conducting additional analyses, but they are not included in the manuscript since there are few clear expectations about what should be reported. To address this, in the next section we articulate a framework for self-assessment and reporting of the text-to-measure pipeline.
3. Transparently reporting the pipeline
With the general schematic shown in Figure 1 in mind, we now discuss the most critical aspects of the text-to-measure pipeline that researchers should report. For convenience, we provide a detailed guideline in Figure 2 that captures these key considerations.

Figure 2. A guideline for reporting and validating the text-to-measure pipeline.
First, researchers should subset the overall corpus for labeling and validation. This subsetting procedure should be reported since this decision can significantly affect downstream outcomes. Ideally, the readers should be able to assess the representativeness, size, and diversity of the labeled corpus and validation sets.
Random sampling is a simple strategy to ensure representativeness. If the labeled set is not representative, it can introduce subtle distortions. For example, in our example below, selecting only recent congressional documents might overemphasize temporal-specific features (e.g., “trump” or “vaccine”) that are not dispositive for earlier periods. Similarly, Anastasopoulos and Bertelli Reference Anastasopoulos and Anthony(2020) notes that relying solely on major legislation texts could be consequential. However, researchers may choose to over-sample specific traits or use block sampling based on existing dictionaries for a balanced training set. In some cases, such as Guess Reference Guess(2021), the sample is built on the basis of available labels (news articles from specific web domains).
The optimal number of documents to score depends on various factors: sampling strategy, distribution, and complexity of the latent trait, text feature processing, and learning algorithms. Although it is difficult to pinpoint the ideal size of the training set for these reasons, we suggest starting with labeling at least 300 observations. Note that, however, in many cases far more will be needed. Then, one can use a “learning curve” approach, adding observations until out-of-sample prediction rates plateau (Mohr and van Rijn, Reference Mohr and van Rijn2022), while still holding out reasonably sized validation sets.
Researchers should also report the diversity of the labeled subset along key traits. This is crucial because any feature absent from the labeled set cannot inform the learning algorithms about its mapping into the latent space. At a minimum, researchers should disclose the variation of the labeled set along the latent dimension of interest. For instance, if the labels are categorical, it is important to note how many observations in each class are included in the labeled set (e.g., histogram). For continuous measures, researchers should report measures of central tendency (e.g., median and mean) and spread (e.g., variance, range, and/or interquartile range).
Using the sampled data that are representative, diverse, and large enough, we suggest separating them into four groups for model fitting and validation. We present an example of this grouping strategy in the application section (see Table 2). Basically, the objective is to let the first group serve as the largest, baseline group and to reserve the other three smaller groups to validate the labels, the fitted models, and the final measure.
Table 2. Partitioning of the labeled data for training and three forms of validation

Second, researchers should detail their labeling process: who did the coding (e.g., researchers, trained research assistants, or online workers), what rules were followed, and any steps taken to improve label quality (e.g., multiple coding). They should assess both reliability, meaning the consistency of labels (e.g., inter-coder reliability such as Krippendorff’s alpha for categorical variables and Pearson’s correlation coefficient for continuous variables among many options), and validity. Validity can be established through face validity (providing readers with specific documents and their predicted scores), convergent validity (showing that the labels correspond to related measures as expected), or some other validation strategy such as comparing the labels to gold standard coding, if available or constructed by researchers, and measuring accuracy rates (e.g., F1 or Root Mean Square Errors (RMSE)) (Grimmer and Stewart, Reference Grimmer and Stewart2013b).
Third, researchers must address the fundamental challenge of reducing the textual input’s high-dimensional feature space to a manageable size while preserving essential semantic information. The most common approach begins with basic pre-processing steps: tokenization to create a term-document frequency (TDF) matrix (Grimmer and Stewart, Reference Grimmer and Stewart2013a), removal of stopwords, and stemming.Footnote 11 More sophisticated dimensionality reduction techniques, such as word embeddings (Mikolov et al., Reference Mikolov, Chen, Corrado and Dean2013; Rodriguez and Spirling, Reference Rodriguez and Spirling2022), can capture complex semantic relationships in a lower-dimensional space. When selecting a feature representation strategy, researchers must balance two competing concerns: computational tractability and information preservation. Over-aggressive dimensionality reduction can lead to sparse representation in the labeled corpus, potentially eliminating crucial semantic distinctions. To navigate this trade-off, researchers can either empirically evaluate different feature representations through cross-validation to select the best performing approach, or, as we demonstrate in our application, combine models trained on different feature representations to optimize performance. When researchers lack a clear reason to favor any single representation, the form of ensemble approach to model building used below will typically lead to more accurate predictions than any single model. Regardless of how this is navigated, researchers should report these steps to readers.
Fourth, researchers train models to predict labels for unlabeled documents. They should disclose model fitting procedures and justify hyperparameter choices. Model validation can involve assessing predictive accuracy within the labeled set. However, for complex, tuned models, this risks overfitting. To mitigate this, researchers must evaluate predictive accuracy on a held-out validation set. The type of accuracy metrics are determined based on the type of measures of interest. For continuous measures, RMSE and Pearson’s correlation coefficients can be used. For categorical measures, Accuracy, F1-score, Recall, and Precision are widely used while there are other options (e.g., Cohen’s Kappa). To ensure transparency, we suggest always reporting a confusion matrix for categorical measures so that readers can compute alternative accuracy metrics if needed.Footnote 12 In addition, reporting face validity can provide more confidence to readers that the scores that model predicted captures the latent trait. For example, Stier et al. Reference Stier, Mangold, Scharkow and Breuer(2022)’s model for identifying political news articles heavily weighted words like “trump,” “president,” “house,” “mueller,” “democrats,” and “campaign,” reassuring readers that predictions are driven by sensible features.
Finally, scores are imputed for all documents to create the final measure. The quality of this measure depends on all previous steps, sometimes in ambiguous ways. Therefore, we recommend multiple validation methods including face validity, convergent validity, and predictive validity. We provide explicit examples in the following section (see also Grimmer and Stewart, Reference Grimmer and Stewart2013a; Goet, Reference Goet2019).
In summary, this section provides a framework for transparently reporting key decisions and validation steps in a supervised text-to-measure pipeline. By following the guideline in Figure 2, researchers ensure readers understand critical choices at each stage: subsetting the corpus, labeling training data, transforming textual features, modeling feature-label relationships, and imputing scores. This transparency enhances reproducibility and allows readers to critically evaluate the measure’s validity and reliability.
Our guidelines emphasize the importance of thoroughly reporting the details of the measurement process and its validation. Ideally, this information should be included in the main text of the study wherever possible, especially if the measure is a key explanatory or dependent variable. Depending on the centrality of the measurement to the research and the space constraints of journal articles, many of these details may need to be provided in the appendix. However, we encourage researchers to present at least the three-stage validation metrics in the main text to ensure transparency and rigor.
To illustrate these principles in practice, we now turn to a concrete example: measuring the tone of senators’ statements in 916 confirmation hearings. By walking through each pipeline step, we demonstrate how careful reporting can bolster confidence in the final measure and provide insights for future applications.
4. Application: Senate confirmation hearings
In this section, we turn to an example that illustrates our framework as well as a specific set of decisions to operationalize our recommendations. Again, we emphasize that we view these particular choices as appropriate for our application. In other contexts, other decisions may be preferable. However, we believe that this example will serve as a template that other researchers can build from, alter, and revise to suit their own purposes.
In this example, we measure the positive or negative tone of senators’ statements in 916 US Senate confirmation or nomination committee hearing transcripts. These are from the 105th-115th Congresses and include both bureaucratic and judicial nominations. We collected these transcripts from the Government Publishing Office and Congress.gov. Typically, a committee hearing transcript records statements made by both members of Congress and witnesses. We focus only on statements made by committee members.Footnote 13 In total, our corpus includes 89,279 member statements.
Our goal is to estimate the tone of the questions senators asked of nominees. An example is the following statement:
“If you are confirmed, would you be willing to hold a moratorium until you are sure that all of the local law enforcement officers have been properly trained and understand what the authority is under 287(g), which is not to stop people just based on the color of their skin? I am wondering whether or not you would be able to move in that direction.”
By our definition, a positive statement either expresses support for a nominee (e.g., by praising their qualifications, achievements, or character) or provides a favorable description of a situation (such as highlighting policy successes).
4.1. Partitioning
We used a simple random sampling scheme to select 3,600 paragraphs for labeling and/or validity testing.Footnote 14 The paragraphs were chosen through the following procedure: statements containing more than 120 words and composed of multiple paragraphs were broken down into paragraphs; paragraphs that were too short (containing less than fifty words) were combined with adjacent paragraphs in the same statement; subsequently, among those containing fifty to 120 words,Footnote 15 3,600 paragraphs were chosen at random. The 3,000 documents used to train our final model (Groups A and B described below) include statements from 781 unique hearings and 200 unique senators (45.7% Republican and 55.17% Democrat).
These 3,600 paragraphs were grouped into four subsets with 2,900 (Group A), 100 (Group B), 500 (Group C), and 100 (Group D) paragraphs, respectively. Table 2 summarizes our partitioning strategy. The first three sets (n = 3,500) were labeled by Amazon Mechanical Turk (MTurk) online workers using a common procedure described below. We hand-labeled 200 documents (Groups B and D) on a five-point scale for downstream validation.Footnote 16 The 500 paragraphs in Group C were labeled but held out from the model building, and the 100 documents in Group D were excluded from the entire procedure.
The partitioning strategy we propose is valuable in that it allows us to perform validations at three different stages. First, we can validate the crowdsourced labels by comparing them in Group B with the expert labels. Second, we can validate our model using a held-out sample by training the model on Groups A and B only, and comparing the model’s prediction for Group C to its crowd-sourced labels. Third, we can validate our final measure using a held-out sample by comparing the model’s prediction for Group D with our own coding of the documents. We explain these procedures in more detail in the following sections.
4.2. Labeling
We used MTurk workers to create a measure of tone based on pairwise comparisons following Carlson and Montgomery Reference Carlson and Montgomery(2017). Briefly, we randomly paired two documents in the sample and let workers choose the one that exhibited positivity more strongly. We created 25,000 random pairs to generate the same number of human intelligence tasks (HITs). Each document appeared in the pairwise comparisons with a similar frequency. We then used the Bradley-Terry model to estimate the latent position of each document on a continuous scale, which we refer to as the crowdsourced score of the documents. Figure C3 in the Online Supplementary Material shows the distribution of the measure. We explain coding instructions and worker management in the Online Supplementary Material Section C. We assess the reliability of the measure by comparing estimates using only the first half of the HITs to estimates generated from the second half of the HITs. These estimates are correlated at 0.593. This seems low but should be understood given the fact that the pairwise comparison approach is significantly affected by the volume of the data and we reduced it to half.
Now we validate whether our labels capture our concept of interest because low-quality labels can lead to a low-quality final measure. One option is to check the face validity of the measure. Table C2 shows 10 randomly chosen paragraphs along with their estimated scores.Footnote 17 These generally confirm that more negative statements are assigned to lower values.Footnote 18
Second, we also assess the convergent validity of our measure by comparing it with our expert coding. For 100 of these documents (Group B in Table 2), we hand-coded them on a five-point scale. The correlation coefficient between the two scores is 0.808 (see Figure C4), again suggesting that the labels are a valid measure.
4.3. Feature representation
As we had no a priori reason to prefer any specific pre-processing approach, we created an ensemble of models based on term-document frequency representations (TDF), embeddings, and Bidirectional Encoder Representations from Transformers (BERT) (Devlin et al., Reference Devlin, Chang, Lee and Toutanova2019).
For the TDF, we remove punctuations and numbers, put words into lowercase, stem, remove stopwords,Footnote 19 include unigrams and bigrams, and remove infrequent words by retaining only the 5,000 features—unigrams and bigrams—that appear most frequently in the corpus. To determine which of these pre-processing steps to choose, we used the Support Vector Machine (SVM) model in the R package and chose the alternative in each pre-processing step showing the best prediction power as measured based on 10-fold cross validation in the training sample. The validation statistics we relied on are the out-of-fold RMSE and the correlation coefficient between the crowdsourced labels and the model prediction. The end result was a TDF representation with 5,025 features.Footnote 20 Of these, 4,921 (97.93%) words or features are contained in the training set.Footnote 21
For embeddings, we use the word2vec model, which is an unsupervised learning model also known as distributed vector representation of words (Mikolov et al., Reference Mikolov, Chen, Corrado and Dean2013). This model assigns each word a numeric vector that captures the context in which the word appears in a corpus by measuring its relationship with the words surrounding it. We first took pre-processing steps, which includes removing punctuations, numbers, and stopwords, lowercasing, and stemming. To tune embeddings for our task, we varied four key parameters generating ninety-six different combinations: 400, 500, and 600 for the dimension of a word vector; 5, 7, 10, and 12 for the window size; 10, 15, 30, and 45 epochs; only unigrams and both uni- and bigrams. Among the ninety-six model specifications, the one with 400 dimensions, a window size of 5, and 15 epochs including only unigrams performed the best in our cross validation.
BERT is an unsupervised deep learning model that was pre-trained on a large corpus to capture representation of words and sentences and their semantic relationship. It can be fine-tuned with small data at the user’s end (Devlin et al., Reference Devlin, Chang, Lee and Toutanova2019). As a tuning parameter, we varied epoch size and chose the one that resulted in the smallest loss.Footnote 22
4.4. Fitting learners
To fit our model, we combined Group A and Group B in Table 2 to create a training set of 3,000 statements.Footnote 23 We fit 13 models in total and combined them using ensemble Bayesian model averaging (Montgomery et al., Reference Montgomery, Hollenbach and Ward2012). Table D4 provides a complete list of these learners and shows their predictive performance where prediction is calculated with a cross validation within our 3,000 document training set.Footnote 24
Assessing the model itself is a second validation point. However, we have used the training data set to (i) determine the best approaches for text pre-processing, (ii) tune individual models, and (iii) create ensemble weights for the final model. Reusing the training data in this way—even with techniques like cross validation—risks overfitting.
Therefore, we used our model to predict the score on the 500 documents held back for model validation (Group C). We found that the RMSE for this set was 0.538. This compares favorably to the RMSE of the best single learning algorithm, 0.604, the support vector machine using the doc2vec matrix.Footnote 25 The Pearson’s correlation coefficient between the crowdsourced scores and the ensemble scores is 0.757 (see Figure D6). In all, these results suggest that the ensemble learner can accurately, if imperfectly, predict crowdsourced labels from textual features.
In addition, we checked the most frequent word stems representing positive and negative tones obtained through the following procedure. First, we selected paragraphs whose scores fell in the top quartile (most positive statements) and the bottom quartile (most negative statements) of the entire corpus excluding the scored setFootnote 26 and extracted the 300 most frequent word stems from each side.Footnote 27 Then, we excluded 178 overlapping features to create a list of most frequent and exclusive stems.
We present the top 50 words from each category in Table D5 in the Online Supplementary Material and find the following patterns: positive statements are featured with appreciations (e.g., “thank-much”, “appreci”), greetings (e.g., “pleas”, “honor”), endorsement (e.g., “proud”, “friend”) and, most importantly, compliments on nominees’ ability (e.g., “experi”, “energi”, “distinguish”, “leadership”) or work-related experiences (e.g., “ambassador”, “develop”, “director”, “manag”, “univers”); in contrast, negative statements tend reference problems (e.g., “problem”, “matter”), personal views (e.g., “agre”, “opinion”, “view”), money (e.g., “$”, “money”,), fact-checks (e.g., “correct”, “report”, “percent”, “quote”), and rules (e.g., “constitut”, “rule”, “regul”, “standard”, “suprem-court”). Given the context, these results give us further confidence that the model is leveraging appropriate textual features to measure the latent trait.
4.5. Final output
Model validation provides evidence that our learner can accurately predict the labels, and we also took steps to validate the labels themselves. However, we can also assess the validity of the final measure itself.
To assess the convergent validity of the measure, we turn to the 100 paragraphs we scored on a five-point scale but were held back from all of the previous steps (Group D). The correlation coefficient between the two measurements is 0.743 suggesting that the expert ratings and the ensemble predictions are picking up the same latent trait (see Figure D7).
To assess the face validity of the measure, Table 3 provides five examples of scored statements along with the five-point scale expert coding and the scores generated by the ensemble prediction. These were randomly selected to ensure that one document from each of the five-point categories was included. In these examples, both scores go hand in hand. The statements scoring lower tend to convey more negative tone than those scoring higher, which provides strong face validity of our ensemble prediction score.
Table 3. Five random sample statements

Finally, we check predictive validity by analyzing whether members’ speaking tone in Senate confirmation hearings correlates with institutional factors in theoretically expected directions. First, we expect that members’ speaking tone should become more negative over time due to the increasing level of partisan polarization (McCarty, Reference McCarty2019) and intensified partisan conflict in Congress (Lee, Reference Lee2016). Second, members’ questions asked to appointees across party lines should be more negative than questions asked to appointees from their own party. Third, we expect that tone will be more negative in judicial hearings relative to nominations to the bureaucracy.Footnote 28
To facilitate the substantive interpretation of the changes in members’ speaking tone, we rescaled the ensemble prediction scores to range from 0 to 100. It has a mean of 45.08 and a standard deviation of 15.58. Using this measure, we find support for most of our expectations. First, Figure 3a) presents the average tone of the members in each congress with 95% confidence intervals. We can see a slight downward trend, especially in the 115th congress, which is the first two years of the Trump administration. However, the pattern is not as strong as expected.

Figure 3. Changes in members’ tone (a) By congress. (b) By appointment type.
Second, we analyze whether the same member tends to be tougher on judicial nominees than on bureaucratic nominees. We include only the statements made by those who participated in confirmation hearings for both types of appointments. This totals 79,158 statements made by 161 senators. Then, we compare the average tone of member statements made in bureaucratic nomination hearings and those from judicial nomination hearings. These statistics are presented in Figure 3 with 95% confidence intervals. As expected, senators tend to speak more negatively to judicial appointees (43.557) than they would to bureaucratic nominees (45.782).
Third, we calculate the average tone for Democrats and Republicans, respectively, for each presidential administration.Footnote 29 Figure 4 shows that senators are generally more negative toward those nominated by presidents from the opposite party. The partisan gap is visible in all four administrations. It is interesting to note that, first, the partisan gap grows over time, and second, the tones of both parties become somewhat more negative across administrations. Both findings are consistent with the intensified partisan conflict reported in the literature (Lee, Reference Lee2016).

Figure 4. Tone by party and administration.
One potential question is whether taking these steps actually matters to the quality of the measure. In part, this question is tangential to our argument, in that our goal is not merely to build superior measures but to communicate the quality of the measure to readers both transparently and robustly. Nonetheless, it is informative to compare our measure to alternative strategies to address this question. Thus, in Section E, we pursue three alternative strategies—dictionary method to measure sentiment, a pre-trained sentiment classifier, and our own method but with various shortcuts—and show that the measurement procedure trumps the other alternatives.
5. Considerations for large language models
Before concluding, we consider how our recommendations apply to large language models (LLMs). While the use of LLMs reduces reliance on labeled data, the “black box” nature of LLMs heightens the need for clear disclosure and rigorous validation. Thus, our framework remains essential, though certain aspects require adaptation.
Two key scenarios illustrate this. For small corpora, LLMs can generate labels for the entire dataset, reducing the role of supervised learning. For large corpora, full LLM processing may be computationally costly, making hybrid approaches more practical. In such cases, LLMs can generate high-quality training data for traditional supervised models, necessitating careful documentation of their role in label creation.
In both scenarios, LLM-generated labels require methodological transparency. Researchers must document prompting strategies by reporting the exact prompts used, providing example outputs, and systematically testing prompt effectiveness. Additional techniques, such as varying prompts to test stability or comparing results across different LLMs or across time to account for model updates, may be necessary. Researchers should also document the version of the model and parameter settings while acknowledging LLM limitations, such as biases and replication challenges.
As with human-generated labels, LLM labels should undergo rigorous validation processes including face validity checks and comparison with expert coding to ensure alignment with intended concepts. Researchers should maintain held-out validation datasets to assess label consistency and alignment with human judgments, ensuring an out-of-sample test of the prompt and model. As these methods continue to advance, establishing rigorous standards for documentation and validation will be critical to ensuring scientific credibility.
6. Conclusion
Social scientists increasingly use supervised learning to study latent concepts in large data sets. Validating new measurements has long been crucial, yet our review of recent articles using supervised learning finds inconsistent reporting and validation standards. To address this, we propose a framework that identifies key decision points to be reported and multiple stages that require validation—namely, the label, the prediction model, and the final measure. While not every process must follow all steps, our guidelines establish general standards for transparency and reliability in supervised learning-based measurement.
Our framework can be easily extended to analyze other types of data. Importantly, while our specific example involves a continuous measure, the general guidance we provide should work equally well for categorical outcomes just with alternative predictive performance metrics suitable for a categorical outcome. In addition, our guideline can be extended to the research using non-text data, such as images, videos, audio, etc. While the specific steps will differ for these research domains, researchers will face similar underlying issues such as how to choose subsets, acquire labels, pre-process inputs, train models, and validate results.
As computational tools evolve, the core challenge remains: researchers must clearly communicate their measurement procedures and validate results. Our proposed standards are not restrictive but establish a shared vocabulary and baseline expectations for assessing new measures. Just as reporting standards exist for surveys and experimental designs, clear guidelines for supervised learning will support scientific progress. Our framework lays this foundation, adapting to rapid technological advancements while emphasizing transparency and validation. As new tools emerge, maintaining these commitments will be essential.
Supplementary material
The supplementary material for this article can be found at https://doi.org/10.1017/psrm.2025.10042. To obtain replication material for this article, https://doi.org/10.7910/DVN/AFBW80.