Introduction
Political speech is enormously varied. While political speech is often tedious and boring, some speeches are enthralling and memorable. Whether a speech is dull or captivating depends in no small part on its delivery. Yet, even though the delivery is key in determining how a speech resonates with the public, we lack an understanding of when and why political actors make passionate appeals. In an effort to help fill this gap, this paper focuses on the non-verbal characteristics of political speech, which have rarely been the subject of systematic examination. This gap is due to measurement challenges that researchers face when analyzing non-verbal speech characteristics. Key facets of the delivery cannot be studied using the written record but can only be gauged from video footage. While political scientists have developed a rich toolbox for analyzing the textual features of political speech, the field lacks tools for capturing the non-verbal characteristics of political speech.
To overcome this limitation, we adopt methodological innovations from the field of computer vision for analyzing video data on a large scale. Building on these innovations, we develop and apply an automated video analysis model to analyze video recordings of political speeches. Specifically, we train a convolutional neural network (CNN) to analyze video footage from the US House of Representatives. The CNN is trained to detect gesturing, facial expressions, and pitch to gauge the emphasis in legislative speech.
To understand variation in speech delivery, we develop the argument that legislators adopt particular delivery styles to signal to their constituents. Scholars have long viewed political speeches as signaling tools (Mayhew Reference Mayhew1974). Yet, the utility of such signals depends on whether they reach their intended audience. As most political speeches go unnoticed, we argue that legislators rely on emphatic appeals to make their signaling efforts more visible. Particularly in the current media environment, it is imperative that legislators deliver good soundbites to make it past the media gatekeepers or to go viral on social media (Esser, Reference Esser2008; Larsson, Reference Larsson2020; Negrine and Lilleker, Reference Negrine and Lilleker2002; Strömbäck, Reference Strömbäck2008). Legislators are aware of this bottleneck and they make fiery appeals when they want to signal their positions. Whether legislators try to send such a signal depends on public opinion. When public opinion is aligned with their policy stance, we expect legislators to make an effort to highlight their position. Thus, in line with existing work on the responsiveness of political speech to public opinion (Bäck and Debus Reference Bäck and Debus2018; Baumann et al. Reference Baumann, Debus and Müller2015; Hill and Hurley Reference Hill and Hurley2002), we argue that legislators are not only mindful of public opinion in what they say but also in how they say it.
To assess the effect of public opinion on non-verbal speech characteristics, we link the speech emphasis with survey data. We use multilevel regression and post-stratification, as well as Bayesian additive regression trees and post-stratification to estimate district preferences on a series of bills from the 111th to the 115th Congress (2009–18) using data from the Cooperative Congressional Election Study (Bisbee Reference Bisbee2019; Warshaw and Rodden Reference Warshaw and Rodden2012). The results provide support for the notion that legislators employ emphatic appeals to signal their policy positions when they are aligned with constituency opinion.
The findings have important implications for our understanding of legislative speech. Our study is one of few contributions that consider the non-verbal characteristics of legislative speech. In addition to showing that non-verbal speech characteristics contain valuable cues for political research, we highlight how these characteristics are shaped by strategic considerations. Among others, these findings are relevant for researchers trying to gauge the substance of political conflict from speeches (Lauderdale and Herzog Reference Lauderdale and Herzog2016; Monroe et al. Reference Monroe, Colaresi and Quinn2008; Proksch and Slapin Reference Proksch and Slapin2009). Incorporating the non-verbal characteristics into these efforts can generate novel insights as speech emphasis can help distinguish key policy statements from everyday speech.
Moving Beyond the Textual Features of Political Speech
Political speech is an area of intense research. Particularly the digitization of parliamentary records has helped expand our understanding of the use (Maltzman and Sigelman Reference Maltzman and Sigelman1996; Morris Reference Morris2001; Proksch and Slapin Reference Proksch and Slapin2012) and substance of parliamentary speech (Hill and Hurley Reference Hill and Hurley2002; Morris Reference Morris2001; Quinn et al. Reference Quinn, Monroe, Colaresi, Crespin and Radev2010). Despite the undeniable value of these efforts, they are subject to limitations. Efforts to categorize political speech have almost exclusively focused on textual features. While the text of speeches is sufficient for many research questions, political speech has important dimensions that are difficult to capture based on the textual features alone. Key among the characteristics that are typically disregarded in the analysis of speech is the delivery. Speeches are not generally given for the written record. They are a form of political communication where the delivery is central to their intent and effects. While some of the non-textual features may shine through in the written record, much will be lost in the transcription. For example, comparing the written record with video recordings of speeches in the Canadian House of Commons, Cochrane et al. (Reference Cochrane, Rheault, Godbout, Whyte, Wong and Borwein2022) show that emotional arousal cannot be extracted from the text alone. Succinctly put, researchers have learned a lot about what legislators say but little about how they say it.
Based on the idea that delivery matters for understanding political speech, some contributions have attempted to quantify the non-textual aspects of political speech (Banning and Coleman Reference Banning and Coleman2009; Bucy Reference Bucy2016; Wasike Reference Wasike2019) and how they shape perceptions of the speaker (Burgoon et al. Reference Burgoon, Birk and Pfau1990; Koppensteiner and Grammer Reference Koppensteiner and Grammer2010; Masters and Sullivan Reference Masters and Sullivan1989). These efforts have been constrained by the difficulty and labor-intensity of manually coding speech recordings. One promising way forward for this research is to build on the recent advances for the automated analysis of audio and video data and to apply these innovations to the ever more widely available digitized recordings of legislative speech.
A nascent literature has begun to employ these tools for researching the non-verbal characteristics of legislative speech and other recordings of political interest. While there are several studies focusing on audio recordings (Dietrich et al. Reference Dietrich, Hayes and O’Brien2019; Dietrich et al. Reference Dietrich, Enos and Sen2019; Knox and Lucas Reference Knox and Lucas2021; Rittmann Reference Rittmann2024), applications studying video recordings are few and far between. For example, Dietrich (Reference Dietrich2021) uses video recordings from the US House of Representatives to analyze political polarization. Studying plenary shots, Dietrich finds that legislators have become less likely to mingle across party lines on the House floor as polarization has gone up. Boussalis and Coan (Reference Boussalis and Coan2021) and Boussalis et al. (Reference Boussalis, Coan, Holman and Müller2021) use computer vision to extract facial expressions of candidates during televised election debates in the United States and Germany, and find that candidates’ emotive displays affect viewers’ evaluations. Finally, Neumann et al. (Reference Neumann, Fowler and Ridout2022) analyze politicians’ body language in televised ads in the United States, showing that men use more assertive gestures than women.
While existing studies constitute valuable efforts to move beyond the textual features of political speech, the current research agenda using audio and video data is fairly narrow. Due to the novelty of the data and tools for studying digitized audio and video data, the research is heavily invested in validation efforts and in exploring descriptive relationships between actor characteristics and non-verbal political behavior. What is lacking are systematic efforts to situate the new measures in conventional research programs. Indeed, the fact that previous contributions have found substantial variation in non-verbal communication underscores the need for research aimed at explaining variation in the non-verbal characteristics of legislative speech. To this end, we develop and test a theoretical account for emphatic legislative speech.
Signaling Through Emphatic Legislative Speech
The starting point for our theoretical account of emphatic legislative speech is that we conceive of speeches as signals. The signaling function of legislative speech is well established in research on political communication (see, for example, Grimmer Reference Grimmer2013; Highton and Rocca Reference Highton and Rocca2005; Hill and Hurley Reference Hill and Hurley2002), yet there is a notable gap in existing accounts of speech signaling. While there is little doubt that legislators are mindful of public preferences when speaking in public, it has often remained unacknowledged how unlikely it is that such signals will reach their intended audience.
The motivating idea for this contribution is that legislators are aware that the vast majority of speech signals will go unnoticed by the public, which is why legislators can and do make efforts to make their contributions more visible. Particularly in the current media environment, legislators can rely on a push strategy by posting their speeches on their websites or on social media. Legislators can also pursue a pull strategy by trying to create ‘broadcastable’ moments, hoping that their contributions get picked up by traditional or social media. In practice, push and pull strategies will go hand in hand in that legislators try to create good soundbites, which they then post on social media in the hopes that their messages might go viral.
In trying to create good soundbites, legislators can rely on a host of rhetorical strategies that have been elaborated since antiquity. While captivating oratory can never be separated from the substance of what is being said, it is never a mere textual matter either. Instead, there are a great number of non-verbal characteristics that distinguish a good speech from a bad one. This might cover physical aspects such as gesturing, pose, and body movement, as well as auditory features such as pace, pitch, and volume.
We expect that legislators deliberately rely on these techniques to improve the odds that their signal reaches its intended audience. Notably, we are not interested in whether a signal is actually perceived but whether legislators predictably vary their speech delivery, suggesting a deliberate use of these techniques for strategic ends. In this sense, we conceptualize speech delivery as a deliberate effort on the part of legislators rather than an unconscious indicator of legislators’ true beliefs. To be sure, conceptualizing speech delivery as deliberate should not be equated with insincere. It is easily conceivable that legislators often feel strongly about an issue, resulting in a forceful delivery. Yet, the notion of deliberate emphasis suggests that, for the most part, professional politicians can choose to hide their true feelings when giving a speech if they consider doing so politically advantageous. With regard to our theoretical interest, then, we assume that legislators can choose to send a signal by way of an emphatic delivery.
There is tentative evidence that emphatic and emotional appeals are in fact more likely to be perceived by the public. Given the difficulty of systematically quantifying the emphasis in political speech (Cochrane et al. Reference Cochrane, Rheault, Godbout, Whyte, Wong and Borwein2022), there is little direct evidence on this question. The most direct study linking speech emphasis and media visibility is presented by Dietrich et al. (Reference Dietrich, Schultz and Jaquith2018). Analyzing audio data from floor speeches in the US House, the authors show that more emotionally charged speeches are more likely to be broadcast and receive media coverage. Beyond the direct evidence, there are various studies on public engagement with political messages, which consistently show that greater emotional intensity predicts the success of political messages on social media (Brady et al. Reference Brady, Wills, Jost, Tucker and Van Bavel2017; Heiss et al. Reference Heiss, Schmuck and Matthes2019; Nave et al. Reference Nave, Shirman and Tenenboim-Weinblatt2018; Peeters et al. Reference Peeters, Opgenhaffen, Kreutz and van Aelst2023), as well as showing that legislators’ rhetorical skills and emotional appeals predict their visibility in traditional media (Amsalem et al. Reference Amsalem, Sheafer, Walgrave and Loewen2017; Lupacheva and Mölder Reference Lupacheva and Mölder2024; Maier and Nai Reference Maier and Nai2020; Sheafer Reference Sheafer2001 Reference Sheafer2008; Sheafer and Wolfsfeld Reference Sheafer and Wolfsfeld2004; Wolfsfeld and Sheafer Reference Wolfsfeld and Sheafer2006).
For a first test of our new measure of legislative speech emphasis, we rely on one of the most well-established context factors in research on legislative behavior – the effect of public preferences. We expect that legislators only choose an emphatic delivery when their preferences are aligned with the preferences of their electorate. Only under conditions of alignment should we expect legislators go out of their way to try to send a signal about where they stand politically.Footnote 1
In formulating this expectation, we are able to add nuance to research on legislative signaling. Arguably the most politically consequential signal that legislators can send is their plenary vote. While there is ample evidence that public preferences shape legislators’ voting record, legislators often find themselves voting against their district, either because of strongly held personal beliefs or, more commonly, because of pressures from the party leadership, where the pressure to fall in line with the party should be especially strong in the current climate of political polarization. Consequently, while legislators may often find themselves voting against their district, there is little reason to expect legislators to advertise that fact to their electorate by delivering an emphatic speech.
Even though there is little reason to expect legislators to emphatically highlight a vote against the district majority, one might wonder whether legislators with unpopular positions would not be better off not taking the floor at all. While trying to fly under the radar is not an unreasonable strategy, it can also prove dangerous when an unpopular vote comes to light. Therefore, legislators who find themselves voting against their district may prefer to take the floor to explain their vote. At the same time, it would rarely be wise to call attention to an unpopular stance by creating a good soundbite. We can therefore summarize that legislators whose vote is aligned with the preferences of their district are more likely to give an emphatic appeal than legislators with an unpopular stance. As an empirical matter, this expectation also helps distinguish between a deliberate and an unconscious account of speech delivery. If public preferences systematically shape legislators’ speech delivery, this is unlikely to result from legislators’ true beliefs that accidentally shine through in their delivery.
Research Design
Are legislators more likely to deliver emphatic speeches when district preferences on a bill align with their vote? To test this proposition, we study debates on twenty-five pieces of legislation in the 111th–115th US House of Representatives (2009–18). The sample was selected using four criteria. Bills were selected if they were politically salient, if public opinion data on the bills was available, if there was partisan conflict, and if public opinion towards the bill varied between congressional districts. To select the sample, we compiled a list of survey items in the Cooperative Congressional Election Survey (CCES) between 2010 and 2018 where respondents were asked to indicate their preferences on specific pieces of legislation. We then matched these questions to bills in the House of Representatives. These bills overlap to a large extent with votes that were classified as ‘key votes’ by Congressional Quarterly and cover a wide range of domestic and foreign policy issues (Ansolabehere and Jones Reference Ansolabehere and Jones2010). From this sample, we discarded bills that were passed without partisan conflictFootnote 2 and bills without variation of district-level opinion. The restriction to partisan votes is plausible as voters are more likely to hold or be able to form preferences on important and controversial issues. For the same reason, signaling is a more promising strategy on important and controversial issues, as speeches on irrelevant or undisputed bills are unlikely to be observed by the public. As a practical matter, more speeches are delivered on important and partisan bills.Footnote 3 Table 1 lists the twenty-five bills in the resulting sample.
Table 1. Summary statistics on the bills

Note: Speakers refers to the number of speakers during all debates on a bill. House vote presents the result of the final vote on the bill. Mean provides the average emphasis scores across all speeches on a bill, s.d. is the associated standard deviation.
In the remainder of this section, we first introduce the dependent variable, the emphasis in legislative speech, and how it can be automatically gauged from plenary video recordings. Next, we discuss the estimation of district preferences on the bills as the independent variable.
Measuring Emphasis in Legislative Speech Using Automated Video Analysis
To study the emphasis in legislative speech, we analyze video recordings of key debates in the House of Representatives. The video recordings of the debates were compiled from the online archives of the House. The sample contains video recordings of all debates on the twenty-five pieces of legislation. We manually discard irrelevant sequences to ensure that we only analyze footage where the camera fully captures the speaker.Footnote 4 This results in seventy-seven hours of video footage comprising 2,341 speeches by 543 legislators. Table A2 in the Online Appendix lists the debates and the pieces of legislation.
We employ computer vision to measure the speech emphasis. Specifically, we generalize manual annotations based on a set of training videos to all videos in the sample. We start by drawing an additional sample of 245 speeches by 116 legislators on 37 bills from the 115th Congress as training/test data. We manually selected 245 speeches rather than choosing them at random to ensure sufficient variation in emphasis across our training and test data. This approach allowed us to incorporate a mix of high-, mid-, and low-emphasis speeches in both datasets. The resulting videos were split into 184 training and 61 test videos. Four trained coders were tasked with annotating the emphasis in the speeches using a seven-point Likert scale, ranging from − 3 (low emphasis) to + 3 (high emphasis), for every non-overlapping two-second segment.Footnote 5
Every video was annotated by two randomly selected coders to better judge the emphasis in the videos and to evaluate inter-rater agreement. As continuous annotation of video data is subject to different reaction times and mental processing speeds, annotations can move out of sync. Therefore, we align the annotation sequences by the two coders using the mean absolute error distance. This alignment shifts values by at most two seconds, that is, by one segment. Table 2 summarizes the key values of the manually annotated dataset. On average, the two annotators deviate by less than 0.5 scale points based on the mean absolute error across all two-second segments in the training and test data. Additionally, we report Lin’s concordance correlation coefficient and Pearson’s correlation coefficient as common measures for inter-rater agreement.Footnote 6
Table 2. Summary metrics for the annotated data set and model evaluation

*Note: predictions drawn from a clipped standard normal distribution, 1,000 runs.
Predicting speech emphasis using a convolutional neural network
To estimate emphasis scores for speeches outside the manually annotated set, we use the training data to train a multimodal convolutional neural network using audio and video inputs. The goal of the network is to assign emphasis scores for each two-second segment. As context information is useful for predicting the current emphasis state of a speaker, we include the surrounding two-second segments for the prediction. Thus, the model takes an input of six seconds of audio and video data for each two-second segment prediction. Before feeding the data into the network, we perform a series of preprocessing steps which we describe in Online Appendix C.
To predict the speech emphasis, we employ a convolutional neural network.Footnote 7 CNNs typically comprise two stages: feature learning and prediction. In the first stage, feature learning, the convolutional base of the model learns hierarchies of modular patterns in the input data. These features are represented in feature vectors that constitute the output of the convolutional base. In the second stage, prediction, this feature vector is fed into a second neural network, which uses the features from the first stage to predict outcome values, in our case, emphasis scores. If trained on a large enough dataset, features learned in the convolutional base are sufficiently generic to be useful for a wide variety of classification tasks. Therefore, especially in cases with small training data, pre-trained networks are commonly used for feature extraction and have proven to be highly effective (Carreira and Zisserman Reference Carreira and Zisserman2017; Chollet and Allaire Reference Chollet and Allaire2018).
As our manual training data is limited, we use a pre-trained network to extract features for the video input. Specifically, we use a state-of-the art pseudo-3D-Resnet CNN (Qiu et al. Reference Qiu, Yao and Mei2017). This network is pre-trained on the Kinetics data set, which is commonly used for human action recognition (Kay et al. Reference Kay, Carreira, Simonyan, Zhang, Hillier, Vijayanarasimhan, Viola, Green, Back, Natsev, Suleyman and Zisserman2017). The network takes 299 × 299 pixel images as input and generates a 2,048-dimensional feature vector. For the audio input, we use a Soundnet-like subnetwork (Aytar et al. Reference Aytar, Vondrick and Torralba2016). This network takes the audio input and generates a 512-dimensional feature vector.
After passing the video and audio inputs through these two networks in the feature learning stage, we obtain two feature vectors that summarize the video and audio inputs. In the next step, we combine both vectors and pass them to two fully connected layers. The final layer produces values in the [ − 1, + 1] range. To match the output to the original emphasis scale, we multiply the predicted values by 3 to obtain scores ranging from − 3 to + 3. Figure 1 visualizes the network architecture.

Figure 1. Convolutional neural network architecture.
To train the model, we use a mean absolute error loss function. This function minimizes the distance between values predicted by the model and the mean emphasis scores provided by the human annotators. To prevent the neural network from overfitting, we add dropout to the fully connected layers in the second stage (Srivastava et al. Reference Srivastava, Hinton, Krizhevsky, Sutskever and Salakhutdinov2014).Footnote 8 We use the Adam algorithm to optimize the model’s parameters (Kingma and Ba Reference Kingma and Ba2014).
Applying the model to footage of key vote debates
We apply the trained model to the video footage from all debates in our sample. The network predicts emphasis scores for each two-second sequence. For example, a one-minute speech contains thirty consecutive emphasis scores. To generate one emphasis score per speech, it is necessary to aggregate the individual scores. The simplest approach would be to calculate the average emphasis of each speech. Such an approach would ignore that speeches differ in length. This means that a multi-minute speech with thirty seconds of intense delivery would score lower than a one-minute speech with the same sequence. In line with the argument that legislators attempt to signal their issue positions by giving passionate speeches in the hopes of being amplified by traditional or social media, it is sensible to focus on shorter sequences within speeches. Legislators are aware that only short sequences of their speeches may be picked up and broadcast to the public. Therefore, it is sufficient to deliver a short but high-intensity appeal as part of a longer speech, such that short and long speeches with the same high-intensity sequence should score the same. Consequently, we select the thirty-second sequence with the highest within-speech average emphasis to score the speeches.Footnote 9 For the same reason, if a legislator delivered more than one speech on a bill, we select the speech with the highest emphasis score for the analysis. We explain this aggregation procedure in more detail in Online Appendix D.
Table 1 provides summary statistics for the resulting data. The number of legislators who delivered speeches on a bill ranges from 13 to 231. Mean emphasis scores range from − 0.11 (Kate’s Law) to 0.72 (End Don’t Ask Don’t Tell Act). Figure 2 provides additional information on the distribution of the emphasis scores across all debates, which range from − 1.5 to 2.2. Thus, the distribution does not reach the extremes of the emphasis scale, running from − 3 to + 3. This is unsurprising as we average over 30-second segments. The distribution can be characterized as approximately normal with a mean of 0.29 and a standard deviation of 0.70.

Figure 2. Visualization of the emphasis scores and their distribution.
Note: Each line in the upper panel depicts the estimated speech emphasis over the course of the thirty-second sequences. The three highlighted sequences depict the emphasis scores of the speeches with the highest and lowest average emphasis scores (by Rosa L. DeLauro and John Conyers, Jr.) and the speech with the highest within-speech variance (by John Lewis). The video frames give an impression of how increased levels of gesturing and facial expression are linked to higher estimated emphasis scores. The density curve on the right depicts the distribution of the average emphasis scores as used in the analysis. The two boxplots represent the distributions of average emphasis scores by Democrats (D) and Republicans (R).
Model evaluation
We now turn to the evaluation of the neural network. We present the results of two validation exercises. First, we apply the trained model to the held-out test set of sixty-one videos and compare the model predictions with the human annotations throughout all two-second segments within those speeches. In addition, we apply our aggregation algorithm to all of these speeches using both the model predictions and the human annotations and compare the results. Table 2 shows the results of this comparison, along with the results based on random guessing (drawing values from a clipped standard normal distribution) and zero guessing (predicting an emphasis score of 0). As evaluation criteria, we compute mean absolute errors (MAE), Lin’s concordance correlation coefficient (CCC), and Pearson’s correlation coefficient (PCC).
Unsurprisingly, the correlations are essentially zero under random guessing. For zero guessing, the correlation is defined as zero as a constant cannot correlate with a variable. For both random and zero guessing, we observe MAE values close to one standard deviation of the underlying label distribution. The neural network achieves considerably lower MAE values. Based on the MAE metric, the machine prediction is 0.552 scale points off from the human annotation when considering two-second segments. This figure is close to the human interrater MAE. For thirty-second aggregates, the MAE decreases further to 0.460, suggesting that aggregation helps average out random noise in the data. Unlike the guessed values, the model predictions show high correlations for both the CCC and the PCC metric. As before, the correlation between the machine prediction and the human annotators is in the same range as the correlation between the human annotators. We thus conclude that the neural network reliably predicts the speech emphasis.
Our second validation is based on coders’ ratings of thirty- rather than two-second segments. We generated 150 speech pairs based on stratified samples from all thirty-second sequences we used in the analysis.Footnote 10 For each pair, we asked two coders to indicate which of the speakers displayed a higher level of emphasis, and compared their ratings with the model ratings. Coder and model ratings are considered aligned when the speech identified as more emphatic by the coder receives a higher emphasis score from the model. If the model assigns nearly identical scores to both speeches, we would expect coder ratings to align with the model predictions in around 50 per cent of the cases. As the difference in model emphasis scores increases, we expect coder ratings to be more likely to align with the model predictions.
Naturally, the two coders do not always agree on which of two speeches is more emphatic. Overall, they disagree on twenty-three of the 150 speech pairs. Panel A of Figure 3 plots the probability of agreement between the coders against the difference of predicted emphasis scores by the model. It shows that the probability of the two coders agreeing on the emphasis ordering of a speech pair increases as the speeches are rated more distinct by the model.
Panel B focuses on speech pairs where both coders agree and compares their ratings to the model predictions. As expected, coder and model ratings almost always align when the model predicts a substantial difference in emphasis between the two speeches. However, when the model predicts smaller differences in emphasis, coder ratings are more likely to diverge from the model predictions.
Panel C addresses the fact that the model is trained on speeches from the 115th legislative term, while the analysis includes speeches from the 111th to 115th term. These terms include speakers who were not in the training data, and speaking styles may have evolved, making it more difficult for the model to assess speeches from before the 115th term. To assess whether this is indeed the case, Panel C differentiates between speech pairs that include at least one speech from the 115th legislative term, and that are based on speeches from earlier terms. Indeed, the probability of alignment between coder and model ratings decreases slightly for pairs based on speeches prior to the 115th House. To address this, we include a dummy variable in our analyses, indicating whether a speech is from the 115th term or before.

Figure 3. Model evaluation based on pairwise comparisons.
Note: Predicted probabilities and confidence intervals are based on bivariate logistic models, regressing agreement on the difference in predicted emphasis between two speeches. Panel A shows the predicted probability that the two coders agree on which speaker displays greater emphasis. Agreement increases as the model predicts less similar emphasis levels between the two speeches. Panel B is based on pairs where both coders agree on the more emphatic speech, displaying the predicted probability that their ratings align with the model predictions. Agreement between coders and the model increases as the model identifies greater differences in emphasis between the speeches. Panel C distinguishes between pairs that include at least one speech from the 115th legislative term and those that do not. Disagreement between the model and coder ratings is more likely for pairs without speeches from the 115th legislative term.
While the model demonstrates satisfactory performance in the validation exercises, it is not without error. This raises the question of when and why the model makes incorrect predictions. To explore this, we conducted additional quantitative and qualitative analyses of the variation in prediction errors across speeches in our test dataset. We summarize the key insights from this analysis below and provide further details in Appendix E.
First, the model rarely predicts values above + 2 or below − 2, suggesting it may be subject to some attenuation bias. Second, our assessment of the speeches with the highest average prediction error suggests that, compared to human annotators, the model is less sensitive to hand movements occurring in front of the body of the speaker and to gestures made with a closed rather than an open hand. In contrast, the model is more responsive to large, clearly visible hand movements. This observation is plausible as open hands are better visible than closed hands, and gestures stand out more clearly against the background when positioned beside the body of the speaker. Third, we observed that the model tends to overestimate the emphasis of speakers who naturally speak with a strong voice. This is understandable, as a naturally strong voice can resemble an emphatic one. The tendency may correlate with the gender of the speaker, which we account for by controlling for gender in the analysis.
Estimating District-Level Bill Preferences
The key independent variable is the extent to which legislators’ votes align with constituency preferences. We draw on survey data from multiple waves of the CCES to estimate district-level preferences. Each survey contains multiple questions on specific bills. Respondents are provided with the title and a short summary of the bill and are asked how they would have voted.Footnote 11 Matching each bill in our sample to a CCES item enables us to estimate district preferences towards the bills.Footnote 12
Despite the large sample size of the CCES, it is not designed to be representative of congressional district populations. Thus, simply disaggregating survey answers to estimate district-level preferences would likely yield biased estimates. To overcome this challenge, we rely on the widely used multilevel regression and post-stratification (MrP) for estimating district preferences (Gelman and Little Reference Gelman and Little1997; Lax and Phillips Reference Lax and Phillips2009; Warshaw and Rodden Reference Warshaw and Rodden2012). To assess the reliability of the estimates, we complement the MrP approach with Bayesian regression trees and post-stratification (BARP) (Bisbee Reference Bisbee2019). We employ MrP and BARP to estimate district-level preferences for the twenty-five bills in our sample. The estimates range from zero to one, where high values indicate high levels of support.Footnote 13
In the next step, we match these estimates to the representatives’ voting records on the twenty-five bills. Substantively, we are interested in the extent to which legislators’ votes align with the preferences of their electorate. To compute an alignment score, we code roll call votes as one for legislators who voted in favor of a bill and zero for those who voted against it. To assess the extent to which legislators’ votes align with the preferences of their electorate, we calculate the absolute difference between legislators’ votes (YES=1, NO=0) and the preferences of their districts and subtract this from the maximum distance, 1. We label this variable Vote–District Alignment. Values close to 1 indicate high levels of agreement between legislators and their districts, values close to 0 indicate low levels of agreement. Formally:
We present the distributions of the Vote–District Alignment variable by bill and vote choice in Figure 4. The bright density curves depict the distributions of the Vote–District Alignment for legislators who voted yes, the dark densities depict the distributions for legislators who voted no. For most bills, MrP and BARP result in similar distributions of the alignment between legislator vote and district preference. In most instances, the distribution for legislators who vote yes diverge from those who vote no. Consider the State Children’s Health Insurance Act (HR 2, 111th) as an example. Although there is variation between the districts, almost all districts were fairly supportive of the bill. Thus, the Vote–District Alignment scores for legislators who voted for the bill are significantly higher than the values for legislators who voted against the bill. This also means that we observe numerous instances where legislators’ votes were misaligned with their districts. Arguably, this finding can be traced back in no small part to the high levels of polarization and the resulting pressure to vote along party lines, such that legislators often find themselves between a rock and a hard place, where they either have to vote against the preferences of their constituents or risk upsetting the party leadership.

Figure 4. Distributions of alignment between legislator votes and district preferences.
District Preferences and Signaling in Legislative Speech
To estimate the effect of district preferences on the emphasis in legislative speech, we proceed in two steps. First, we establish the link between district preferences and speech emphasis by presenting evidence from multilevel models. Next, we increase the complexity of the statistical model to explore the underlying mechanism. Specifically, we investigate whether the association is driven by differences between the delivery styles of legislators from different districts, or because legislators vary their delivery depending on the extent to which their vote is aligned with district preferences. In Online Appendix H.2, we additionally show that our results hold when using a binary measure of vote–district alignment.
We begin by fitting several multilevel models to estimate the effect of district opinion on speech emphasis. The dependent variable is legislators’ speech emphasis. The independent variable of interest is the vote–district alignment, indicating the extent to which legislators’ floor votes are aligned with district preferences. Following the theoretical argument, we expect vote–district alignment to be positively related to speech emphasis. The model incorporates varying intercepts at the debate level to account for the hierarchical structure of the data, where speeches are nested in legislative debates on the different bills.Footnote 14
To adjust for potential confounding, we consider six control variables. We account for legislators’ party affiliation with an indicator variable for Republican legislators. As legislators who vote against the party line may face pressures not to signal this behavior, we include a binary variable that indicates whether a legislator’s vote is in line with the majority of their party. We control for seniority to account for legislators’ experience in delivering speeches. To account for the possibility that ideologically extreme members might deliver more emphatic speeches than moderate legislators, we include the absolute values of legislators’ DW-Nominate scores (Lewis et al. Reference Lewis, Poole, Rosenthal, Boche, Rudkin and Sonnet2020). Finally, we control for gender to account for the possibility that male and female legislators differ in their presentational styles, and a variable indicating whether the speech was held in the 115th term to account for potential differences in measurement error. Table 3 presents eight model specifications. Models (1) to (4) depict the results for speeches in opposition to a bill. Models (1), (3), (5), and (7) are based on MrP estimates, while models (2), (4), (6), and (8) are based on BARP estimates. Models (1), (2), (5), and (6) are baseline models without covariate adjustment. Models (3), (4), (7), and (8) include control variables.
Table 3. Multilevel specifications with debate random effects. Parentheses report heteroskedasticity consistent wild bootstrap standard errors (Modugno and Giannerini, Reference Modugno and Giannerini2015; Loy, et al. Reference Loy, Steele and Korobova2023)

Note: The dependent variable is the level of emphasis of a legislative speech. AIC: Akaike Information Criterion; BIC: Bayesian Information Criterion.
The results indicate that legislators alter their delivery style in reaction to public opinion, but only when they rise in opposition to a bill. Models (1) to (4) show a positive relation between vote–district alignment on legislators’ speech emphasis, suggesting that opposing legislators deliver more emphatic speeches when their electorate is also opposed to the bill. While the magnitude of the effect remains relatively stable between the models, the substantive interpretation of the effect is not straightforward. Consider two legislators who both rise in opposition to a bill. In legislator A’s district, 65 per cent of the voters are, like legislator A, opposed to the bill. This amounts to a vote-alignment score of 0.65. In contrast, 65 per cent of the voters in legislator B’s district support the bill, which means that only 35 per cent are aligned with their vote,Footnote 15 leading us to expect that legislator A delivers a more emphatic speech than legislator B. Based on the estimates from model (3), we expect legislator A to deliver a speech that scores 0.39 (s.e. = 0.07) points higher on the emphasis scale compared to legislator B.Footnote 16 This is equivalent to about 0.5 standard deviations of the distribution of speech emphasis among legislators who rise in opposition.
Turning to models (5) to (8) in Table 3, the results suggest that this finding does not generalize to legislators who deliver speeches in support of a bill. The estimated coefficients indicate that when legislators rise in support, they do not deliver their speech more or less emphatically depending on how strongly their district supports the bill. The finding that speeches in support of a bill are less affected by public opinion aligns well with recent research that has highlighted that opposition legislators are more prone to using emotional language (Gennaro and Ash Reference Gennaro and Ash2022).
Before proceeding with the analysis, we should caution against interpreting the association between district preferences and speech delivery as causal. Establishing causality would require strong theoretical assumptions, especially the absence of confounding variables – assumptions we cannot confidently make given the observational nature of our study. The next section demonstrates that the association holds when considering only within-legislator variation, ensuring that the result is not driven by time-invariant confounders. While this partially addresses concerns about causality, it does not fully resolve them.
Individual Versus Macro-Level Explanation
Having shown evidence for a link between district opinion and speech emphasis among legislators who rise in opposition, we now proceed to test whether the proposed individual-level explanation is driving the findings. It is conceivable that legislators whose preferences are more consistently aligned with those of their constituents might be more likely to signal this alignment to their constituents by giving more emphatic soundbites overall. We are thus interested in differentiating between such an explanation at the legislator level from an explanation where legislators are more emphatic when their position is aligned with their district in a particular debate.
We adjust the statistical model to study the isolated effects of both explanations. To that end, we partition the total variation of vote–district alignment into two parts: variation within districts and variation between districts. Within vote–district alignment is defined as the deviation of the vote-alignment on a specific bill from legislators’ average vote-alignment, which reflects the ‘legislator-in-debate’ level explanation. Between vote–district alignment is defined as legislators’ average vote-alignment, which reflects the generic legislator level explanation.
We specify a within–between random effects (REWB) model (Bell et al. Reference Bell, Fairbrother and Jones2019; Bell and Jones Reference Bell and Jones2015) to estimate the effects of both variance components in the same model. The model is specified as follows:
$$ y_{it} = \beta _{0} + \beta _{1W}(x_{it} - \bar x_{i}) + \beta _{2B} \bar x_{i} + \sum _{j=1}^{k} \gamma _{j} z_{ji} + (\upsilon _{i} + \upsilon _{t} + \epsilon _{it}) $$
where y
it
represents legislator i’s emphasis in debate t on a specific bill. x
it
is the alignment between legislator i’s vote after debate t and the preference of their district.
$\bar {{x_i}} )$
is the average alignment between legislator i’s votes and the preference of their district. υ
i
are random intercepts for legislator i and υ
t
are random intercepts for debate t. z
ji
represent the same k individual-level control variables as before.
The model has several properties worth noting. Importantly, the independent variable of interest, vote-alignment, enters the model in two forms: First, in its de-meaned form
$({x_{it}} - \bar {{x_i}} )$
, that is, the deviation of the vote-alignment from legislators’ average vote-alignment. The coefficient β
1W
represents the average within effect of vote-alignment, that is, the expected change in a legislator’s speech emphasis caused by variation of preferences within their district. Thus, positive values of β
1W
would constitute evidence for the individual-level explanation, making β
1W
the main coefficient of interest. The coefficient is equivalent to individual fixed effects and is independent of differences in legislators’ delivery styles. Second, the model incorporates legislators’ average vote-alignment (
$\bar {{x_i}} $
) as a covariate. The coefficient β
2B
represents the average between effect of vote-alignment. This effect captures differences in emphasis between legislators with varying average levels of vote-alignment, that is, legislators with more or less average district support for their floor votes. Thus, positive values of β
2B
would constitute evidence for the macro-level explanation.
The results in Table 4 provide support for both the individual and the macro-level mechanism. Starting with the individual mechanism, models (1) and (2) show that there is a positive within effect of vote–district alignment on speech emphasis. This result indicates that legislators who rise in opposition vary their delivery depending on how closely their vote is aligned with the preference of their district. Specifically, legislators who rise in opposition deliver more emphatic speeches as their districts become increasingly hostile to the bill. Figure 5 helps to assess the size of this effect. Consider a legislator who rises in opposition in two debates on different bills. In the first debate, 50 per cent of the voters in their district are – like them – opposed to the bill. In the second debate, 75 per cent are opposed, making their vote more aligned with public opinion in their district.Footnote 17 Based on the estimates from model (1), we would expect the legislator to deliver a speech that scores 0.19 points higher on the emphasis scale during the second debate compared to a speech during the first debate. This is equivalent to 0.45 standard deviations of the de-meaned speech emphasis.
Table 4. Within–between multilevel specifications with legislator and debate random effects. Parentheses report heteroskedasticity consistent wild bootstrap standard errors (Modugno and Giannerini, Reference Modugno and Giannerini2015; Loy, et al. Reference Loy, Steele and Korobova2023)

Note: The dependent variable is the level of emphasis of a legislative speech. Parentheses show wild bootstrap standard errors.

Figure 5. First differences and 95 per cent confidence intervals illustrating the expected change of speech emphasis in response to increased alignment between a legislator’s No vote and public opinion in the district.
Note: Wild cluster bootstrap confidence intervals based on model (1) and model (2) in Table 4. The baseline value of the de-meaned district alignment is set to − 0.28 (minimum for the MrP estimates), the mean level of vote-alignment is set to the empirical mean (0.53 for MrP, 0.51 for BARP), Republican is set to zero, vote with party is set to 1, seniority is set to its mean (15.7), gender is set to zero, ideology is set to the empirical mean (0.44).
The results also provide evidence for the macro-level mechanism. Both models show a positive between effect of vote–district alignment on speech emphasis (1.28 and 1.16 depending on the public opinion estimate). This indicates that legislators whose opposing vote constantly shows high alignment with their electorate tend to deliver more emphatic speeches compared to legislators with less support for their votes. To assess the substantive meaning of this effect, consider two otherwise similar legislators whose voters differ in their attitudes towards key pieces of legislation, where legislator A enjoys high alignment between her votes and public opinion and B does not. Suppose that the difference in vote alignment between legislators A and B amounts to 25 percentage points on average.Footnote 18 Model (1) predicts that this difference has implications for how legislators A and B present themselves on the floor: On average, legislator A would deliver speeches that score 0.32 points higher on the emphasis scale compared to legislator B. This amounts to 0.55 standard deviations of the mean emphasis score.
Taken together, the results from the REWB model on legislators who rise in opposition provide evidence for both the individual-level and the macro-level mechanism. Regarding the individual-level mechanism, legislators deliver more emphatic speeches as their vote becomes more aligned with their district. At the same time, legislators whose votes constantly show high alignment with their districts deliver more emphatic speeches on average.
The results from models (3) and (4) echo the findings from the models in Table 3. They show that this finding does not generalize to legislators who rise in support of a bill. The magnitudes of the estimated within and between estimates are not distinguishable from zero at conventional levels of statistical significance.
Conclusion
Automated analyses of audio and video data have begun to make their way into political science research (Dietrich Reference Dietrich2021; Dietrich et al. Reference Dietrich, Enos and Sen2019; Dietrich et al. Reference Dietrich, Hayes and O’Brien2019; Knox and Lucas Reference Knox and Lucas2021,Nyhuis et al. Reference Nyhuis, Ringwald, Rittmann, Gschwend and Stiefelhagen2021). These techniques promise to bring about significant innovations in a number of research fields by allowing scholars to make better use of the massive amounts of digitized data and to move beyond the narrow focus on digitized political text. In research on legislative speech specifically, incorporating the new tools and data sources enables the systematic study of questions beyond the substance of speeches and a greater appreciation of the non-verbal aspects of political speech.
In this paper, we have built on these nascent efforts to explain variation in the delivery of legislative speech. We have argued that legislators are not only strategic in what they say, but also in how they say it. As legislators are aware that most speeches go all but unnoticed, they make conscious decisions about when to deliver emphatic speeches in order to increase their chances of being featured in the media. Constituency preferences are a key factor in explaining such signaling in legislative speech. As actors with a singular interest in re-election, legislators are only expected to highlight their positions when they align with the preferences of their constituents.
To assess whether the speech delivery is responsive to public opinion, we relied on automated video analyses to measure the extent to which legislators deliver emphatic speeches on twenty-five key bills in the 111th–115th US House of Representatives (2009–18). The analyses have shown consistent effects of constituency opinion on speech delivery. Across different model specifications, legislators rising in opposition to a bill were found to deliver more emphatic speeches, the more their districts are opposed to the measure.
Having shown evidence for the effect of district preferences on the non-verbal characteristics of legislative speech in the US House, one might wonder to what extent the effect generalizes to other legislatures. We would argue that the effect of constituency preferences on legislative speech should be strongest where legislators can expect to benefit the most from a personal vote (cf. Carey and Shugart Reference Carey and Shugart1995), such that the US House probably constitutes a most likely case for observing an effect of public preferences on speech signaling. To a somewhat lesser extent, one might expect that legislators are more mindful of their messaging in legislatures where rhetoric is more valued, such that assemblies such as the UK House of Commons is probably characterized by more emphatic appeals than its continental European counterparts (Yildirim Reference Yildirim2025; Osnabrügge et al. Reference Osnabrügge, Hobolt and Rodon2021).
Despite consistent effects across different model specifications, a few limitations should be explicitly addressed. First, we argued that the Cooperative Congressional Election Study is useful for estimating district preferences on Congressional roll call votes and that attitudes towards specific bills are better suited for gauging constituency preferences and their effects on legislative speech than a general ideology measure. It should not be left unmentioned, however, that using these indicators comes at a price. As the CCES only features survey items on key congressional votes, our analysis is restricted to key debates, raising the question whether our findings generalize beyond debates on important bills. On the one hand, there is little reason to expect strategic legislators to go out of their way to signal their position when it clashes with the preferences of their constituents. On the other hand, speeches on inconsequential bills might generally be characterized by fewer emphatic appeals, which could result in fewer differences between speeches of legislators who agree with their constituents and those who do not. Future research could shed light on the question of whether our findings generalize beyond key bills by building on the present efforts and studying a broader sample of bills, while relying on a coarser measure of district ideology. Such research is greatly simplified by the promises of computer vision where trained neural networks can easily be deployed to study speeches on other bills.
Second, the observational nature of our study prevents us from making causal claims about the relationship between district preferences and emphatic speech delivery. This limitation is shared by many studies on the effect of public opinion on elite behavior, as public opinion can rarely be experimentally manipulated. However, future research might identify scenarios where naturally occurring exogenous variation in public opinion offers more credible support for the assumptions required to establish causal links between public opinion and speech delivery (cf. Hager and Hilbig Reference Hager and Hilbig2020). Relatedly, our research design does not offer insights into how accurately legislators assess public opinion in their district before giving a speech, leaving unanswered questions about the precise mechanism underlying our findings.
Third, future research should also try and link the textual and the non-verbal characteristics of legislative speech more closely in order to gain additional insights into legislative speech. While our study has made first steps towards such an analysis by showing how the non-verbal characteristics are tied to position-taking in speeches, additional research could investigate which specific parts of speeches legislators choose to emphasize and which content features betray a high-energy delivery.
Finally, while our study is among the first to examine the determinants of emphatic legislative speech, it does not distinguish between kinesic and vocalic cues in shaping emphasis. Since kinesic cues are conveyed through video and vocalic cues through audio, future research on non-verbal components of political speech might benefit from disentangling their relative contributions. In particular, methodological investigations into how audio and video modalities influence model performance could provide valuable insights into the distinct informational content carried by gestures and vocal inflection. Such analyses could help researchers prioritize modalities when computational or analytical constraints necessitate focusing on just one modality.
Overall, the study of non-verbal characteristics with emerging computer vision tools holds enormous promise for research on political speech, legislative behavior, and more. The present contribution constitutes one of the first attempts to systematically trace and explain the non-verbal characteristics of legislative speech. In line with previous research, our findings underscore that legislators are conscious and strategic in their use of legislative speech and that such strategy is not exhaustively described by the substantive aspects of legislative speech. To further refine our understanding of speech delivery, we hope that the theoretical and empirical advances presented in this contribution elicit a growing interest in the analysis of non-verbal aspects of political speech.
Supplementary material
The supplementary material for this article can be found at https://doi.org/10.1017/S0007123425100872.
Data Availability statement
Replication data for this article can be found in Harvard Dataverse at: https://doi.org/10.7910/DVN/5EWYTN.
Acknowledgments
We thank Morten Harmening, Felix Münchow, Marie-Lou Sohnius, Sam Känner, and Caro Krömer for excellent research assistance. We are grateful to Thomas Gschwend, Lukas Stoetzer, Christian Arnold, Seo-young Silvia Kim, Patrick Kraft, Marcel Neunhoeffer, Anna Adendorf, Oke Bahnsen, Sean Carey, Franziska Quoß, Viktoriia Semenova, Christoph Steinert, and the participants of the Severyns Ravenholt Seminar in Comparative Politics at the University of Washington, as well as the participants of the American Politics Research Group at the University of North Carolina at Chapel Hill for feedback on earlier versions of this manuscript.
Financial support
This project was supported by the Deutsche Forschungsgemeinschaft and the Sonderforschungsbereich 884. Further support is gratefully acknowledged from the research alliance ForDigital.
Competing interests
None.




