1 Introduction
How can we measure the core characteristics of democracy? This question has animated political science for nearly a quarter of a century (see, e.g., Bush Reference Bush2017; Claassen et al. Reference Claassen2024; Munck and Verkuilen Reference Munck and Verkuilen2002; Przeworski et al. Reference Przeworski, Alvarez, Cheibub and Limongi2000). It has also motivated a number of large-scale data collection projects—including Freedom House and Varieties of Democracy (V-Dem)—that are increasingly used by scholars and policymakers to understand trends in democratization. A recent development in this field is the use of surveys to generate disaggregated measures of democracy, where country experts score regimes on a range of theoretically-relevant features (Coppedge et al. Reference Coppedge2011). As well as being very resource intensive, these techniques have generated considerable debate about the extent to which bias from human coders exaggerates the extent of democratic backsliding worldwide (Knutsen et al. Reference Knutsen2024; Little and Meng Reference Little and Meng2024; Widmann and Wich Reference Widmann and Wich2022). In this context, some democratization scholars have argued for the development of more “objective” measures of democracy, derived from repeated empirical observations of regime behavior (Little and Meng Reference Little and Meng2024).
We contribute to this debate by showing how news articles can be utilized to develop measures of media freedom, which is a key component of the indices that are used to study and monitor democratization and autocratic backsliding. Specifically, we focus on measuring the media’s ability to report events and opinions that are critical of the political executive—an essential feature of democratic life that appears in all major democracy datasets and is often the most important component of empirical indexes measuring media freedom. This item is particularly important given the changing nature of authoritarianism. With the rise of new types of autocratic governance, there has emerged a new form of media capture and control. Contemporary “informational autocrats” continue to police limits on acceptable political reporting, but also derive benefits from allowing certain forms of media coverage and critical commentary (Egorov, Guriev, and Sonin Reference Egorov, Guriev and Sonin2009; Guriev and Treisman Reference Guriev and Treisman2019; Walker and Orttung Reference Walker and Orttung2014). In these contexts, while some degree of free speech is permitted, reporting on events or opinions that criticize regime leaders or the top of the political power structure often constitutes a red line (Lorentzen Reference Lorentzen2014). It follows that the degree to which this red line is enforced offers a tangible measure of media freedom.
To capture changes in media criticism, we introduce a technique that builds on a recent advance in unsupervised word-embedding approaches: “A la Carte” (ALC) word embeddings (Arora et al. Reference Arora, Li, Liang, Ma and Risteski2018; Khodak et al. Reference Khodak, Saunshi, Liang, Ma, Stewart and Arora2018; Rodriguez, Spirling, and Stewart Reference Rodriguez, Spirling and Stewart2023). Our technique requires no human input or financial investment beyond the collection of media articles and the minimal computational costs of training an embedding layer and is applicable to any context with a national news media. It is also more granular and responsive to changes in political context than traditional methods for measuring media freedom, e.g., expert surveys, which typically measure developments in media criticism at the country year level. To implement our method, we measure the distance in semantic space between a vector of target words, i.e., the names of political leaders or the titles of their offices, and language found in news media connoting either support or opposition. Drawing on both real and synthetic news media, we show how the proximity of our target words to language connoting opposition is interpretable as a robust measure of criticism. This innovation enables us to recover the level of critical news or opinion in the media at units of varying scales (e.g., articles, publications, or countries) and measures of time (e.g., days, weeks, and months), thus providing considerably more flexibility and granularity than the country-year measures that are currently available.
To validate our approach, we first draw on a large corpus of 8.5 million Arabic-language news media published over the period 2008–2019 generated from five countries in the Middle East and North Africa (MENA). This period, which coincides with the 2011 Arab Spring, witnesses democratization processes, sustained anti-regime protests, a military coup, and authoritarian backsliding. During our analysis period, three of the countries in our sample (Algeria, Morocco, and Saudi Arabia) maintained persistent and deeply entrenched autocratic politics, while two (Egypt and Tunisia) experienced democratic transitions. The variation in our cases allows us to determine whether our approach accurately recovers changes in political reporting that follow from structural political change. We demonstrate our approach using local-language media as this is the most relevant when understanding the degree to which print news media can openly report on events and opinions that are critical of regime elites.
The article proceeds in five parts. First, we outline how to construct our media criticism measure, and then compare our scores to V-Dem. As we show, our approach to quantifying criticism of the executive in national news media closely tracks the values recorded in expert surveys during periods of substantive political change. V-Dem performs less well in stable autocracies, missing time-varying changes in the level of media criticism. This suggests, that expert surveys may capture large changes in easy-to-recognize cases, but can miss less dramatic developments in authoritarian contexts. Second, we demonstrate with a series of experiments that the technique can recover reliable estimates with sparse data. Third, we demonstrate the utility of media criticism scores derived from news media for both descriptive and causal research. For descriptive research, we demonstrate that changepoint analyses of our media criticism scores recover shifts in media freedom that align with case knowledge. For causal research, we demonstrate that comparison cases can be used to generate credible estimates of backsliding events such as military coups on media criticism. Fourth, we generate a series of synthetic articles across seven additional languages to demonstrate that the method extends to multiple linguistic domains. A battery of additional checks, including human validation and design-based supervised learning, underscore the validity and robustness of our approach. We conclude with a discussion of how the method may extend to multiple domains beyond media criticism.
2 Media Freedom and Its Measurement
Control of the media constitutes one of the most powerful weapons in the authoritarian arsenal (McMillan and Zoido Reference McMillan and Zoido2004). Media capture by the state leads to censorship of unfavorable news and events, the distortion of facts, and pro-government agenda setting (Field et al. Reference Field, Kliger, Wintner, Pan, Jurafsky and Tsvetkov2018; Woo Reference Woo1996). Research from diverse contexts suggests that media capture has important real-world consequences, including shifting policy attitudes to favor government positions, boosting party membership, increasing the vote share for pro-regime parties, inciting violence against political opponents, stifling collective action, and reducing aggregate political knowledge (Adena et al. Reference Adena, Enikolopov, Petrova, Santarosa and Zhuravskaya2015; Chen and Yang Reference Chen and Yang2019; Enikolopov, Petrova, and Zhuravskaya Reference Enikolopov, Petrova and Zhuravskaya2011; King, Pan, and Roberts Reference King, Pan and Roberts2013, Yanagizawa-Drott Reference Yanagizawa-Drott2014).
Many modern authoritarian regimes employ both direct and indirect means of control (Guriev and Treisman Reference Guriev and Treisman2019). These measures include preventing outlets from reporting on critical content by arresting journalists and editors, prosecuting media owners under the guise of national security laws, conducting punitive tax audits, manipulating government advertising, and imposing “seemingly reasonable” content restrictions (Simon Reference Simon2014). Contemporary “informational autocrats” also derive benefits from allowing some media criticism as this aids in the functioning of government and provides a veneer of political freedom (Guriev and Treisman Reference Guriev and Treisman2020; Walker and Orttung Reference Walker and Orttung2014). However, direct criticism of the executive branch of government—either in the form of hostile editorials, or coverage of events that directly criticize regime leaders such as protests—is rarely tolerated. Infractions can incur severe penalties, including sizable fines, imprisonment, and state-sanctioned violence (Carter and Carter Reference Carter and Carter2021; Lorentzen Reference Lorentzen2014).
Against this backdrop, the ability of media outlets to report on events and opinions critical of regime elites has become a key variable for the construction of composite indices of both media freedom and democracy (AMB 2022; FreedomHouse 2017; RSF 2022; Whitten-Woodring and James Reference Whitten-Woodring and James2012). To date, social scientists have relied mainly on panels of expert survey respondents to develop measures of media freedom. The V-Dem dataset is one of the most widely used and sophisticated examples of this approach—and provides survey responses for the measurement of media criticism specifically (Coppedge et al. Reference Coppedge2021; Lührmann, Marquardt, and Mechkova Reference Lührmann, Marquardt and Mechkova2020). In their analysis of different indicators of media freedom, Solis and Waggoner (Reference Solis and Waggoner2021) find that the V-Dem variable measuring the ability of media outlets to criticize the government, contributes most information to their latent measure of media freedom. For these reasons, we focus on media criticism as a key determinant of media freedom overall. Given its wide uptake, we can also use the V-Dem measures of media criticism as an initial reference against which to compare the text-based estimates of media criticism.Footnote 1
3 Measuring Media Criticism
Our main approach to measuring media criticism exploits newly developed word-embedding approaches to project words that appear close to the mention of any leader onto a vector index of opposition and support. This enables us to determine whether the leader(s) are the subject of news of events or opinions that are more or less critical over time and means we can capture a core feature of media freedom—the degree to which criticism of the political executive is permitted.
To validate our approach, we begin by analyzing media reporting in five MENA countries—Algeria, Egypt, Morocco, Saudi Arabia, and Tunisia—over the period from 2010 to 2019, coinciding with the Arab Spring. Two of these countries—Egypt and Tunisia—saw substantial political change over our observation period; the other three saw relative stability. We refer to the former as our “change cases” and to the latter as our “stability cases.” We provide background information about happenings in each case in Section A of the Supplementary Material.
Researchers can increasingly access new media from autocratic contexts at scale. We draw on a set of Arabic-language news articles taken from news aggregation websites for each of the countries in the sample: https://www.djazairess.com/ (Algeria), https://www.masress.com/ (Egypt), https://www.maghress.com/ (Morocco), https://www.sauress.com/ (Saudi Arabia), and https://www.turess.com/ (Tunisia). Articles date from 2008–2019 for each of the countries in our sample. We only include sources that can credibly be characterized as news sources.Footnote 2 In total, across all five countries, we have 335 newspaper sources and a subsample of 8.5m unique news articles. We provide a full list of the newspaper sources and number of articles sampled for each country in Table B.1 in the Supplementary Material.
The smallest size sample in our data is the Tunisia corpus, numbering around 1.7m news articles. To ensure comparable overall sample sizes for our analytical samples, we therefore set the total size of the Tunisia sample as our upper limit for sampling other countries. To identify passages related to the political executive, we use the names of the leaders in each of the countries during the time period of their rule. A table detailing each leader and the period of their rule is provided in Table B.2 in the Supplementary Material.
3.1 Word Embedding
Word-embedding techniques represent an important recent advance in the large-scale analysis of text and, in particular, semantic meaning (Caliskan, Bryson, and Narayanan Reference Caliskan, Bryson and Narayanan2017; Charlesworth, Caliskan, and Banaji Reference Charlesworth, Caliskan and Banaji2022; Garg et al. Reference Garg, Schiebinger, Jurafsky and Zou2018). The basic requirement for training a word embedding layer is to convert a corpus of text into a term co-occurrence matrix. With this matrix we are then able to exploit pre-packaged algorithmic architectures to learn the pattern of co-occurrences and derive a distributional representation of each word in the corpus in vector space. To date, most practitioners use one of the GloVe (Pennington, Socher, and Manning Reference Pennington, Socher and Manning2014) or Word2Vec (Mikolov et al. Reference Mikolov, Chen, Corrado and Dean2013) modeling approaches. In the following, we use GloVe to train our embedding layer. We do so by sampling a maximum of 1.5m news articles across all countries combined and estimating a single embedding layer using the combined data from all countries. We also detail below experiments in the minimum effective training data sample size for this procedure.
While a promising agenda, studying semantic change over time in this way is confronted with two problems: 1) computational inefficiency; and 2) identification. Training an embedding layer is computationally expensive. As such, examining shifts in the relationship between words (proximity in vector space) over covariates of interest (such as time) requires a large amount of compute power, especially for large corpora. A corollary problem is that when embedding layers are trained over different temporal units, this means that over-time comparisons are no longer robust due to lack of identification in the underlying vector space (Hamilton, Leskovec, and Jurafsky Reference Hamilton, Leskovec and Jurafsky2016; Rodriguez et al. Reference Rodriguez, Spirling and Stewart2023).
3.2 ALC Word Embeddings
A recent innovation by Khodak et al. (Reference Khodak, Saunshi, Liang, Ma, Stewart and Arora2018), and implemented and extended by Rodriguez et al. (Reference Rodriguez, Spirling and Stewart2023), helps solve these problems. The technique—“ALC on Text”—provides a computationally efficient way to identify semantic change over time. The advantage of this technique is that we are able to use a single pre-trained embedding layer, and accompanying transformation matrix, to induce embeddings for a given target word over time without having to retrain an embedding layer for each unit of time.
The efficiency gains of the ALC approach come from the realization that embeddings for a particular (even very rare) target word may be derived by averaging the vectors of embeddings for words within its (here: six-word) context window from a pre-trained embedding layer (Arora et al. Reference Arora, Li, Liang, Ma and Risteski2018; Khodak et al. Reference Khodak, Saunshi, Liang, Ma, Stewart and Arora2018; Rodriguez et al. Reference Rodriguez, Spirling and Stewart2023). Once we have the embeddings of context words, we can then take the average of these vectors to derive our distributional representation of the target word. A transformation matrix—required to downweight words (such as stop words) that appear with high frequency—is then computed using the term co-occurrence matrix used to generate the embedding layer as well as the embedding layer itself (Khodak et al. Reference Khodak, Saunshi, Liang, Ma, Stewart and Arora2018; Rodriguez et al. Reference Rodriguez, Spirling and Stewart2023). Unlike other, e.g., dictionary-based methods, then, ALC embeddings do not rely on specific words appearing within the text corpus to derive a measurement nor does it rely on our target word appearing with high frequency in the embedding layer.Footnote 3
In our application, we train an embedding layer across a combined sample of all newspaper sources in all countries that make up our sample. We refer to this as our “reference embedding.” We pre-process the text by removing numbers, stopwords, and punctuation. We then use the GloVe algorithm, the R packages quanteda (Benoit et al. Reference Benoit2018) and text2vec (Selivanov, Bickel, and Wang Reference Selivanov, Bickel and Wang2025). We set vector dimensionality to length 300, and use a window size of six. The maximum number of iterations for training the embedding layer was set to 100, and the models all converged under this threshold for each country. We pruned the vocabulary over which to train the embedding layer such that the overall dimensionality of the resulting co-occurrence matrices was
$\sim $
30000x30000 (i.e., 30000 unique words). We then compute the transformation matrix required for the ALC approach using the R package conText developed by (Rodriguez et al. Reference Rodriguez, Spirling and Stewart2023). This is used to reweight words appearing with high frequency in the corpus. We also conduct experiments to determine the size of the feature space required to reliably detect signal.
3.3 Criticism Index
Unlike word frequency or topic modelling approaches, which use a bag of words as their foundation, word-embedding techniques retain the context and order of the text. One advantage of this is that the embedding layers retain information on the semantic associations between words, which means we can use matrix arithmetic to perform analogy tasks or derive index (vector) representations of concepts of interest (Bolukbasi et al. Reference Bolukbasi, Chang, Zou, Saligrama and Kalai2016). Our target leader words are detailed in Table B.2 in the Supplementary Material. We then calculate a criticism dimension by subtracting the vector for the word “opposition” (
) from the vector for the word “support” (
) in our reference embedding. The rationale for subtracting one word from the other is that this means the both poles of the index are interpretable. Ultimately, it also reduces the number of operations as it means projecting target words onto one index rather than two. This gives us a single “criticism index,” which will be used to capture coverage of events or editorial opinions that are critical of the political executive. We infer that text with high cosine similarity to the opposition pole is likely to be critical.
By criticism, we mean both news of events that are critical of the leader and editorial articles that are directly critical of the executive and its policies.Footnote 4 Here, our index captures one or all of three different things, which we understand to denote criticism of the executive:
-
1. Reporting on events that target the figure of the leader, e.g., protests against the leader or their policies.
-
2. Opinion articles and editorials detailing failings and allocating blame to the leader or his/her government.
-
3. Second-hand criticism of the figure of the leader, e.g., reporting on public opinion and soundbites of citizens or other figures critical of the leader.
3.4 Projecting Words Over Time
We can observe temporal trends by calculating the cosine similarities between our target words of interest and our criticism index. To recover the over-time cosine similarities, we first split our observation period into year-week slices, and then get the context words around our target leader words for each country week. Using the ALC approach, we then estimate a time-period-specific embedding for the leader from the words appearing around their name over this time period. We do so by taking the average of the vectors of surrounding context words from our pre-trained references embedding layer for each of the leaders in each country in our sample respectively. We then combine these context words and apply the transformation matrix to downweight commonly appearing words. The weighting specified for the transformation matrix determines the extent to which commonly appearing words are penalized. A larger weighting means fewer words are downweighted.Footnote 5 Here, we set our transformation matrix weighting at 100, which is similar to other published work (Rodriguez et al. Reference Rodriguez, Spirling and Stewart2023). Below, we also detail experiments to determine the influence of the transformation matrix weighting on results.
From this procedure we are able to induce a single period-specific embedding for each leader over each time period. Once we have recovered these embeddings, we can then project them onto our criticism index by calculating the (l2-normalized) cosine distance between the vectors for each of our leaders and our criticism index over time. To aid applied researchers implement our approach, we have summarized this process in Figure 1.

Figure 1 Data analysis pipeline for measurement of media criticism in Arabic-language news media. This figure, illustrates the key steps in our methodology. (1) We collect news articles from publicly available sources across multiple countries and preprocess the text by removing stopwords, punctuation, and irrelevant content. (2) We identify mentions of political leaders and extract context words appearing within a six-word window around each mention. (3) Using pre-trained GloVe embeddings, we generate a reference embedding layer for all articles, which forms the basis for estimating media criticism. (4) We apply the ALC embedding method to construct time-specific word embeddings for each leader, allowing us to track changes in media discourse over time. (5) We compute a criticism index by projecting leader embeddings onto a semantic dimension spanning words associated with support and opposition. This pipeline enables us to estimate media criticism dynamically and at a granular temporal scale.
4 Observational Diagnostics and Causal Effects
To illustrate the validity and potential use cases for our approach, we focus on media criticism in our change cases: Egypt and Tunisia. We do so for two reasons: 1) to demonstrate the value of the media criticism scores in applied observational and causal settings; 2) to demonstrate how we might benchmark the substantive importance of an observed change in media criticism scores. The first, changepoint, technique provides diagnostics of what constitutes statistically significant change in observed levels of media criticism—and tells us whether such a change aligns with case knowledge. The second, synthetic difference-in-differences, technique benchmarks the size of any change to another case in order to provide counterfactual causal estimates of the effect size of an event in time.
4.1 Changepoint Analysis
To provide a diagnostic routine for detecting signal of abrupt change in the levels of observed media criticism, we use a conventional cumulative sum (CUSUM) changepoint approach to detect structural changes in the time-series data (Zeileis et al. Reference Zeileis, Leisch, Hornik and Kleiber2002). The CUSUM approach works by estimating model residuals as a function of the time parameter of interest. It does so by estimating an OLS model of the outcome of interest, calculating the cumulative sum of standardized residuals over time, and comparing these to a null hypothesis of no change. Here, our specification is:
${cos\_sim}_t$
=
$\beta _0$
+
$\beta _1$
,
${week}_t$
+
$\epsilon _t$
where
${cos\_sim}_t$
is the dependent variable (cosine similarity) at time t; β
0 is the intercept; β
1 is the slope coefficient for the predictor week
t
; and ϵ
t
is the error term at time t. The F-statistic in this approach, provides us with an over-time estimate of model fit under two competing hypotheses: one of no structural change and another of structural change. A high F-statistic at a given point in time provides evidence of improved model fit when accounting for some structural change in over-time variation.Footnote 6
4.2 Synthetic Difference-in-Differences
We envisage that researchers will want to use our estimation approach to ask counterfactual questions. In particular, they may look to make causal estimates of the effect size of political events on media criticality of the executive. For our cases, the most obvious is the 2013 coup in Egypt. Here, the counterfactual question is: what would media criticism in Egypt have looked like had a coup not happened? The natural comparison case is Tunisia. Both Egypt and Tunisia experienced democratic breakthroughs in early 2011. Both are Arabic-speaking, Muslim-majority republics where entrenched dictators were overthrown through street-level mobilization within a few months of the other (Brownlee, Masoud, and Reynolds Reference Brownlee, Masoud and Reynolds2015; Ketchley and Barrie Reference Ketchley and Barrie2020). In both cases, precarious and highly polarized democratic transitions unfolded, with secular political forces competing for electoral power against organized Islamist movements (Nugent Reference Nugent2020). Crucially, both cases also saw a proliferation of new independent media organizations and the lifting of long-standing restrictions on media reporting during the post-breakthrough democratic transitions (El- Issawi Reference El- Issawi2016). However, unlike in Tunisia, Egypt’s democratic transition was abruptly ended in mid-2013, when a military coup overthrew the country’s first democratically-elected president, sparking thousands of anti-government street protests and a cycle of contention that targeted Egypt’s post-coup leadership (Ketchley Reference Ketchley2017, chapter 6).
For our main counterfactual analysis, we exploit the availability of news media from Tunisia to implement the synthetic difference-in-differences estimation procedure as described in Arkhangelsky et al. (Reference Arkhangelsky, Athey, Hirshberg, Imbens and Wager2021). Our analysis uses a panel of ten Egyptian and ten Tunisian newspapers (i) observed at weekly periods (t) beginning at the start of Egypt’s democratic transition in February 2011.Footnote 7 Following the case literature, we assume that Egyptian and Tunisian newspapers observed prior to the coup are operating in transitional democratizing contexts where they are more able to report on news and opinion that criticizes the executive, while newspapers in Egypt after the coup were more constrained in their reporting. Our econometric specification is thus:
${cos\_sim}_{it}$
=
$L_{it}$
+
$\tau _{it}W_{it}$
+
$\epsilon _{it}$
, where
$\tau _{it}$
is the effect of the coup on the cosine similarity score of newspaper (i) at week (t), and we estimate the average of
$\tau _{it}$
over the observations where
$W_{ij}$
=1. The matrix
$L_{it}$
are simple two-way fixed effects at the unit and week level. Unit weights (
$\hat {\omega }$
) match the pre-trend of the treated newspapers with the untreated controls (here: Tunisian newspapers), while time weights (
$\hat {\lambda }$
) minimize the differences between the pre and post-treated periods for the controls.Footnote 8 In Section H of the Supplementary Material, we also estimate an interrupted time series model. This strategy is useful when applied researchers want to estimate the effect of an event on media criticality, but lack media articles from a comparison case.Footnote 9 We can also imagine that researchers might use newspaper media criticality scores from different contexts to estimate a comparative interrupted time series.
4.3 Synthetic Data Simulation
Applied researchers will also want to implement our proposed technique in other languages. To assess this, we innovate by generating synthetic data using two OpenAI large language models (LM) (specifically, gpt-3.5-turbo and gpt-4o). The prompts and code we used to generate these data are in code block 1 in the Supplementary Material. We used a limited prompting design, asking the model to generate a series of 500 articles that were “critical” and 500 “not critical” articles of a political figure we refer to as POLITFIG.Footnote 10 We use this neutral denotation to mitigate against activating any biases baked into the training data. We iterate over seven additional languages as well as Arabic. These data are useful for two key reasons: 1) they provide evidence of the generalizability of our technique to other languages; 2) they are designed specifically to include articles that are variously “critical” or “not critical” meaning we are able to determine whether our criticism index is actually capturing this concept. We translate words for support and opposition into each of these languages (see Figure C.1 in the Supplemenatry Material for the translations used). Instead of re-estimating embedding layers for these languages, we use the pre-trained embeddings for each language provided by Wirsching et al. (Reference Wirsching, Rodriguez, Spirling and Stewart2025). We select those languages that Wirsching et al. (Reference Wirsching, Rodriguez, Spirling and Stewart2025) have validated with human coders: Arabic, Chinese, English, French, Japanese, Korean, Russian, and Spanish.Footnote 11
5 Estimating Media Criticism
Using the pipeline described in Figure 1, we first generate country-level descriptive trends, which we benchmark to expert survey scores. Our word-embedding estimates of media criticism in our change cases, closely track those reported in V-Dem (see Figure 2). Spearman’s
$\rho $
ranges from .85 to .93 for Egypt and Tunisia, respectively—the two countries that underwent democratic transitions during the observation window.Footnote 12 In Egypt, we see an increase in media criticism in the aftermath of Mubarak’s ousting in 2011 uprising followed by a sharp decrease in the aftermath of the coup of 2013. In Tunisia, we see a sharp increase in media criticism in the aftermath of the 2010–11 uprising that then stays at approximately the same level for the ensuing years to 2019. For our stability cases, we see relatively flat lines throughout the observation period. We do nonetheless observe that there is more substantial variation in media criticism scores in our stable cases than the expert survey scores would suggest.

Figure 2
A: V-Dem media criticism scores over time across all countries; B: Normalized cosine similarity criticism scores over time across all countries. Lines in panel B is smoothed (LOESS) curve with span (
$\alpha $
) set to 0.5. Confidence intervals for the V-Dem scores in panel A are based on cross-coder aggregations calculated according to the Bayesian item response theory measurement model described in Coppedge et al. (Reference Coppedge2019).
To aid applied researchers adopting this technique, we run several tests to explore the effects of several parameter choices when: a) determining the minimum number of leader words required in each time unit; b) determining the size of the training data necessary for our reference embeddings; c) determining the vocabulary (feature) size of the reference embedding layer; and d) specifying the weighting of the transformation matrix. Full results of these experiments are detailed in the Supplementary Material. Overall, we see that the same basic trends obtain across most iterations of n of leader words, corpus (training data) size, vocabulary size, and transformation matrix. While we caution readers against placing too high trust in results when data are particularly sparse for a given time period, it appears that just two occurrences of a leader word is actually sufficient to extract a reasonably reliable signal. Even training data of size 10k news articles seems to recover the basic trends observed above, with the exception of Tunisia. Perhaps surprisingly, a vocabulary size of 1k also picks up a comparable signal across all countries. For the transformation matrix, only the most severe weight setting (100k) exhibits variation that departs markedly from the other parameter settings. Taken together, these checks are good news for applied researchers, as they demonstrate that our approach is able to reliably detect known signals even when the size of the data is comparatively small. It also means the computational cost of the procedure can be lower. That said, to train a word embedding layer with even the largest vocabulary size, applied researchers need no more computational power than provided by a modern personal computer. After the initial training, the estimation of criticism scores takes seconds.
5.1 Detecting Changepoints
The above estimates demonstrate a high correlation between our text-based measures of media criticism in our change cases and those deriving from expert surveys measured at the country year level. If we are to incorporate text-based indicators into standard metrics of media freedom, we need a method to detect changes and a measure of the uncertainty associated with these changes. These rule-of-thumb metrics are central to other contributions using large text data as the foundation for early-warning systems (Balashankar, Subramanian, and Fraiberger Reference Balashankar, Subramanian and Fraiberger2023; Stolerman et al., Reference Stolerman2023). To achieve this, we implement our changepoint procedure that estimates an F-statistic of model fit under assumptions of structural change in the level of media criticism. The point where the F-statistic peaks can be understood as the most probable changepoint.
Figure 3 displays the weeks with the highest probability across our two change cases. The most pressing concern for applied researchers will likely be the size of training data required to detect signal. As such, we estimate our changepoint models over all six versions of our training data, i.e., from 10k to 1.5m unique news articles. We keep the feature size constant at 30k.Footnote 13 For both Egypt and Tunisia, we see that all versions, with the exception of the smallest in Tunisia, recover estimates of structural change points that align with case knowledge. In Egypt, this is at the beginning of July 2013 following a military coup; in Tunisia this is in late-December of 2010 and January of 2011 when Ben Ali’s dictatorial regime was ousted from power.

Figure 3 Top panel: F-statistics over time for changepoint procedure across versions; Bottom panel: text-based criticism scores for Egypt and Tunisia and breakpoints for each version. Breakpoints displayed with slight offset for visibility.
5.2 Counterfactual Estimation
Figure 4 shows the results of our synthetic difference in difference analysis to estimate the effect of the military coup in July 2013 in Egypt on criticism of the executive. As noted, to generate a plausible pre-coup trend, we use media criticism scores from newspapers from nearby Tunisia—which was also undergoing a democratic transition during this period—to construct unit and time weights in a two-way fixed effects model. The outcome measure is a newspaper’s criticism score assigned to the year-week from the period of democratic breakthrough in early 2011 through to 2019. The results suggest that the treatment in Egypt led to a substantive, enduring, and statistically significant diminution in media criticism, roughly equivalent to a one standard deviation decrease relative to pre-coup media criticism scores (
$p < .001$
). This marked reduction in media criticism of the post-coup executive comes despite Egypt experiencing rampant inflation, currency devaluation and a foreign exchange crisis, thousands of anti-coup protests that continued for years after the military’s seizure of power, other episodes of street-level mobilization including against food prices and unpopular foreign policy decisions, and protracted insurgencies in the country’s border provinces (Grimm Reference Grimm2019; Ketchley and El-Rayyes Reference Ketchley and El-Rayyes2017; Ketchley Reference Ketchley2017; Nugent and Siegel Reference Nugent and Siegel2024).

Figure 4 Treatment effect of the coup (black arrow) on media criticism in Egypt using Tunisian newspapers as a counterfactual. Dashed line marks the coup.

Figure 5 Normalized cosine similarity criticism scores over time across eight languages using synthetic articles generated by gpt-3.5-turbo. Coloured lines represent the linear best fit over each period. The line at 0 is the point at which the LLM shifts to producing “not critical” articles.

Figure 6 Normalized cosine similarity criticism scores over time across eight languages using synthetic articles generated by gpt-4o. Coloured lines represent the linear best fit over each period. The line at 0 is the point at which the LLM shifts to producing “not critical” articles.
5.3 Synthetic Data
Figure 5 displays the results of re-estimating our main analysis across seven additional languages as well as Arabic using synthetic data. The text data is split into ten time units corresponding to 100 articles each. The call to the OpenAI API specified that language should become less critical after 500 of the 1000 total runs. The imagined time point at which the API shifts to producing “not critical” articles is displayed as 0 in Figures 5 and 6. Reassuringly, we observe that across all languages, there is a marked and statistically significant change at the midway point that is detected by our ALC word embedding approach for both LM. These results demonstrate that our approach is not sensitive to the input language and may be applied to other cases of authoritarianism and democratization.
5.4 Robustness
In the Supplementary Material, we provide a number of additional tests to ensure the robustness of our approach. These include comparing scores generated by our criticism index to human-labelled values for a sample of news articles, design-based supervised learning, estimating alternative criticism indices, alternative target words for the political executive, alternative normalization procedures when estimating cosine similarity, and additional checks that our scores are not an artefact of our use of “opposition” to connote criticism.
6 Discussion and Conclusion
In this article, we propose and empirically validate a computationally inexpensive—and completely unsupervised—approach to scoring a key indicator of media freedom: the level of media criticism directed at the political executive. To date, researchers have relied on expert surveys and composite indices to measure the health of the fourth estate across countries and across time. Drawing on news media from autocracies, as well as synthetic media reports, we build on innovations in the computational analysis of text to demonstrate a method that convincingly recovers estimates of media criticism across transitional and stable contexts. For applied researchers, we demonstrate that these scores can be used in both descriptive and causal settings. A series of experiments show that the technique recovers sensible measures even with sparse data. Using synthetic data, we demonstrate that the technique travels to multiple other languages.
Importantly, the technique we propose is not limited to the study of media criticism alone. A key benefit of our method is that we are able to recover over-time estimates of text-based trends even when the target construct or individual is rare (in the text).Footnote 14 Applied researchers might use a version of the method for e.g.,: estimating the over-time targets of populist speech through the generation of an index of populism and enumerating a set of targets (e.g., institutions, groups, other countries); estimating the issue positions of individual legislators by estimating an index of support versus opposition and enumerating a set of issue targets (e.g., abortion, gun ownership, and healthcare reform); or estimating changes in the targets and level of hostility online by generating a hostility index and identifying a set of targets (e.g., political groups, individuals, and ideas).
Despite its advantages, our method has several possible limitations. First, while our approach effectively measures media criticism, it does not directly capture media freedom, as criticism may also capture changes in leader popularity or economic performance rather than shifts in press autonomy. That said, even very popular leaders will still be subject to criticism from political opponents, just as all policy platforms inevitably create winners and losers. Thus, there is good reason to believe that dramatic declines in criticism are more likely to result from reduced media freedom rather than increased leader popularity. Indeed, we expect that in environments with free media, criticism of popular leaders will be especially prevalent, as opponents seek to undermine their support. Future research could integrate additional indicators, such as independent assessments of media restrictions, to further disentangle these factors. Second, our method relies on publicly available news sources, which may introduce selection biases if certain types of outlets are underrepresented or disproportionately censored. The increasing availability of news aggregators should help to ameliorate this problem. Third, while the ALC embedding technique allows for efficient estimation of media criticism over time, it does not account for nuanced rhetorical strategies, such as self-censorship or coded dissent, which may be important in authoritarian contexts. Finally, our approach assumes that language associated with opposition and support is relatively stable over time, though shifts in political discourse could affect our estimates. Despite these limitations, our method provides a scalable and replicable tool for tracking media criticism across diverse settings, offering a valuable complement to expert-coded and survey-based measures.
A number of extensions naturally follow from this advance. Given the temporal granularity we can now rely on, we will be able to incorporate fine-grained measures of media criticism as variables within survey research or into commonly used indices of media freedom. Extensions of the approach might also involve the use of our granular measures as features in a supervised machine-learning context to detect widening or narrowing media freedoms worldwide (Balashankar et al. Reference Balashankar, Subramanian and Fraiberger2023; Mueller and Rauh Reference Mueller and Rauh2018).
For counterfactual designs, we demonstrate how to use outlet-level measurements of media criticism to estimate the effect of major political events on media reporting. This opens the door to estimation of the causal effects of a wide variety of events on media freedom, including new entrants in media markets, violent episodes, and the advent of alternative media (Guriev and Treisman Reference Guriev and Treisman2020; Hale Reference Hale2018; Shirky Reference Shirky2011). Because our approach can be adapted to many languages, it also facilitates comparative analysis. We hope similar methods will be applied in diverse global contexts to augment existing measures and improve our understanding of the dynamics of media freedom in democratic and authoritarian societies alike.
Acknowledgements
Versions of the article were presented at conferences and workshops for Academia Sinica, Taiwan, the Talking Methods Seminar at the University of Edinburgh, the Dealing with Messy Data workshop at the University of Edinburgh, the Social Data Science Hub workshop at the University of Edinburgh, the Washington University in St. Louis Comparative Politics Annual Conference, and the 2020 annual conference of the American Political Science Association Conference. For their feedback on previous drafts, we are particularly grateful to: Zachary Steinert-Threlkeld, Mohammd Dhia Hammami, Aybuke Atalay, Maddi Bunker, Laurence Rowley-Abel, Ala’ Alrababa’h, and Thoraya El-Rayyes.
Funding Statement
The research benefited from an internal departmental grant at the University of Oslo.
Data Availability Statement
Data and code required to reproduce the analyses in this article are available at Barrie (Reference Barrie2025). A preservation copy of the same code and data can also be accessed via Dataverse at https://doi.org/10.7910/DVN/NXESQQ. Due to the data sharing agreement underwriting this research, we are unable to share the raw news text data.
Author Contributions
C.B., N.K., A.S. conceived of project and wrote article; C.B. developed measurement technique; N.K. and A.S. developed additional analyses; M.B. collected original data.
Competing Interest
The authors declare none.
Ethical Standards
The project received Research Ethics Approval from the University of Edinburgh Ref No: ID 286780.
Supplementary Material
For supplementary material accompanying this paper, please visit https://doi.org/10.1017/pan.2025.10012.