Missing Data in the Humanities

It takes a village to write a book: Mapping anonymous contributions in Stephen Langton’s Quaestiones Theologiae
Part of:
- CHR Missing Data in the Humanities
Jan Maliszewski
Journal:

Computational Humanities Research / Volume 1 / 2025

Published online by Cambridge University Press:

19 June 2025, e2
- Article
- - You have access
  - Open access
- PDF
- HTML
- Export citation
While the indirect evidence suggests that already in the early scholastic period the literary production based on records of oral teaching (so-called reportationes) was not uncommon, there are very few sources commenting on the practice. This article details the design of a study applying stylometric techniques of authorship attribution to a collection developed from reportationes – Stephen Langton’s Quaestiones Theologiae – aiming to uncover layers of editorial work and thus validate some hypotheses regarding the collection’s formation. Following Camps, Clérice, and Pinche (2021), I discuss the implementation of an HTR pipeline and stylometric analysis based on the most frequent words, POS tags, and pseudo-affixes. The proposed study will offer two methodological gains relevant to computational research on the scholastic tradition: it will directly compare performance on manually composed and automatically extracted data, and it will test the validity of transformer-based OCR and automated transcription alignment for workflows applied to scholastic Latin corpora. If successful, this study will provide an easily reusable template for the exploratory analysis of collaborative literary production stemming from medieval universities.

Style evolution in Western choral music: A corpus-based strategy
Part of:
- CHR Missing Data in the Humanities
Benjamin Henzel, Meinard Müller, Christof Weiß
Journal:

Computational Humanities Research / Volume 1 / 2025

Published online by Cambridge University Press:

07 October 2025, e13
- Article
- - You have access
  - Open access
- PDF
- HTML
- Export citation
This article introduces a strategy for the large-scale corpus analysis of music audio recordings, aimed at identifying long-term trends and testing hypotheses regarding the repertoire represented in a given corpus. Our approach centers on computing evolution curves (ECs), which map style-relevant features, such as musical complexity, onto historical timelines. Unlike traditional approaches that rely on sheet music, we use audio recordings, leveraging their widespread availability and the performance nuances they capture. We also emphasize the benefits of pitch-class features based on deep learning, which improve the robustness and accuracy of tonal complexity measures compared to traditional signal processing methods. Addressing the frequent lack of exact work dates (year of composition) in historical corpora, we propose a heuristic method that aligns works with timelines using composers’ life dates. This method effectively preserves historical trends with minimal deviation compared to using actual work dates, as validated against available metadata from the Carus Audio Corpus, which spans 450 years of choral and sacred music and contains 5,729 tracks with detailed metadata. We demonstrate the utility of our strategy through case studies of this corpus, showing how ECs provide insights into stylistic developments that confirm expectations from musicology, thus highlighting the potential of computational studies in this field. For example, we observe a steady increase in tonal complexity from the Renaissance through the Baroque period, stable complexity levels in the 19th and 20th centuries, and consistently higher complexity in minor-key works compared to major-key works. Our visualizations also reveal that vocal music was more complex than instrumental music in the 18th century, but less complex in the 20th century. Finally, we conduct comparative analyses of individual composers, exploring how historical and biographical contexts may have influenced their works. Our findings highlight the potential of this strategy for computational corpus studies in musicological research.

An unsupervised information-theoretic approach to identifying formulaic clusters in textual data
Part of:
- CHR Missing Data in the Humanities
Gideon Yoffe, Yair Segev, Barak Sober
Journal:

Computational Humanities Research / Volume 1 / 2025

Published online by Cambridge University Press:

19 September 2025, e9
- Article
- - You have access
  - Open access
- PDF
- HTML
- Export citation
Texts, whether literary or historical, exhibit structural and stylistic patterns shaped by their purpose, authorship and cultural context. Formulaic texts, which are characterized by repetition and constrained expression, tend to differ in their information content (as defined by Shannon) compared to more dynamic compositions. Identifying such patterns in historical documents, particularly multi-author texts like the Hebrew Bible, provides insights into their origins, purpose and transmission. This study aims to identify formulaic clusters: sections exhibiting systematic repetition and structural constraints, by analyzing recurring phrases, syntactic structures and stylistic markers. However, distinguishing formulaic from non-formulaic elements in an unsupervised manner presents a computational challenge, especially in high-dimensional and sample-poor data sets where patterns must be inferred without predefined labels.
To address this, we develop an information-theoretic algorithm leveraging weighted self-information distributions to detect structured patterns in text. Our approach directly models variations in sample-wise self-information to identify formulaicity. By extending classical discrete self-information measures with a continuous formulation based on differential self-information in multivariate Gaussian distributions, our method remains applicable across different types of textual representations, including neural embeddings under Gaussian priors.
Applied to hypothesized authorial divisions in the Hebrew Bible, our approach successfully isolates stylistic layers, providing a quantitative framework for textual stratification. This method enhances our ability to analyze compositional patterns, offering deeper insights into the literary and cultural evolution of texts shaped by complex authorship and editorial processes.

Whose news? Critical methods for assessing bias in large historical datasets
Part of:
- CHR Missing Data in the Humanities
Kaspar Beelen, Jon Lawrence, Katherine McDonough, Daniel C.S. Wilson
Journal:

Computational Humanities Research / Volume 1 / 2025

Published online by Cambridge University Press:

17 September 2025, e8
- Article
- - You have access
  - Open access
- PDF
- HTML
- Export citation
This article implements a critical method for assessing bias in large historical datasets that we term the “Environmental Scan.” The Environmental Scan sheds new light on newspaper collections by linking newly available “reference metadata” gathered from historical sources to existing full-text and catalogue metadata. The rise of computational methods in history and the social sciences, in tandem with newly “datafied” source materials, creates a challenge for researchers to adapt their existing critical practices to the increasing scale and complexity of computational research. To help address this challenge, the Environmental Scan situates big historical datasets in much greater context, including estimating what materials are missing, thereby revealing the ways digital collections can be “oligoptic” in nature. Using the British Newspaper Archive (BNA) as a case study, we diagnose the biases and imbalances in the digitised Victorian press. We determine which voices are under- or over-represented in relation to the political composition of the collection as well as its content and we trace the origins of these biases in the digitisation process. This article informs future interdisciplinary discussions about data bias and offers a conceptual model adaptable to diverse historical datasets. The Environmental Scan provides a more nuanced and accurate understanding of how newspaper data reflects past societies, making it a valuable tool for researchers.

Undate: humanistic dates for computation: Because reality is frequently inaccurate
Part of:
- CHR Missing Data in the Humanities
Rebecca Sutton Koeser, Julia Damerow, Robert Casties, Cole Crawford
Journal:

Computational Humanities Research / Volume 1 / 2025

Published online by Cambridge University Press:

05 August 2025, e5
- Article
- - You have access
  - Open access
- PDF
- HTML
- Export citation
undate is an ambitious, in-progress effort to develop a pragmatic Python package for computation and analysis of temporal information in humanistic and cultural data, with a particular emphasis on uncertain, incomplete, and imprecise dates and with support for multiple calendars. The development of undate is grounded in domain-specific work on digital and computational humanities projects from multiple institutions, including Shakespeare and Company Project, Princeton Geniza Project, and Islamic Scientific Manuscript Initiative. With increasing support for different formats and calendars, Undate aims to bridge technical gaps across different communities and methodologies. In this article, we describe the undate software package and the functionality of the core Undate and UndateInterval classes to work with dates and date intervals. We discuss why this software exists, how it expands on and generalizes prior work, how it compares to other approaches and tools, and its current limitations. We describe the development methodology used to create the software, our plans for active and continuing development, and the potential undate has to impact computational humanities research.

Computational Humanities Research

Refine listing

Actions for selected content:

Missing Data in the Humanities

Registered Report Protocol

It takes a village to write a book: Mapping anonymous contributions in Stephen Langton’s Quaestiones Theologiae

Research Article

Style evolution in Western choral music: A corpus-based strategy

An unsupervised information-theoretic approach to identifying formulaic clusters in textual data

Whose news? Critical methods for assessing bias in large historical datasets

Software Paper

Undate: humanistic dates for computation: Because reality is frequently inaccurate

Computational Humanities Research

Refine listing

Actions for selected content:

Save Search

Missing Data in the Humanities

Registered Report Protocol

It takes a village to write a book: Mapping anonymous contributions in Stephen Langton’s Quaestiones Theologiae

Research Article

Style evolution in Western choral music: A corpus-based strategy

An unsupervised information-theoretic approach to identifying formulaic clusters in textual data

Whose news? Critical methods for assessing bias in large historical datasets

Software Paper

Undate: humanistic dates for computation: Because reality is frequently inaccurate