Hostname: page-component-65b85459fc-22psd Total loading time: 0 Render date: 2025-10-17T06:14:02.451Z Has data issue: false hasContentIssue false

LLM agents for interactive exploration of historical cadastre data: framework and application to Venice

Published online by Cambridge University Press:  01 October 2025

Tristan Karch*
Affiliation:
DH-Lab, EPFL , Lausanne, Switzerland
Jakhongir Saydaliev
Affiliation:
DH-Lab, EPFL , Lausanne, Switzerland
Isabella Di Lenardo
Affiliation:
DH-Lab, EPFL , Lausanne, Switzerland
Frederic Kaplan
Affiliation:
DH-Lab, EPFL , Lausanne, Switzerland
*
Corresponding author: Tristan Karch; E-mail: tristan.karch@gmail.com
Rights & Permissions [Opens in a new window]

Abstract

Cadastral data reveal key information about the historical organization of cities but are often non-standardized due to diverse formats and human annotations, complicating large-scale analysis. We explore as a case study Venice’s urban history during the critical period from 1740 to 1808, capturing the transition following the fall of the ancient Republic and the Ancien Régime. This era’s complex cadastral data, marked by its volume and lack of uniform structure, presents unique challenges that our approach adeptly navigates, enabling us to generate spatial queries that bridge past and present urban landscapes. We present a text-to-programs framework that leverages large language models to process natural language queries as executable code for analyzing historical cadastral records. Our methodology implements two complementary techniques: a SQL agent for handling structured queries about specific cadastral information, and a coding agent for complex analytical operations requiring custom data manipulation. We propose a taxonomy that classifies historical research questions based on their complexity and analytical requirements, mapping them to the most appropriate technical approach. This framework is supported by an investigation into the execution consistency of the system, alongside a qualitative analysis of the answers it produces. By ensuring interpretability and minimizing hallucination through verifiable program outputs, we demonstrate the system’s effectiveness in reconstructing past population information, property features and spatiotemporal comparisons in Venice.

Information

Type
Research Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2025. Published by Cambridge University Press

Plain Language Summary

This study applies large language models (LLMs) to historical cadastral data, focusing on Venice from 1740 to 1808, a period of major political and social change. Cadastral records, while valuable for urban history, are often unstandardized and challenging to analyze at scale. To address this, a framework is introduced that converts natural language queries into executable code, enabling structured database searches (SQL) and complex analytical operations (Python). By categorizing historical research questions based on complexity, the methodology ensures accuracy and minimizes errors. This approach reconstructs past urban landscapes, uncovering insights into property ownership, land use and socioeconomic patterns. The findings highlight LLMs’ potential for historical urban studies, offering a scalable and reliable method applicable across disciplines.

Introduction

Context

Historical cadastral records, widely distributed throughout Europe, serve as invaluable documents to reconstruct past urban and territorial information (Kain and Baigent Reference Kain and Baigent1992). These records document property ownership, usage functions and other essential elements for taxation, offering high confidence in their reliability due to their administrative purpose (Bloch et al. Reference Bloch, Aakjar, Hall, Tawney and Vogel1929). Often paired with cartographic mappings, these dual systems combine textual descriptions with geographic representations following standardized visual and ontological codes to minimize subjective interpretation and enhance utility for taxation (Bourguet and Blum Reference Bourguet and Blum1988). The evolution of cadastral sources has been extensively studied, with analyses spanning specific case studies to comparative frameworks. Associated cartography, particularly before and after the Napoleonic introduction of geometric-parcel cadastres, reflects a shift toward standardized cartographic practices (Clergeot Reference Clergeot2007). These sources are critical for reconstructing historical population data, property functions and spatial dynamics, and offer rich opportunities for analyzing urban development and socioeconomic structures.

While digitization has increased access to cadastral records, efficiently querying and analyzing them remains a significant challenge. Traditional methods – based on close reading and manual data extraction – are labor-intensive and difficult to scale across time and space. As researchers seek to engage with these sources at scale, there is a growing need for computational approaches that preserve historical nuance while enabling forms of distant reading. This also aligns with recent calls for more design-oriented methodologies in cadastral research, which emphasize the development of flexible tools for information processing (Çağdaş and Stubkjaer Reference Çağdaş and Stubkjaer2011).

Objectives and contributions

This study introduces a generalizable framework that uses large language models (LLMs) to assist historians in querying and analyzing historical cadastral data. Our objectives are threefold:

  • To overcome the structural heterogeneity of historical cadastral sources, which include orthographic variations, transcription inconsistencies and non-standardized formats that hinder large-scale analysis (see Figure 1).

    Figure 1. The role of LLMs in processing historical Cadastres. Processing historical records with orthographic variations and complex transcription details through SQL and coding agents for systematic data analysis.

  • To enable domain experts to interact with these datasets using natural language, by processing research questions as executable programs that retrieve, aggregate or analyze the relevant data.

  • To evaluate the reliability, consistency and scope of LLM-based agents when applied to diachronic and spatially grounded historical records.

Our main contributions are:

  • A typology of historical research questions tailored to cadastral analysis, which distinguishes between browsing (structured lookups) and prompting (contextual or multi-dataset queries), and maps them to the appropriate processing method.

  • A dual-agent system for historical data analysis, combining two complementary strategies:

    • A SQL agent for structured and idiographic queries.

    • A coding agent for more complex analytical operations, such as spatiotemporal comparisons and pattern detection.

  • A scalable and interpretable workflow that outputs verifiable code. This ensures that each answer is grounded in the source data and allows researchers to inspect or modify the generated analysis logic.

  • A case study on the cadastres of Venice (1740 and 1808), through which we evaluate our framework on real-world archival data. This includes an open benchmark of 240 expert-curated research questions (100 for structured browsing and 140 for complex prompting) that span spatial, functional, personal and temporal analyses. The benchmark is designed to reflect authentic urban historical research and support future comparative evaluations. Two example queries are shown in Figure 1.

Related work

Processing historical cadastral data requires bridging challenges related to unstructured formats, orthographic inconsistency and diachronic language variation. Existing computational strategies fall broadly into three categories: supervised machine learning models, rule-based methods and more recently, LLM-based code generation.

Machine learning approaches have leveraged advances in tabular modeling to support structured data extraction and reasoning. Models such as TabularNet (Du et al. Reference Du, Gao, Xu, Jia, Wang, Zhang, Han and Zhang2021) and TableFormer (Yang et al. Reference Yang, Gupta, Upadhyay, He, Goel, Paul, Muresan, Nakov and Villavicencio2022) integrate neural architectures for improved table parsing, while STab (Hajiramezanali et al. Reference Hajiramezanali, Diamant, Scalia and Shen2022) introduces self-supervised learning for diverse tabular data. Other methods, such as those presented in Zhang et al. (Reference Zhang, Zhang, Shen, Srinivasan, Qin, Faloutsos, Rangwala and Karypis2024), aim to reconstruct incomplete historical records through data synthesis. However, these approaches generally require large labeled datasets and may struggle with the variability of historical formats.

Rule-based methods are often designed with an assumed underlying relational structure (Shigarov and Mikhailov Reference Shigarov and Mikhailov2017), and rely on predefined rules for normalizing toponyms (Garbin and Mani Reference Garbin and Mani2005), recognizing descriptions and applying historical ontologies. While interpretable and often robust in controlled settings, they tend to be brittle when faced with noisy, ambiguous or nonstandard historical data.

LLM-based code generation represents a more flexible alternative. Instead of relying on fixed schema or extensive training, these systems generate executable code – typically SQL for structured lookups or Python for custom analyses – directly from user queries. This paradigm has multiple advantages: (1) it eliminates the need for extensive labeled training data, (2) it produces verifiable outputs through executable programs and (3) it scales across a wide range of query complexities, from simple lookups to advanced spatial-temporal operations. Building on recent advances in LLM-based data agents, our framework adapts and extends developments in (1) robustness, (2) flexibility, (3) modularity and (4) adaptability to the specific challenges of historical cadastral analysis.

(1) Robustness is addressed through InfiAgent-DABench (Hu et al. Reference Hu, Zhao, Wei, Chai, Ma, Wang and Wang2024), which formalizes benchmarks for execution consistency – defined as the statistical stability of results from repeated executions of equivalent queries. We adopt this methodology to quantify run-to-run variance as a proxy for output reliability.

(2) Flexibility is tackled by OpenAgent (Xie et al. Reference Xie, Zhou, Cheng, Shi, Weng, Liu, Hua, Zhao, Liu, Liu, Liu, Xu, Su, Shin, Xiong and Yu2023), which integrates SQL and Python execution within a single agentic framework, enabling seamless transitions between structured retrieval and procedural computation. While this architecture informed our design, our experiments confirm that SQL alone is ill-suited to the irregular schemas and semantic variability characteristic of cadastral sources. Kapoor et al. (Reference Kapoor, Stroebl, Siegel, Nadgir and Narayanan2024) further propose techniques for improving cost efficiency and reducing debugging iterations, which have guided our implementation choices.

(3) Modularity is exemplified by CodeChain (Le et al. Reference Le, Chen, Saha, Gokul, Sahoo and Joty2024), which promotes reusable, composable analysis components – an essential property when applying similar transformations across heterogeneous datasets.

(4) Adaptability is central to KwaiAgents (Pan et al. Reference Pan, Zhai, Yuan, Lv, Ruiji, Liu, Wang and Qin2024), whose architecture combines planning, memory and tool use to support dynamic task execution in varied contexts. We draw on this principle to develop mechanisms that select and combine spatial, temporal and semantic analyses according to query context and dataset properties as in Majumder et al. (Reference Majumder, Surana, Agarwal, Hazra, Sabharwal and Clark2024).

Cadastral data of the city of Venice

The 1740 Catastico represents a textual survey system managed by the Collegio dei Dieci Savi in Rialto, designed to administer the Venetian tithe, a 10 percent property tax introduced in 1463. Property owners submitted self-declared “Condizioni di Decima” or “Polizze” which detailed property type, location, status and income. These submissions were organized by district and sequentially numbered based on submission order, with taxation calculated from declared rents (Chauvard Reference Chauvard and Touzery2007). The general information structure of the document is displayed in Figure 2a. Following a major archival fire in 1514, redecimation efforts occurred sporadically, with significant collections in 1514, 1537, 1566, 1581, 1661, 1711 and 1740. Unlike later geometric cadastres (e.g., the first one introduced in Venice in 1808), the Catastico did not integrate cartographic representation. Instead, records were generated through door-to-door surveys conducted by censors, who documented owners, tenants, toponyms, property functions and rents.

Figure 2. The information structure of (a) Catastici 1740 and (b) Sommarioni 1808. The structure of the two documents is as follows for (a): (1) place name, (2) urban functions, (3) tenants, (4) owners, (5) annual income; (b): (1) cadastral parcel identifier corresponding to a number on the map, (2) owners, (3) door number, (4) urban functions.

In contrast, the first geometric cadastre in Venice was established in 1808 (Pavanello Reference Pavanello1981), adhering to French administrative standards (Clergeot Reference Clergeot, Bourillon and Vivier2008). As displayed in Figure 3, it operates as a dual system, combining cartographic maps of parcels with textual records that document ownership, location, function and area. Each parcel is assigned a unique number, which is cross-referenced with the records that catalog owners, toponyms, uses and dimensions (see Figure 2b). Ownership records include individual names, family relationships and institutions, while functions are classified using a codified Italian ontology from 1808 (Di Lenardo et al. Reference Di Lenardo, Barman, Pardini and Kaplan2021). In particular, the terms reflect historical usage; for instance, a “shop” is referred to as “bottega,” rather than the modern Italian term “negozio.”

Figure 3. The dual information system of the 1808 cadaster. Each parcel mention in the textual document is geolocalized on the cadastral map through the same ID code.

These systems reflect complementary approaches to surveying. The 1808 cadastre is cartographic, parcel-based and systematic, while the Catastico is textual, household-focused and income-oriented. Despite these differences, both systems exhibit stable informational structures (Chauvard Reference Chauvard and Touzery2007). Both have been digitized and transcribed: in the 1808 cadastre, parcel identifiers, codes and toponyms were automatically transcribed and subsequently verified manually (Ares Oliveira, Kaplan, and di Lenardo Reference Ares Oliveira, Kaplan and di Lenardo2017; Ares Oliveira et al. Reference Ares Oliveira, di Lenardo, Tourenc and Kaplan2019). For the 1740 Catastico, geolocation was achieved by correlating toponyms with contemporary maps, reconstructing the censors’ survey paths and identifying shared features between parcels recorded in 1740 and 1808.

Together, these datasets encompass more than 34,000 data points, providing detailed information on owner professions, tenants, property functions, rents, areas and geolocations. In this context, a “data point” corresponds to digitization of a single cadastral entry (row) containing all associated attributes such as parcel identifier, owner, profession, tenant and function with occasional missing values across fields.

A typology of historical questions related to cadastre data

In urban historical research, cadastral data are used to link people and urban functions to territories (Ares Oliveira et al. Reference Ares Oliveira, di Lenardo, Tourenc and Kaplan2019; Ekamper Reference Ekamper2010; Lelo Reference Lelo2020). There are several objectives when consulting these records. The first is to consult them to identify the location of one or more people or urban functions in a specific place or places. The second is to investigate more general principles and test hypotheses about specific places, groups of people, types of function or compare periods of time, particularly the past with the present. These research questions often involve complex statistical operations to aggregate and compare information that spans multiple data points from potentially different datasets.

In our evaluation, the first type of question assesses an LLM’s ability to retrieve, combine and compute relevant details – such as owners’ names, urban functions and toponyms – thus using the model as a historical dataset browser. The second type probes the model’s capacity to answer more complex queries that require semantic understanding and spatiotemporal reasoning within the historical context.

Browsing questions

Browsing questions can be categorized into (1) simple aggregation queries and (2) more complex relational queries. Simple aggregation queries focus on straightforward retrieval of information, such as calculating the total rent revenue generated by properties of a specific type. For example, a researcher might ask, “What is the total rent revenue generated from properties of the ‘bottega da casarol’ variety?” This type of query provides quick insights into financial aspects of property types, allowing immediate analysis of income generated from specific categories.

In contrast, relational queries delve deeper into the dataset to examine the relationships and patterns among various data points, such as identifying how many families own properties across multiple categories. An example of such a question could be, “How many families own properties of more than one type category?” These relational queries are essential for revealing trends in property ownership and usage, providing insights into socioeconomic dynamics within urban settings. Table 1 displays an aggregation as well as a relational question with their related SQL queries. The set of 100 hand-crafted browsing questions about the Catastici 1740 dataset are available in the Additional Methods (see the Appendix).

Table 1. Simple (top) and relational (bottom) browsing questions and their corresponding SQL queries

Note: Simple queries only require a single selection in the dataset while relational queries imply multiple selections.

Prompting questions

Prompting questions are designed to go beyond mere data retrieval or the aggregation of information from individual datasets. Unlike browsing questions, which rely on exact matching of entities and are thus not robust to typos, synonyms or variations – and require users to know precisely which data points exist within the dataset – prompting questions aim to leverage multiple data sources along with common-sense understanding to uncover richer, more nuanced insights. Such questions require a deep understanding of linguistic subtleties, particularly in categories such as professions, ownership and the intricate interrelations among entities within tabular data. This involves not only extracting explicit information, but also interpreting implicit connections. Furthermore, prompting questions frequently demands a conceptual grasp of spatial and temporal dynamics to effectively organize and contextualize data. In certain instances, city-specific knowledge becomes crucial for identifying diachronic language, local customs or accurately inferring distances.

After careful analysis of the datasets and their potential applications, we identified that meaningful questions about historical cadastral data could be organized into four distinct categories. The first category leverages the geocoordinates of the cadastre entries to examine spatial distributions, enabling queries that bridge past and present urban landscapes. By relating historical properties to relatively stable urban landmarks such as churches and squares (extracted from OpenStreetMap [OpenStreetMap contributors 2017]), we can investigate how individuals and properties were allocated across diverse areas. Although these landmarks may have undergone modifications or reconstructions over time, they often maintain their general location and social function, serving as semipersistent spatial anchors for historical analysis. The second category is dedicated to building functions, exploring the intended purposes or uses of various structures within the urban environment. The third category focuses on personal information, examining demographic and socioeconomic characteristics associated with individuals in the cadastral data. Finally, the fourth category targets temporal analysis, specifically comparing data over two distinct periods to reveal trends, shifts or patterns over time.

In sum, we have curated a comprehensive set of 140 questions, which we are releasing as an open-source resource along with this article. All questions have been validated by urban specialists to ensure their relevance for urban analysis applications. Table 2 presents a selection of questions from each category. The questions encompass diverse expected output formats, ranging from binary yes/no responses to numerical values or the identification of specific entities.

Table 2. Examples of prompting questions

Note: Alongside their category and expected output format.

Unlike browsing questions, designing SQL queries for prompting questions presents significant challenges due to their inherently complex nature. Prompting questions often require insights that extend beyond the information readily available in the dataset’s structure or columns. The intricate operations necessary for these questions move beyond simple data filtering and aggregation, involving advanced processes such as semantic searches, spatial computations and statistical tests or correlation evaluations. Additionally, certain prompting questions demand the incorporation of external knowledge or common-sense reasoning, which cannot be encapsulated purely through SQL.

Methods

Overview

Our framework combines two complementary approaches, each tailored to a category of research questions defined in the Section “A typology of historical questions related to cadastre data”:

  1. 1. SQL agent – designed for browsing questions that involve structured lookups and relational queries over a single historical dataset.

    • Input: a natural language question.

    • Output: an SQL query executed on the target table to return results.

  2. 2. Coding agent – designed for prompting questions requiring integration of multiple datasets and complex operations, such as spatial, temporal or semantic reasoning.

    • Input: a natural language question.

    • Output: an executable Python program that performs the required analysis, optionally including multi-step planning, entity extraction and consistency evaluation.

While both agents generate executable code directly from natural language, their architectures differ: the SQL agent uses a single text-to-SQL model over a defined schema, whereas the coding agent coordinates multiple specialized LLM components to extract entities, plan analyses and generate Python code.

Technical data representation

Our evaluation uses two historical cadastral datasets and a modern geographic reference dataset, represented in a simplified, analysis-ready form for the purposes of this study.

  • Catastici (1740) – a tabular dataset with seven columns: Catastici_ID [integer], Owner_ID [integer], Owner_ First_Name [text], Owner_Family_Name [text], Property_Type [text], Rent_Income [integer], Property_Location [text].

  • Sommarioni (1808) – a comparable cadastral dataset with owner and property attributes, structured for cross-temporal comparison with the Catastici.

  • Landmarks dataset – a set of semi-persistent urban features (e.g., churches, squares) extracted from OpenStreetMap, represented as point geometries. These landmarks serve as spatial anchors for contextualizing historical property locations within a stable geographic framework.

SQL agent

Questions

We curated 100 expert-designed browsing questions on the Catastici (1740) dataset, covering simple retrieval and relational queries, as detailed in the previous section.

Architecture

Natural language questions are converted into SQL queries using the open-source text-to-SQL model CodeS-7B (Li et al. Reference Li, Zhang, Liu, Fan, Zhang, Zhu, Wei, Pan, Li and Chen2024). Each prompt includes a detailed description of the table schema and its columns. We evaluate two prompting settings: in the zero-shot setting, the question is provided with table metadata only; in the three-shot setting, it is provided alongside three example question–query pairs. For each question, the model is run four times with identical inputs, and the final SQL query is selected by majority voting before being executed via SQLite to obtain the corresponding answer. Figure 4 provides an illustration of the SQL agent. For further clarity, examples of the prompts designed to interact with the CodeS-7B model are provided in the Appendix.

Figure 4. The SQL agent. Questions are fed to the system into a prompt engineered to match with the CodeS model requirements.

Coding agents

Questions

As discussed in the Section “A typology of historical questions related to cadastre data,” prompting questions fall into four categories: spatial, functional, personal and temporal. Spatial questions locate and organize records from the Catastici (1740) and Sommarioni (1808) datasets relative to landmarks from the OpenStreetMap-based Landmarks dataset (Figure 5), while temporal questions identify and compare patterns across the two periods. The coding agent integrates these three datasets, using semi-persistent features such as churches and squares as spatial anchors. Although these structures may have changed physically, their stable locations and enduring social functions provide a consistent framework for situating historical cadastral data within contemporary geography.

Figure 5. The coding agent. The agent receives a question and consults different datasets to (1) extract the entities being referred to; (2) creates a plan to answer it; and (3) produces and runs a python script to generate an answer.

Architecture

The coding agent operates as a dialogue among three specialized components: the entity extractor, the planner and the coder. Each is an LLM prompted for a specific role (see “Additional Methods” in the Appendix), with outputs from one component guiding the next. The overall information flow is orchestrated using LangChain (Chase Reference Chase2022) (Figure 5).

Entity extraction

We use a retrieval-augmented generation (RAG) approach (Lewis et al. Reference Lewis, Perez, Piktus, Petroni, Karpukhin, Goyal and Küttler2020) to align broad queries with dataset content. The system selects the relevant dataset, maps question terms to columns and retrieves entries using exact matching (e.g., avocato), fuzzy matching for spelling variants (e.g., avvocato) and semantic matching for related terms (e.g., procuratore). Exact and fuzzy matching offer greater precision, while semantic search increases recall at the cost of specificity (Figure 6).

Figure 6. The Entity Extractor phase. Given a question, in this phase, we extract the most relevant rows from the datasets.

Code generation

Following the Plan-and-Solve approach (Wang et al. Reference Wang, Xu, Lan, Hu, Lan, Lee, Lim, Rogers, Boyd-Graber and Okazaki2023), the planner produces a detailed execution strategy based on dataset metadata, extracted entities and query mappings. The coder then translates this plan into Python code, which is executed in a controlled environment. Errors trigger iterative debugging up to a fixed retry limit, after which unresolved queries are marked as unanswerable.

Consistency measures

We assess Execution Consistency (EC) as the proportion of identical outputs obtained when repeating the same question three times under different random seeds, capturing the stability of generated code and results. Following prior work (Hu et al. Reference Hu, Zhao, Wei, Chai, Ma, Wang and Wang2024; Kapoor et al. Reference Kapoor, Stroebl, Siegel, Nadgir and Narayanan2024), we use three runs as a cost–accuracy trade-off: it is sufficient to reveal stability patterns while keeping evaluation costs manageable.

Results

SQL agent results

We evaluated the SQL agent on 100 curated browsing questions covering retrieval and relational operations over the Catastici (1740) dataset. Performance was assessed using two complementary metrics: exact match accuracy, measuring the correspondence between generated and ground-truth SQL queries, and unigram overlap, capturing lexical similarity and accounting for cases in which different query formulations yield equivalent outputs. Ground truth queries were manually authored and executed to obtain reference results. No SQL runtime errors were observed in any setting.

In the zero-shot configuration, the SQL agent achieved 52 percent exact match accuracy and 86 percent unigram overlap. Providing three example question–query pairs (three-shot prompting) substantially improved these scores to 79 percent and 92 percent, respectively, indicating strong benefits from in-context learning (Table 3).

Table 3. Performance of CodeS-7B on browsing tasks

Note: Exact match and unigram overlap scores for 0-shot and 3-shot settings, with zero SQL runtime errors.

Error analysis revealed that most failures arose from misinterpreting output specifications. For example, when asked “How many owners receive more than 100 ducati in total rent income?,” the system returned a detailed list rather than an aggregate count. Similarly, in response to “What is the total rent income of the top 5 earners?,” the system provided individual incomes rather than a summed value. More complex analytical tasks, such as computing the “average rent income variance across all locations” or the “share of income from properties labeled as ‘bottega da fabro’,” also proved challenging, reflecting limitations in multi-step aggregation and specialized filtering.

Overall, these results demonstrate that the SQL agent can reliably translate natural language queries into executable SQL for idiographic browsing of historical cadastral data, with high accuracy and zero execution errors. Remaining weaknesses are primarily associated with advanced aggregation logic and nuanced output formatting. These limitations emerge most clearly when questions require multi-step reasoning, integration of heterogeneous sources, or contextual interpretation – tasks that move beyond structured browsing and into the prompting category addressed by the coding agent.

Coding agent results

Execution consistency

We assessed the coding agent using the EC metric, defined as the proportion of identical answers obtained when executing the same query three times with different random seeds. EC-3 denotes perfect consistency (identical answers in all three runs) and EC-2 denotes partial consistency (identical answers in two out of three runs).

Figure 7 summarizes EC results by question category and by answer type. Personal and functional questions achieved the highest consistency ( $\approx $ 95% EC-3), followed by spatial ( $\approx $ 85%) and comparison ( $\approx $ 80%) queries. Analysis by answer type revealed that yes/no and single-numerical responses were more consistent ( $\approx $ 90% EC-3) than entity-name extractions ( $\approx $ 60% EC-3), likely due to the greater determinism of binary and numerical operations compared to multi-step entity retrieval.

Figure 7. Execution consistency (EC) of the coding agent. (a) EC grouped by question category; (b) EC grouped by answer type.

Manual inspection of EC-3 responses indicated that 12 of 79 consistently generated answers contained errors, yielding a 15.2 percent error rate. This finding suggests that while perfect consistency is not a guarantee of correctness, it is a strong indicator of answer reliability.

Model comparison

We compared GPT-4 (OpenAI et al. Reference OpenAI, Adler, Agarwal, Ahmad, Akkaya and Aleman2024) and Llama-3 70B (Grattafiori et al. Reference Grattafiori, Dubey, Jauhri, Pandey, Kadian, Al-Dahle and Letman2024) on the same prompting tasks (Figure 8). Both models achieved near-perfect execution accuracy (i.e., generated programs ran without error). However, GPT-4 produced markedly higher EC and answer correctness. The primary source of this difference was not code syntax, but rather query interpretation during the planning stage. For example, when asked “Which square has the highest density of tenants?”, both models generated valid Pandas code, but GPT-4 correctly mapped “square” to the appropriate geographic entity type and returned campo san bartolomeo, while LLaMA misaligned the entity resolution and returned corte bollani. This illustrates that GPT-4’s advantage stems from emergent reasoning capabilities associated with larger-scale models, enabling more accurate semantic parsing of complex questions, rather than purely from code generation proficiency.

Figure 8. Model comparison on prompting tasks. Performance of GPT-4 and Llama-3 70B in terms of execution consistency and correctness.

Qualitative analysis of data operations

Table 4 illustrates how the coding agent systematically translates complex natural language questions into executable data analysis operations. Each question is decomposed into key entities and references, mapping linguistic elements (e.g., “rent price,” “square”) to corresponding dataset columns and types. Once entities are identified, the agent determines the appropriate analytical procedure – such as correlation analysis, filtering or aggregation – tailored to the query. For example, a question about the correlation between rent price and proximity to squares in 1740 is converted into a Pearson correlation calculation, while queries involving temporal comparisons or categorical relationships employ statistical methods such as chi-square tests or comparative metrics.

Table 4. Representative examples of EC-3 answers

Note: Each row links question elements to dataset columns and describes the corresponding code operation.

Two broad error classes emerged from this qualitative inspection. First, semantic ambiguity in entity mapping: modern descriptors such as “commercial building” or “shop” were often mapped to historical terms like magazzeno or locale, which in 18th-century Venetian context usually referred to storage or ancillary spaces. This mismatch inflated counts in spatial aggregation tasks, such as identifying squares with the highest density of commercial buildings. Second, diachronic linguistic variation: terms like negozio or ufficio, introduced in later centuries, were substituted for the historically accurate bottega when analyzing functional change over time, leading to systematic undercounting.

From the EC-3 subset of outputs, three consistent trends emerged:

  • Robust mappings in well-defined domains. Queries involving unambiguous spatial anchors (e.g., square, church) or numeric attributes (e.g., rent price) were consistently mapped to correct dataset fields and processed with straightforward statistical or spatial operations, yielding high EC and correctness.

  • Greater variability in functional classifications. Queries involving building function, especially those requiring aggregation over categories (e.g., “multi-function buildings”), were more sensitive to lexical variability and synonym coverage.

  • Temporal integration challenges. Cross-dataset comparisons between 1740 and 1808 often failed to bridge lexical gaps when direct string matches were absent, highlighting the need for historical ontology alignment.

Overall, the coding agent demonstrates strong performance when entity mappings are unambiguous and align with stable schema elements. However, performance degrades when semantic drift, synonym variation or temporal differences in terminology are present, underscoring the potential value of incorporating structured historical vocabularies into the pipeline.

Discussion

Our results provide several insights into the use of LLM-based text-to-program systems for historical cadastral analysis. First, when applied to structured queries, specialized SQL and general-purpose coding agents achieve comparable accuracy: the coding agent reaches a unigram overlap of 0.85 on browsing questions, indicating that its greater versatility does not come at the expense of precision. This suggests that a single Python coding agent pipeline could address both simple and complex queries, offering a unified solution for heterogeneous analytical needs.

A central advantage of the proposed framework is its interpretability. By producing executable code rather than direct natural language answers, the system anchors every result in the source data, reducing the risk of unsupported inferences. The generated programs act as transparent reasoning traces in which analytical assumptions and methodological choices are explicitly encoded. Our comparison of GPT-4 (OpenAI et al. Reference OpenAI, Adler, Agarwal, Ahmad, Akkaya and Aleman2024) and Llama-3 70B (Grattafiori et al. Reference Grattafiori, Dubey, Jauhri, Pandey, Kadian, Al-Dahle and Letman2024) (Figure 8) shows a substantial performance gap, largely attributable to differences in query interpretation rather than code generation. While this underscores current limitations of open-source LLMs for complex historical analysis, rapid progress in the field suggests the gap may narrow. The framework is also city-agnostic: although demonstrated on Venetian cadastres, it can be adapted to other urban contexts with digitized records, requiring only minor schema and context adjustments.

Several limitations must be considered. First, the interpretability benefits of code outputs assume some programming literacy, which may limit accessibility for historians without technical backgrounds. An intermediate layer translating code into natural language explanations could broaden usability. Second, performance declines when processing diachronic linguistic variation, pointing to the need for time-aware RAG methods tailored to historical dialects and evolving terminology. Third, the use of semi-permanent urban anchors (e.g., churches, squares) is effective for spatial grounding but may introduce anachronisms or oversimplifications if not critically assessed. More broadly, results must be interpreted in light of the assumptions and biases inherent in the underlying datasets.

These findings highlight promising directions for future work: improving accessibility through natural language code explanations, enhancing historical language processing, and developing more nuanced temporal–spatial integration methods. Despite current challenges, the proposed framework demonstrates that LLM-based agents can support rigorous, verifiable, and scalable historical urban research.

Data availability statement

An extended version of the 1808 Venetian cadastre is available through a Zenodo repository (https://doi.org/10.5281/zenodo.16761169) (Di Lenardo et al. Reference Di Lenardo, Viaccoz, Guhennec, Musso and Kaplan2025). The exact dataset employed in this study, together with the full set of questions and the accompanying code, are provided in the associated GitHub repository (https://github.com/dhlab-epfl/venice-agents).

Competing interests

The authors declare no competing interests.

Ethical standards

The research meets all ethical guidelines, including adherence to the legal requirements of the study country.

Disclosure of use of AI tools

Portions of this manuscript were prepared with the assistance of AI-based writing and editing tools (OpenAI ChatGPT, GPT-5). These tools were used to improve clarity, structure, and language, while all ideas, interpretations, and conclusions are the authors’ own.

Appendix: Additional methods

Questions for browsing the Catastici 1740

SQL CodeS-7B Prompts

Coding agent implementation details

Open AI API (https://platform.openai.com/docs/overview) Generation hyperparameters (top p, temperature,…) are left as default and used with a random seed. The system is implemented using the langchain framework (https://www.langchain.com/), as it allows flexibility to implement the interaction between the agents.

Coding agent prompts

This section contains all the prompts used at each step of the coding agent’s process. The first two prompts serve as system prompts, provided to the agent along with a description of the datasets as part of its context. The subsequent prompts align with the agent’s workflow, as shown in Figure 5. The process begins with extracting references, followed by entity extraction. Next, the agent generates a plan and subsequently writes the necessary code. If errors occur during code execution, the agent debugs the code based on the Python console’s error messages. In the following prompts, all inputs used to generate prompts are highlighted in blue. In context examples are given for reference and entity extraction.

Footnotes

T.K. and J.S. are joint first authors.

This article was awarded Open Data and Open Materials badges for transparent practices. See the Data availability statement for details.

References

Ares Oliveira, Sophia, Kaplan, Frédéric, and di Lenardo, Isabella. 2017. “Machine Vision Algorithms on Cadaster Plans.” In Digital Humanities Conference in Montreal. Montreal. https://infoscience.epfl.ch/record/254960?ln=en.Google Scholar
Ares Oliveira, Sophia, di Lenardo, Isabella, Tourenc, Bastien, and Kaplan, Frédéric. 2019. “A Deep Learning Approach to cadastral computing.” In Digital Humanities Conference. Utrecht: Alliance of Digital Humanities Organizations (ADHO).Google Scholar
Bloch, Marc, Aakjar, Svend, Hall, Hubert, Tawney, A.-H., and Vogel, Walther. 1929. “Les plans parcellaires: Allemagne, Angleterre, Danemark, France [in fre]. Publisher: Persée - Portail des revues scientifiques en SHS.” Annales 1, no. 1: 6070. https://doi.org/10.3406/ahess.1929.1039.Google Scholar
Bourguet, Marie-Noëlle, and Blum, Alain. 1988. Dhiffrer la France: La statistique dartementale’oque napolnienne. Paris: Editions des archives contemporaines.Google Scholar
Çağdaş, Volkan, and Stubkjaer, Erik. 2011. “Design Research for Cadastral Systems.” Computers, Environment and Urban Systems 35, no. 1: 7787. https://doi.org/10.1016/j.compenvurbsys.2010.07.003.Google Scholar
Chase, Harrison. 2022. Langchain. Released on 2022-10-17. https://github.com/langchain-ai/langchain.Google Scholar
Chauvard, Jean-François. 2007. Les catastici vitiens de l’oque moderne. Pratique administrative et connaissance du territoire [in fr], edited by Touzery, Mireille, 419454. Histoire économique et financière - Ancien Régime. Code: De l’estime au cadastre en Europe. L’époque moderne. Vincennes: Institut de la gestion publique et du développement économique. Accessed January 22, 2025. https://books.openedition.org/igpde/9768.Google Scholar
Clergeot, Pierre. 2007. Cent millions de parcelles en France. 1807 Un cadastre pour l’empire. Paris: Publi-Topex.Google Scholar
Clergeot, Pierre. 2008. Le recueil mhodique de811 [in fr], edited by Bourillon, Florence and Vivier, Nadine, 167–73. Histoire économique et financière - XIXe-XXe. Code: De l’estime au cadastre en Europe. Les systèmes cadastraux aux XIXe et XXe siècles. Vincennes: Institut de la gestion publique et du développement économique. Accessed October 17, 2024. https://books.openedition.org/igpde/10963.Google Scholar
Di Lenardo, Isabella, Barman, Raphaël, Pardini, Federica, and Kaplan, Frédéric. 2021. “Une approche computationnelle du cadastre napoléonien de Venise.” Humanités numériques, no. 3: 133.https://doi.org/10.4000/revuehn.1786.Google Scholar
Di Lenardo, Isabella, Viaccoz, Cédric, Guhennec, Paul, Musso, Carlo, and Kaplan, Frédéric. 2025. “Venice 1808 Land Registers (August).” https://doi.org/10.5281/zenodo.16761169.Google Scholar
Du, Lun, Gao, Fei, Xu, Chen, Jia, Ran, Wang, Junshan, Zhang, Jiang, Han, Shi, and Zhang, Dongmei. 2021. “Tabularnet: A Neural Network Architecture for Understanding Semantic Structures of Tabular Data.” In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, 322–31. KDD’21. Virtual Event. Singapore: Association for Computing Machinery. https://doi.org/10.1145/3447548.3467228.Google Scholar
Ekamper, Peter. 2010. “Using Cadastral Maps in Historical Demographic Research: Some Examples from the Netherlands.” The History of the Family 15, no. 1: 112. https://doi.org/10.1016/j.hisfam.2010.01.003.Google Scholar
Garbin, Eric, and Mani, Inderjeet. 2005. “Disambiguating Toponyms in News.” In Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing, 363–70. Vancouver, BC: Association for Computational Linguistics (ACL).Google Scholar
Grattafiori, Aaron, Dubey, Abhimanyu, Jauhri, Abhinav, Pandey, Abhinav, Kadian, Abhishek, Al-Dahle, Ahmad, Letman, Aiesha, et al. 2024. “The llama 3 Herd of Models.” https://arxiv.org/abs/2407.21783.Google Scholar
Hajiramezanali, Ehsan, Diamant, Nathaniel Lee, Scalia, Gabriele, and Shen, Max W.. 2022. “STab: Self-Supervised Learning for Tabular Data.” In Neurips 2022 First Table Representation Workshop. https://openreview.net/forum?id=EfR55bFcrcI.Google Scholar
Hu, Xueyu, Zhao, Ziyu, Wei, Shuang, Chai, Ziwei, Ma, Qianli, Wang, Guoyin, Wang, Xuwu, et al. 2024. “InfiAgent-DABench: Evaluating Agents on Data Analysis Tasks.” Accessed June 26, 2024. http://arxiv.org/abs/2401.05507.Google Scholar
Kain, Roger J.P., and Baigent, Elisabeth. 1992. Cadastral Map in the Service of the State. Chicago, IL: University of Chicago Press.Google Scholar
Kapoor, Sayash, Stroebl, Benedikt, Siegel, Zachary S., Nadgir, Nitya, and Narayanan, Arvind. 2024. “AI Agents that Matter” [in en]. Accessed July 3, 2024. http://arxiv.org/abs/2407.01502.Google Scholar
Le, Hung, Chen, Hailin, Saha, Amrita, Gokul, Akash, Sahoo, Doyen, and Joty, Shafiq. 2024. “CodeChain: Towards Modular Code Generation Through Chain of Self-Revisions with Representative Sub-Modules” [in en]. Accessed August 26, 2024. http://arxiv.org/abs/2310.08992.Google Scholar
Lelo, Keti. 2020. “Analysing Spatial Relationships through the Urban Cadastre of Nineteenth-Century Rome.” Urban History 47, no. 3: 467–87. https://doi.org/10.1017/S0963926820000188.Google Scholar
Lewis, Patrick, Perez, Ethan, Piktus, Aleksandra, Petroni, Fabio, Karpukhin, Vladimir, Goyal, Naman, Küttler, Heinrich, et al. 2020. “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.” In Proceedings of the 34th International Conference on Neural Information Processing Systems. NIPS’20. Vancouver, BC: Curran Associates Inc. Google Scholar
Li, Haoyang, Zhang, Jing, Liu, Hanbing, Fan, Ju, Zhang, Xiaokang, Zhu, Jun, Wei, Renjie, Pan, Hongyan, Li, Cuiping, and Chen, Hong. 2024. “CodeS: Towards Building Open-source Language Models for Text-to-SQL.” Accessed March 26, 2024. http://arxiv.org/abs/2402.16347.Google Scholar
Majumder, Bodhisattwa Prasad, Surana, Harshit, Agarwal, Dhruv, Hazra, Sanchaita, Sabharwal, Ashish, and Clark, Peter. 2024. “Data-Driven Discovery with Large Generative Models.” Accessed March 4, 2024. http://arxiv.org/abs/2402.13610.Google Scholar
OpenAI, Josh Achiam, Adler, Steven, Agarwal, Sandhini, Ahmad, Lama, Akkaya, Ilge, Aleman, Florencia Leoni, et al. 2024. “Gpt-4 Technical Report.” https://arxiv.org/abs/2303.08774.Google Scholar
OpenStreetMap Contributors. 2017. “Planet Dump.” Retrieved from https://planet.osm.org. https://www.openstreetmap.org.Google Scholar
Pan, Haojie, Zhai, Zepeng, Yuan, Hao, Lv, Yaojia, Ruiji, Fu, Liu, Ming, Wang, Zhongyuan, and Qin, Bing. 2024. “KwaiAgents: Generalized Information-Seeking Agent System with Large Language Models.” Accessed August 26, 2024. http://arxiv.org/abs/2312.04889.Google Scholar
Pavanello, Italo. 1981. I Catasti storici di Venezia, 1808-1913. Volume 2 of Materiali di storia urbana. Venezia: Officina.Google Scholar
Shigarov, Alexey O., and Mikhailov, Andrey A.. 2017. “Rule-Based Spreadsheet Data Transformation from Arbitrary to Relational Tables.” Information Systems 71: 123–36. https://doi.org/10.1016/j.is.2017.08.004.Google Scholar
Wang, Lei, Xu, Wanyu, Lan, Yihuai, Hu, Zhiqiang, Lan, Yunshi, Lee, Roy Ka-Wei, and Lim, Ee-Peng. 2023. “Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Language Models.” In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), edited by Rogers, Anna, Boyd-Graber, Jordan, and Okazaki, Naoaki, 2609–34. Toronto, ON: Association for Computational Linguistics. https://aclanthology.org/2023.acl-long.147.Google Scholar
Xie, Tianbao, Zhou, Fan, Cheng, Zhoujun, Shi, Peng, Weng, Luoxuan, Liu, Yitao, Hua, Toh Jing, Zhao, Junning, Liu, Qian, Liu, Che, Liu, Leo Z., Xu, Yiheng, Su, Hongjin, Shin, Dongchan, Xiong, Caiming, and Yu, Tao. 2023. “OpenAgents: An Open Platform for Language Agents in the Wild.” Accessed June 26, 2024. http://arxiv.org/abs/2310.10634.Google Scholar
Yang, Jingfeng, Gupta, Aditya, Upadhyay, Shyam, He, Luheng, Goel, Rahul, and Paul, Shachi. 2022. “TableFormer: Robust Transformer Modeling for Table-Text Encoding.” In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), edited by Muresan, Smaranda, Nakov, Preslav, and Villavicencio, Aline, 528–37. Dublin: Association for Computational Linguistics. https://aclanthology.org/2022.acl-long.40/.Google Scholar
Zhang, Hengrui, Zhang, Jiani, Shen, Zhengyuan, Srinivasan, Balasubramaniam, Qin, Xiao, Faloutsos, Christos, Rangwala, Huzefa, and Karypis, George. 2024. “Mixed-Type Tabular Data Synthesis with Score-Based Diffusion in Latent Space.” In The Twelfth International Conference on Learning Representations. Proceedings of the ICLR conference. https://openreview.net/forum?id=4Ay23yeuz0.Google Scholar
Figure 0

Figure 1. The role of LLMs in processing historical Cadastres. Processing historical records with orthographic variations and complex transcription details through SQL and coding agents for systematic data analysis.

Figure 1

Figure 2. The information structure of (a) Catastici 1740 and (b) Sommarioni 1808. The structure of the two documents is as follows for (a): (1) place name, (2) urban functions, (3) tenants, (4) owners, (5) annual income; (b): (1) cadastral parcel identifier corresponding to a number on the map, (2) owners, (3) door number, (4) urban functions.

Figure 2

Figure 3. The dual information system of the 1808 cadaster. Each parcel mention in the textual document is geolocalized on the cadastral map through the same ID code.

Figure 3

Table 1. Simple (top) and relational (bottom) browsing questions and their corresponding SQL queries

Figure 4

Table 2. Examples of prompting questions

Figure 5

Figure 4. The SQL agent. Questions are fed to the system into a prompt engineered to match with the CodeS model requirements.

Figure 6

Figure 5. The coding agent. The agent receives a question and consults different datasets to (1) extract the entities being referred to; (2) creates a plan to answer it; and (3) produces and runs a python script to generate an answer.

Figure 7

Figure 6. The Entity Extractor phase. Given a question, in this phase, we extract the most relevant rows from the datasets.

Figure 8

Table 3. Performance of CodeS-7B on browsing tasks

Figure 9

Figure 7. Execution consistency (EC) of the coding agent. (a) EC grouped by question category; (b) EC grouped by answer type.

Figure 10

Figure 8. Model comparison on prompting tasks. Performance of GPT-4 and Llama-3 70B in terms of execution consistency and correctness.

Figure 11

Table 4. Representative examples of EC-3 answers

Submit a response

Rapid Responses

No Rapid Responses have been published for this article.