In the expanding landscape of political science research, the integration of advanced artificial intelligence (AI) tools has opened novel avenues for data collection, annotation, and analysis. Among these tools, large language models (LLMs), such as OpenAI’s Generative Pre-trained Transformer (GPT), have garnered attention for their potential to enhance research productivity and expand empirical research capabilities (Ziems et al. Reference Ziems, Held, Shaikh, Chen, Zhang and Yang2024).Footnote 1 This study specifically examined the use of GPT for information extraction from unstructured text—an essential task that involves retrieving explicitly stated details that may be challenging to access manually. Unlike broader applications—such as generating text labels for classification (Chiu, Collins, and Alexander Reference Chiu, Collins and Alexander2022; Wang Reference Wang2023), simulating survey responses (Argyle et al. Reference Argyle, Busby, Fulda, Gubler, Rytting and Wingate2023b), generating stimulus for survey experiments (Velez and Liu Reference Velez and Liu2024), and engaging in conversations with humans (Argyle et al. Reference Argyle, Bail, Busby and Wingate2023a)—information extraction focuses on accurately identifying and retrieving explicit content within documents. Although GPT shows promise in various tasks, this study highlights their particular effectiveness in information extraction.
Our study is divided into detailed examinations of the utility of GPT for various data-collection tasks. In these examples, GPT’s applications demonstrate its versatility in handling increasingly complex information tasks across two languages: English and Italian. In the first example, GPT is used to clean Optical Character Recognition (OCR) errors from scans of historical documents, demonstrating its basic ability to process textual data. In the more complex applications described in the second and third examples, GPT helps to extract participant information from semi-structured administrative-meeting-minutes data and detailed source information from lengthy news articles. In the fourth example, we show GPT’s ability to perform an advanced task of synthesizing data from multiple Internet sources.
Our study is divided into detailed examinations of the utility of GPT for various data-collection tasks. In these examples, GPT’s applications demonstrate its versatility in handling increasingly complex information tasks across two languages: English and Italian.
Each of these four applications demonstrates how GPT performs labor-intensive tasks not only with remarkable speed but also with accuracy that either matches or exceeds human efforts. Furthermore, the use of GPT in these contexts highlights its potential to manage large volumes of data—a capability that is particularly useful in political science when researchers often are faced with extensive but only partially structured datasets. The examples presented in this article highlight GPT’s strengths in natural-language processing while mitigating its weaknesses in complex reasoning and “hallucination” (i.e., false information) (Ji et al. Reference Ji, Lee, Frieske, Yu, Su, Xu, Ishii, Bang, Madotto and Fung2023; Wei et al. Reference Wei, Wang, Schuurmans, Bosma, Xia, Chi, Le and Zhou2022) along with the reliability and consistency of synthetic survey data produced by LLMs (Bisbee et al., Reference Bisbee, Clinton, Dorff, Kenkel and Larson2024).
By presenting a range of unique examples, this article expands thinking in the discipline about the potential uses of LLMs rather than providing a specific how-to guide. We discuss the importance of creatively engineering prompts tailored to different tasks, illustrating that the first prompt may not always suffice and that careful refinement is crucial for optimal results. Through this approach, we hope to inspire further exploration and creative problem-solving using LLMs in political science research.
GPT’s potential to reduce the gap in unequal research resources is another significant benefit of its inclusion in the political science toolbox. Traditionally, large-scale research projects often have been the purview of well-funded researchers who can afford large teams of research assistants (RAs) and expensive data-processing tools. However, GPT’s ability to automate and streamline data extraction and analysis tasks could level the playing field, allowing researchers with limited budgets to undertake more extensive research efforts. However, the use of LLMs in research raises ethical concerns, including the potential loss of jobs for student RAs, privacy risks, social bias in output, and significant environmental impacts. The various ethical concerns of using GPT are discussed in detail.
GPT’s ability to automate and streamline data extraction and analysis tasks could level the playing field, allowing researchers with limited budgets to undertake more extensive research efforts.
APPLICATIONS
This section presents four examples in which LLMs streamline traditionally labor-intensive tasks and enable innovative approaches to data collection and analysis in political science.
Example 1: Cleaning and Analyzing Historical Data
This section explores the use of GPT in conjunction with OCR tools to clean and analyze historical documents. Although OCR technology has advanced, the quality of output nevertheless depends on the quality of the scanned image and the choice of OCR tool, which often results in errors (e.g., misspellings and odd spacing). High-quality OCR tools such as Google Cloud Vision (GCV) produce cleaner text but often are impractical due to issues such as document accessibility and other resource constraints. To address these challenges, we used the GPT-4-1106-preview model to clean text produced by the open-source OCR tool, Tesseract.
We used previously unused archival materials concerning World War II–era race-related incidents and racial reform from the National Archives in College Park, Maryland. These materials, consisting of five boxes, contain the weekly intelligence reports of the Army Service Forces from August 1944 to January 1946. The reports provide a comprehensive description of race-related incidents involving military personnel, as well as the preventive or reactionary measures taken to mitigate racial strife (see the sample image in figure 1). The records contain important details about these incidents, such as the date and location, the people involved, and the actions taken by key players. Unfortunately, available OCR tools show varying levels of accuracy (see online appendix A, table A1).

Figure 1 Example of a Scanned Image from a Weekly Intelligence Report
This study proposes a time-saving approach that combines open-source tools (i.e., Tesseract) with GPT. We took the noisy text generated by Tesseract and used the GPT Application Programming Interface (API) to clean the noise, a process illustrated in table 1. We then visualized the performance of this method compared to GCV–processed text for an entire box, consisting of 20 folders (997 images).Footnote 2 We used the GCV-processed text as gold-standard data because of its superior quality once images were obtained and preprocessed for accurate character recognition as well as the impracticality of generating human-typed gold-standard data for large archival materials. We measured the performance of the Tesseract–GPT combination using Character Error Rate (CER), a common metric used to evaluate OCR performance. CER is defined as the ratio of the number of character-level errors to the total number of characters in the reference text. We also used GPT to extract critical details from each incident, including the location, main actors involved, and targets. Finally, we extracted a 10% sample from the cases and manually verified the accuracy of the information extracted by GPT.
Table 1 OCR Results Using Tesseract and GPT

Note: Errors are highlighted in bold text.
The results shown in figure 2 illustrate the effectiveness of GPT in cleaning and analyzing historical data. On average, about 6% of the characters in the OCR-generated text were incorrect, compared to the text generated by GCV, our reference text (figure 2a). Furthermore, the high accuracy rate for capturing relevant information—such as location, main actor, and target—illustrates the general effectiveness of GPT in information extraction, especially when it is related to objective, context-independent information (e.g., location) (figure 2b).Footnote 3

Figure 2 Performance of GPT in Cleaning and Analyzing Archival Data
(a) Average Character Error Rate; (b) Accuracy Rate of Information Summarized
Example 2: Extracting Unstructured Administrative Data
This section describes how GPT (specifically, the GPT-4-1106-preview model) can be used to collect and clean administrative data provided in a semi-structured format (often in PDFs). We focused on meeting minutes from federal advisory committees (FACs) within federal agencies in the United States. A significant number of FACs serve as independent advisors that make policy recommendations to federal agencies. These committees hold more than 5,000 public meetings annually, convening committee members, federal agency officials, and outside interest groups to discuss agency policy. As such, FAC meeting minutes provide a unique opportunity for scholars to examine the extent to which outside groups participate in bureaucratic policy making. Figure 3 presents two examples of committee-meeting minutes: from the Environmental Protection Agency (EPA) and the Centers for Disease Control and Prevention (CDC). Each example includes the name, position (e.g., chair, member, agency staff, or public attendee), and affiliation (e.g., Karmanos Cancer Institute) of each meeting participant.

Figure 3 Examples of Federal Advisory Committee Meeting Minutes
(a) EPA Meeting Minutes; (b) CDC Meeting Minutes
Our goal was to extract the participant information from 79 meeting minutes of two EPA FACs—the Clean Air Scientific Advisory Committee and the Science Advisory Boards—from 2017 to 2023. We used the GPT API and R to extract the name, affiliation, and position of each meeting participant from the FAC meeting minutes and generated structured comma-separated values (CSV) data. Table 2 lists the API prompt and R commands that we used. First, the prompt contains a phrase that asks GPT to create a delimited table of three columns. Second, the prompt contains sentences describing the information that GPT should fill in for each column based on the meeting minutes. Third, the prompt asks GPT to clean the participants’ names and remove commas that are not delimiters.
Table 2 GPT Prompt and API Command in R

Whereas GPT easily extracted individuals’ names and affiliations, it often had difficulty extracting participants’ position labels from meeting minutes because the labels were so diverse and broad. For example, “invited speaker” was not included as an example of a participant position in the prompt; as a result, GPT often would label those individuals as something else, such as “registered speaker.” This could have been problematic if accurately identifying individuals’ positions was critical to understanding their roles in FAC meetings. To address this, researchers can include in the prompt the extensive set of position labels that appear in meeting minutes. However, we also found that simply adding “etc.” at the end of a list of example positions quickly solved the problem by giving GPT the latitude to determine which information in the meeting minutes concerned the participants’ positions.
Although researchers may be concerned about data fabrication by GPT, we found that it rarely occurs in tasks like this, in which GPT constructed datasets based on given information. After GPT created datasets from the meeting minutes, undergraduate RAs validated each dataset to ensure that all meeting-attendee information was included in terms of names, affiliations, and positions. In our example, GPT failed to extract complete information from four of the 79 meeting minutes because our prompt did not include a complete list of participant positions. In this case, RAs filled in the position information for those participants that GPT was unable to retrieve from the meeting minutes.Footnote 4
The collected data allowed us to examine who attended these FAC meetings (see the list provided in online appendix B). The data showed that a substantial number of interest groups voluntarily participated in FAC meetings and that their participation rate varied over time. This has not been documented by existing studies of FACs that focus primarily on FAC members appointed by agency heads (Feinstein and Hemel Reference Feinstein and Hemel2020).
Our example shows that the data collection and cleaning process for FAC meetings still requires human validation. However, having RAs review the GPT-generated data is much less resource-intensive and time-consuming than hiring RAs to build data based on meeting minutes. If the minutes of a meeting contain 50,000 characters (i.e., five to six pages), it would cost 30 cents to run the GPT code on the transcript.
Example 3: Extracting Primary Sources from News Articles
This section describes our approach to using GPT to extract semi-structured data from the extensive, unstructured text of news articles, focusing on identifying the diverse sources cited by journalists. Newspaper articles typically reference a wide range of sources—from politicians and bureaucrats to private citizens and business owners—which significantly influences the information conveyed to the public. Although we focused on newspapers, our approach could be applied to similar tasks, such as extracting witness information from court records and guest appearances in news transcripts.
Identifying sources was particularly challenging due to the length of the input documents and the nuanced integration of source information within the article text, including variations in name and context. In the initial phases of prompt development, we found that GPT had difficulty aggregating sources that were mentioned by multiple similar names and often failed to extract all sources, especially for longer articles. We suspected that this task was hampered by performance degradation because input text length increases the relatively complex level of reasoning required to identify and aggregate sources (Wei et al. Reference Wei, Cui, Cheng, Wang, Zhang, Huang and Xie2023). Based on the common errors that we observed, we divided the source-extraction task into subtasks and used a separate prompt to solve each separately, with the output of one subtask prompt feeding directly into the next. This made the logic of each subtask explicit, which also made debugging easier.
The details of the method are shown in figure 4. First, we identified all quotes and information attributed to third parties in the news article. Second, we aggregated the quotes and information at the speaker or organization level. Third, we transformed the data into structured JSON (i.e., a format for organizing and managing data in a hierarchical structure) that can be processed with any data tool of choice. The full set of prompts and sample output are provided in online appendix C.

Figure 4 Source-Extraction Process Outline
To validate our approach, we used the described method to extract 214 sources for 50 articles and employed crowd workers to identify errors in the extracted sources. To ensure worker quality, we included results from only those workers who successfully identified intentional errors that we embedded in the worker task (see online appendix C for details about crowd-worker sourcing and screening).
We identified three types of errors: minor details (i.e., incorrect title, name, or organization); false sources (Type I), in which the extracted source was not cited in the article; and missing sources (Type II), in which a source present in the article was not extracted. We manually reviewed each error identified by the crowd workers and estimated the overall error rates. Our results show that the GPT-based system was highly accurate in extracting source details and rarely made Type I or Type II errors (i.e., all error rates were less than 5%). Figure 5 lists the error rates with 95% confidence intervals. Furthermore, a manual inspection revealed that the majority of errors were edge cases, for which it is difficult to determine with certainty the difference between a source citation and a mere mention of a particular entity (e.g., “President Xi Jinping of China has vowed repeatedly to move ahead with steps in his country to curb climate-altering pollution…”). In particular, when crowd workers noticed that source entities extracted by GPT were not cited in the article (i.e., Type I errors), these entities were always mentioned at least in the text. In other words, these errors were exclusively mistakes in judging whether a mentioned entity (i.e., Xi Jinping in the example) should be considered a cited source as opposed to outright hallucination of source entities. The remaining true missing-source (i.e., Type II) errors tended to occur in longer articles with six or more sources.

Figure 5 Performance of GPT-Based Source Extraction
We used this set of prompts to extract 31,431 sources from 5,795 New York Times articles about climate change during the period 2012–2022 using the “GPT-4 Turbo” model. Figure C1 in online appendix C shows the distribution of sources and articles per year. The total cost of the extraction and validation was $1,300.
Example 4: Extracting Elite Biographies from Online Sources
This section leverages GPT to extract specific information from an unstructured corpus of sources obtained through systematic Google searches. This exercise reflected a broad category of data-collection tasks for which researchers could not rely on a specific set of source materials or a corpus of structured text. In these cases, data collection involved searching for sources as well as extracting the relevant information. As a result, data collection drew from various sources, including websites, news articles, and academic and expert texts.
We replicated a large human-coded data-collection effort by Montano, Paci, and Superti (Reference Montano, Paci and Superti2024), which examined whether having a daughter influenced the pro-women policies of Italian mayors. The original study reflected a growing interest in political science in the role of elite biographical characteristics (Krcmaric, Nelson, and Roberts Reference Krcmaric, Nelson and Roberts2020). However, this approach faced a significant challenge because systematic biographical data rarely are readily available. As a result, researchers must resort to time-consuming and expensive data collection. The original effort by Montano, Paci, and Superti (Reference Montano, Paci and Superti2024) leveraged systematic Google searches for 1,800 mayors. It was conducted by three RAs from July 2023 to February 2024. For each mayor, the RAs reviewed up to the first 20 available search results for a total of more than 7,300 Italian webpages.Footnote 5 Each link was checked for three pieces of information: whether it contained any information about the mayor’s children, the number of kids, and the number of daughters.
We automated this process by scraping the original links and feeding the text into the GPT-4 Turbo API along with a carefully engineered prompt (see table D1 in online appendix D). We developed the prompt through an iterative trial-and-error procedure that sampled random draws from the list of webpages and manually checked the model output. The final prompt included instructions to make informational extraction more efficient, especially in edge cases. For instance, we directed GPT to infer gender from names and to assume that the mayor had at least one child if it was mentioned that he had grandchildren. Furthermore, because each webpage came from search results about a specific mayor, we could develop mayor-specific prompts, specifying their name and municipality.
This task tested GPT-4’s ability to parse through ambiguous and heterogeneous data. Most sources (about 90%) did not contain relevant information. The relevant information was encoded in myriad ways and the nuance of textual clues could be misleading. Table 3 presents illustrative examples of GPT-4 output. In three cases, GPT-4 correctly recovered the source information. The fourth case was an example of an error in which the information was encoded in a complex way. The text mentioned the mayor’s “only son” and his two daughters. GPT-4 understood this as the mayor having three children whereas, in truth, the two were the mayor’s son’s daughters and thus the mayor’s granddaughters—not to be counted as his direct offspring.
Table 3 Examples of GPT-4 Information from Google Search Results

Given the same set of search-result links, we estimated the error rate of human coders and of GPT-4. We considered as ground truth all cases in which human coders and GPT-4 agreed. For all disagreements, we adjudicated between the two sets with a third round of human coding, assisted by new RAs. For cases in which all three rounds disagreed—only seven of the total sample—the authors manually coded the ground truth.
Figure 6 illustrates the error rates against the ground truth by the original group of human coders and GPT-4. Across the three main pieces of relevant information, GPT-4 outperformed human coders. Figure D1 in online appendix D sorts the overall error rate into categories of mistakes: Type 1 (false positives), Type 2 (false negatives), and Type M (magnitude).Footnote 6 Compared to human coders, GPT-4 made fewer Type 1 errors and more Type 2 errors. On the one hand, this pattern is reassuring because GPT-4’s output may not require extensive validation given its lower rate of false positives. On the other hand, it also suggests that GPT-4 may omit some information, probably whenever it is encoded in an ambiguous or complex way.

Figure 6 Human Coders and GPT-4 Coding Error Rates
We also tested GPT-4’s ability to self-assess and found mixed results. The prompt asked GPT-4 to produce confidence ratings, on a scale of 0 to 100, about the accuracy of its output. The results are shown in Figure D2 in online appendix D. Whenever GPT-4 expressed a confidence rating less than 100, the error rate increased significantly, from 2.8% to 27.3%. However, GPT-4 often expressed overconfidence, giving a rating of 100 to half of the errors found in this exercise. As such, confidence ratings can be considered only as a noisy indicator of potential error.
LIMITATIONS AND BEST PRACTICES
These four applications focus on data collection, cleaning, and extraction tasks that are tedious but common in quantitative political science research. These types of tasks allow for a straightforward application of LLMs while minimizing the potential for reasoning errors and hallucinations. However, despite their straightforward nature, our applications also have limitations. In the context of data cleaning and collection, we also highlight important limitations and best practices. These recommendations integrated our experience and findings from our validation exercises in this study along with advice on emerging best practices for LLM use and prompt engineering (Ekin Reference Ekin2023).
In the context of data cleaning and collection, we also highlight important limitations and best practices. These recommendations integrated our experience and findings from our validation exercises in this study along with advice on emerging best practices for LLM use.
First, LLM performance is extremely sensitive to the specific prompt used. The term “prompt engineering” has emerged to describe the process of tailoring the LLM prompt to the task at hand. This task is iterative and potentially idiosyncratic to the specific application. However, general guidelines can improve the process. In our experience, the best-performing prompts include several common components. First, the prompt should describe the task context, including the main objective and the type of input data. In addition, researchers should specify the output format, providing detailed descriptions of each data field. Prompts also may include examples of common information-encoding patterns or even be constructed computationally to incorporate document-specific context. For complex tasks, we encourage researchers to explore multistep prompts, as demonstrated in Example 3, or to ask the model to explain its reasoning before providing data, as recommended by Wei et al. (Reference Wei, Wang, Schuurmans, Bosma, Xia, Chi, Le and Zhou2022).
Second, the context window of LLMs limits the length of both input and output text generated by the model. LLM performance also degrades as the text length increases, even for documents that fit comfortably within the context window. Figure D3 in online appendix D shows that GPT made more errors in identifying the mayor’s children as the length of the input text increased. A practical guideline is to limit texts to well under half of the advertised context window by selecting portions of the text that contain relevant keywords or by breaking tasks into smaller segments. (See figure D3 in online appendix D for the relationship between source text length and coding errors.)
Third, GPT occasionally does not follow the task instructions. This behavior can manifest as incomplete responses, incorrect column names, or incorrect data output. Whereas prompt engineering can mitigate these issues, we found that, in most cases, simply rerunning the same prompt multiple times until the output was well formed was sufficient. Similarly, researchers can leverage logical dependencies across data fields to check for response coherence. For instance, in Example 4, we checked that the number of children (of both genders) was greater than or equal to the number of daughters. A related concern is the production of hallucinations, or false information. In our experience with data-collection and cleaning tasks, outright hallucinations have not occurred. Researchers can experiment with the temperature parameter, which affects how much the LLM relies on current input data relative to its training data. Lower values reduce the likelihood of hallucinations but increase sensitivity to prompt wording and reduce reasoning ability. Temperature values range from zero to two; we kept it less than one in all of our examples.
Fourth, we make a few recommendations to improve the ergonomics of interacting with the GPT API. We recommend allowing the model to record the portions of the texts from which it extracts information. This addition can facilitate validation and shed light on the inner workings of the LLM information processing to aid in debugging. To simplify output data management, we recommend instructing the model to limit output to JSON or CSV/Tab Separated Values format (e.g., “provide only the table and nothing else”).
Fifth, we note the existence of competing LLMs in addition to OpenAI’s GPT. We focused on the GPT family of models in this study because of their ease of use, widespread adoption, and high standard of performance. However, alternative models, such as Google Gemini and Anthropic’s Claude, also may be worth considering.Footnote 7 In particular, open-source models, such as Llama and Mistral, offer significant cost and reproducibility advantages (Spirling Reference Spirling2023). However, they require more-technical setup procedures and potentially lower generalized performance.
ETHICAL CONSIDERATIONS
The use of LLMs raises ethical concerns related to professional, privacy, and environmental issues. Researchers should consider whether the potential costs of these novel tools outweigh the added efficiencies. Similarly, we encourage practitioners to consider strategies that limit or offset any negative downstream consequences of integrating LLMs into the research process.
First, the applications presented in this article outsource tasks traditionally performed by student RAs. Although this improves the cost effectiveness of data collection, it undermines student-employment opportunities. These opportunities provide students with not only financial support but also valuable research experience and insight into academic work, potentially influencing some to pursue graduate studies. The RA experience strengthens students’ résumés and also provides an important pedagogical opportunity for experiential learning. We encourage researchers to continue the practice of hiring promising students as RAs. The use of LLMs does not completely eliminate the need for RAs because validation requires thorough human coding. Outsourcing repetitive data-entry tasks to LLMs can free up time and resources to offer students more rewarding and intellectually stimulating tasks, such as exploratory literature reviews and more complex data management.
Second, LLMs raise potential privacy concerns. Given the rapid development of these models, no clear consensus has emerged on the confidentiality risks associated with input data (Wu, Duan, and Ni Reference Wu, Duan and Ni2024; Yao et al. Reference Yao, Duan, Xu, Cai, Sun and Zhang2024). Therefore, we recommend that researchers exercise caution and avoid using the proposed techniques for any sensitive data.
Third, both research and anecdotal evidence shows that LLMs may exhibit social biases embedded in their training data (Hida, Kaneko, and Okazaki Reference Hida, Kaneko and Okazaki2024). As a result, information-extraction tasks may produce output data that are consistent with the model’s underlying biases, such as relying on stereotypes to decide ambiguous cases. Researchers should evaluate whether their applications may be susceptible to this problem and focus validation efforts on detecting social biases in LLM output.
Fourth, the development and operation of LLMs requires significant energy consumption, which raises environmental concerns (Strubell, Ganesh, and McCallum Reference Strubell, Ganesh and McCallum2020). Researchers should consider limiting their use of LLMs to cases in which efficiency gains are clear and justify an increased environmental footprint. Similarly, for larger projects, researchers should evaluate the benefits of carbon-offsetting strategies.
SUPPLEMENTARY MATERIAL
To view supplementary material for this article, please visit http://doi.org/10.1017/S1049096525000046.
DATA AVAILABILITY STATEMENT
Research documentation and data that support the findings of this study are openly available at the Harvard Dataverse at https://doi.org/10.7910/DVN/7KJLH7.
CONFLICTS OF INTERESTS
The authors declare that there are no ethical issues or conflicts of interest in this research.