Hostname: page-component-68c7f8b79f-xc2tv Total loading time: 0 Render date: 2025-12-22T08:43:17.423Z Has data issue: false hasContentIssue false

Complexity welcome: Pangenome graphs for comprehensive population genomics

Published online by Cambridge University Press:  27 October 2025

Zhigui Bao
Affiliation:
Department of Molecular Biology, Max Planck Institute for Biology Tübingen, 72076 Tübingen, Germany
Detlef Weigel*
Affiliation:
Department of Molecular Biology, Max Planck Institute for Biology Tübingen, 72076 Tübingen, Germany Institute for Bioinformatics and Medical Informatics, University of Tübingen, 72076 Tübingen, Germany
*
Corresponding author: Detlef Weigel; Email: weigel@tue.mpg.de

Abstract

Pangenome graphs are revolutionising evolutionary and population genomics by moving beyond linear reference genomes to represent the full spectrum of sequence diversity within and across species. This review traces the field’s progression from reference-augmented graphs to assembly-based, alignment-first approaches that capture complex structural variation with reduced bias. We examine key strategies for graph construction, genotyping and implementing graph-aware tools in functional genomics, including transcriptomics and epigenomics. While much of the work to date has focused on humans, diverse and structurally complex plant genomes pose unique challenges that require further methodological innovation. Key bottlenecks – including visualisation, scalability and integration with multi-omic data – persist. By outlining trade-offs among current tools and emphasising the need for rigorous evaluation frameworks, we argue that progress will depend on community-driven efforts to unify graph construction, genotyping and interpretation. Despite technical hurdles, pangenome graphs offer a powerful foundation for more inclusive evolutionary and population genomics.

Information

Type
Review
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2025. Published by Cambridge University Press in association with John Innes Centre

1. Introduction

The first papers describing nearly complete genome sequences were typically entitled ‘The genome of …’. Implicit in these titles was the assumption that much of the genome is shared between different members of a species. Looking back today, it is clear that a great deal has been learned from the study of genes conserved in each species, the proteins they encode, their regulatory elements and so on. Similarly, the commonalities as well as fixed differences between both closely and more distantly related species have greatly informed biology, including our understanding of evolution over different time scales.

As interesting as all this knowledge is, we cannot fully understand biology without considering the genetic differences between individuals. These are not only at the root of adaptation to specific environments but also underlie susceptibility to disease and abiotic factors. Such seemingly deleterious variants are often linked by evolutionary trade-offs, where genes and alleles that are favourable in one environment become a liability in a different environment. The original genome papers for different species, therefore, were often quickly followed by attempts to record interindividual differences at the whole-genome level. Pioneering in this regard were scientists working with the plant Arabidopsis thaliana. The initial genome paper already contained information on shotgun sequences from a second strain in addition to the one from which the genome sequence had been primarily generated (The Arabidopsis Genome Initiative, 2000). Soon thereafter, the first genome-wide polymorphism analyses were published, at increasingly higher resolution, relying on a series of different technologies (Clark et al., Reference Clark, Schweikert, Toomajian, Ossowski, Zeller, Shinn, Warthmann, Hu, Fu, Hinds, Chen, Frazer, Huson, Schölkopf, Nordborg, Rätsch, Ecker and Weigel2007; Kim et al., Reference Kim, Plagnol, Hu, Toomajian, Clark, Ossowski, Ecker, Weigel and Nordborg2007; Nordborg et al., Reference Nordborg, Borevitz, Bergelson, Berry, Chory, Hagenblad, Kreitman, Maloof, Noyes, Oefner, Stahl and Weigel2002, Reference Nordborg, Hu, Ishino, Jhaveri, Toomajian, Zheng, Bakker, Calabrese, Gladstone, Goyal, Jakobsson, Kim, Morozov, Padhukasahasram, Plagnol, Rosenberg, Shah, Wall, Wang and Bergelson2005). Some of the approaches could interrogate in principle every position in the reference genome, but were limited by the extent of sequence divergence that could be recorded, and in highly divergent or missing regions, the exact sequence remained unknown (Clark et al., Reference Clark, Schweikert, Toomajian, Ossowski, Zeller, Shinn, Warthmann, Hu, Fu, Hinds, Chen, Frazer, Huson, Schölkopf, Nordborg, Rätsch, Ecker and Weigel2007). This remained the case when much cheaper short-read sequencing entered the scene, although with increasing lengths of short reads, more and more of the genome became accessible to polymorphism analyses (Ossowski et al., Reference Ossowski, Schneeberger, Clark, Lanz, Warthmann and Weigel2008)]. Importantly, short-read sequencing provided access to sequences that were not present in the original reference genome, and some of these could even be anchored to reference positions (Cao et al., Reference Cao, Schneeberger, Ossowski, Günther, Bender, Fitz, Koenig, Lanz, Stegle, Lippert, Wang, Ott, Müller, Alonso-Blanco, Borgwardt, Schmid and Weigel2011; Gan et al., Reference Gan, Stegle, Behr, Steffen, Drewe, Hildebrand, Lyngsoe, Schultheiss, Osborne, Sreedharan, Kahles, Bohnert, Jean, Derwent, Kersey, Belfield, Harberd, Kemen, Toomajian and Mott2011; Long et al., Reference Long, Rabanal, Meng, Huber, Farlow, Platzer, Zhang, Vilhjálmsson, Korte, Nizhynska, Voronin, Korte, Sedman, Mandáková, Lysak, Seren, Hellmann and Nordborg2013; Ossowski et al., Reference Ossowski, Schneeberger, Clark, Lanz, Warthmann and Weigel2008; Schneeberger et al., Reference Schneeberger, Ossowski, Ott, Klein, Wang, Lanz, Smith, Cao, Fitz, Warthmann, Henz, Huson and Weigel2011). The picture for other plant species, especially the crops rice and maize, was broadly similar, albeit typically with a delay of a few years (Gao et al., Reference Gao, Gonda, Sun, Ma, Bao, Tieman, Burzynski-Chang, Fish, Stromberg, Sacks, Thannhauser, Foolad, Diez, Blanca, Canizares, Xu, van der Knaap, Huang, Klee and Fei2019; Li et al., Reference Li, Zhou, Ma, Jiang, Jin, Zhang, Guo, Zhang, Sui, Zheng, Zhang, Zuo, Shi, Li, Zhang, Hu, Kong, Hong, Tan and Qiu2014; Zhao et al., Reference Zhao, Feng, Lu, Li, Wang, Tian, Zhan, Lu, Zhang, Huang, Wang, Fan, Zhao, Wang, Zhou, Chen, Zhu, Li, Weng and Huang2018).

The use of short reads for the identification of sequence polymorphisms that distinguish the focal individual from the reference strain begins with the mapping of reads against the reference genome sequence. A limitation, therefore, is the reference bias that is caused by the degree of mismatch between a specific short read and its target. Although mismatches may still allow confident mapping, more reads can be mapped with a more closely related genome. This insight led early on to the suggestion of producing synthetic reference genome sequences that represented all possible combinations of polymorphisms across the genome, including those not yet discovered in the genomes analysed (Schneeberger et al., Reference Schneeberger, Hagmann, Ossowski, Warthmann, Gesing, Kohlbacher and Weigel2009).

A less obvious problem has been that a specific sequence might be present only once in one genome but multiple times in another genome. And even if a sequence is present only once, it might occur in different regions of the genome across individuals. In both cases, short-read mapping can be misleading. For example, heterozygosity may be inferred when in reality there are two closely related but not identical duplicated fragments in the short-read sequenced genome.

Rather than relying on a single reference genome, the concept of the pangenome was introduced to capture all sequence variation within a species. First applied in bacterial genomes (Tettelin et al., Reference Tettelin, Masignani, Cieslewicz, Donati, Medini, Ward, Angiuoli, Crabtree, Jones, Durkin, DeBoy, Davidsen, Mora, Scarselli, Peterson, Hauser, Sundaram, Nelson and Fraser2005), this framework quantified core and dispensable genes across populations, highlighting extensive diversity. As sequencing costs declined, the pangenome approach was extended to eukaryotic genomes, including humans and Arabidopsis thaliana (Cao et al., Reference Cao, Schneeberger, Ossowski, Günther, Bender, Fitz, Koenig, Lanz, Stegle, Lippert, Wang, Ott, Müller, Alonso-Blanco, Borgwardt, Schmid and Weigel2011; Li et al., Reference Li, Li, Zheng, Luo, Zhu, Li, Qian, Ren, Tian, Li, Zhou, Zhu, Wu, Qin, Jin, Li, Cao, Hu, Blanche and Wang2010; Sudmant et al., Reference Sudmant, Kitzman, Antonacci, Alkan, Malig, Tsalenko, Sampas, Bruhn, Shendure and Eichler2010). Over time, this concept evolved into a broader model encompassing the full genomic landscape of a population, species or clade. A somewhat unfortunate fact is that ‘pangenome’ these days more often refers to the collection of a specific set of genomes rather than the ideal of all reasonably common variants in a population.

2. The evolution of building pangenome graphs

One intuitive way to overcome the limitations of a linear reference genome is through genome graphs, which offer a compact data structure where sequences are represented as node-labelled graphs with edges connecting variants across multiple genomes. Unlike traditional linear representations, genome graphs preserve original coordinates by tracing paths through the sequence graph, accommodating both shared and unique genomic regions. The interpretability of pangenome graphs and their level of detail exist on a two-dimensional spectrum. At one extreme, highly abstract graphs (Figure 1a,b), such as those representing every nucleotide variant with numerous loops and alternative paths, may be difficult to understand and have limited practical use. At the other extreme, unaligned genomic sequences, while easy to interpret, can obscure meaningful genomic differences. Depending on the approach, fixed k-mer-based de Bruijn graphs (Figure 1a,b) and fully aligned multi-individual genomes (Figure 1c–h) represent different points along this continuum (Figure 1i).

Figure 1. Multiple approaches to building pangenome graphs. (a) A graph that has only four nodes, corresponding to the four DNA bases, with all possible connections between the nodes. (b) A k-mer graph based on short sequences (here, triplets). (c) and (d) The same sequences are combined in different representations, which highlights the equivalency of a multiple genome alignment and a genome graph. (e) The same sequences shown in VCF format in a symbolic manner. (f), (g) and (h) represent the entire chromosome 1 from five A. thaliana individuals. (f) A Biforst 21-mer graph. (g) A Minigraph-Cactus graph. (h) A PGGB graph. (i) Summary of the trade-offs in node size and complexity for different types of graphs. Note that even with the same method, parameter choice will result in different graphs from identical sequences.

2.1. Variation-first: Augmenting the reference

Early pangenome graph construction was constrained by practical realities: high-quality genome assemblies were expensive and rare, while catalogues of variation from resequencing projects were abundant. This imbalance led to reference-augmented approaches that embedded known variants into linear reference backbones. These methods typically integrate variant genotyping within the same workflow, though here we focus on the graph construction aspect.

Schneeberger et al. (Reference Schneeberger, Hagmann, Ossowski, Warthmann, Gesing, Kohlbacher and Weigel2009) pioneered this concept with GenomeMapper, demonstrating that incorporating known polymorphisms could reduce mapping bias in Arabidopsis thaliana. The approach was later extended to the major histocompatibility complex region in humans (Dilthey et al., Reference Dilthey, Cox, Iqbal, Nelson and McVean2015), where high diversity makes linear references particularly inadequate. These early successes established the fundamental principle that graph representations could capture variation more faithfully than linear sequences. It was further generalised by several groups to the whole genomes of thousands of human individuals, with each group using distinct strategies (Eggertsson et al., Reference Eggertsson, Jonsson, Kristmundsdottir, Hjartarson, Kehr, Masson, Zink, Hjorleifsson, Jonasdottir, Jonasdottir, Jonsdottir, Gudbjartsson, Melsted, Stefansson and Halldorsson2017; Garrison et al., Reference Garrison, Sirén, Novak, Hickey, Eizenga, Dawson, Jones, Garg, Markello, Lin, Paten and Durbin2018; Rakocevic et al., Reference Rakocevic, Semenyuk, Lee, Spencer, Browning, Johnson, Arsenijevic, Nadj, Ghose, Suciu, Ji, Demir, Li, Toptaş, Dolgoborodov, Pollex, Spulber, Glotova, Kómár and Kural2019). GraphTyper (Eggertsson et al., Reference Eggertsson, Jonsson, Kristmundsdottir, Hjartarson, Kehr, Masson, Zink, Hjorleifsson, Jonasdottir, Jonasdottir, Jonsdottir, Gudbjartsson, Melsted, Stefansson and Halldorsson2017) iteratively realigned, clipped and unaligned short reads with an embedded genome graph for small variant calling. The VG toolkit (Variation Graph toolkit, (Garrison et al., Reference Garrison, Sirén, Novak, Hickey, Eizenga, Dawson, Jones, Garg, Markello, Lin, Paten and Durbin2018; Hickey et al., Reference Hickey, Heller, Monlong, Sibbesen, Sirén, Eizenga, Dawson, Garrison, Novak and Paten2020) emerged as the first comprehensive open-source framework for this reference-augmented paradigm. VG constructs variation graphs by threading known variants from VCF files into reference genomes or directly from genome alignment, creating alternative paths that represent different allelic states. It supports complex structural variations, which include duplications and inversions, using a bidirectional cyclic graph. Graph Genome pipeline (Rakocevic et al., Reference Rakocevic, Semenyuk, Lee, Spencer, Browning, Johnson, Arsenijevic, Nadj, Ghose, Suciu, Ji, Demir, Li, Toptaş, Dolgoborodov, Pollex, Spulber, Glotova, Kómár and Kural2019) also supports structural variants (SVs) for genotyping with high speed, but it is limited to human genomes and is not openly distributed.

Due to the flexibility of constructing graphs from precomputed VCFs, VG has become the backbone of many pipelines. By incorporating variants derived from genome assemblies or filtered long-read alignments, VG-based workflows have been successfully applied across diverse species, including humans and crops (Liu et al., Reference Liu, Du, Li, Shen, Peng, Liu, Zhou, Zhang, Liu, Shi, Huang, Li, Zhang, Wang, Zhu, Han, Liang and Tian2020; Qin et al., Reference Qin, Lu, Du, Wang, Chen, Chen, He, Ou, Zhang, Li, Li, Li, Liao, Gao, Tu, Yuan, Ma, Wang, Qian and Li2021; Sirén et al., Reference Sirén, Monlong, Chang, Novak, Eizenga, Markello, Sibbesen, Hickey, Chang, Carroll, Gupta, Gabriel, Blackwell, Ratan, Taylor, Rich, Rotter, Haussler, Garrison and Paten2021; Zhou et al., Reference Zhou, Zhang, Bao, Li, Lyu, Zan, Wu, Cheng, Fang, Wu, Zhang, Lyu, Lin, Gao, Saha, Mueller, Fei, Städler, Xu and Huang2022).

2.2. The alignment-first paradigm: Towards unbiased representation

With decreasing costs and improving quality of long-read sequencing, generating multiple high-quality genome assemblies has become increasingly feasible, shifting the bottleneck from data generation to comparative analysis. This shift enabled a new paradigm in pangenome graph construction – moving from reference-based variant threading to graph building through whole-genome alignments directly, adopting an ‘alignment-first’ approach. Theoretically, this approach can reduce reference bias and better capture complex structural variations, including inversions, duplications and rearrangements that are hard to encode using VCF-based models.

Even before genome graphs were formally introduced, a multiple genome alignment (MGA) already served as an implicit representation of shared and divergent sequence features across assemblies. Multiple sequence alignments (MSAs) naturally lend themselves to representation as partially ordered sequence (POA) graphs (Lee et al., Reference Lee, Grasso and Sharlow2002), which have been extended into A-Bruijn graphs (Raphael et al., Reference Raphael, Zhi, Tang and Pevzner2004) and cactus graphs (Paten et al., Reference Paten, Earl, Nguyen, Diekhans, Zerbino and Haussler2011) to better accommodate genome rearrangements and duplications. Mauve (A. C. E. Darling et al., Reference Darling, Mau, Blattner and Perna2004) and TBA (Threaded Blockset Aligner) (Blanchette et al., Reference Blanchette, Kent, Riemer, Elnitski, Smit, Roskin, Baertsch, Rosenbloom, Clawson, Green, Haussler and Miller2004) represent some of the earliest efforts to align genome regions across multiple species. Vaughn and colleagues (Vaughn et al., Reference Vaughn, Branham, Abernathy, Hulse-Kemp, Rivers, Levi and Wechter2022) recently used progressiveMauve (Darling et al., Reference Darling, Mau and Perna2010) to align melon genomes and convert them into a genome graph for genotyping.

To bridge traditional alignment and graph construction, several intermediate tools have been developed. REVEAL (Recursive Exact-Matching Aligner) (Linthorst et al., Reference Linthorst, Hulsman, Holstege and Reinders2015) employs a recursive exact-matching strategy to construct alignments, while tools like NovoGraph (Biederstedt et al., Reference Biederstedt, Oliver, Hansen, Jajoo, Dunn, Olson, Busby and Dilthey2018) and Seq-seq-pan (Jandrasits et al., Reference Jandrasits, Dabrowski, Fuchs and Renard2018) utilise progressive or block-based alignment strategies to scale MGAs to a large number of genomes. ProgressiveCactus (Armstrong et al., Reference Armstrong, Hickey, Diekhans, Fiddes, Novak, Deran, Fang, Xie, Feng, Stiller, Genereux, Johnson, Marinescu, Alföldi, Harris, Lindblad-Toh, Haussler, Karlsson, Jarvis and Paten2020; Paten et al., Reference Paten, Earl, Nguyen, Diekhans, Zerbino and Haussler2011) dramatically improves scalability using a guide-tree-based alignment strategy. Its output can be used as alignment input to the VG toolkit, enabling the inclusion of large duplications and inversions in yeast (Garrison et al., Reference Garrison, Sirén, Novak, Hickey, Eizenga, Dawson, Jones, Garg, Markello, Lin, Paten and Durbin2018). This approach provided the first workflow for converting an MGA into a graph that can be used both to infer genotype information and to map short reads. SibeliaZ (Minkin & Medvedev, Reference Minkin and Medvedev2020) generalised these ideas based on information from a de Bruijn graph to construct improved MGAs.

The Human Pangenome Reference Consortium (HPRC) (Liao et al., Reference Liao, Asri, Ebler, Doerr, Haukness, Hickey, Lu, Lucas, Monlong, Abel, Buonaiuto, Chang, Cheng, Chu, Colonna, Eizenga, Feng, Fischer, Fulton and Paten2023) has greatly advanced the field by releasing an initial pangenome draft from 47 humans, constructed using methods like Minigraph, Minigraph-Cactus and Pangenome Graph Builder (PGGB). Minigraph (Li et al., Reference Li, Feng and Chu2020) extended the minimap2 chaining algorithm to progressively add large SVs (>50 bp) into the graph. Minigraph-Cactus (Hickey et al., Reference Hickey, Monlong, Ebler, Novak, Eizenga, Gao, Marschall, Li and Paten2024) recruits the graph from Minigraph as a backbone. It then adds base-level alignments after clipping sequences that are highly divergent from a chosen reference sequence (‘clipping’ is the technical term for removing the portion of a query sequence that cannot be confidently aligned to the target genome). The details of these graphs will depend on the order of the input of sequences or the divergence between samples in the collection of genomes (Garrison & Guarracino, Reference Garrison and Guarracino2023), but it simplifies the graph structure and makes the graph suitable for downstream genotyping tasks. Similarly, ACMGA (AnchorWave-Cactus Multiple Genome Alignment) (Zhou et al., Reference Zhou, Su and Song2024) combines cactus with AnchorWave, which improves the alignment of long repetitive sequences in the plant genomes, for detection of large SVs (Song et al., Reference Song, Marco-Sola, Moreto, Johnson, Buckler and Stitzer2022). Huijse et al. (Reference Huijse, Adams, Burton, David, Julian, Meshulam-Simon, Mickalide, Tafesse, Calonga-Solís, Wolf, Morrison, Augusto and Endlich2023) found that AnchorWave outperformed Minigraph-Cactus in producing alignments in the highly divergent MHC region of human genomes. The PGGB (Garrison et al., Reference Garrison, Guarracino, Heumos, Villani, Bao, Tattini, Hagmann, Vorbrugg, Marco-Sola, Kubica, Ashbrook, Thorell, Rusholme-Pilcher, Liti, Rudbeck, Golicz, Nahnsen, Yang, Mwaniki and Prins2024) tries to capture all variations in the input sequences by constructing and all-to-all genome alignment by wfmash and rendering it with seqwish (Garrison & Guarracino, Reference Garrison and Guarracino2023) and GFAffix, then further consensus with smoothxg. While this approach offers a comprehensive representation of variations, the computational demands of all-to-all alignments are substantial. Instead of building a whole genome graph, PGR-TK (PanGenomic Research Took Kit) (Chin et al., Reference Chin, Behera, Khalak, Sedlazeck, Sudmant, Wagner and Zook2023) rapidly constructs subgraphs of specific regions using data structures designed for long-read assembly (Chin & Khalak, Reference Chin and Khalak2019; Li, Reference Li2016); it was shown to be very fast in rebuilding the complex variations in MHC haplotypes, though its use demands substantial expertise for parameter tuning and result interpretation.

2.3. Scalable alternatives to whole-genome alignments

Over the past decade, the complexity and scalability challenges of constructing and querying large genome graphs have become increasingly apparent. As a result, researchers have explored pangenomes using analyses based on specific sequence blocks – such as orthologous genes or k-mers – rather than base-resolution DNA sequences. Different strategies have been developed to make pangenome analyses more scalable, each with its own trade-offs. K-mer based approaches are computationally efficient, making them attractive for large-scale comparisons. However, they sacrifice sequence context and struggle to distinguish between repeats, particularly in complex eukaryotic genomes. In contrast, gene-based methods are more interpretable and extensible across genomes but depend heavily on good gene annotation. Annotation quality in turn is dependent on a range of factors, such as the availability of RNA and proteomics data, whether a genome is from a taxon that contains other well-annotated genomes and so on. The good news is that ever more comprehensive sampling, at the level of individuals (tissues and conditions), populations, species and higher-order groupings will undoubtedly improve gene annotations.

In bacterial pangenomics, gene presence–absence matrices generated by orthogroup clustering with OrthoMCL have been the standard (Contreras-Moreira & Vinuesa, Reference Contreras-Moreira and Vinuesa2013; Li et al., Reference Li, Stoeckert and Roos2003; Page et al., Reference Page, Cummins, Hunt, Wong, Reuter, Holden, Fookes, Falush, Keane and Parkhill2015). This strategy was subsequently extended by incorporating gene graphs in tools like PPanGGOLiN (Gautreau et al., Reference Gautreau, Bazin, Gachet, Planel, Burlot, Dubois, Perrin, Médigue, Calteau, Cruveiller, Matias, Ambroise, Rocha and Vallenet2020) and Panaroo with partitioned and fixed annotation error (Tonkin-Hill et al., Reference Tonkin-Hill, MacAlasdair, Ruis, Weimann, Horesh, Lees, Gladstone, Lo, Beaudoin, Floto, Frost, Corander, Bentley and Parkhill2020). Genome Complexity Browser visualised and quantified variability with orthogroup inference (Manolov et al., Reference Manolov, Konanov, Fedorov, Osmolovsky, Vereshchagin and Ilina2020). PanPA constructs graphs based on protein sequence alignments (Dabbaghie et al., Reference Dabbaghie, Srikakulam, Marschall and Kalinina2023), and Pangene leverages rapid protein alignments to build gene graphs for eukaryotic genomes, enabling analysis of gene copy number changes and orientation – remarkably, it can build a graph from 100 human haplotypes in under one minute (Li et al., Reference Li, Marin and Farhat2024).

Although implicit graphs constructed from fixed k-mers provide a valuable snapshot of genomic diversity, their resolution is inherently limited, and other tools have taken different routes. PanTools (Sheikhizadeh et al., Reference Sheikhizadeh, Schranz, Akdel, de Ridder and Smit2016) detects homology groups with k-mers and builds a database for pan-proteome query, while PanKmer (Aylward et al., Reference Aylward, Petrus, Mamerto, Hartwick and Michael2023) and Panagram (Benoit et al., Reference Benoit, Jenike, Satterlee, Ramakrishnan, Gentile, Hendelman, Passalacqua, Suresh, Shohat, Robitaille, Fitzgerald, Alonge, Wang, Santos, He, Ou, Golan, Green, Swartwood and Lippman2025) decompose assembled genomes into a k-mer database with further ability to locate specific positions in assemblies. Furthermore, methods like Biforst (Holley & Melsted, Reference Holley and Melsted2020) and mdBG (Ekim et al., Reference Ekim, Berger and Chikhi2021) efficiently construct de Bruijn graphs for storage and rapid querying; they can be applied to genotype variable tandem repeats with short reads (Lu et al., Reference Lu and Chaisson2021), though they fall short in accurately representing complete loci for downstream analyses (Andreace et al., Reference Andreace, Lechat, Dufresne and Chikhi2023).

3. Variant calling in the graph era

Once a pangenome graph is constructed, it can serve as an enhanced reference for genotyping resequenced samples – either by aligning reads or matching k-mers – capturing a broader range of sequence variation than linear references. While many current tools rely on read mapping or k-mer comparison to identify SNPs and SVs, some have advanced to support haplotype reconstruction and the detection of novel variants – capabilities that are particularly effective with long-read resequencing.

Among these, one of the most widely adopted and versatile tools is the Variation Graph Toolkit (VG), which provides a comprehensive framework for mapping, small variant calling, and SV genotyping. VG has become popular since its first open-source release (Garrison et al., Reference Garrison, Sirén, Novak, Hickey, Eizenga, Dawson, Jones, Garg, Markello, Lin, Paten and Durbin2018; Hickey et al., Reference Hickey, Heller, Monlong, Sibbesen, Sirén, Eizenga, Dawson, Garrison, Novak and Paten2020). It also reduces reference bias in ancient samples (Martiniano et al., Reference Martiniano, Garrison, Jones, Manica and Durbin2020). Another VG module, Giraffe (Sirén et al., Reference Sirén, Monlong, Chang, Novak, Eizenga, Markello, Sibbesen, Hickey, Chang, Carroll, Gupta, Gabriel, Blackwell, Ratan, Taylor, Rich, Rotter, Haussler, Garrison and Paten2021), was developed as a successor of VG map to accelerate the process for large-scale genotyping. PHG (Practical Haplotype Graph) utilises established tools for mapping against linear references (e.g., GATK) for genotyping in the offspring of crops (Bradbury et al., Reference Bradbury, Casstevens, Jensen, Johnson, Miller, Monier, Romay, Song and Buckler2022). DRAGEN (Dynamic Read Analysis for GENomics) (Behera et al., Reference Behera, Catreux, Rossi, Truong, Huang, Ruehle, Visvanath, Parnaby, Roddey, Onuchic, Finocchio, Cameron, English, Mehtalia, Han, Mehio and Sedlazeck2024) is currently the fastest for mapping and genotyping against pangenome references, exploiting hardware acceleration and tricks from machine learning, but it requires a commercial license. Apart from directly mapping to a graph, mapping reads to multiple references first and then injecting them into a graph based on mapping coordinates is another direction; one example is Gfa2bin (Vorbrugg et al., Reference Vorbrugg, Bezrukov, Bao and Weigel2024) and cosigt (Bolognini et al., Reference Bolognini, Halgren, Lou, Raveane, Rocha, Guarracino, Soranzo, Chin, Garrison and Sudmant2024), which uses node coverage across multiple references for genotyping with mapping by bwa (Li, Reference Li2013). Such approaches benefit from the maturity of linear reference mapping and the compatibility of their outputs with downstream graph-based analyses. Mapping long reads directly to genome graphs has become increasingly viable. Graphaligner (Rautiainen & Marschall, Reference Rautiainen and Marschall2020) is the first tool to achieve long-read mapping to a graph with a seed-and-extend strategy, with much higher speed than VG. Minigraph (Li et al., Reference Li, Feng and Chu2020) can find approximate mapping locations without base-level alignment, while Minichain (Chandra et al., Reference Chandra, Gibney and Jain2024) introduces a recombination penalty for long reads mapping to the graph.

To sidestep the computational cost of full mapping, many tools employ k-mer comparison strategies that match sequencing reads to known variants encoded in the graph. PanGenie (Ebler et al., Reference Ebler, Ebert, Clarke, Rausch, Audano, Houwaart, Mao, Korbel, Eichler, Zody, Dilthey and Marschall2022) and KAGE (Grytten et al., Reference Grytten, Dagestad Rand and Sandve2022, Reference Grytten, Rand and Sandve2023) compare k-mers from reads to a pangenome graph to reduce run time and mapping bias. Ensemble Variant Genotyper (Du et al., Reference Du, He and Jiao2024) is a framework designed to standardise the performance of various genotyping tools by accounting for the genomic features specific to plant species. Varigraph (Du et al., Reference Du, He, Xiao, Hu, Yang and Jiao2025) further optimised the k-mer-based approach with memory efficiency and extended the model for dosage estimation in autopolyploid genomes. A drawback is that these tools only genotype the known variations independently and thus cannot reconstruct the haplotypes in the population. To address this gap, Locityper (Prodanov et al., Reference Prodanov, Plender, Seebohm, Meuth, Eichler and Marschall2025) and cosigt (Bolognini et al., Reference Bolognini, Halgren, Lou, Raveane, Rocha, Guarracino, Soranzo, Chin, Garrison and Sudmant2024) have been developed to utilise read alignment profiles to locate the closest haplotype in the graph.

Furthermore, SV calling directly from pangenome graphs remains a critical challenge. To overcome the issues, SVarp (Soylev et al., Reference Soylev, Ebler, Pani, Rausch, Korbel and Marschall2024) tackles this by locally assembling potential SV alleles from long-read data, while PALSS (Denti et al., Reference Denti, Bonizzoni, Brejova, Chikhi, Krannich, Vinar and Hormozdiari2025) augments the graph with the consensus from sample-specific long reads without mapping.

In summary, the field of pangenome graph construction is dynamic, with no single tool dominating; the optimal tool choice depends on the specific research objectives and the desired resolution. Reference-based variation graphs, for instance, facilitate population genetics analyses across extensive cohorts but may omit certain genomic variations. Tools like PGGB offer comprehensive graph representations; however, their complexity can pose challenges for downstream applications such as VG Giraffe alignment, necessitating tailored pruning strategies for effective read mapping. Notably, developing and benchmarking efforts have predominantly centred on human genomics. Given that non-human species, including plant genomes, are often much more diverse than human genomes, there is a need for expanded evaluation across diverse species of tools for the building and use of pangenomes.

4. Functional pangenomics: Linking variation to mechanism

Reference bias not only affects variant discovery. Its shortcomings have knock-on effects in downstream functional analyses, including the comparison of chromatin accessibility, gene expression, or DNA methylation (Galli et al., Reference Galli, Chen, Ghandour, Chaudhry, Gregory, Feng, Li, Schleif, Zhang, Dong, Song, Walley, Chuck, Whipple, Kaeppler, Huang and Gallavotti2025; Igolkina et al., Reference Igolkina, Vorbrugg, Rabanal, Liu, Ashkenazy, Kornienko, Fitz, Collenberg, Kubica, Mollá Morales, Jaegle, Wrightsman, Voloshin, Bezlepsky, Llaca, Nizhynska, Reichardt, Bezrukov, Lanz and Nordborg2025). Compared to the growing adoption of genome graphs for structural variation calling and genotyping, much more still needs to be done to take advantage of graph-based frameworks for functional genomics.

Grytten et al. (Reference Grytten, Rand, Nederbragt, Storvik, Glad and Sandve2019) implemented Graph Peak Caller to identify ChIP-seq peaks using a variation graph in A. thaliana, identifying more than twice as many base pairs absent from the linear reference than had been found with previous methods. DNA methylation studies have revealed analogous benefits – and also underscore the extent of reference bias in functional assays. In cattle, using the wrong reference genome can lead to substantial errors in methylation quantification, with up to ~2% global bias and large numbers of methylated cytosines being affected by breed-specific variation (MacPhillamy et al., Reference MacPhillamy, Chen, Hiendleder, Williams, Alinejad-Rokny and Low2024). In Arabidopsis thaliana, methylation profiling was even more sensitive to reference choice, with only ~88% of sites being consistent between reference and focal strain, with one major reason being that transposable elements, which are prime targets of DNA methylation, have been much more active in this species than, for example, in humans (Igolkina et al., Reference Igolkina, Vorbrugg, Rabanal, Liu, Ashkenazy, Kornienko, Fitz, Collenberg, Kubica, Mollá Morales, Jaegle, Wrightsman, Voloshin, Bezlepsky, Llaca, Nizhynska, Reichardt, Bezrukov, Lanz and Nordborg2025). To address this, methylGrapher (Zhang et al., Reference Zhang, Macias-Velasco, Zhuo, Belter, Tomlinson, Garza, Tekkey, Li and Wang2025) introduced the first graph-based approach for mapping bisulfite sequencing data. Compared to traditional methods such as Bismark, it uniquely identified 2.2–2.9 million mCpGs across five human samples, many of which were absent from the reference or misclassified as unmethylated before.

Reference bias also affects RNA-seq analysis. In Arabidopsis thaliana, expression estimates diverged for a subset of genes depending on whether reads were mapped to the reference genome or to the accession’s own genome; these genes were strongly enriched for transposable elements and copy number–variable loci (Igolkina et al., Reference Igolkina, Vorbrugg, Rabanal, Liu, Ashkenazy, Kornienko, Fitz, Collenberg, Kubica, Mollá Morales, Jaegle, Wrightsman, Voloshin, Bezlepsky, Llaca, Nizhynska, Reichardt, Bezrukov, Lanz and Nordborg2025). Similar trends were observed in barley, but at an even higher rate, where mapping transcriptomic reads to a pan-transcriptome built with 20 genotypes improved the mapping rate by around 11% compared to a single linear reference (Guo, Schreiber et al. Reference Guo, Schreiber, Marosi, Bagnaresi, Jørgensen, Braune, Chalmers, Chapman, Dang, Dockter, Fiebig, Fincher, Fricano, Fuller, Haaning, Haberer, Himmelbach, Jayakodi, Jia and Waugh2025). VG rpvg (Sibbesen et al., Reference Sibbesen, Eizenga, Novak, Sirén, Chang, Garrison and Paten2023) extends genome graph approaches to RNA-seq analysis by building spliced pangenome graphs and quantifying expression along haplotype-resolved paths (Sibbesen et al., Reference Sibbesen, Eizenga, Novak, Sirén, Chang, Garrison and Paten2023). These methods improve accuracy and enable haplotype-specific quantification, even without prior haplotype phasing, but they are ideally based on comprehensive pan-transcriptome annotation, which is absent in most species. Haplotype information in turn is immensely useful in outbred species, and perhaps even more so, in polyploid species with their complex allele ratios (Bao et al., Reference Bao, Li, Li, Wang, Peng, Cheng, Li, Zhang, Li, Huang, Ye, Dong, Cheng, VanderZaag, Jacobsen, Bachem, Dong, Zhang, Huang and Zhou2022; Bird et al., Reference Bird, Brock, Grabowski, Harder, Healy, Shu, Barry, Boston, Daum, Guo, Lipzen, Walstead, Grimwood, Schmutz, Lu, Comai, McKay, Pires, Edger and Kliebenstein2025; Du et al., Reference Du, He, Xiao, Hu, Yang and Jiao2025).

Despite these advances, graph-based approaches to functional genomics remain in their infancy. Few tools have been developed, and most remain proof-of-concept applications limited to model species. Even where tools exist, broader adoption has been slow, partly due to the lack of comprehensive functional annotations and the complexity of graph-aware analytical workflows. Expanding these approaches across multiple omics layers – including methylation, expression, chromatin states and chromatin accessibility – and to diverse species with more complex genomes remains a critical challenge for future research (Figure 2).

Figure 2. Functional pangenomics. (a) A pangenome graph can integrate diverse layers of functional annotations (e.g., genes, transposons, methylation level) in its reference coordinate system and serve as a unified platform for cross-genome comparison. (b) Graph nodes can be used directly for genome-wide association analyses. The colors represent different length-based categories for the node, while node shapes indicate whether the sequence originates from the chosen reference genome. Broken line indicates nominal significance threshold. This figure was adapted from (Vorbrugg, Bezrukov, Bao, Xian, et al., 2024).

5. Navigating the tangled graph: visualisation, comparison and scalability

Although there are multiple strategies for graph construction, most approaches now adopt the Graphical Fragment Assembly (GFA) format to store graph information. Unfortunately, querying large-scale pangenomes remains challenging due to the inherent complexity and enormous size of these graphs. For instance, the VG toolkit offers a versatile suite of functions to construct, convert and manipulate genome graphs, but even with VG, extracting information from Gb-scale pangenomes can be nontrivial. To overcome scalability issues, several specialised tools have been developed (Figure 3a). ODGI (Optimised Dynamic Genome/Graph Implementation) (Guarracino et al., Reference Guarracino, Heumos, Nahnsen, Prins and Garrison2022) implements scalable algorithms to visualise graphs at multiple resolutions, to extract specific loci and to compare path similarities. Meanwhile, tools such as Gretl (Vorbrugg et al., Reference Vorbrugg, Bezrukov, Bao, Xian and Weigel2024) are designed to evaluate the quality of multiple graphs by providing a range of quantitative metrics for graph description and comparison. PANCAT (Dubois et al., Reference Dubois, Zytnicki, Lemaitre and Faraut2025) characterises differences among variation graphs derived from the same sequence set using edit distance metrics.

Figure 3. Timeline of pangenome graph algorithms. (a) The differently coloured circles indicate the main functions of tools; some workflows/tools may have multiple usages. For tools described in journals, we use the date of publication, but we note that many colleagues in this area are very generous and often release their tools long before formal publication. Given the rapid development in this area, it is perhaps not surprising that some tools only have a public GitHub repository. In the text, we provide hyperlinks as references. (b) The number of graph-based tools developed for different purposes.

On the visualisation side, early GUI-based tools like Bandage (Wick et al., Reference Wick, Schultz, Zobel and Holt2015) and GfaViz (Gonnella et al., Reference Gonnella, Niehus and Kurtz2019) provide whole-graph views of assembly graphs but are limited when it comes to base level or Gb-scale pangenome graphs. VG view and VG viz can display sequences up to about 100 kb, whereas SequenceTubemap (Beyer et al., Reference Beyer, Novak, Hickey, Chan, Tan, Paten and Zerbino2019) adopts an intuitive visualisation model (inspired by public transport network maps) to display variation graphs along with read mappings at the appropriate scale. Momi-G (Yokoyama et al., Reference Yokoyama, Sakamoto, Seki, Suzuki and Kasahara2019) extends this concept for large-scale SV inspection in human variation graphs, and ODGI viz further expands on the VG viz layout by exporting rasterised images suitable for chromosome-scale genome graphs.

Efforts to integrate graph layouts with functional annotation are also emerging. For example, VRPG, a visualisation and interpretation framework for linear reference–projected pangenome graphs (Miao & Yue, Reference Miao and Yue2025), extracts subgraphs based on reference path coordinates and annotations, while PPanG (Liu, Zhang et al., Reference Liu, Zhang, Lu, Xue, Dong, Li, Xu, Wang and Wei2024) adapts the SequenceTubemap framework to display multiple genome annotations through embedded JBrowse2 components in real time. Additionally, Gfaestus (https://github.com/chfi/gfaestus) leverages GPU frameworks to visualise full graphs from projects like HPRC, and waragraph (https://github.com/chfi/waragraph) can integrate annotation information into ODGI layouts interactively.

Compared to graph construction, the visualisation and comparison of pangenome graphs have lagged significantly. While multiple tools exist for assembling and processing variation graphs, there is still no comprehensive, scalable and interactive visualisation framework that can handle large-scale pangenomes efficiently and connect the functional annotation (Figure 3b). As the pangenome expands to hundreds of individuals rapidly, it could even go beyond species, and extracting biological knowledge from the complex tangles in graphs requires better tools than what is currently available.

6. Conclusion and perspectives

The development of eukaryotic pangenomics has entered a transformative phase. Advances in sequencing technologies and assembly algorithms have made it feasible to generate high-quality genomes at population scale (Antipov et al., Reference Antipov, Rautiainen, Nurk, Walenz, Solar, Phillippy and Koren2025; Cheng et al., Reference Cheng, Asri, Lucas, Koren and Li2024; Koren et al., Reference Koren, Bao, Guarracino, Ou, Goodwin, Jenike, Lucas, McNulty, Park, Rautiainen, Rhie, Roelofs, Schneiders, Vrijenhoek, Nijbroek, Nordesjo, Nurk, Vella, Lawrence and Phillippy2024). As a result, pangenome references constructed from tens to hundreds of assemblies now exist for a growing number of species, including foundational species such as Arabidopsis thaliana (Kang et al., Reference Kang, Wu, Liu, Liu, Zhu, Han, Liu, Chen, Song, Tan, Yin, Zhao, Yan, Lou, Zan and Liu2023; Lian et al., Reference Lian, Huettel, Walkemeier, Mayjonade, Lopez-Roques, Gil, Roux, Schneeberger and Mercier2024; Wlodzimierz et al., Reference Wlodzimierz, Rabanal, Burns, Naish, Primetis, Scott, Mandáková, Gorringe, Tock, Holland, Fritschi, Habring, Lanz, Patel, Schlegel, Collenberg, Mielke, Nordborg, Roux and Henderson2023), key crops (Cheng et al., Reference Cheng, Wang, Bao, Zhou, Guarracino, Yang, Wang, Zhang, Tang, Zhang, Wu, Zhou, Zheng, Hu, Lian, Ma, Lassois, Zhang, Lucas and Huang2025; Guo et al., Reference Guo, Li, Lu, Zhao, Kurata, Wei, Wang, Wang, Zhan, Fan, Zhou, Lu, Tian, Weng, Feng, Huang, Zhang, Gu, Wang and Han2025; Hufford et al., Reference Hufford, Seetharam, Woodhouse, Chougule, Ou, Liu, Ricci, Guo, Olson, Qiu, Della Coletta, Tittes, Hudson, Marand, Wei, Lu, Wang, Tello-Ruiz, Piri and Dawe2021; Liu et al., Reference Liu, Du, Li, Shen, Peng, Liu, Zhou, Zhang, Liu, Shi, Huang, Li, Zhang, Wang, Zhu, Han, Liang and Tian2020; Lynch et al., Reference Lynch, Padgitt-Cobb, Garfinkel, Knaus, Hartwick, Allsing, Aylward, Bentz, Carey, Mamerto, Kitony, Colt, Murray, Duong, Chen, Trippe, Harkess, Crawford, Vining and Michael2025; Zhou et al., Reference Zhou, Zhang, Bao, Li, Lyu, Zan, Wu, Cheng, Fang, Wu, Zhang, Lyu, Lin, Gao, Saha, Mueller, Fei, Städler, Xu and Huang2022) and humans (Liao et al., Reference Liao, Asri, Ebler, Doerr, Haukness, Hickey, Lu, Lucas, Monlong, Abel, Buonaiuto, Chang, Cheng, Chu, Colonna, Eizenga, Feng, Fischer, Fulton and Paten2023). Its application has shown that additional variations capture some of the heritability previously missed (Zhou et al., Reference Zhou, Zhang, Bao, Li, Lyu, Zan, Wu, Cheng, Fang, Wu, Zhang, Lyu, Lin, Gao, Saha, Mueller, Fei, Städler, Xu and Huang2022), find more associations between variations and agronomic traits (Hufford et al., Reference Hufford, Seetharam, Woodhouse, Chougule, Ou, Liu, Ricci, Guo, Olson, Qiu, Della Coletta, Tittes, Hudson, Marand, Wei, Lu, Wang, Tello-Ruiz, Piri and Dawe2021) and can uncover the complex evolution history of well-studied loci (Bolognini et al., Reference Bolognini, Halgren, Lou, Raveane, Rocha, Guarracino, Soranzo, Chin, Garrison and Sudmant2024).

Nevertheless, capturing the full spectrum of variation across a species in an unbiased and comprehensive way remains a challenge. While tools such as Minigraph-Cactus use iterative construction to simplify the process of graph alignment, they are sensitive to input order and tend to discard sequences that diverge too much from the reference – an issue especially problematic in high-diversity species (Cheng et al., Reference Cheng, Wang, Bao, Zhou, Guarracino, Yang, Wang, Zhang, Tang, Zhang, Wu, Zhou, Zheng, Hu, Lian, Ma, Lassois, Zhang, Lucas and Huang2025; Garrison & Guarracino, Reference Garrison and Guarracino2023). On the other hand, all-to-all alignment approaches, such as PGGB, provide more complete graphs but require substantial computational resources, making them impractical for datasets involving hundreds of genomes (Lynch et al., Reference Lynch, Padgitt-Cobb, Garfinkel, Knaus, Hartwick, Allsing, Aylward, Bentz, Carey, Mamerto, Kitony, Colt, Murray, Duong, Chen, Trippe, Harkess, Crawford, Vining and Michael2025). Similarly, genotyping tools face scalability constraints for large graphs: VG Giraffe, for instance, typically downsamples to 64 haplotypes prior to mapping (Sirén et al., Reference Sirén, Monlong, Chang, Novak, Eizenga, Markello, Sibbesen, Hickey, Chang, Carroll, Gupta, Gabriel, Blackwell, Ratan, Taylor, Rich, Rotter, Haussler, Garrison and Paten2021).

Progress in these areas depends on the availability of high-quality benchmark datasets for validation. Yet such resources are scarce in non-human species, and even in human genomics, benchmarking is often confined to a few well-characterised individuals (Dwarshuis et al., Reference Dwarshuis, Kalra, McDaniel, Sanio, Alvarez Jerez, Jadhav, Huang, Mondal, Busby, Olson, Sedlazeck, Wagner, Majidian and Zook2024). This creates systematic bias and limits our ability to assess how well genome graphs capture rare, complex or population-specific variation. Developing robust metrics and comparative frameworks to evaluate graph quality remains a crucial direction for the field.

Moreover, the continued reliance on biallelic SNP models restricts the development of population genetic theory capable of explaining the full complexity of pangenomic variation. However, SVs are generated by a wide range of distinct mutational mechanisms – including non-homologous end joining, non-allelic homologous recombination, template-switching, nested transposon insertions and tandem repeat expansion – that often result in multi-allelic loci rather than simple binary variants (Collins & Talkowski, Reference Collins and Talkowski2025). In association studies, the assumption of biallelic variation also introduces confounding effects, particularly in the presence of genetic heterogeneity (Liu et al., Reference Liu, Fu, Xu and Nordborg2025). Incorporating haplotype-aware models may reveal additional associations that are otherwise missed due to the underlying complexity of SVs and the impact of multi-allelic loci (Smith et al., Reference Smith, Strausz, FinnGen, Spence, Ollila and Pritchard2025; Zhou et al., Reference Zhou, Zhang, Bao, Li, Lyu, Zan, Wu, Cheng, Fang, Wu, Zhang, Lyu, Lin, Gao, Saha, Mueller, Fei, Städler, Xu and Huang2022). As such, there is a growing need to develop new population genetic models.

Looking ahead, integrating pangenome graphs with evolutionary models remains a wide-open frontier. Ancient hybridisation, incomplete lineage sorting, and structural rearrangements complicate cross-species alignment, increasing the difficulty of graph construction. Yet global biodiversity sequencing projects (Lewin et al., Reference Lewin, Robinson, Kress, Baker, Coddington, Crandall, Durbin, Edwards, Forest, Gilbert, Goldstein, Grigoriev, Hackett, Haussler, Jarvis, Johnson, Patrinos, Richards, Castilla-Rubio and Zhang2018) are beginning to fill the tree of life with genome assemblies. Embedding phylogenetic history directly into graph construction – rather than treating it as a downstream layer – may help generate more meaningful, interpretable graphs.

In contrast to advances in variant calling, the interpretation, visualisation and benchmarking of pangenome graphs are still in early stages. There is still no equivalent of IGV for intuitive graph exploration, and widely adopted formats for complex variant representation are lacking. While tools like ODGI offer useful summaries and visualisations, they lack interactivity and scalability for Gb-scale graphs. Even more critically, the integration of functional genomics with graph frameworks remains far behind. At present, RNA-seq, methylation, and chromatin accessibility data cannot be seamlessly analysed in graph-aware contexts. Bridging this gap will require unified, scalable methods for aligning and interpreting multi-omic data within graph-based references for genotype-phenotype association.

As we enter the next phase of pangenomic research, the field faces substantial computational and modelling hurdles. Yet the growing ecosystem of graph-based methods offers more than just an ever-expanding toolkit – it provides the foundation for a new paradigm in genomics. By embracing the full complexity of genomic variation, pangenome graphs have the potential to reshape how we conduct association studies, trace evolutionary history and interpret regulatory landscapes. Moving beyond reference bias and linear constraints, these graphs can unify population-scale diversity, functional readouts and comparative signals across the tree of life. Realizing this vision will require not only scalable tools and new theoretical frameworks but also sustained community efforts in benchmarking, visualisation and data integration. Still, the promise is profound: to build genomic models that reflect biological reality more accurately, and in doing so, to understand evolution in unprecedented detail.

Open peer review

To view the open peer review materials for this article, please visit http://doi.org/10.1017/qpb.2025.10028.

Acknowledgements

We thank both our colleagues from the 1001 Genomes Project in the Weigel and Nordborg labs and the participants of the International Genome Graph Symposium (IGGSy'24) in Ascona 2024 for inspiration, and especially Andrea Guarracino and Erik Garrison from the University of Tennessee Health Science Center for extensive discussions.

Competing interest

D.W. holds equity in Computomics, which advises plant breeders. D.W. also consults for KWS SE, a globally active plant breeder and seed producer.

Author contributions

Z.B. wrote the draft of the manuscript. Z.B. and D.W. revised the manuscript.

Funding statement

Our work is supported by the International Max Planck Research School (IMPRS) ‘From Molecules to Organisms’ (Z.B.), the Novozymes Prize of the Novo Nordisk Foundation (D.W.), and the Max Planck Society.

Footnotes

Associate Editor: Dr. Luke Dunning

References

Andreace, F., Lechat, P., Dufresne, Y., & Chikhi, R. (2023). Comparing methods for constructing and representing human pangenome graphs. Genome Biology, 24(1), 274.Google Scholar
Antipov, D., Rautiainen, M., Nurk, S., Walenz, B. P., Solar, S. J., Phillippy, A. M., & Koren, S. (2025). Verkko2 integrates proximity-ligation data with long-read De Bruijn graphs for efficient telomere-to-telomere genome assembly, phasing, and scaffolding. Genome Research, 35(7), 15831594.Google Scholar
Armstrong, J., Hickey, G., Diekhans, M., Fiddes, I. T., Novak, A. M., Deran, A., Fang, Q., Xie, D., Feng, S., Stiller, J., Genereux, D., Johnson, J., Marinescu, V. D., Alföldi, J., Harris, R. S., Lindblad-Toh, K., Haussler, D., Karlsson, E., Jarvis, E., & Paten, B. (2020). Progressive cactus is a multiple-genome aligner for the thousand-genome era. Nature, 587, 246251.Google Scholar
Aylward, A. J., Petrus, S., Mamerto, A., Hartwick, N. T., & Michael, T. P. (2023). PanKmer: K-mer-based and reference-free pangenome analysis. Bioinformatics (Oxford, England), 39(10), btad621.Google Scholar
Bao, Z., Li, C., Li, G., Wang, P., Peng, Z., Cheng, L., Li, H., Zhang, Z., Li, Y., Huang, W., Ye, M., Dong, D., Cheng, Z., VanderZaag, P., Jacobsen, E., Bachem, C. W. B., Dong, S., Zhang, C., Huang, S., & Zhou, Q. (2022). Genome architecture and tetrasomic inheritance of autotetraploid potato. Molecular Plant, 15(7), 12111226.Google Scholar
Behera, S., Catreux, S., Rossi, M., Truong, S., Huang, Z., Ruehle, M., Visvanath, A., Parnaby, G., Roddey, C., Onuchic, V., Finocchio, A., Cameron, D. L., English, A., Mehtalia, S., Han, J., Mehio, R., & Sedlazeck, F. J. (2024). Comprehensive genome analysis and variant detection at scale using DRAGEN. Nature Biotechnology. Google Scholar
Benoit, M., Jenike, K. M., Satterlee, J. W., Ramakrishnan, S., Gentile, I., Hendelman, A., Passalacqua, M. J., Suresh, H., Shohat, H., Robitaille, G. M., Fitzgerald, B., Alonge, M., Wang, X., Santos, R., He, J., Ou, S., Golan, H., Green, Y., Swartwood, K., & Lippman, Z. B. (2025). Solanum pan-genetics reveals paralogues as contingencies in crop engineering. Nature, 640, 135145.Google Scholar
Beyer, W., Novak, A. M., Hickey, G., Chan, J., Tan, V., Paten, B., & Zerbino, D. R. (2019). Sequence tube maps: Making graph genomes intuitive to commuters. Bioinformatics (Oxford, England), 35(24), 53185320.Google Scholar
Biederstedt, E., Oliver, J. C., Hansen, N. F., Jajoo, A., Dunn, N., Olson, A., Busby, B., & Dilthey, A. T. (2018). NovoGraph: Human genome graph construction from multiple long-read de novo assemblies. F1000Research, 7, 1391.Google Scholar
Bird, K. A., Brock, J. R., Grabowski, P. P., Harder, A. M., Healy, A. L., Shu, S., Barry, K., Boston, L., Daum, C., Guo, J., Lipzen, A., Walstead, R., Grimwood, J., Schmutz, J., Lu, C., Comai, L., McKay, J. K., Pires, J. C., Edger, P. P., & Kliebenstein, D. J. (2025). Allopolyploidy expanded gene content but not pangenomic variation in the hexaploid oilseed Camelina sativa. Genetics, 229(1), 144.Google Scholar
Blanchette, M., Kent, W. J., Riemer, C., Elnitski, L., Smit, A. F. A., Roskin, K. M., Baertsch, R., Rosenbloom, K., Clawson, H., Green, E. D., Haussler, D., & Miller, W. (2004). Aligning multiple genomic sequences with the threaded blockset aligner. Genome Research, 14(4), 708715.Google Scholar
Bolognini, D., Halgren, A., Lou, R. N., Raveane, A., Rocha, J. L., Guarracino, A., Soranzo, N., Chin, C.-S., Garrison, E., & Sudmant, P. H. (2024). Recurrent evolution and selection shape structural diversity at the amylase locus. Nature, 634, 617625.Google Scholar
Bradbury, P. J., Casstevens, T., Jensen, S. E., Johnson, L. C., Miller, Z. R., Monier, B., Romay, M. C., Song, B., & Buckler, E. S. (2022). The practical haplotype graph, a platform for storing and using pangenomes for imputation. Bioinformatics (Oxford, England), 38(15), 36983702.Google Scholar
Cao, J., Schneeberger, K., Ossowski, S., Günther, T., Bender, S., Fitz, J., Koenig, D., Lanz, C., Stegle, O., Lippert, C., Wang, X., Ott, F., Müller, J., Alonso-Blanco, C., Borgwardt, K., Schmid, K. J., & Weigel, D. (2011). Whole-genome sequencing of multiple Arabidopsis thaliana populations. Nature Genetics, 43(10), 956963.Google Scholar
Chandra, G., Gibney, D., & Jain, C. (2024). Haplotype-aware sequence alignment to pangenome graphs. Genome Research, 34.12651275.Google Scholar
Cheng, H., Asri, M., Lucas, J., Koren, S., & Li, H. (2024). Scalable telomere-to-telomere assembly for diploid and polyploid genomes with double graph. Nature Methods, 21(6), 967970.Google Scholar
Cheng, L., Wang, N., Bao, Z., Zhou, Q., Guarracino, A., Yang, Y., Wang, P., Zhang, Z., Tang, D., Zhang, P., Wu, Y., Zhou, Y., Zheng, Y., Hu, Y., Lian, Q., Ma, Z., Lassois, L., Zhang, C., Lucas, W. J., & Huang, S. (2025). Leveraging a phased pangenome for haplotype design of hybrid potato. Nature, 640(8058), 408417.Google Scholar
Chin, C.-S., Behera, S., Khalak, A., Sedlazeck, F. J., Sudmant, P. H., Wagner, J., & Zook, J. M. (2023). Multiscale analysis of pangenomes enables improved representation of genomic diversity for repetitive and clinically relevant genes. Nature Methods. https://doi.org/10.1038/s41592-023-01914-y.Google Scholar
Chin, C.-S., & Khalak, A. (2019). Human genome assembly in 100 minutes. bioRxiv. https://doi.org/10.1101/705616.Google Scholar
Clark, R. M., Schweikert, G., Toomajian, C., Ossowski, S., Zeller, G., Shinn, P., Warthmann, N., Hu, T. T., Fu, G., Hinds, D. A., Chen, H., Frazer, K. A., Huson, D. H., Schölkopf, B., Nordborg, M., Rätsch, G., Ecker, J. R., & Weigel, D. (2007). Common sequence polymorphisms shaping genetic diversity in Arabidopsis thaliana . Science, 317(5836), 338342.Google Scholar
Collins, R. L., & Talkowski, M. E. (2025). Diversity and consequences of structural variation in the human genome. Nature Reviews. Genetics, 26(7), 443462.Google Scholar
Contreras-Moreira, B., & Vinuesa, P. (2013). GET_HOMOLOGUES, a versatile software package for scalable and robust microbial pangenome analysis. Applied and Environmental Microbiology, 79(24), 76967701.Google Scholar
Dabbaghie, F., Srikakulam, S. K., Marschall, T., & Kalinina, O. V. (2023). PanPA: Generation and alignment of panproteome graphs. Bioinformatics Advances, 3(1), vbad167.Google Scholar
Darling, A. C. E., Mau, B., Blattner, F. R., & Perna, N. T. (2004). Mauve: Multiple alignment of conserved genomic sequence with rearrangements. Genome Research, 14(7), 13941403.Google Scholar
Darling, A. E., Mau, B., & Perna, N. T. (2010). progressiveMauve: Multiple genome alignment with gene gain, loss and rearrangement. PLoS One, 5(6), e11147.Google Scholar
Denti, L., Bonizzoni, P., Brejova, B., Chikhi, R., Krannich, T., Vinar, T., & Hormozdiari, F. (2025). Pangenome graph augmentation from unassembled long reads. bioRxiv. https://doi.org/10.1101/2025.02.07.637057.Google Scholar
Dilthey, A., Cox, C., Iqbal, Z., Nelson, M. R., & McVean, G. (2015). Improved genome inference in the MHC using a population reference graph. Nature Genetics, 47(6), 682688.Google Scholar
Du, Z.-Z., He, J.-B., & Jiao, W.-B. (2024). A comprehensive benchmark of graph-based genetic variant genotyping algorithms on plant genomes for creating an accurate ensemble pipeline. Genome Biology, 25(1), 91.Google Scholar
Du, Z.-Z., He, J.-B., Xiao, P.-X., Hu, J., Yang, N., & Jiao, W.-B. (2025). Varigraph: An accurate and widely applicable pangenome graph-based variant genotyper for diploid and polyploid genomes. Molecular Plant. https://doi.org/10.1016/j.molp.2025.08.001.Google Scholar
Dubois, S., Zytnicki, M., Lemaitre, C., & Faraut, T. (2025). Pairwise graph edit distance characterizes the impact of the construction method on pangenome graphs. Bioinformatics (Oxford, England), 41(6), btaf291.Google Scholar
Dwarshuis, N., Kalra, D., McDaniel, J., Sanio, P., Alvarez Jerez, P., Jadhav, B., Huang, W. E., Mondal, R., Busby, B., Olson, N. D., Sedlazeck, F. J., Wagner, J., Majidian, S., & Zook, J. M. (2024). The GIAB genomic stratifications resource for human reference genomes. Nature Communications, 15(1), 9029.Google Scholar
Ebler, J., Ebert, P., Clarke, W. E., Rausch, T., Audano, P. A., Houwaart, T., Mao, Y., Korbel, J. O., Eichler, E. E., Zody, M. C., Dilthey, A. T., & Marschall, T. (2022). Pangenome-based genome inference allows efficient and accurate genotyping across a wide spectrum of variant classes. Nature Genetics, 54(4), 518525.Google Scholar
Eggertsson, H. P., Jonsson, H., Kristmundsdottir, S., Hjartarson, E., Kehr, B., Masson, G., Zink, F., Hjorleifsson, K. E., Jonasdottir, A., Jonasdottir, A., Jonsdottir, I., Gudbjartsson, D. F., Melsted, P., Stefansson, K., & Halldorsson, B. V. (2017). Graphtyper enables population-scale genotyping using pangenome graphs. Nature Genetics, 49(11), 16541660.Google Scholar
Ekim, B., Berger, B., & Chikhi, R. (2021). Minimizer-space de Bruijn graphs: Whole-genome assembly of long reads in minutes on a personal computer. Cell Systems, 12(10), 958968.e6.Google Scholar
Galli, M., Chen, Z., Ghandour, T., Chaudhry, A., Gregory, J., Feng, F., Li, M., Schleif, N., Zhang, X., Dong, Y., Song, G., Walley, J. W., Chuck, G., Whipple, C., Kaeppler, H. F., Huang, S.-S. C., & Gallavotti, A. (2025). Transcription factor binding divergence drives transcriptional and phenotypic variation in maize. Nature Plants, 11(6), 12051219.Google Scholar
Gan, X., Stegle, O., Behr, J., Steffen, J. G., Drewe, P., Hildebrand, K. L., Lyngsoe, R., Schultheiss, S. J., Osborne, E. J., Sreedharan, V. T., Kahles, A., Bohnert, R., Jean, G., Derwent, P., Kersey, P., Belfield, E. J., Harberd, N. P., Kemen, E., Toomajian, C., & Mott, R. (2011). Multiple reference genomes and transcriptomes for Arabidopsis thaliana . Nature, 477(7365), 419423.Google Scholar
Gao, L., Gonda, I., Sun, H., Ma, Q., Bao, K., Tieman, D. M., Burzynski-Chang, E. A., Fish, T. L., Stromberg, K. A., Sacks, G. L., Thannhauser, T. W., Foolad, M. R., Diez, M. J., Blanca, J., Canizares, J., Xu, Y., van der Knaap, E., Huang, S., Klee, H. J., & Fei, Z. (2019). The tomato pan-genome uncovers new genes and a rare allele regulating fruit flavor. Nature Genetics, 51(6), 10441051.Google Scholar
Garrison, E., & Guarracino, A. (2023). Unbiased pangenome graphs. Bioinformatics, 39(1), 1419.Google Scholar
Garrison, E., Guarracino, A., Heumos, S., Villani, F., Bao, Z., Tattini, L., Hagmann, J., Vorbrugg, S., Marco-Sola, S., Kubica, C., Ashbrook, D. G., Thorell, K., Rusholme-Pilcher, R. L., Liti, G., Rudbeck, E., Golicz, A. A., Nahnsen, S., Yang, Z., Mwaniki, M. N., & Prins, P. (2024). Building pangenome graphs. Nature Methods, 21(11), 20082012.Google Scholar
Garrison, E., Sirén, J., Novak, A. M., Hickey, G., Eizenga, J. M., Dawson, E. T., Jones, W., Garg, S., Markello, C., Lin, M. F., Paten, B., & Durbin, R. (2018). Variation graph toolkit improves read mapping by representing genetic variation in the reference. Nature Biotechnology, December 2017. https://doi.org/10.1038/nbt.4227 Google Scholar
Gautreau, G., Bazin, A., Gachet, M., Planel, R., Burlot, L., Dubois, M., Perrin, A., Médigue, C., Calteau, A., Cruveiller, S., Matias, C., Ambroise, C., Rocha, E. P. C., & Vallenet, D. (2020). PPanGGOLiN: Depicting microbial diversity via a partitioned pangenome graph. PLoS Computational Biology, 16(3), e1007732.Google Scholar
Gonnella, G., Niehus, N., & Kurtz, S. (2019). GfaViz: Flexible and interactive visualization of GFA sequence graphs. Bioinformatics (Oxford, England), 35(16), 28532855.Google Scholar
Grytten, I., Dagestad Rand, K., & Sandve, G. K. (2022). KAGE: Fast alignment-free graph-based genotyping of SNPs and short indels. Genome Biology, 23(1), 209.Google Scholar
Grytten, I., Rand, K. D., Nederbragt, A. J., Storvik, G. O., Glad, I. K., & Sandve, G. K. (2019). Graph peak caller: Calling ChIP-seq peaks on graph-based reference genomes. PLoS Computational Biology, 15(2), e1006731.Google Scholar
Grytten, I., Rand, K. D., & Sandve, G. K. (2023). KAGE 2: Fast and accurate genotyping of structural variation using pangenomes. bioRxiv, 2023.12.23.572333. https://doi.org/10.1101/2023.12.23.572333 Google Scholar
Guarracino, A., Heumos, S., Nahnsen, S., Prins, P., & Garrison, E. (2022). ODGI: Understanding pangenome graphs. Bioinformatics (Oxford, England), 38(13), 33193326.Google Scholar
Guo, D., Li, Y., Lu, H., Zhao, Y., Kurata, N., Wei, X., Wang, A., Wang, Y., Zhan, Q., Fan, D., Zhou, C., Lu, Y., Tian, Q., Weng, Q., Feng, Q., Huang, T., Zhang, L., Gu, Z., Wang, C., & Han, B. (2025). A pangenome reference of wild and cultivated rice. Nature, 642(8068), 662671.Google Scholar
Guo, W., Schreiber, M., Marosi, V. B., Bagnaresi, P., Jørgensen, M. E., Braune, K. B., Chalmers, K., Chapman, B., Dang, V., Dockter, C., Fiebig, A., Fincher, G. B., Fricano, A., Fuller, J., Haaning, A., Haberer, G., Himmelbach, A., Jayakodi, M., Jia, Y., & Waugh, R. (2025). A barley pan-transcriptome reveals layers of genotype-dependent transcriptional complexity. Nature Genetics, 57(2), 441450.Google Scholar
Hickey, G., Heller, D., Monlong, J., Sibbesen, J. A., Sirén, J., Eizenga, J., Dawson, E. T., Garrison, E., Novak, A. M., & Paten, B. (2020). Genotyping structural variants in pangenome graphs using the vg toolkit. Genome Biology, 21(1), 35.Google Scholar
Hickey, G., Monlong, J., Ebler, J., Novak, A. M., Eizenga, J. M., Gao, Y., Human Pangenome Reference Consortium, Marschall, T., Li, H., & Paten, B. (2024). Pangenome graph construction from genome alignments with Minigraph-cactus. Nature Biotechnology, 42(4), 663673.Google Scholar
Holley, G., & Melsted, P. (2020). Bifrost: Highly parallel construction and indexing of colored and compacted de Bruijn graphs. Genome Biology, 21(1), 249.Google Scholar
Hufford, M. B., Seetharam, A. S., Woodhouse, M. R., Chougule, K. M., Ou, S., Liu, J., Ricci, W. A., Guo, T., Olson, A., Qiu, Y., Della Coletta, R., Tittes, S., Hudson, A. I., Marand, A. P., Wei, S., Lu, Z., Wang, B., Tello-Ruiz, M. K., Piri, R. D., … Dawe, R. K. (2021). De novo assembly, annotation, and comparative analysis of 26 diverse maize genomes. Science (New York, N.Y.), 373(6555), 655662.Google Scholar
Huijse, L., Adams, S. M., Burton, J. N., David, J. K., Julian, R. S., Meshulam-Simon, G., Mickalide, H., Tafesse, B. D., Calonga-Solís, V., Wolf, I. R., Morrison, A. J., Augusto, D. G., & Endlich, S. (2023). A pan-MHC reference graph with 246 fully contiguous phased sequences. bioRxiv, 2023.09.01.555813. https://doi.org/10.1101/2023.09.01.555813 Google Scholar
Igolkina, A. A., Vorbrugg, S., Rabanal, F. A., Liu, H.-J., Ashkenazy, H., Kornienko, A. E., Fitz, J., Collenberg, M., Kubica, C., Mollá Morales, A., Jaegle, B., Wrightsman, T., Voloshin, V., Bezlepsky, A. D., Llaca, V., Nizhynska, V., Reichardt, I., Bezrukov, I., Lanz, C., & Nordborg, M. (2025). A comparison of 27 Arabidopsis thaliana genomes and the path toward an unbiased characterization of genetic polymorphism. Nat Genet, 57, 22892301.Google Scholar
Jandrasits, C., Dabrowski, P. W., Fuchs, S., & Renard, B. Y. (2018). Seq-seq-pan: Building a computational pan-genome data structure on whole genome alignment. BMC Genomics, 19(1), 47.Google Scholar
Kang, M., Wu, H., Liu, H., Liu, W., Zhu, M., Han, Y., Liu, W., Chen, C., Song, Y., Tan, L., Yin, K., Zhao, Y., Yan, Z., Lou, S., Zan, Y., & Liu, J. (2023). The pan-genome and local adaptation of Arabidopsis thaliana . Nature Communications, 14(1), 6259.Google Scholar
Kim, S., Plagnol, V., Hu, T. T., Toomajian, C., Clark, R. M., Ossowski, S., Ecker, J. R., Weigel, D., & Nordborg, M. (2007). Recombination and linkage disequilibrium in Arabidopsis thaliana . Nature Genetics, 39(9), 11511155.Google Scholar
Koren, S., Bao, Z., Guarracino, A., Ou, S., Goodwin, S., Jenike, K. M., Lucas, J., McNulty, B., Park, J., Rautiainen, M., Rhie, A., Roelofs, D., Schneiders, H., Vrijenhoek, I., Nijbroek, K., Nordesjo, O., Nurk, S., Vella, M., Lawrence, K. R., & Phillippy, A. M. (2024). Gapless assembly of complete human and plant chromosomes using only nanopore sequencing. Genome Research, 34(11), 19191930.Google Scholar
Lee, C., Grasso, C., & Sharlow, M. F. (2002). Multiple sequence alignment using partial order graphs. Bioinformatics (Oxford, England), 18(3), 452464.Google Scholar
Lewin, H. A., Robinson, G. E., Kress, W. J., Baker, W. J., Coddington, J., Crandall, K. A., Durbin, R., Edwards, S. V., Forest, F., Gilbert, M. T. P., Goldstein, M. M., Grigoriev, I. V., Hackett, K. J., Haussler, D., Jarvis, E. D., Johnson, W. E., Patrinos, A., Richards, S., Castilla-Rubio, J. C., & Zhang, G. (2018). Earth BioGenome project: Sequencing life for the future of life. Proceedings of the National Academy of Sciences of the United States of America, 115(17), 43254333.Google Scholar
Li, H. (2013). Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv [q-bio.GN]. http://arxiv.org/abs/1303.3997 Google Scholar
Li, H. (2016). Minimap and miniasm: Fast mapping and de novo assembly for noisy long sequences. Bioinformatics, 32(14), 21032110.Google Scholar
Li, H., Feng, X., & Chu, C. (2020). The design and construction of reference pangenome graphs with minigraph. Genome Biology, 21(1), 265.Google Scholar
Li, H., Marin, M., & Farhat, M. R. (2024). Exploring gene content with pangene graphs. Bioinformatics (Oxford, England), 40(7), btae456.Google Scholar
Li, L., Stoeckert, C. J., & Roos, D. S. (2003). OrthoMCL: Identification of ortholog groups for eukaryotic genomes. Genome Research. https://doi.org/10.1101/gr.1224503.Google Scholar
Li, R., Li, Y., Zheng, H., Luo, R., Zhu, H., Li, Q., Qian, W., Ren, Y., Tian, G., Li, J., Zhou, G., Zhu, X., Wu, H., Qin, J., Jin, X., Li, D., Cao, H., Hu, X., Blanche, H., & Wang, J. (2010). Building the sequence map of the human pan-genome. Nature Biotechnology, 28(1), 5763.Google Scholar
Li, Y.-H., Zhou, G., Ma, J., Jiang, W., Jin, L.-G., Zhang, Z., Guo, Y., Zhang, J., Sui, Y., Zheng, L., Zhang, S.-S., Zuo, Q., Shi, X.-H., Li, Y.-F., Zhang, W.-K., Hu, Y., Kong, G., Hong, H.-L., Tan, B., & Qiu, L.-J. (2014). De novo assembly of soybean wild relatives for pan-genome analysis of diversity and agronomic traits. Nature Biotechnology, 32(10), 10451052.Google Scholar
Lian, Q., Huettel, B., Walkemeier, B., Mayjonade, B., Lopez-Roques, C., Gil, L., Roux, F., Schneeberger, K., & Mercier, R. (2024). A pan-genome of 69 Arabidopsis thaliana accessions reveals a conserved genome structure throughout the global species range. Nature Genetics, 56(5), 982991.Google Scholar
Liao, W.-W., Asri, M., Ebler, J., Doerr, D., Haukness, M., Hickey, G., Lu, S., Lucas, J. K., Monlong, J., Abel, H. J., Buonaiuto, S., Chang, X. H., Cheng, H., Chu, J., Colonna, V., Eizenga, J. M., Feng, X., Fischer, C., Fulton, R. S., & Paten, B. (2023). A draft human pangenome reference. Nature, 617(7960), 312324.Google Scholar
Linthorst, J., Hulsman, M., Holstege, H., & Reinders, M. (2015). Scalable multi whole-genome alignment using recursive exact matching. bioRxiv. https://doi.org/10.1101/022715.Google Scholar
Liu, H.-J., Fu, J., Xu, S., & Nordborg, M. (2025). Potential synthetic associations created by epistasis. Genome Biol, 26, 336. https://doi.org/10.1186/s13059-025-03807-z Google Scholar
Liu, M., Zhang, F., Lu, H., Xue, H., Dong, X., Li, Z., Xu, J., Wang, W., & Wei, C. (2024). PPanG: A precision pangenome browser enabling nucleotide-level analysis of genomic variations in individual genomes and their graph-based pangenome. BMC Genomics, 25(1), 405.Google Scholar
Liu, Y., Du, H., Li, P., Shen, Y., Peng, H., Liu, S., Zhou, G.-A., Zhang, H., Liu, Z., Shi, M., Huang, X., Li, Y., Zhang, M., Wang, Z., Zhu, B., Han, B., Liang, C., & Tian, Z. (2020). Pan-genome of wild and cultivated soybeans. Cell, 182(1), 162176.e13.Google Scholar
Long, Q., Rabanal, F. A., Meng, D., Huber, C. D., Farlow, A., Platzer, A., Zhang, Q., Vilhjálmsson, B. J., Korte, A., Nizhynska, V., Voronin, V., Korte, P., Sedman, L., Mandáková, T., Lysak, M. A., Seren, U., Hellmann, I., & Nordborg, M. (2013). Massive genomic variation and strong selection in Arabidopsis thaliana lines from Sweden. Nature Genetics, 45(8), 884890.Google Scholar
Lu, T.-Y., & Human Genome Structural Variation Consortium, & Chaisson, M. J. P. (2021). Profiling variable-number tandem repeat variation across populations using repeat-pangenome graphs. Nature Communications, 12(1), 4250.Google Scholar
Lynch, R. C., Padgitt-Cobb, L. K., Garfinkel, A. R., Knaus, B. J., Hartwick, N. T., Allsing, N., Aylward, A., Bentz, P. C., Carey, S. B., Mamerto, A., Kitony, J. K., Colt, K., Murray, E. R., Duong, T., Chen, H. I., Trippe, A., Harkess, A., Crawford, S., Vining, K., & Michael, T. P. (2025). Domesticated cannabinoid synthases amid a wild mosaic cannabis pangenome. Nature, 110.Google Scholar
MacPhillamy, C., Chen, T., Hiendleder, S., Williams, J. L., Alinejad-Rokny, H., & Low, W. Y. (2024). DNA methylation analysis to differentiate reference, breed, and parent-of-origin effects in the bovine pangenome era. GigaScience, 13. https://doi.org/10.1093/gigascience/giae061.Google Scholar
Manolov, A., Konanov, D., Fedorov, D., Osmolovsky, I., Vereshchagin, R., & Ilina, E. (2020). Genome complexity browser: Visualization and quantification of genome variability. PLoS Computational Biology, 16(10), e1008222.Google Scholar
Martiniano, R., Garrison, E., Jones, E. R., Manica, A., & Durbin, R. (2020). Removing reference bias and improving indel calling in ancient DNA data analysis by mapping to a sequence variation graph. Genome Biology, 21(1), 250.Google Scholar
Miao, Z., & Yue, J.-X. (2025). Interactive visualization and interpretation of pangenome graphs by linear reference-based coordinate projection and annotation integration. Genome Research, 35(2), 296310.Google Scholar
Minkin, I., & Medvedev, P. (2020). Scalable multiple whole-genome alignment and locally collinear block construction with SibeliaZ. Nature Communications, 11(1), 6327.Google Scholar
Nordborg, M., Borevitz, J. O., Bergelson, J., Berry, C. C., Chory, J., Hagenblad, J., Kreitman, M., Maloof, J. N., Noyes, T., Oefner, P. J., Stahl, E. A., & Weigel, D. (2002). The extent of linkage disequilibrium in Arabidopsis thaliana . Nature Genetics, 30(2), 190193.Google Scholar
Nordborg, M., Hu, T. T., Ishino, Y., Jhaveri, J., Toomajian, C., Zheng, H., Bakker, E., Calabrese, P., Gladstone, J., Goyal, R., Jakobsson, M., Kim, S., Morozov, Y., Padhukasahasram, B., Plagnol, V., Rosenberg, N. A., Shah, C., Wall, J. D., Wang, J., & Bergelson, J. (2005). The pattern of polymorphism in Arabidopsis thaliana . PLoS Biology, 3(7), e196.Google Scholar
Ossowski, S., Schneeberger, K., Clark, R. M., Lanz, C., Warthmann, N., & Weigel, D. (2008). Sequencing of natural strains of Arabidopsis thaliana with short reads. Genome Research, 18(12), 20242033.Google Scholar
Page, A. J., Cummins, C. A., Hunt, M., Wong, V. K., Reuter, S., Holden, M. T. G., Fookes, M., Falush, D., Keane, J. A., & Parkhill, J. (2015). Roary: Rapid large-scale prokaryote pan genome analysis. Bioinformatics (Oxford, England), 31(22), 36913693.Google Scholar
Paten, B., Earl, D., Nguyen, N., Diekhans, M., Zerbino, D., & Haussler, D. (2011). Cactus: Algorithms for genome multiple sequence alignment. Genome Research, 21(9), 15121528.Google Scholar
Prodanov, T., Plender, E. G., Seebohm, G., Meuth, S. G., Eichler, E. E., & Marschall, T. (2025). Locityper enables targeted genotyping of complex polymorphic genes. Nat Genet.Google Scholar
Qin, P., Lu, H., Du, H., Wang, H., Chen, W., Chen, Z., He, Q., Ou, S., Zhang, H., Li, X., Li, X., Li, Y., Liao, Y., Gao, Q., Tu, B., Yuan, H., Ma, B., Wang, Y., Qian, Y., & Li, S. (2021). Pan-genome analysis of 33 genetically diverse rice accessions reveals hidden genomic variations. Cell, 184(13), 35423558.e16.Google Scholar
Rakocevic, G., Semenyuk, V., Lee, W. P., Spencer, J., Browning, J., Johnson, I. J., Arsenijevic, V., Nadj, J., Ghose, K., Suciu, M. C., Ji, S. G., Demir, G., Li, L., Toptaş, B., Dolgoborodov, A., Pollex, B., Spulber, I., Glotova, I., Kómár, P., & Kural, D. (2019). Fast and accurate genomic analyses using genome graphs. Nature Genetics, 51(2), 354362.Google Scholar
Raphael, B., Zhi, D., Tang, H., & Pevzner, P. (2004). A novel method for multiple alignment of sequences with repeated and shuffled elements. Genome Research, 14(11), 23362346.Google Scholar
Rautiainen, M., & Marschall, T. (2020). GraphAligner: Rapid and versatile sequence-to-graph alignment. Genome Biology, 21(1), 253.Google Scholar
Schneeberger, K., Hagmann, J., Ossowski, S., Warthmann, N., Gesing, S., Kohlbacher, O., & Weigel, D. (2009). Simultaneous alignment of short reads against multiple genomes. Genome Biology, 10(9), R98.Google Scholar
Schneeberger, K., Ossowski, S., Ott, F., Klein, J. D., Wang, X., Lanz, C., Smith, L. M., Cao, J., Fitz, J., Warthmann, N., Henz, S. R., Huson, D. H., & Weigel, D. (2011). Reference-guided assembly of four diverse Arabidopsis thaliana genomes. Proceedings of the National Academy of Sciences of the United States of America, 108(25), 1024910254.Google Scholar
Sheikhizadeh, S., Schranz, M. E., Akdel, M., de Ridder, D., & Smit, S. (2016). PanTools: Representation, storage and exploration of pan-genomic data. Bioinformatics, 32(17), i487i493.Google Scholar
Sibbesen, J. A., Eizenga, J. M., Novak, A. M., Sirén, J., Chang, X., Garrison, E., & Paten, B. (2023). Haplotype-aware pantranscriptome analyses using spliced pangenome graphs. Nature Methods, 20(2), 239247.Google Scholar
Sirén, J., Monlong, J., Chang, X., Novak, A. M., Eizenga, J. M., Markello, C., Sibbesen, J. A., Hickey, G., Chang, P.-C., Carroll, A., Gupta, N., Gabriel, S., Blackwell, T. W., Ratan, A., Taylor, K. D., Rich, S. S., Rotter, J. I., Haussler, D., Garrison, E., & Paten, B. (2021). Pangenomics enables genotyping of known structural variants in 5202 diverse genomes. Science, 374(6574), abg8871.Google Scholar
Smith, C. J., Strausz, S., FinnGen, , Spence, J. P., Ollila, H. M., & Pritchard, J. K. (2025). Haplotype analysis reveals pleiotropic disease associations in the HLA region. The American Journal of Human Genetics, 112(8), 18331851.Google Scholar
Song, B., Marco-Sola, S., Moreto, M., Johnson, L., Buckler, E. S., & Stitzer, M. C. (2022). AnchorWave: Sensitive alignment of genomes with high sequence diversity, extensive structural polymorphism, and whole-genome duplication. Proceedings of the National Academy of Sciences of the United States of America, 119(1). https://doi.org/10.1073/pnas.2113075119.Google Scholar
Soylev, A., Ebler, J., Pani, S., Rausch, T., Korbel, J., & Marschall, T. (2024). SVarp: pangenome-based structural variant discovery. bioRxiv, 2024.02.18.580171. https://doi.org/10.1101/2024.02.18.580171 Google Scholar
Sudmant, P. H., Kitzman, J. O., Antonacci, F., Alkan, C., Malig, M., Tsalenko, A., Sampas, N., Bruhn, L., Shendure, J., 1000 Genomes Project, & Eichler, E. E. (2010). Diversity of human copy number variation and multicopy genes. Science (New York, N.Y.), 330(6004), 641646.Google Scholar
Tettelin, H., Masignani, V., Cieslewicz, M. J., Donati, C., Medini, D., Ward, N. L., Angiuoli, S. V., Crabtree, J., Jones, A. L., Durkin, A. S., DeBoy, R. T., Davidsen, T. M., Mora, M., Scarselli, M., , y. R., Peterson, J. D., Hauser, C. R., Sundaram, J. P., Nelson, W. C., & Fraser, C. M. (2005). Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: Implications for the microbial “pan-genome.”. Proceedings of the National Academy of Sciences, 102(39), 1395013955.Google Scholar
The Arabidopsis Genome Initiative. (2000). Analysis of the genome sequence of the flowering plant Arabidopsis thaliana . Nature, 408(6814), 796815.Google Scholar
Tonkin-Hill, G., MacAlasdair, N., Ruis, C., Weimann, A., Horesh, G., Lees, J. A., Gladstone, R. A., Lo, S., Beaudoin, C., Floto, R. A., Frost, S. D. W., Corander, J., Bentley, S. D., & Parkhill, J. (2020). Producing polished prokaryotic pangenomes with the Panaroo pipeline. Genome Biology, 21(1), 180.Google Scholar
Vaughn, J. N., Branham, S. E., Abernathy, B., Hulse-Kemp, A. M., Rivers, A. R., Levi, A., & Wechter, W. P. (2022). Graph-based pangenomics maximizes genotyping density and reveals structural impacts on fungal resistance in melon. Nature Communications, 13(1), 7897.Google Scholar
Vorbrugg, S., Bezrukov, I., Bao, Z., & Weigel, D. (2024). Gretl-variation GRaph evaluation TooLkit. Bioinformatics (Oxford, England), 41(1), btae755.Google Scholar
Vorbrugg, S., Bezrukov, I., Bao, Z., Xian, W., & Weigel, D. (2024). Gfa2bin enables graph-based GWAS by converting genome graphs to pan-genomic genotypes. bioRxiv. https://doi.org/10.1101/2024.12.05.626966 Google Scholar
Wick, R. R., Schultz, M. B., Zobel, J., & Holt, K. E. (2015). Bandage: Interactive visualization of de novo genome assemblies. Bioinformatics (Oxford, England), 31(20), 33503352.Google Scholar
Wlodzimierz, P., Rabanal, F. A., Burns, R., Naish, M., Primetis, E., Scott, A., Mandáková, T., Gorringe, N., Tock, A. J., Holland, D., Fritschi, K., Habring, A., Lanz, C., Patel, C., Schlegel, T., Collenberg, M., Mielke, M., Nordborg, M., Roux, F., & Henderson, I. R. (2023). Cycles of satellite and transposon evolution in Arabidopsis centromeres. Nature, 618(7965), 557565.Google Scholar
Yokoyama, T. T., Sakamoto, Y., Seki, M., Suzuki, Y., & Kasahara, M. (2019). MoMI-G: Modular multi-scale integrated genome graph browser. BMC Bioinformatics, 20(1), 548.Google Scholar
Zhang, W., Macias-Velasco, J. F., Zhuo, X., Belter, E. A. Jr., Tomlinson, C., Garza, J., Tekkey, N., Li, D., & Wang, T. (2025). methylGrapher: Genome-graph-based processing of DNA methylation data from whole genome bisulfite sequencing. Nucleic Acids Research, 53(3), gkaf028.Google Scholar
Zhao, Q., Feng, Q., Lu, H., Li, Y., Wang, A., Tian, Q., Zhan, Q., Lu, Y., Zhang, L., Huang, T., Wang, Y., Fan, D., Zhao, Y., Wang, Z., Zhou, C., Chen, J., Zhu, C., Li, W., Weng, Q., & Huang, X. (2018). Pan-genome analysis highlights the extent of genomic variation in cultivated and wild rice. Nature Genetics, 50(2), 278284.Google Scholar
Zhou, H., Su, X., & Song, B. (2024). ACMGA: A reference-free multiple-genome alignment pipeline for plant species. BMC Genomics, 25(1), 515.Google Scholar
Zhou, Y., Zhang, Z., Bao, Z., Li, H., Lyu, Y., Zan, Y., Wu, Y., Cheng, L., Fang, Y., Wu, K., Zhang, J., Lyu, H., Lin, T., Gao, Q., Saha, S., Mueller, L., Fei, Z., Städler, T., Xu, S., & Huang, S. (2022). Graph pangenome captures missing heritability and empowers tomato breeding. Nature, 606(7914), 527534.Google Scholar
Figure 0

Figure 1. Multiple approaches to building pangenome graphs. (a) A graph that has only four nodes, corresponding to the four DNA bases, with all possible connections between the nodes. (b) A k-mer graph based on short sequences (here, triplets). (c) and (d) The same sequences are combined in different representations, which highlights the equivalency of a multiple genome alignment and a genome graph. (e) The same sequences shown in VCF format in a symbolic manner. (f), (g) and (h) represent the entire chromosome 1 from five A. thaliana individuals. (f) A Biforst 21-mer graph. (g) A Minigraph-Cactus graph. (h) A PGGB graph. (i) Summary of the trade-offs in node size and complexity for different types of graphs. Note that even with the same method, parameter choice will result in different graphs from identical sequences.

Figure 1

Figure 2. Functional pangenomics. (a) A pangenome graph can integrate diverse layers of functional annotations (e.g., genes, transposons, methylation level) in its reference coordinate system and serve as a unified platform for cross-genome comparison. (b) Graph nodes can be used directly for genome-wide association analyses. The colors represent different length-based categories for the node, while node shapes indicate whether the sequence originates from the chosen reference genome. Broken line indicates nominal significance threshold. This figure was adapted from (Vorbrugg, Bezrukov, Bao, Xian, et al., 2024).

Figure 2

Figure 3. Timeline of pangenome graph algorithms. (a) The differently coloured circles indicate the main functions of tools; some workflows/tools may have multiple usages. For tools described in journals, we use the date of publication, but we note that many colleagues in this area are very generous and often release their tools long before formal publication. Given the rapid development in this area, it is perhaps not surprising that some tools only have a public GitHub repository. In the text, we provide hyperlinks as references. (b) The number of graph-based tools developed for different purposes.

Author comment: Complexity welcome: Pangenome graphs for comprehensive population genomics — R0/PR1

Comments

Dear QPB Team,

Apologies for being so late in submitting our review. The good news is that we are really happy with the outcome (and that there are by now enough recent papers for this to be meaningful).

Best,

Detlef Weigel

Review: Complexity welcome: Pangenome graphs for comprehensive population genomics — R0/PR2

Conflict of interest statement

Reviewer declares none.

Comments

Bao & Weigel present a well-written review of the development of analytical tools for pangenomes. It is a welcome overview of the different analytical tools, focusing on the rather technical, and complicated, aspects of analyses and interpretations of pangenomes.

1.The last sentence in the abstract refers to “reference-free evolutionary and population genomics”. I question the use of “reference-free”. What do the authors mean by this term? Will we ever be able to not rely on a reference in genomic analyses?

2.Introduction: In many parts of the introduction the authors use very long and complex sentences that can be hard to completely follow. I suggest that the author simplify the language of the introduction by simply converting the very long sentences into multiple shorter once. For example, the first sentence in the second paragraph of the introduction spans 5 lines and contains two “which sub-sentences”. If converted to 2-3 shorter sentences the line of thought in this sentence would be much easier to follow and would also be clearer.

3. First paragraph of the Introduction: What is “modern biology”, and is evolutionary research “modern”? Do the authors refer to the use of genomics in biology being “modern”?

4. Last sentence of the Introduction: Will you ever be able to get “all variants present in a species”? Is that even desired? When will a pangenome refer to something else than a “collection of a specific-set of genomes”? How would you overcome this?

5. On page 7 you refer to “annotation quality”. I do agree with the authors that this is a problem, but the authors fail to include that annotations in themselves are very problematic for a number of reasons and that many genome annotations rely on “well-annotated” model organisms, e.g. Arabidopsis. Thus, to overcome the problem of annotation quality a whole new sets of annotation tools and more comprehensive biological sampling would be needed. I think the authors can include a few sentences about this for clarity.

6. I miss a discussion (can be short) on the specific complexity with pangenomics in plants, such as how to deal with hybridizations and polyploidy. It might also be worth to shortly comment on the uneven availability of genomic resources for plants, e.g. the overrepresentations of crops.

7. Fig 1: I suggest that the authors change the letters of the different panels such that the panels are described in alphabetical order in the figure legend. Also, panel e lacks a description. Panel e is referred to last in the text so then why is it not just the last panel? The vcf format in panel d is not completely clear to me.

8. Fig 2 and 3: Please make such that they are referred to in the correct order. It now looks like Fig 3 is referred to before Fig 2. I also believe the figure legends have been swapped but that the figures are uploaded in the reverse order. Since I cannot see which figure is what in the proof I’m not 100% sure.

9. Fig 2 (which I think is Fig 3 in the text): The colors and shapes in panel 2 are difficult to distinguish. It is almost impossible to see the difference between the two blue and the filled shapes fill also the un-filled shapes. Please consider revising this figure.

10. Fig 3 (which I think is Fig 2 in the text): I cannot see any a or b panels. Why are some programs in bold? The lighter green and blue are difficult to distinguish in the timeline.

Review: Complexity welcome: Pangenome graphs for comprehensive population genomics — R0/PR3

Conflict of interest statement

Reviewer declares none.

Comments

This is an excellent and timely article. I thoroughly enjoyed reading it. The manuscript provides a clear and engaging progression through the development of pangenome graph methods, their applications, and associated challenges. It is very well written and well structured.

I do not have any major concerns. Below, I provide a set of minor editorial suggestions that could further improve clarity, readability, and consistency.

Page 3

• Sentence: “This insight led early on to the suggestion of producing synthetic reference genome sequences that represented all possible combinations of polymorphisms across the genome.” → Please provide a reference or an example. What does synthetic reference genome sequences refer to here?

Page 5

• Last paragraph, second sentence: “MSAs” should likely be “MGAs.”

• Please spell out “TBA” at first mention.

Page 6

• Please spell out “PGGB” and “ACMGA” at first use.

Page 7 and subsequent pages

• Please spell out “PGR-TK,” “DRAGEN,” “EVG,” “KAGE,” and “PALSS” at first mention.

References / Bibliography

• The following reference is duplicated:

Paten, B., Earl, D., Nguyen, N., Diekhans, M., Zerbino, D., & Haussler, D. (2011). Cactus: Algorithms for genome multiple sequence alignment. Genome Research, 21(9), 1512–1528.

Page 9

• First paragraph, third last sentence: “in humans ((Igolkina et al., 2024).” → remove extra parenthesis.

Figures

• Figure legends appear to be mismatched: the legend for Figure 2 corresponds to Figure 3, and vice versa.

Recommendation: Complexity welcome: Pangenome graphs for comprehensive population genomics — R0/PR4

Comments

Thank you for submitting your manuscript to Quantitative Plant Biology. Your review was positively received by both reviewers, who both detail minor suggestions for improvement.

Decision: Complexity welcome: Pangenome graphs for comprehensive population genomics — R0/PR5

Comments

No accompanying comment.

Author comment: Complexity welcome: Pangenome graphs for comprehensive population genomics — R1/PR6

Comments

Thank you for the nice and useful reviews!

Review: Complexity welcome: Pangenome graphs for comprehensive population genomics — R1/PR7

Conflict of interest statement

Reviewer declares none.

Comments

Two very small things:

Fig1: Page 4, second to last line in the “The evolution of building pangenome graphs)” I think it should be Fig. 1c-h. Also Make sure that the letters in panel i are updated to reflect the new order of the panels.

Fig2. Double-check that panel b is in the format you want. I do understand your thoughts behind the original coloring and shapes in the figure. Thank you for clarifying that in your responses. I think the legend of the shapes and colors have not made it into the new version.

Review: Complexity welcome: Pangenome graphs for comprehensive population genomics — R1/PR8

Conflict of interest statement

Reviewer declares none.

Comments

All my concerns have been addressed. Thank you again for such an excellent read!

Recommendation: Complexity welcome: Pangenome graphs for comprehensive population genomics — R1/PR9

Comments

Thanks for submitting this revised version to QPB. It is an interesting review which was well received.

Reviewer 1 has some very minor comments

Two very small things:

Fig1: Page 4, second to last line in the “The evolution of building pangenome graphs)” I think it should be Fig. 1c-h. Also Make sure that the letters in panel i are updated to reflect the new order of the panels.

Fig2. Double-check that panel b is in the format you want. I do understand your thoughts behind the original coloring and shapes in the figure. Thank you for clarifying that in your responses. I think the legend of the shapes and colors have not made it into the new version.

Decision: Complexity welcome: Pangenome graphs for comprehensive population genomics — R1/PR10

Comments

No accompanying comment.