Introduction
Copy number variations (CNVs) allude to genetic modifications that include deletions, duplications and insertions surpassing a size threshold of 50 base pairs (bp). These modifications intricately reshape the Deoxyribonucleic acid (DNA) architecture, exerting a profound influence on genomic diversity, a phenomenon readily apparent both within specific breeds and across diverse populations (Letaief et al., Reference Letaief, Rebours, Grohs, Meersseman, Fritz, Trouilh, Esquerré, Barbieri, Klopp and Philippe2017). CNVs have been observed to influence a greater proportion of genomic sequences compared to other types of genomic variations, such as single-nucleotide polymorphisms (SNPs) (Geistlinger et al., Reference Geistlinger, Da Silva, Cesar, Tizioto, Waldron, Zimmer, Regitano and Coutinho2018; Zhou et al., Reference Zhou, Utsunomiya, Xu, Hay, Bickhart, Sonstegard, Van Tassell, Garcia and Liu2016; Zhao et al., Reference Zhao, Wang, Wang, Jia and Zhao2013; Liu and Bickhart, Reference Liu and Bickhart2012; Hou et al., Reference Hou, Liu, Bickhart, Cardone, Wang, Kim, Matukumalli, Ventura, Song and VanRaden2011; Zhang et al., Reference Zhang, Gu, Hurles and Lupski2009). They can also affect the expression of adjacent genes, even when they may not be inherently connected through linkage disequilibrium (LD). CNVs and CNV regions (CNVRs) have been linked to both qualitative and quantitative traits across various animal species (da Silva et al., Reference Da Silva, Giachetto, Da Silva, Cintra, Paiva, Yamagishi and Caetano2016). Changes within CNVRs can appear as either copy number gains, copy number losses, or mixed types, involving both gain and loss simultaneously (Butty et al., Reference Butty, Chud, Cardoso, Lopes, Miglior, Schenkel, Cánovas, Häfliger, Drögemüller and Stothard2021).
CNVs have the potential to induce significant phenotypic variations through a diverse array of mechanisms, that is, through gene dosage effect, alterations in gene expression levels, gene blocking effects, gene fusion events, positional effects, the activation of previously dormant alleles, functional polymorphisms and the possibility of compounded effects (Zhang et al., Reference Zhang, Calus, Bosse, Sahana, Lund and Guldbrandtsen2018b). They might represent the basis upon which evolutionary mechanisms can exert their influence (Emerson et al., Reference Emerson, Cardoso-Moreira, Borevitz and Long2008). About 50% of recognized CNVs from humans involve protein coding regions, acknowledged for their roles in fundamental cellular processes, overall metabolism and the initiation of various diseases and disease susceptibility (Sebat et al., Reference Sebat, Lakshmi, Troge, Alexander, Young, Lundin, Manér, Massa, Walker and Chi2004; Gupta et al., Reference Gupta, Place, Goldstein, Sarkar, Zhou, Potamousis, Kim, Flanagan, Li and Newton2015; Cooper et al., Reference Cooper, Coe, Girirajan, Rosenfeld, Vu, Baker, Williams, Stalker, Hamid and Hannig2011; Casey et al., Reference Casey, Magalhaes, Conroy, Regan, Shah, Anney, Shields, Abrahams, Almeida and Bacchelli2012; Almal and Padh, Reference Almal and Padh2012; El-Sayed Moustafa et al., Reference El-Sayed Moustafa, Eleftherohorinou, De Smith, Andersson-Assarsson, Couto Alves, Hadjigeorgiou, Walters, Asher, Bottolo and Buxton2012; Conrad et al., Reference Conrad, Pinto, Redon, Feuk, Gokcumen, Zhang, Aerts, Andrews, Barnes and Campbell2010; Park et al., Reference Park, Kim, Kasif and Park2015; Pinto et al., Reference Pinto, Pagnamenta, Klei, Anney, Merico, Regan, Conroy, Magalhaes, Correia and Abrahams2010). Modifications in CNVs are identified in cancerous tissues (Gupta et al., Reference Gupta, Place, Goldstein, Sarkar, Zhou, Potamousis, Kim, Flanagan, Li and Newton2015; Park et al., Reference Park, Kim, Kasif and Park2015; Ouyang et al., Reference Ouyang, Lee, Park, Mao, Shi, Gong, Zheng, Li, Zhao and Wang2014; Malek, Reference Malek2013; Verma et al., Reference Verma, Khoury and Ioannidis2013) and have been linked to different other traits (Park et al., Reference Park, Kim, Kasif and Park2015; Zarrei et al., Reference Zarrei, MacDonald, Merico and Scherer2015; Butty et al., Reference Butty, Chud, Cardoso, Lopes, Miglior, Schenkel, Cánovas, Häfliger, Drögemüller and Stothard2021; Kang et al., Reference Kang, Li, Liu, Liu, Pan, Wiggans, Rosen and Liu2020; Di Gerlando et al., Reference Di Gerlando, Sutera, Mastrangelo, Tolone, Portolano, Sottile, Bagnato, Strillacci and Sardina2019; Zhou et al., Reference Zhou, Connor, Wiggans, Lu, Tempelman, Schroeder, Chen and Liu2018; da Silva et al., Reference Da Silva, Giachetto, Da Silva, Cintra, Paiva, Yamagishi and Caetano2016; Jakobsson et al., Reference Jakobsson, Scholz, Scheet, Gibbs, VanLiere, Fung, Szpiech, Degnan, Wang and Guerreiro2008).
Historically, the identification of CNVs at the cytogenetic level utilized techniques such as Fluorescent In Situ Hybridization (FISH) and chromosomal karyotyping (Zhao et al., Reference Zhao, Wang, Wang, Jia and Zhao2013). The majority of extensive population-based studies for CNV detection primarily utilize two approaches: SNP genotyping panels and comparative genomic hybridization arrays (CGH) (Zhang et al., Reference Zhang, Jia, Yang, Xu, Li, Sun, Huang, Lan, Lei and Zhou2014; Cicconardi et al., Reference Cicconardi, Chillemi, Tramontano, Marchitelli, Valentini, Ajmone-Marsan and Nardone2013; Xu et al., Reference Xu, Hou, Bickhart, Zhou, Hay, Song, Sonstegard, Van Tassell and Liu2016). The merits and drawbacks linked to them are thoroughly discussed in the literature (Pinto et al., Reference Pinto, Darvishi, Shi, Rajan, Rigler, Fitzgerald, Lionel, Thiruvahindrapuram, MacDonald and Mills2011; Ionita-Laza et al., Reference Ionita-Laza, Rogers, Lange, Raby and Lee2009; Curtis et al., Reference Curtis, Lynch, Dunning, Spiteri, Marioni, Hadfield, Chin, Brenton, Tavaré and Caldas2009). Among these methods, the use of SNP arrays with varying densities is advantageous in diverse livestock species because of their high-throughput nature and cost-effectiveness. Additionally, the incorporation of the B-allele frequency (BAF) and Log R Ratio (LRR) parameter further aids in result interpretation (Fadista et al., Reference Fadista, Nygaard, Holm, Thomsen and Bendixen2008). Numerous computational tools and methods have been developed for the exhaustive analysis of CNVs on a genome-wide scale. Notably, the Hidden Markov Model (HMM) utilized in PennCNV stands out as a highly accurate approach for CNV detection, recognized for its heightened specificity and sensitivity (Pierce et al., Reference Pierce, Dzama and Muchadeyi2018; Geistlinger et al., Reference Geistlinger, Da Silva, Cesar, Tizioto, Waldron, Zimmer, Regitano and Coutinho2018; Zhang et al., Reference Zhang, Jia, Yang, Xu, Li, Sun, Huang, Lan, Lei and Zhou2014).
Studies focused on identifying CNVs have been successfully conducted in economically significant animal species. These species encompass cattle (Jiang et al., Reference Jiang, Jiang, Yang, Liu, Wang, Wang, Ding, Liu and Zhang2013; Xu et al., Reference Xu, Cole, Bickhart, Hou, Song, VanRaden, Sonstegard, Van Tassell and Liu2014a; Bickhart et al., Reference Bickhart, Hou, Schroeder, Alkan, Cardone, Matukumalli, Song, Schnabel, Ventura and Taylor2012; Hou et al., Reference Hou, Bickhart, Chung, Hutchison, Norman, Connor and Liu2012a; Hou et al., Reference Hou, Bickhart, Hvinden, Li, Song, Boichard, Fritz, Eggen, Denise, Wiggans and Liu2012b; Jiang et al., Reference Jiang, Jiang, Wang, Ding, Liu and Zhang2012; Liu and Bickhart, Reference Liu and Bickhart2012; Liu et al., Reference Liu, Brown, Hebert, Cardone, Hou, Choudhary, Shaffer, Amazu, Connor, Ventura, Gasbarre, Van Tassell and Sonstegard2011; Liu et al., Reference Liu, Hou, Zhu, Cardone, Jiang, Cellamare, Mitra, Alexander, Coutinho, Dell’Aquila, Gasbarre, Lacalandra, Li, Matukumalli, Nonneman, Regitano, Smith, Song, Sonstegard, Van Tassell, Ventura, Eichler and Bickhart2010; Liu et al., Reference Liu, Ventura, Cellamare, Chen, Cheng, Zhu, Li, Song and Eichler2009), goat (Fontanesi et al., Reference Fontanesi, Martelli, Beretti, Riggio, Dall’Olio, Colombo, Casadio, Russo and Portolano2010), sheep (Fontanesi et al., Reference Fontanesi, Beretti, Martelli, Colombo, Dall’Olio, Occidente, Portolano, Casadio, Matassino and Russo2011; Liu et al., Reference Liu, Zhang, Xu, Ren, Lu, Zhang, Zhang, Zhou, Wei and Zhao2013) and buffalo (Ahmad et al., Reference Ahmad, Chandrababu Shailaja, Vaishnav, Kumar, Gaur, Janga, Ahmad, Malla and Dutt2023; Dash et al., Reference Dash, Sivalingam, Bidyalaxmi, Sukhija, Kumar, Niranjan, Tantia and Gupta2023; Kumar et al., Reference Kumar, Panigrahi, Strillacci, Sonejita Nayak, Rajawat, Ghildiyal, Bhushan and Dutt2023; Liu et al., Reference Liu, Kang, Catacchio, Liu, Fang, Schroeder, Li, Rosen, Iamartino and Iannuzzi2019; Strillacci et al., Reference Strillacci, Moradi-Shahrbabak, Davoudi, Ghoreishifar, Mokhber, Masroure and Bagnato2021; Yang et al., Reference Yang, Han, Deng, Li, Han, Xia, Quan, Hua, Yang and Zhou2023; Zhang et al., Reference Zhang, Chen, Chen, Lei and Sun2022). A substantial quantity of CNVs have been observed in both indicine and taurine cattle breeds, particularly within genes and genomic regions influencing complex and quantitative traits (Zhang et al., Reference Zhang, Jia, Yang, Xu, Li, Sun, Huang, Lan, Lei and Zhou2014; Cicconardi et al., Reference Cicconardi, Chillemi, Tramontano, Marchitelli, Valentini, Ajmone-Marsan and Nardone2013; Bickhart et al., Reference Bickhart, Hou, Schroeder, Alkan, Cardone, Matukumalli, Song, Schnabel, Ventura and Taylor2012; Hou et al., Reference Hou, Liu, Bickhart, Matukumalli, Li, Song, Gasbarre, Van Tassell and Sonstegard2012c; Jiang et al., Reference Jiang, Jiang, Wang, Ding, Liu and Zhang2012; Jiang et al., Reference Jiang, Jiang, Yang, Liu, Wang, Wang, Ding, Liu and Zhang2013). Notably, there is a stronger overlap of CNVs reported among different taurine breeds compared to the overlap seen when indicine and taurine cattle are compared. Interestingly, indicine breeds displayed the greatest CNV diversity among all (Bickhart et al., Reference Bickhart, Hou, Schroeder, Alkan, Cardone, Matukumalli, Song, Schnabel, Ventura and Taylor2012).
In the context of cattle, CNVs have been associated with diverse traits, including parasite resistance (Hou et al., Reference Hou, Liu, Bickhart, Matukumalli, Li, Song, Gasbarre, Van Tassell and Sonstegard2012c), growth characteristics (Zhang et al., Reference Zhang, Calus, Bosse, Sahana, Lund and Guldbrandtsen2018b; Xu et al., Reference Xu, Shi, Cai, Zhou, Lan, Zhang, Lei, Qi and Chen2014b), reproduction (Yue et al., Reference Yue, Dechow, Chang, DeJarnette, Marshall, Lei and Liu2014; Sasaki et al., Reference Sasaki, Ibi, Akiyama, Fukushima and Sugimoto2016), milk production and composition (da Silva et al., Reference Da Silva, Giachetto, Da Silva, Cintra, Paiva, Yamagishi and Caetano2016; Gao et al., Reference Gao, Jiang, Yang, Hou, Liu, Zhang, Zhang and Sun2017), milk somatic cell scores (Durán Aguilar et al., Reference Durán Aguilar, Román Ponce, Ruiz López, González Padilla, Vásquez Peláez, Bagnato and Strillacci2017), meat quality (da Silva et al., Reference Da Silva, Giachetto, Da Silva, Cintra, Paiva, Yamagishi and Caetano2016; de Lemos et al., Reference De Lemos, Peripolli, Berton, Feitosa, Olivieri, Stafuzza, Tonussi, Kluska, Chiaia and Mueller2018) and feed conversion ratios (de Almeida Santana et al., Reference De Almeida Santana, Junior, Cesar, Freua, Da Costa Gomes, Da Luz e Silva, Leme, Fukumasu, Carvalho and Ventura2016).
The identification of CNVs and CNVRs within crossbred cattle populations holds the potential to unveil specific genetic segments responsible for variations in critical economic traits (Liu et al., Reference Liu, Chen, Huang, Wang, Peng, Zhang, Chai, Khan and Wang2024). A lot of CNV-related literature is available; however, only a few studies have been conducted to explore the crossbred cattle genomics of tropical regions like Pakistan (Chen et al., Reference Chen, Khan, Wang, Liang, Ren, Kou, Liu, Chen, Peng and Wang2024). The primary focus of the current study was to construct a comprehensive genome-wide CNV map for crossbred cattle employing SNP genotyping techniques, aiming to facilitate genetic enhancements and delve into the genetic foundations of improved production and environmental adaptability.
Previously, the signatures of selection and LD parameters were done using the same dataset (Nisa et al., Reference Nisa, Kaul, Asif, Amin, Mrode, Mansoor and Mukhtar2023; Nisa et al., Reference Nisa, Naqvi, Arshad, Ilyas, Asif, Amin, Mrode, Mansoor and Mukhtar2024). Sahiwal, a tropical dairy cattle, is well recognized for disease resistance and heat tolerance (Iqbal et al., Reference Iqbal, Liu, Yang, Huang, Hanif, Asif, Khan and Mansoor2019), but the lower production is of great concern. The import of high-yielding dairy animals, like HF, is rising in Pakistan to mitigate the production-related issues. Crossbreeding Sahiwal and HF is a highly efficient method to bolster livestock productivity with improved sustainability and reproductive ability (Leroy et al., Reference Leroy, Baumung, Boettcher, Scherf and Hoffmann2016; Mbole-Kariuki et al., Reference Mbole-Kariuki, Sonstegard, Orth, Thumbi, Bronsvoort, Kiara, Toye, Conradie, Jennings and Coetzer2014; Bebe et al., Reference Bebe, Udo, Rowlands and Thorpe2003). This crossbreeding yields progeny that harness the benefits of hybrid vigour (Kumar et al., Reference Kumar, Alex, Gaur, Mukherjee, Mandal, Singh, Tyagi, Kumar, Das and Deb2018). It may combine the high production yield of HF and the adaptability and heat tolerance of Sahiwal into a single individual, with improved production and adaptability.
Material and methods
Ethics statement
To guarantee the ethical and compassionate treatment of animals, the investigation outlined here received approval from the research ethics committee of the National Institute for Biotechnology and Genetic Engineering (NIBGE), Faisalabad, Pakistan, on 10 June 2020. A qualified veterinarian supervised the blood collection process to minimize discomfort to the animals. Before sample collection, the researchers conducted a meeting with the farmers to elaborate on the objective of the investigation and secured verbal acknowledgment of consent.
Sample collection and data generation
The study sample comprised 81 crossbred cattle. The animals were selected based on varying proportions of HF and Sahiwal genetics across different lactations. Due to the inherent variability in crossbreeding, breed composition differed among individuals, with some possessing approximately 50% HF and 50% Sahiwal inheritance, while others had up to 31/32 HF ancestry, with the remainder from Sahiwal. Genotyping was done using the GGP_HDv3_C chip (GeneSeek® Genomic Profiler™) and commercially available services at GeneSeek (Neogen Corporation, Lincoln, NE, United States). The details about the blood sample collection, crossbred composition, DNA extraction, its qualitative and quantitative assessment and genotyping detail is mentioned in previous studies on the same dataset (Nisa et al., Reference Nisa, Naqvi, Arshad, Ilyas, Asif, Amin, Mrode, Mansoor and Mukhtar2024; Nisa et al., Reference Nisa, Kaul, Asif, Amin, Mrode, Mansoor and Mukhtar2023). The genotypes were originally discerned utilizing Illumina, Inc.’s Genome Studio. The examination was conducted based on the ARS-UCD1.2 bovine genome assembly.
Quality control (QC)
After genotyping, raw data consisted of 139,376 SNPs. QC was performed utilizing the PLINK v1.9 software as outlined by Purcell et al. (Reference Purcell, Neale, Todd-Brown, Thomas, Ferreira, Bender, Maller, Sklar, De Bakker and Daly2007). This involved eliminating SNPs with a call rate of < 95%, a minor allele frequency (MAF) of < 0.02 and a Hardy-Weinberg equilibrium (HWE) of < 10E−05. Downstream analysis considered autosomal SNPs only.
Calling copy number variations (CNVs) and copy number variation regions (CNVRs)
In this study, the Genome Studio v2.0.5 software developed by Illumina was utilized to extract pertinent information, including BAF and LRR, from the signal intensity data of the genotyped samples. Notably, the genotyping data exhibited a minimal rate of missing values, boasting an impressive genotyping rate of 99.7%.
For the crucial task of CNV detection, the PennCNV programme was used (Wang et al., Reference Wang, Li, Hadley, Liu, Glessner, Grant, Hakonarson and Bucan2007). This programme leverages the power of HMM for the accurate identification of CNVs. To facilitate the analysis, the Compile_pfb script within PennCNV was utilized. This script allowed the generation of a comprehensive genome-wide Population Frequency of B Allele (PFB) file, primarily derived from the BAF associated with each SNP.
To further refine the data analysis, Kcolumn, a Perl script within PennCNV, was employed. This script was instrumental in the segmentation and organization of the information pertaining to LRR, BAF and PFB. It is noteworthy that, due to the unavailability or incompleteness of complete pedigree information, the ’--test’ option was chosen for the CNV calling process, ensuring the robustness of the analysis in situations where pedigree information was lacking or not fully utilized in the study.
CNVs were identified using the intensity files in the Perl script detect_cnv supplied by the PennCNV. QC for CNVs adhered to stringent criteria, necessitating a low LRR standard deviation (SD) of less than 0.3, with a minimal BAF drift of less than < 0.01 and a GC wave factor of less than 0.05.
It is important to highlight that although the PennCNV was basically designed for humans, essential modifications were incorporated while analysing to accommodate the extra chromosomes of the bovine genome. All other parameters and settings of PennCNV were retained at their default values during CNV calling.
The identified CNVs were subsequently categorized into discrete intervals, referred to as CNVRs. This choice was made to define CNVRs more naturally, encompassing intervals with overlapping CNVs that did not surpass the average size of the CNV+1SD. CNVRs were constructed using the CNVRuler programme with default parameters, as outlined by Kim et al. (Reference Kim, Kim, Raney and Ernst2012b).
It is important to note that CNVRuler offers three distinct methods for defining CNVRs: CNVR, Reciprocal Overlap (RO) and Fragment. In this study, the CNVR method was selected. To ensure accuracy, a recurrence value of 0.3 was set to trim sparse regions of overlap, preventing the overestimation of CNVR size and frequency.
For validation purposes, the same CNVRs were also obtained using the HandyCNV package within R (Zhou et al., Reference Zhou, Liu, Lopdell, Garrick and Shi2021). Three categories of CNVRs were identified specifically: loss, gain and mixed.
Functional annotation
The automated annotation of genes located in the identified CNVs and CNVRs was conducted using the handyCNV package in R, specifically employing the call_gene function. However, before utilizing this function, the preparation of gene lists against the correct reference genome, namely ARS-UCD1.2, was imperative. This preparatory step was accomplished through the get_refgene function.
Two distinct gene lists were compiled. The first one drew upon data from UCSC, while the second was created using information sourced from the Ensembl Genome browser. After individually extracting information from both browsers, a comparative analysis was undertaken between the results to identify a set of consensus genes. Additionally, the Ensemble database, specifically Ensembl gene 110, was accessed via Biomart for the same annotation purpose.
Quantitative trait loci (QTL) detection within CNVRs
CNVRs were additionally scrutinized for their potential association with significant QTLs affecting various economically important traits. This evaluation utilized the CattleQTLdb (https://www.animalgenome.org/cgi-bin/QTLdb/BT/index). Genomic coordinates were employed to identify QTLs and genes that exhibited spatial overlap within CNVRs.
Additionally, annotation using Gene Ontology (GO) was performed using the DAVID platform (https://david.ncifcrf.gov/tools.jsp) (Huang et al., Reference Huang, Li, Wang, Yu, Cai, Zheng, Li, Zhang, Chen and Asadollahpour Nanaei2021). This approach offered insights into the biological functions and pathways associated with the genes located within CNVs and CNVRs.
Comparison of CNVR with previous studies
To compare CNVRs found in this study with previously reported studies, autosomal CNVs from eight studies were retrieved from the Database of Genomic Variants Archive (DGVa) at EMBL-EBI (accessed on 30 September 2023). They were juxtaposed with CNVRs identified in this study. The study populations in these studies are mainly of taurine breeds; however, in 3 datasets, we got some samples from indicine breeds as well (Liu et al., Reference Liu, Hou, Zhu, Cardone, Jiang, Cellamare, Mitra, Alexander, Coutinho, Dell’Aquila, Gasbarre, Lacalandra, Li, Matukumalli, Nonneman, Regitano, Smith, Song, Sonstegard, Van Tassell, Ventura, Eichler and Bickhart2010; Hou et al., Reference Hou, Liu, Bickhart, Cardone, Wang, Kim, Matukumalli, Ventura, Song and VanRaden2011; Karimi et al., Reference Karimi, Esmailizadeh, Wu and Gondro2017). One study detected CNV using CGH (Liu et al., Reference Liu, Hou, Zhu, Cardone, Jiang, Cellamare, Mitra, Alexander, Coutinho, Dell’Aquila, Gasbarre, Lacalandra, Li, Matukumalli, Nonneman, Regitano, Smith, Song, Sonstegard, Van Tassell, Ventura, Eichler and Bickhart2010), three studies used SNP Chip data (Hou et al., Reference Hou, Liu, Bickhart, Cardone, Wang, Kim, Matukumalli, Ventura, Song and VanRaden2011; Karimi et al., Reference Karimi, Esmailizadeh, Wu and Gondro2017) and four studies used WGS data (Bickhart et al., Reference Bickhart, Hou, Schroeder, Alkan, Cardone, Matukumalli, Song, Schnabel, Ventura and Taylor2012; Boussaha et al., Reference Boussaha, Esquerré, Barbieri, Djari, Pinton, Letaief, Salin, Escudié, Roulet and Fritz2015; Keel et al., Reference Keel, Keele and Snelling2017; Mesbah-Uddin et al., Reference Mesbah-Uddin, Guldbrandtsen, Iso-Touru, Vilkki, De Koning, Boichard, Lund and Sahana2018). The studies encompassed a variable number of breeds, ranging from 1 to 21, with sample sizes ranging from 6 to 539. To compile the DGVa CNVR set, information including study details, type of CNV, chromosome, start and end position was extracted.
CNVs in these studies were identified using UMD3.1 and UMD3.1.1 assemblies of bovines. The coordinates from different assemblies were first converted to ARS-UCD1.2 using the LiftOver tool of UCSC Genome Browser (Navarro Gonzalez et al., Reference Navarro Gonzalez, Zweig, Speir, Schmelter, Rosenbloom, Raney, Powell, Nassar, Maulding and Lee2021). The minimum threshold for the ratio of bases requiring remapping was established at 0.4 (Butty et al., Reference Butty, Chud, Miglior, Schenkel, Kommadath, Krivushin, Grant, Häfliger, Drögemüller and Cánovas2020) and for all other LiftOver parameters, default values were applied.
After translation to ARS-UCD1.2 positions, CNVs that shared a minimum overlap of 1bp were merged. The DGVa CNVR set resulted in a total of 9243 CNVRs. CNVRs from our dataset were considered equivalent to those from the DGVa if the RO between them was at least 50%.
Results
In the current study, GGP_HDv3_C array data from 81 crossbred animals were employed to detect CNVs and CNVRs. The HMM within the PennCNV program was applied for this purpose. Initially, a total of 1206 CNVs were identified within the crossbred dataset. After a rigorous filtering process, 1055 CNVs were retained, distributed across animals and encompassing all autosomes (Supplementary File 1).
The observed CNV count per animal ranged from a minimum of 3 to a maximum of 37, with an average of 13.88 CNVs per animal. Btau11 displayed the highest number of CNVs, occurring at 97 distinct genomic locations. Conversely, Btau23 and Btau27 in the bovine genome exhibited the lowest number of CNVs, each containing only three CNVs.
The total regions displaying losses and gains in relation to the normal copy number (CNV = 2) amounted to 129 and 926, respectively. The size of the filtered CNVs displayed considerable variation, ranging from 2.9 kilobases (kb) to 1108.7 kb. The average CNV length was approximately 184.502 kb, with a median length of 133.472 kb. The relationship between CNV types and their respective lengths in kb was estimated (Figure 1), providing a visual representation of CNV distribution across the genome (Supplementary File 2). The box plot illustrates the distribution of CNV lengths across four CNV types (0, 1, 3 and 4). Notably, types 3 and 4 exhibit a broader range of CNV lengths and a higher number of outliers compared to types 0 and 1, indicating greater variability. The median CNV length is higher for type 3, while type 0 shows the least variability and few outliers, suggesting a more consistent CNV length distribution.

Figure 1. Box plot showing CNV lengths (kb) across four categories (0, 1, 3, 4). The boxes represent the interquartile range (IQR), with medians marked as horizontal lines; whiskers extend to 1.5×IQR and dots represent outliers. CNV categories vary in data distribution, with category 0 showing fewer data points and higher outliers compared to categories 3 and 4.
The distribution of CNV sizes is summarized (Table 1). Notably, nearly half (44%) of the CNVs fell within the size range of 0 to 100 kb. CNVs in the 300–400 kb range were less common, while those exceeding 400 kb in size were relatively rare.
Table 1. Summary of CNV based on size in kilobases (kb)

CNVs, copy number variations; CNVRs, copy number variation regions.
A summary plot of CNVs, displaying results categorized by length group, CNV type and chromosome, was generated using the HandyCNV tool (Figure 2). It depicts the distribution of CNVs across different chromosomes, categorized by CNV types (0, 1, 3, 4). Each line represents the number of CNVs for each chromosome, with different colours corresponding to distinct CNV values. The data show notable peaks in CNV counts for chromosomes 5, 14, 24 and 26, particularly for CNV types 3 (purple) and 4 (blue), suggesting a higher prevalence of these CNVs on these chromosomes. Meanwhile, CNV types 0 (red) and 1 (black) are less frequent and display lower variation across chromosomes (Supplementary File 3).

Figure 2. Line plot showing the distribution of CNVs across chromosomes, categorized by CNV types (0, 1, 3, 4). Each line represents the number of CNVs per chromosome, with distinct colours indicating different CNV values. The plot highlights variation in CNV counts across chromosomes and types.
Likewise, each copy plot is differentiated based on its specific copy number (Figure 3). In it, the frequency and length distribution of CNVs across chromosomes for each CNV type (0, 1, 3, 4) using box plots is mentioned. The top panel (CNV type 0, red) shows a significant peak in frequency on chromosome 12. For CNV type 1 (second panel, black), chromosomes 12 and 25 exhibit elevated CNV frequencies and lengths. CNV type 3 (third panel, purple) displays a widespread distribution, with higher frequencies on chromosomes 5, 10 and 24, while CNV type 4 (bottom panel, blue) highlights chromosomes 5, 14, 24 and 26 as hotspots for CNV occurrence. The range of CNV lengths is greater for types 3 and 4, as indicated by the larger spread in the box plots (Summary plots are indicated in Supplementary File 4).

Figure 3. Prevalence and type of CNVs across autosomes for four types (0, 1, 3, 4) represented as boxplots (lengths in kb) and lines with points (frequency per chromosome). Boxplots show medians, interquartile ranges and whiskers, while numbers on points indicate CNV counts.
CNVRs are defined as genomic segments containing one or more CNVs that exhibit at least a single base pair of overlap. Consequently, CNVRs do not overlap with one another. We performed the merging of overlapping CNVs using two distinct approaches: CNVRuler and the HandyCNV package in R (Supplementary Files 5 and 6). As minor modifications were observed therefore in this study the CNVRs obtained using CNVRuler were mainly under consideration.
When employing the CNVRuler software (Kim et al., Reference Kim, Hu, Yim, Bae, Kim and Chung2012a), a total of 268 CNVRs were identified (Supplementary File 5). The majority of these CNVRs (65.67%) fell within the size range of 0 to 100 kb, with 16.79% ranging from 100 to 200 kb. Overall, the CNVRs ranged in size from 3.801 kb to 915.979 kb, with an average size of approximately 115.7949 kb. Among the 268 identified CNVRs, 212 represented gain events, 44 were indicative of loss events and 14 CNVRs comprised a combination of both gain and loss events. Detailed distributions of autosomal CNVRs are presented in Table 2 and Figure 4.
Table 2. No. of CNVRs, proportional length of CNVRs on each autosome using HandyCNV and CNVRuler

CNVs, copy number variations; CNVRs, copy number variation regions.

Figure 4. Distribution of CNVs (a) and CNVRs (b) in different distance categories (0–100 kb, 101–200 kb, 201–300 kb, 301–400 kb, 401–500 kb and >500 kb) across autosomes. (a) The majority of CNVs (44%) fall within the 0–100 kb range. (b) Detailed breakdown of CNVR size ranges, highlighting that 65.67% of the CNVRs are within the 0–100 kb category, followed by 16.79% in the 101–200 kb range.
The cumulative length of the identified CNVRs amounted to 31.03 megabases (Mb), representing approximately 1.24% of the entire genome. It is important to note that the chromosome sizes were sourced from the most recent cattle assembly, ARS-UCD1.3.
The distribution of CNVRs across chromosomes exhibited variability, with the number of CNVRs per chromosome ranging from 0 on BTA23 and BTA24 to 20 on BTA7 and BTA19. The proportion of CNVRs as a fraction of the total chromosome length displayed a spectrum, ranging from 4.72% on BTA25 to 0% on BTA23 and BTA24.
Figure 5 is showing the CNVR map showing the distribution of CNVR across chromosomes (Supplementary File 7).

Figure 5. Genome-wide CNVR map illustrating the distribution of CNVRs across autosomes. Each horizontal bar represents a chromosome, with CNVRs categorized as gains (red), losses (green) and mixed events (black) based on their type. The percentages indicate the proportion of the chromosomes covered by CNVRs. The physical positions of CNVRs are displayed along the x-axis in megabase pairs (Mbp).
CNVR annotation
Annotations were performed separately using information sourced from both the Ensembl genome browser (Supplementary Files 8 and 9) and UCSC (Supplementary Files 10 and 11). Upon comparing the results, it became apparent that the outcomes from the two reference gene lists exhibited slight disparities. This divergence can likely be attributed to variations in methodologies and potential data sources employed by the custodians of each database when generating their gene annotations. Consequently, the position and quantity of genes may vary between different gene builds, even when referencing the same reference genome. Therefore, it is advisable to validate genes of interest by cross-referencing them in more than one database to ensure their reliability and robustness.
After comparing the annotation results from two different approaches, the list of consensus genes was obtained (Supplementary Files 12 and 13). The set of consensus genes common to both CNV results based on the common-gene-threshold criterion is 5%. A total of 97 genes were considered as the ‘common high’ that are present in both approaches more than 5% while the 637 genes were fell among the ‘common low’ that were present in both approaches but not crossing the threshold. The top 10 genes were searched in the cattle literature and found to be associated with relevant traits (Table 3). It can be observed here that the CNVs in our study are highly enriched with immune and defense genes, and the same findings can also be observed in other CNV studies in cattle populations (Liu and Bickhart, Reference Liu and Bickhart2012; Goyache et al., Reference Goyache, Pérez-Pardal, Fernández, Traoré, Menéndez-Arias, Arias and Álvarez2022; Jang et al., Reference Jang, Terefe, Kim, Lee, Belay, Tijjani, Han, Hanotte and Kim2021; Braga et al., Reference Braga, Chud, Watanabe, Savegnago, Sena, Do Carmo, Machado, Panetto, Da Silva and Munari2023).
Table 3. Top 10 highly common genes and their association with economic traits

For confirmation and validation, Ensemble Biomart (Ensemble Genes 110) was also used for the gene annotation. A total of 268 CNVRs containing 249 annotated genes, which can be classified further. Among annotated genes, 233 were protein coding, 2 as pseudogenes, 6 as microRNA, 3 as snRNA and rRNA (n = 1). While annotating these genes with GO terms, biological process components revealed that genes under CNVRs have reported functions related to immune response, production, reproduction, growth, heat stress and more. Many well-defined contrasting traits between indicine and taurine cattle, subject to natural and artificial selection for production, are governed by genes participating in diverse biological processes. These processes encompass reproduction, such as fertility, age of first oestrous, calving interval, (Sartori et al., Reference Sartori, Bastos, Baruselli, Gimenes, Ereno and Barros2011), resilience against ecto- and endo-parasites (Piper et al., Reference Piper, Jonsson, Gondro, Lew-Tabor, Moolhuijzen, Vance and Jackson2009), adaptation to high temperatures (Beatty et al., Reference Beatty, Barnes, Taylor, Pethick, McCarthy and Maloney2006), immunity to diseases (Brunelle et al., Reference Brunelle, Greenlee, Seabury, Brown and Nicholson2008), as well as traits related to growth, carcass and meat quality (Bolormaa et al., Reference Bolormaa, Pryce, Kemper, Hayes, Zhang, Tier, Barendse, Reverter and Goddard2013).
Comparison of CNVR with previous studies
Upon comparison, only a limited number of overlapping CNVRs were observed between the CNVRs identified in this study and the DGVa CNVR set. Eleven overlapping CNVRs are identified. Five CNVRs from this study were overlapped with the study conducted by Hou et al. (Reference Hou, Liu, Bickhart, Cardone, Wang, Kim, Matukumalli, Ventura, Song and VanRaden2011), followed by three CNVRs overlapped from the studies of Mesbah-Uddin et al. (Reference Mesbah-Uddin, Guldbrandtsen, Iso-Touru, Vilkki, De Koning, Boichard, Lund and Sahana2018), Keel et al. (Reference Keel, Keele and Snelling2017), Bickhart et al. (Reference Bickhart, Hou, Schroeder, Alkan, Cardone, Matukumalli, Song, Schnabel, Ventura and Taylor2012) and Karimi et al. (Reference Karimi, Esmailizadeh, Wu and Gondro2017). Two overlapping CNVRs were observed with Liu et al. (Reference Liu, Hou, Zhu, Cardone, Jiang, Cellamare, Mitra, Alexander, Coutinho, Dell’Aquila, Gasbarre, Lacalandra, Li, Matukumalli, Nonneman, Regitano, Smith, Song, Sonstegard, Van Tassell, Ventura, Eichler and Bickhart2010), and only one overlapped CNVR with Boussaha et al. (Reference Boussaha, Esquerré, Barbieri, Djari, Pinton, Letaief, Salin, Escudié, Roulet and Fritz2015) was observed.
Discussion
CNVs are key contributors to genomic structural variation, affecting gene function through changes in gene structure, dosage and regulation, with a larger impact on phenotypes than SNPs (Dang et al., Reference Dang, Zhang, Gao, Peng, Chen and Yang2024; Xu et al., Reference Xu, Shi, Cai, Zhou, Lan, Zhang, Lei, Qi and Chen2014b; Zhang et al., Reference Zhang, Gu, Hurles and Lupski2009). In livestock, CNVs influence economically important traits and disease conditions, making them valuable molecular markers (Cheng et al., Reference Cheng, Jiang, Yang, Cao, Huang, Lan, Lei, Hu and Chen2020; Liu et al., Reference Liu, Brown, Hebert, Cardone, Hou, Choudhary, Shaffer, Amazu, Connor, Ventura, Gasbarre, Van Tassell and Sonstegard2011). Recent studies have extensively explored CNV diversity in both Bos taurus, Bos indicus and their hybrids (Benfica et al., Reference Benfica, Brito, Do Bem, De Oliveira, Mulim, Braga, Cyrillo, Bonilha and Mercadante2024a; Benfica et al., Reference Benfica, Brito, Do Bem, Mulim, Glessner, Braga, Gloria, Cyrillo, Bonilha and Mercadante2024b; Cai et al., Reference Cai, Li, Niu, Li, Lan, Lei, Huang, Xu, Li and Chen2024; Dang et al., Reference Dang, Zhang, Gao, Peng, Chen and Yang2024; Delledonne et al., Reference Delledonne, Punturiero, Ferrari, Bernini, Milanesi, Bagnato and Strillacci2024; Du et al., Reference Du, Ma, Peng, Zhao, Zhao, Wang, Wang, Lyu, Zhang and Qi2024; Liu et al., Reference Liu, Chen, Huang, Wang, Peng, Zhang, Chai, Khan and Wang2024; Maezawa et al., Reference Maezawa, Watanabe, Kobayashi, Yoshida, Chambers, Uchida, Maruyama and Inokuma2024; Oliveira et al., Reference Oliveira, Chud, Oliveira, Hermisdorff, Narayana, Rochus, Butty, Malchiodi, Stothard and Miglior2024; Wang et al., Reference Wang, Ma, Wang, Zhang, Xu, Chen, Zhu, Wang, Gao and Li2024).
The present study aimed to generate a genome-wide CNV map of crossbred dairy cattle in Pakistan. Our results revealed widespread CNVRs, with 1055 CNVs and 268 CNVRs detected using the PennCNV software (Wang et al., Reference Wang, Li, Hadley, Liu, Glessner, Grant, Hakonarson and Bucan2007). PennCNV was chosen for its ability to utilize all available information for each SNP, including the LRR, BAF, PFB and the distance between neighbouring SNPs. Dang et al. (Reference Dang, Zhang, Gao, Peng, Chen and Yang2024) detected 16,507 CNVs and 3,728 CNVRs, accounting for 0.61% of the reference genome in Yunling cattle and Benfica et al. (Reference Benfica, Brito, Do Bem, Mulim, Glessner, Braga, Gloria, Cyrillo, Bonilha and Mercadante2024b) found 3,161 CNVs and 561 CNVRs covering 3.99% of the Nellore autosomal genome. A similar study on Nellore cattle also indicated 14,914 CNVs and 1,884 CNVRs (Benfica et al., Reference Benfica, Brito, Do Bem, De Oliveira, Mulim, Braga, Cyrillo, Bonilha and Mercadante2024a). 870 CNVRs were reported in Holstein cattle (Oliveira et al., Reference Oliveira, Chud, Oliveira, Hermisdorff, Narayana, Rochus, Butty, Malchiodi, Stothard and Miglior2024), 755 CNVRs, accounting for approximately 3.24% of the genome in Pingliang Red Cattle (Wang et al., Reference Wang, Ma, Wang, Zhang, Xu, Chen, Zhu, Wang, Gao and Li2024). Similarly the Delledonne et al. (Reference Delledonne, Punturiero, Ferrari, Bernini, Milanesi, Bagnato and Strillacci2024) reported 123,814 CNVs and 1,397 CNVRs in Holstein cattle.
The observed CNV count per animal is 3–37, with an average of 13.88 CNVs. There are variations in these values in the literature, ranging from 13 to 51 with an average of 32.5 (Delledonne et al., Reference Delledonne, Punturiero, Ferrari, Bernini, Milanesi, Bagnato and Strillacci2024). The crossbred cattle exhibited a relatively high number of CNVs per individual compared to breed groups from other regions. Several factors may contribute to this observation, such as the inadequate representation of Sahiwal or indigenous cattle in the utilized SNP chips for bovines. This lower resolution could potentially introduce bias into the results, especially when compared to studies that did not include indicine cattle in their analysis. Differences in the abundance of CNVs across diverse cattle populations have been previously noted. Specifically, indicine and African taurine breeds exhibit a higher CNV abundance compared to European taurine breeds, a characteristic attributed to their breed divergence and population history (Liu et al., Reference Liu, Brown, Hebert, Cardone, Hou, Choudhary, Shaffer, Amazu, Connor, Ventura, Gasbarre, Van Tassell and Sonstegard2011). These findings highlight the impact of factors such as changes in historical effective population size, gene flow and selection processes on the varying CNV abundance observed in distinct populations.
Thus, it is reasonable to posit that the sustained small effective population size over numerous generations in this group may have prompted a relaxation of purifying selection against mildly deleterious CNVs. Consequently, such relaxation could contribute to the accumulation of a substantial number of CNV events. This aligns with findings from Upadhyay et al. (Reference Upadhyay, Da Silva, Megens, Visker, Ajmone-Marsan, Bălteanu, Dunner, Garcia, Ginja and Kantanen2017) suggesting that genetically isolated small populations may accumulate an abundance of CNVs. However, it is noteworthy that in different studies, deletions were primarily observed (Upadhyay et al., Reference Upadhyay, Da Silva, Megens, Visker, Ajmone-Marsan, Bălteanu, Dunner, Garcia, Ginja and Kantanen2017, Oliveira et al., Reference Oliveira, Chud, Oliveira, Hermisdorff, Narayana, Rochus, Butty, Malchiodi, Stothard and Miglior2024, Tao et al., Reference Tao, Fan, Wang, Ma, Liang and Shi2007), whereas the current study predominantly identified gain events. Nevertheless, it is essential to acknowledge that the present study is limited by a low sample size, and larger samples from other indigenous breeds are necessary to further explore this hypothesis.
CNVRs were generated using the two available in silico molecular techniques, that is, CNVRuler and HandyCNV package of R. Some differentiating points were observed within the two (Table 2). The contributing reason to this may be the type of algorithms used for the detection, as well as the technology. These methodologies vary in coverage range and their capabilities to identify and pinpoint CNV breakpoints (Zhan et al., Reference Zhan, Fadista, Thomsen, Hedegaard, Panitz and Bendixen2011). The functional analysis of the regions encompassed by CNVRs unveiled genes linked to complex traits.
In our analysis of CNVs within the dataset, we observed a notable prevalence of gain events. Here, it is essential to consider the influence of biological variation. Gain events can naturally occur more frequently than loss events in certain genomic regions or within specific populations due to inherent biological diversity. In some instances, these gain events might offer a selective advantage, thus driving their increased occurrence.
Furthermore, genomic regions that undergo duplication or gain events may contain genes or sequences that confer advantageous traits, such as enhanced disease resistance or improved adaptability. This phenomenon could be attributed to positive selective pressure acting on these regions, thereby leading to a higher frequency of gain events.
Annotations were performed separately using information sourced from both Ensembl genome browser (Birney et al., Reference Birney, Andrews, Bevan, Caccamo, Chen, Clarke, Coates, Cuff, Curwen and Cutts2004) and UCSC (Karolchik et al., Reference Karolchik, Baertsch, Diekhans, Furey, Hinrichs, Lu, Roskin, Schwartz, Sugnet, Thomas, Weber, Haussler and Kent2003) using the HandyCNV package of R. Upon comparison of the gene list from both genome browsers, 97 common-high and 637 common-low genes were obtained. The top 10 major genes are found to be involved in different economically important traits like milk production, growth traits, adaptation, disease resistance and immunity (Table 3). This may explain the increased production, heat tolerance and disease resilience abilities of crossbreds, which is the underlying reason for their production. Gene enrichment and QTLs play crucial roles in major functional regions of the genome. Gene Ontology analyses for the detected CNVRs revealed enrichment in important GO terms, highlighting some relevant traits. For instance, GO:0030879 is involved in mammary gland development, directly influencing the milk production of the animals. Similarly, GO:0051879 is mainly linked with heat shock proteins, playing a crucial role in multiple types of stresses, including heat stress. Another intriguing term is GO:0071456, related to hypoxic adaptation, suggesting that these animals can effectively adapt to environments with limited oxygen supply, such as high elevations or hypoxic conditions (Table 3).
FISH and quantitative Polymerase Chain Reaction are well-acknowledged methods for the confirmation and validation of CNVs, offering high specificity and accuracy (Bickhart et al., Reference Bickhart, Hou, Schroeder, Alkan, Cardone, Matukumalli, Song, Schnabel, Ventura and Taylor2012). However, these analyses are recognized for being expensive, time-demanding and consuming a substantial amount of biological material. Therefore, this study opted for an in silico method to identify CNVRs while minimizing the reliance on extensive laboratory resources (Bickhart et al., Reference Bickhart, Hou, Schroeder, Alkan, Cardone, Matukumalli, Song, Schnabel, Ventura and Taylor2012).
The CNVs and CNVRs discovered in this study lay the groundwork for future research on CNVs in other Pakistani cattle breeds and Zebu cattle worldwide. Subsequent investigations should explore the impact of incorporating CNV information in genomic selection for crossbred dairy cattle in Pakistan. Furthermore, it is highly recommended to conduct CNV-based Genome-Wide Association Studies focusing on critical traits in these cattle. This holistic approach will contribute valuable insights to the field of cattle genomics and enhance our understanding of the genetic basis of important traits in diverse cattle populations.
Conclusion
This study marks the inaugural genome-wide detection of CNVs in crossbred dairy cattle in Pakistan. The genes identified within these CNV regions illuminate potential biological processes that may underlie indigenous cattle’s adaptability and disease resistance. QTL analyses revealed significant overlaps between many CNVRs and QTLs associated with economically important traits in cattle, including lactation, fertility, stimulus recognition and health. These findings present viable candidates for further validation in the population. Given the preliminary nature of this report, it is strongly recommended that high-density SNP arrays, whole-genome sequencing, or resequencing data from key indigenous cattle breeds with larger sample sizes be collected and utilized to construct a comprehensive genome-wide map of CNVs in indigenous cattle.
Supplementary material
The supplementary material for this article can be found at https://doi.org/10.1017/S0021859625100208
Acknowledgements
We acknowledge the Higher Education Commission, Pakistan for providing funding to Fakhar un Nisa for her PhD Studies.
Author contributions
Conceptualization: Fakhar un Nisa, Qamar un Nisa. Data curation: Fakhar un Nisa, Muhammad Asif. Formal analysis: Fakhar un Nisa, Rubab Zahra Naqvi. Investigation: Fakhar un Nisa, Muhammad Saif Ur Rehman. Methodology: Fakhar un Nisa. Project administration: Muhammad Asif, Zahid Mukhtar. Resources: Muhammad Asif. Software: Fakhar un Nisa. Supervision: Zahid Mukhtar. Visualization: Fakhar un Nisa. Writing – original draft: Fakhar un Nisa. Writing – review and editing: Zahid Mukhtar, Muhammad Saif Ur Rehman.
Funding statement
This study is funded by the PAKISTAN AGRICULTURAL RESEARCH COUNCIL, AGRICULTURAL LINKAGES PROGRAMME (ALP) with Project Identification No. AS 016 titled ‘Development and application of genomic selection in foreign and local cattle breeds for improvement in dairy-related traits’.
Competing interests
The authors declare that they have no competing interests.
Ethical standards
To guarantee the ethical and humane treatment of animals, the investigation outlined in this research paper received approval from the Research Ethics Committee of the National Institute for Biotechnology and Genetic Engineering (NIBGE), Faisalabad, Pakistan on 10 June 2020. A qualified veterinarian supervised the blood collection process to minimize distress and harm to the animals. Prior to collecting any samples, the researchers conducted a meeting with the farm owners where the animals were housed. During this meeting, they explained the study’s purpose and obtained verbal informed consent.