Impact statement
This article explores the impact of artificial intelligence (AI) on breast cancer (BC) diagnosis within the field of pathology. It examines several applications of AI in BC pathology and provides a succinct summary of the principal findings from multiple investigations. Incorporating AI with conventional pathology methods may enhance diagnostic accuracy and reduce preventable errors. Studies have shown the efficacy of AI in detecting invasive breast tumors by rapidly analyzing extensive whole-slide images. Advanced convolutional neural networks underpin these discoveries. AI-driven quantitative analysis has facilitated the assessment of an individual’s hormonal status, which is crucial for determining the appropriate BC treatment. This is due to its facilitation of consensus among many observers regarding their findings. AI has the potential to become essential for assessing BC and quantifying mitotic cells, as it can accurately classify moderate-grade breast carcinomas. Furthermore, the utilization of AI for measuring mitotic numbers has proven to be more precise and sensitive than manual methods, resulting in enhanced predictive outcomes. In the context of triple-negative BC, to maximize the benefits of AI in BC pathology, it is essential to address issues such as the necessity for comprehensive annotations and the challenges of differentiation. Despite existing challenges, AI’s numerous contributions to BC pathology indicate a promising future characterized by enhanced accuracy, efficiency and uniformity. It is imperative that we continue researching and developing novel approaches to address these challenges and fully harness AI’s promise to revolutionize BC detection and assessment.
Introduction
History of artificial intelligence
Artificial intelligence (AI) refers to the utilization of technology and computers to imitate human-like cognitive processes and intelligent actions (Försch et al., Reference Försch, Klauschen, Hufnagl and Roth2021; Briganti, Reference Briganti2023). The history of computers can be traced back over 200 years, marked by several significant advancements. The exact year of the first computer’s invention is uncertain, but it is commonly attributed to 1822, when Charles Babbage introduced a design for a functional computer on paper (Grzybowski et al., Reference Grzybowski, Pawlikowska–Łagód and Lambert2024). The history of AI spans several decades (Muthukrishnan et al., Reference Muthukrishnan, Maleki, Ovens, Reinhold, Forghani and Forghani2020), beginning in the 1950s with Alan Turing’s research on the feasibility of intelligent machines, culminating in his landmark 1950 article (Turning, Reference Turning1950; Kaul et al., Reference Kaul, Enslin and Gross2020). In 1956, the term “AI” was coined by John McCarthy (Anyoha et al., Reference Anyoha2017) during a conference at Dartmouth College, where the first AI program, “Logic Theorist,” was introduced (Moor, Reference Moor2006).
Concepts in AI
AI applications in medicine have evolved significantly due to advancements in machine learning (ML) and deep learning (DL) (Lanzagorta-Ortega et al., Reference Lanzagorta-Ortega, Carrillo-Pérez and Carrillo-Esper2022). AI-based models can assist in diagnosing diseases, forecasting therapy responses and promoting preventive medicine (Kaul et al., Reference Kaul, Enslin and Gross2020; Pettit et al., Reference Pettit, Fullem, Cheng and Amos2021).
Machine learning
The term “ML” was first introduced by Arthur Samuel in 1959, with applications in medicine emerging in the 1980s and 1990s (Brown, Reference Brown2021). As a subset of AI, ML uses algorithms to build models that can learn from data and help in segmentation, classification or make predictions (Jiang et al., Reference Jiang, Luo, Huang, Liu and Li2022). It is classified into three categories: supervised, unsupervised and reinforcement learning (Hosny et al., Reference Hosny, Parmar, Quackenbush, Schwartz and Aerts2018; Ono and Goto, Reference Ono and Goto2022). Supervised learning trains models with input and output data to predict outcomes and unsupervised learning analyzes unannotated data to discover patterns without predefined results; similarly, reinforcement learning involves learning through interactions with an environment, receiving rewards or penalties based on actions taken (Jovel and Greiner, Reference Jovel and Greiner2021; Lee et al., Reference Lee, Warner, Shaikhouni, Bitzer, Kretzler, Gipson, Pennathur, Bellovich, Bhat, Gadegbeku, Massengill, Perumal, Saha, Yang, Luo, Zhang, Mariani, Hodgin and Rao2022 Jiang et al., Reference Jiang, Luo, Huang, Liu and Li2022; Al-Hamadani et al., Reference Al-Hamadani, Fadhel, Alzubaidi and Harangi2024).
Deep learning
DL is a subset of ML using artificial neural networks (ANNs) with multiple layers (Sarker, Reference Sarker2021) effective in complex tasks and large datasets (Sidey-Gibbons and Sidey-Gibbons, Reference Sidey-Gibbons and Sidey-Gibbons2019; Birhane et al., Reference Birhane, Kasirzadeh, Leslie and Wachter2023). Neural networks are designed like biological neurological systems based on the fundamental unit perceptron or neuron, and usually comprise of an input, hidden and output layers (Kriegeskorte and Golan, Reference Kriegeskorte and Golan2019). Deep neural networks (DNNs) are advanced models with multiple hidden layers used in healthcare for medical imaging and diagnostics (Baji’et al., Reference Bajić, Orel and Habijan2022; Egger et al., Reference Egger, Gsaxner, Pepe, Pomykala, Jonske, Kurz, Li and Kleesiek2022). Some ANNs have no hidden layer while DNNs have multiple, enabling them to understand complex behaviors (Kufel et al., Reference Kufel, Bargieł-Łączek, Kocot, Koźlik, Bartnikowska, Janik, Czogalik, Dudek, Magiera, Lis, Paszkiewicz, Nawrat, Cebula and Gruszczyńska2023). Convolutional neural networks (CNNs) are specifically designed for image recognition and classification tasks (Alajanbi et al., Reference Alajanbi, Malerba and Liu2021). Recently, these models have shown great potential for accurate diagnoses such as diabetic retinopathy from retinal images (Ragab et al., Reference Ragab, Al-Ghamdi, Fakieh, Choudhry, Mansour and Koundal2022).
AI in medicine
ML, a key technology in AI, is used across various medical specialties, including oncology, cardiology and neurology (Bitkina et al., Reference Bitkina, Park and Kim2023). AI applications include screening, diagnosis, treatment, drug development (Xu et al., Reference Xu, Yang, Arikawa and Bai2023), genomic analysis, patient monitoring and wearable health technology (Shajari et al., Reference Shajari, Kuruvinashetti, Komeili and Sundararaj2023). Additionally, AI enhances doctor–patient interactions, enables remote therapy and manages large datasets (Shajari et al., Reference Shajari, Kuruvinashetti, Komeili and Sundararaj2023; Chen and Decary Reference Chen and Decary2020; Basu et al. Reference Basu, Sinha, Ong and Basu2020). Integrating AI into healthcare can significantly improve the effectiveness, accuracy and personalization of medical diagnoses and treatments (Alowais et al., Reference Alowais, Alghamdi, Alsuhebany, Alqahtani, Alshaya, Almohareb, Aldairem, Alrashed, Bin Saleh, Badreldin and Al Yami2023).
AI in cancer diagnosis
AI has the potential to significantly advance cancer diagnosis by using annotated medical data, advanced ML techniques and enhanced processing power (Sufyan et al., Reference Sufyan, Shokat and Ashfaq2023; Alshuhri et al., Reference Alshuhri, Al-Musawi, Al-Alwany, Uinarni, Rasulova, Rodrigues, Alkhafaji, Alshanberi, Alawadi and Abbas2024). These developments are expected to transform patient care by improving efficiency, accuracy and customization in diagnoses and treatments (Chen and Decary, Reference Chen and Decary2020). Over the past decade, DL architectures have outperformed traditional ML methods in cancer diagnosis, effectively utilizing genomic and phenotype data for cancer classification and treatment (Miotto et al., Reference Miotto, Wang, Wang, Jiang and Dudley2018). Computer-aided detection and diagnosis (CADx) are playing important roles in clinical imaging and are expected to further improve (He et al., Reference He, Chen, Tan, Elingarami, Liu, Li, Deng, He, Li, Fu and Li2020; Jairam and Ha, Reference Jairam and Ha2022). AI technology has the potential to improve the accuracy of clinical image analysis for identifying cancer progression, aiding in the early detection and diagnosis, while medical imaging remains essential for early identification and monitoring of cancer (Suberi et al., Reference Suberi, Zakaria and Tomari2017; Liu et al., Reference Liu, Chi, Li, Li, Liang, Liu, Wang and He2020; Lathwal et al., Reference Lathwal, Kumar, Arora and Raghava2020).
AI in breast cancer pathology
Breast cancer (BC) is the most diagnosed cancer among women and the second leading cause of cancer-related deaths worldwide (Watkins et al., Reference Watkins2019), with approximately 2.3 million new cases and 6,85,000 fatalities reported in 2020 cancer (Hanna et al., Reference Hanna and Pantanowitz2017; Nardin et al., Reference Nardin, Mora, Varughese, D’Avanzo, Vachanaram, Rossi, Saggia, Rubinelli and Gennari2020). It represents 25% of all newly diagnosed cancer cases, and projections suggest an increase to nearly 2.96 million cases by 2040 (Sedeta et al., Reference Sedeta, Jobre and Avezbakiyev2023). Accurate histopathological diagnosis is crucial, as it confirms the presence of tumor cells and helps classify the type and grade of cancer (Cardoso et al., Reference Cardoso, Kyriakides, Ohno, Penault-Llorca, Poortmans, Rubio, Zackrisson and Senkus2019). Discrepancies in diagnoses can significantly affect treatment decisions, highlighting the need for precise pathologic assessments (Soliman et al., Reference Soliman, Li and Parwani2024).
Despite advancements in imaging-based diagnostics and therapies, the field of histopathology has been slow to digitize, beginning this process only about two decades ago (Hanna et al., Reference Hanna, Reuter, Hameed, Tan, Chiang, Sigel, Hollmann, Giri, Samboy, Moradel, Rosado, Otilano, England, Corsale, Stamelos, Yagi, Schüffler, Fuchs, Klimstra and Sirintrapun2019; Försch et al., Reference Försch, Klauschen, Hufnagl and Roth2021). Histopathological diagnosis has remained largely unchanged, still relying on microscopic evaluations by pathologists. This reliance can lead to errors, such as false positives or negatives, especially under stress (Morelli et al., Reference Morelli, Porazzi, Ruspini, Restelli and Banfi2013; Cohen et al., Reference Cohen, Martin, Gross, Johnson, Robboy, Wheeler, Johnson and Black-Schaffer2022). Studies have shown significant variability in pathologists’ assessments, with a concordance rate of only 75.3% overall and a particularly low rate of 48% for ductal carcinoma in situ (DCIS) and atypical hyperplasia, indicating ambiguity in pathology interpretations (Elmore et al., Reference Elmore, Longton, Carney, Geller, Onega, Tosteson, Nelson, Pepe, Allison, Schnitt, O’Malley and Weaver2015).
The use of AI in cancer diagnosis is essential for improving diagnostic processes and addressing the shortage of pathologists alongside the increasing number of cancer cases (Robboy et al., Reference Robboy, Gross, Park, Kittrie, Crawford, Johnson, Cohen, Karcher, Hoffman, Smith and Black-Schaffer2020). AI in pathology relies on whole-slide imaging (WSI) technology (Niazi et al., Reference Niazi, Parwani and Gurcan2019; Ahn et al., Reference Ahn, Shin, Yang, Park, Kim, Cho, Ock and Kim2023), which converts physical pathological slides into high-resolution digital images. These images are an abundant source of information – with sizes up to 1,00,000 × 1,00,000 pixels and are the first step in creating AI-assisted models (Mukhopadhyay et al., Reference Mukhopadhyay, Feldman, Abels, Ashfaq, Beltaifa, Cacciabeve, Cathro, Cheng, Cooper, Dickey, Gill, Heaton, Kerstens, Lindberg, Malhotra, Mandell, Manlucu, Mills, Mills, Moskaluk and Taylor2018; Niazi et al., Reference Niazi, Parwani and Gurcan2019). WSI aids in the easy sharing and consultation of images, reducing interpretation errors and enhancing the analysis of complex cases (Hanna et al., Reference Hanna, Reuter, Hameed, Tan, Chiang, Sigel, Hollmann, Giri, Samboy, Moradel, Rosado, Otilano, England, Corsale, Stamelos, Yagi, Schüffler, Fuchs, Klimstra and Sirintrapun2019; Jones et al., Reference Jones, Nazarian, Duncan, Kamionek, Lauwers, Tambouret, Wu, Nielsen, Brachtel, Mark, Sadow, Grabbe and Wilbur2015). Overall, WSI presents a promising alternative to conventional microscopic examination, which has limitations due to its ephemeral nature (Mukhopadhyay et al., Reference Mukhopadhyay, Feldman, Abels, Ashfaq, Beltaifa, Cacciabeve, Cathro, Cheng, Cooper, Dickey, Gill, Heaton, Kerstens, Lindberg, Malhotra, Mandell, Manlucu, Mills, Mills, Moskaluk and Taylor2018; Tizhoosh and Pantanowitz, Reference Tizhoosh and Pantanowitz2018; Hanna et al., Reference Hanna, Reuter, Hameed, Tan, Chiang, Sigel, Hollmann, Giri, Samboy, Moradel, Rosado, Otilano, England, Corsale, Stamelos, Yagi, Schüffler, Fuchs, Klimstra and Sirintrapun2019).
Digitalization in pathology has been slower compared to other medical specialties, largely due to pathologists’ reluctance to abandon traditional methods and the presence of barriers like regulatory and cost issues as well as “pathologist technophobia” (Hanna et al., Reference Hanna, Reuter, Hameed, Tan, Chiang, Sigel, Hollmann, Giri, Samboy, Moradel, Rosado, Otilano, England, Corsale, Stamelos, Yagi, Schüffler, Fuchs, Klimstra and Sirintrapun2019; Hanna and Pantanowitz 2019; Moxley-Wyles et al., Reference Moxley-Wyles, Colling and Verrill2020; Försch et al., Reference Försch, Klauschen, Hufnagl and Roth2021). Despite these challenges, AI has shown promise in enhancing diagnostic capabilities. Studies indicate that digital slide reviewing is as effective as manual methods (Loughrey et al., Reference Loughrey, Kelly, Houghton, Coleman, Carson, Salto-Tellez and Hamilton2015; Elmore et al., Reference Elmore, Barnhill, Elder, Longton, Pepe, Reisch, Carney, Titus, Nelson, Onega, Tosteson, Weinstock, Knezevich and Piepkorn2017; Tabata et al., Reference Tabata, Mori, Sasaki, Itoh, Shiraishi, Yoshimi, Maeda, Harada, Taniyama, Taniyama, Watanabe, Mikami, Sato, Kashima, Fujimura and Fukuoka2017). AI algorithms have been developed for detecting and classifying BC, achieving high accuracy in differentiating between benign and malignant tumors (Cruz-Roza et al., Reference Cruz-Roa, Gilmore, Basavanhally, Feldman, Ganesan, Shih, Tomaszewski, González and Madabhushi2017). A DL model differentiated benign and malignant tumors when tested on eight categories of images, four benign and four malignant, with an accuracy of 93.2% (Han et al., Reference Han, Wei, Zheng, Yin, Li and Li2017). The breast cancer histology (BACH) challenge demonstrated that AI could achieve accuracy levels comparable to pathologists, improving overall diagnostic performance and interobserver concordance (Polonia et al., Reference Polónia, Campelos, Ribeiro, Aymore, Pinto, Biskup-Fruzynska, Veiga, Canas-Marques, Aresta, Araújo, Campilho, Kwok, Aguiar and Eloy2021). Additionally, DL models have successfully identified markers in BC and utilized nuclear characteristics to predict risk categories for patients (Romo-Bucheli et al., Reference Romo-Bucheli, Janowczyk, Gilmore, Romero and Madabhushi2016; Lu et al., Reference Lu, Romo-Bucheli, Wang, Janowczyk, Ganesan, Gilmore, Rimm and Madabhushi2018; Whitney et al., Reference Whitney, Corredor, Janowczyk, Ganesan, Doyle, Tomaszewski, Feldman, Gilmore and Madabhushi2018).
For example, the visual assessment of mitotic figures in BC histological sections stained with hematoxylin and eosin (H&E), referred to as the mitotic score, serves as the gold standard method for evaluating the proliferative activity of BC (Aleskandarany et al., Reference Aleskandarany, Green, Benhasouna, Barros, Neal, Reis-Filho, Ellis and Rakha2012; van Dooijeweert et al., Reference van Dooijeweert, van Diest and Ellis2021) and pathologists face challenges in manually counting mitosis in histopathology slides, a process that is time-consuming (Cree et al., Reference Cree, Tan, Travis, Wesseling, Yagi, White, Lokuhetty and Scolyer2021). To address this, various contests such as the MITOSIS detection contest and others have facilitated advancements in automated counting methods (Aubreville et al., Reference Aubreville, Stathonikos, Donovan, Klopfleisch, Ammeling, Ganz, Wilm, Veta, Jabari, Eckstein, Annuscheit, Krumnow, Bozaba, Çayır, Gu, Chen, Jahanifar, Shephard, Kondo, Kasai, Kotte, Saipradeep, Lafarge, Koelzer, Wang, Zhang, Yang, Wang, Breininger and Bertram2024). Recent studies have demonstrated that DL models can accurately count mitotic figures from H&E-stained slides of early stage ER-positive breast tumors, significantly reducing the time required for pathologists to read slides (Roux et al., Reference Roux, Racoceanu, Loménie, Kulikova, Irshad, Klossa, Capron, Genestie, Le Naour and Gurcan2013; ICPR, 2014; Romo-Bucheli et al., Reference Romo-Bucheli, Janowczyk, Gilmore, Romero and Madabhushi2017; Veta et al., Reference Veta, Heng, Stathonikos, Bejnordi, Beca, Wollmann, Rohr, Shah, Wang, Rousson, Hedlund, Tellez, Ciompi, Zerhouni, Lanyi, Viana, Kovalev, Liauchuk, Phoulady, Qaiser and Pluim2019). Notably, algorithms utilizing advanced architectures like ResNet-101, and Faster R-CNN have shown high accuracy, with one approach reducing reading time by 27.8% (Pantanowitz et al., Reference Pantanowitz, Hartman, Qi, Cho, Suh, Paeng, Dhir, Michelow, Hazelhurst, Song and Cho2020). Furthermore, a comprehensive automated system developed by Nateghi et al. (Reference Nateghi, Danyali and Helfroush2021) can identify regions of interest for high mitotic activity, count mitoses from WSI, and predict tumor proliferation scores, outperforming previous methods. Although these AI models have yet to be deployed in formal clinical practice, the integration of digitalization and AI in pathology has the potential to enhance accuracy, reduce human errors and optimize the time needed for pathologists to review slides, ultimately benefiting both pathologists and patients (Aeffner et al., Reference Aeffner, Zarella, Buchbinder, Bui, Goodman, Hartman, Lujan, Molani, Parwani, Lillard, Turner, Vemuri, Yuil-Valdes and Bowman2019; Kim et al., Reference Kim, Kang, Song and Kim2022).
This systematic review aims to examine AI models and their effectiveness in diagnosing BC, taking into account existing problems in the field of pathological diagnosis as well as the potential advantages of incorporating AI. Additionally, it investigates the potential of AI models to offer second opinions and their integration into the pathology workflow.
Methodology
Preferred Reporting Items for Systematic Reviews and Meta-Analyses statement
This review process follows the guidelines set out in the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) statement, which was first developed in 2009 and updated in 2020. PRISMA functions as a framework specifically created to standardize the process of conducting systematic reviews and improve the thoroughness of their reporting (Page et al., Reference Page, McKenzie, Bossuyt, Boutron, Hoffmann, Mulrow, Shamseer, Tetzlaff, Akl, Brennan, Chou, Glanville, Grimshaw, Hróbjartsson, Lalu, Li, Loder, Mayo-Wilson, McDonald, McGuinness, Stewart, Thomas, Tricco, Welch, Whiting and Moher2021).
Search strategy
A comprehensive literature search was carried out across three electronic databases: PubMed, EMBASE and Cochrane Library, to identify original articles that met the specified inclusion and exclusion criteria and were published up to April 2024. The search involved the use of keywords, their synonyms and Boolean operators, as illustrated in Figure 1. Additionally, the bibliographies of all relevant articles were meticulously reviewed to review additional studies that could potentially be included in the analysis. The titles were meticulously screened using the predetermined inclusion criteria, focusing on studies that assessed the application of AI in BC diagnosis. Notably, no restrictions were placed on publication year, country of origin or age of the studies. Subsequently, a full-text screening was carried out to include the most pertinent research papers for subsequent data extraction and analysis.

Figure 1. Boolean search with keywords and their synonyms.
Selection process
The literature search results were screened in a two-step process. Initially, the titles and abstracts of all articles were assessed for eligibility. After identifying relevant articles, full-text screening was conducted for the studies that met the eligibility criteria. The screening was done by applying the predefined inclusion criteria.
Eligibility criteria
All research papers using AI in BC diagnosis or staging in comparison with pathologists’ report or known datasets as a reference test were included. All types of original studies (either prospective or retrospective) containing their own data on AI validation or development and validation were included. The studies published in English language were included. We further excluded other types of publications such as reviews, single cases, editorial material, books, comments and papers in languages other than English, Articles dealing with other malignancies, articles that used AI to analyze data other than histological or cytological images (e.g., MRI, mammography) and articles with non-AI approaches for diagnosis (i.e., slide flow).
Inclusion criteria
The study included
-
▪ BC diagnosis based on histopathology.
-
▪ AI models and their diagnostic accuracy, sensitivity, specificity and area under the curve (AUC) in BC diagnosis.
-
▪ Potential of AI models to be integrated in regular practice, provide second opinion in BC diagnosis.
Exclusion criteria
-
▪ Following studies were excluded.
-
▪ Articles used AI to analyze data other than histological or cytological images (e.g., MRI, mammography).
-
▪ Articles with non-AI approaches for diagnosis (i.e., slide flow).
-
▪ Review papers, case studies, editorials, book chapters and commentary.
-
▪ Papers in languages other than English.
-
▪ Articles dealing with other malignancies.
Data from the included studies was extracted and recorded in a standardized data extraction sheet. The extracted data encompassed two main categories: (1) characteristics of the included studies and (2) outcome measures.
Results
Literature search and screening
A thorough search of the three main databases, namely PubMed, Cochrane CENTRAL and EMBASE, resulted in a total of 3113 records. After removing duplicates and irrelevant records, 1849 unique studies were assessed for eligibility based on their titles and abstracts. Over 1500 (1517) studies that did not meet our inclusion criteria were excluded. The abstracts of the remaining 332 articles were obtained for further evaluation. Application of predefined criteria led to the exclusion of 258 studies. The full text of 74 articles was reviewed out of which 47 were excluded due to the lack of relevant outcomes, inappropriate study design or poor quality. Finally, 27 studies were included in the systematic review. The flow diagram of the literature search and screening process is shown in Figure 2.

Figure 2. Flowchart describing the literature inclusion process.
Characteristics of included studies
Twenty-seven research papers were analyzed in a thorough literature analysis. These studies specifically investigated the development of AI algorithms for diagnosing BC using histopathology images. The training data for histopathology images used in these research studies showed substantial variety, with data obtained from 21 to more than 400 patients. The literature evaluation includes papers published from 2017 to 2024, originating from various geographical regions including Korea, India, Egypt, the Netherlands, Brazil, Spain, Australia, Turkey, Poland, Japan, China and Malaysia. The research employed a range of AI techniques, with CNNs being particularly prominent. These CNNs were typically used in combination with other methods such as support vector machines (SVMs) or ensemble learning approaches. The datasets used for training and validation purposes included BreakHis (Spanhol et al., Reference Spanhol, Oliveira, Petitjean and Heutte2016), BACH (Aresta et al., Reference Aresta, Araújo, Kwok, Chennamsetty, Safwan, Alex, Marami, Prastawa, Chan, Donovan, Fernandez, Zeineh, Kohl, Walz, Ludwig, Braunewell, Baust, Vu, To, Kim, Kwak, Galal, Sanchez-Freire, Brancati, Frucci, Riccio, Wang, Sun, Ma, Fang, Kone, Boulmane, Campilho, Eloy, Polónia and Aguiar2019) and (ICIAR, 2018 – Grand Challenge), which are well-known. The pixel resolutions of the images varied in the investigations, which affected the level of detail and the potential effectiveness of the models. In addition, the performance metrics in the included studies showed variation, with accuracy, F1-score and AUC being the most reported. Table 1 provides a concise summary of the fundamental features of the research.
Table 1. Summary of studies categorizing breast lesions (i.e., benign vs. malignant)

AUC, area under the curve; BACH, breast cancer histology challenge; BreakHis, breast cancer histopathological database; CCM, correctly classified malignant; CCR, correct classification rate; CKD, covariance-kernel descriptor; CNNs, convolutional neural networks; DCNNs, deep CNNs; DenseNet, dense neural network; FABCD, fully annotated breast cancer database; LERM, log-Euclidean Riemannian metric; MFSCNet, classification of mammary cancer fusing spatial and channel features network; MLP, multilayer perceptron; QDA, quadratic discriminant analysis; ResNet-152, deep residual network with 152 layers; RF, random forest; SVM, support vector machine; WAID, weakly annotated image descriptor; WSIs, whole-slide images.
Summary of findings
Breast lesion categorization
Researchers have developed innovative AI models for classifying breast lesions as benign or malignant, with the potential to identify histological subtypes (Singh et al., Reference Singh, Kumar, Gupta and Al-Turjman2024). The SegEIR-Net model combines segmentation and classification techniques using EfficientNet, InceptionNet and ResNet, achieving high accuracies on the BreakHis dataset (up to 98.66%) and strong results on the BACH and UCSB datasets (93.33% and 96.55%, respectively) (Singh et al., Reference Singh, Kumar, Gupta and Al-Turjman2024). Additionally, the Multilevel Context and Uncertainty aware (MCUa) model categorizes breast histology images into four types: normal tissue, benign lesion, in situ carcinoma and invasive carcinoma. The MCUa model demonstrated impressive performance, with static ensemble accuracy reaching 95.75% and dynamic accuracy reaching 98.11% on the BACH dataset, and outstanding results on the BreakHis dataset (up to 100% accuracy) (Senousy et al., Reference Senousy, Abdelsamea, Gaber, Abdar, Acharya, Khosravi and Nahavandi2022).
The context-aware stacked CNN (CAS-CNN) has demonstrated strong performance in classifying breast WSIs, achieving an AUC of 0.962 in distinguishing normal or benign slides from malignant ones (DCIS and IDC). The system exhibited a precision of 89.1% for categorizing WSIs and an overall accuracy of 81.3% in a three-class classification involving normal/benign, DCIS and IDC categories. While it effectively differentiated between normal/benign and IDC slides, it faced challenges in distinguishing between normal/benign and DCIS slides, as well as between DCIS and IDC slides (Bejnordi et al., Reference Bejnordi, Veta, Van Diest, Van Ginneken, Karssemeijer, Litjens, Van Der Laak, Hermsen, Manson, Balkenhol and Geessink2017). Additional details on studies classifying breast lesions are provided in Tables 1–3.
Table 2. Summary of studies categorizing breast lesions (i.e., normal/benign/in situ/invasive)

AUC, area under the curve; BACH, breast cancer histology challenge; BCBH, bioimaging challenge 2015 breast histology; BreakHis, breast cancer histopathological database; CCM, correctly classified malignant; CCR, correct classification rate; CKD, covariance-kernel descriptor; CNNs, convolutional neural networks; DCNNs, deep CNNs; DenseNet, dense neural network; FABCD, fully annotated breast cancer database; FCNs, fully convolutional networks; LERM, log-Euclidean Riemannian Metric; MCUa, multilevel context and uncertainty aware; MFSCNet, classification of mammary cancer fusing spatial and channel features network; MLP, multilayer perceptron; PDIs, phylogenetic diversity indices; QDA, quadratic discriminant analysis; ResNet-152, deep residual network with 152 layers; RF, random forest; SimCLR, simple framework for contrastive learning of visual representations; SOA, state-of-the-art; SSL, self-supervised learning; SVM, support vector machine; UCSB, University of California Santa Barbara; WAID, weakly annotated image descriptor; WSIs, whole-slide images; Xgboost, eXtreme Gradient Boost.
Table 3. Summary of studies assessing different histopathological subtypes of both benign and malignant breast lesions

AUC, area under the curve; BACH, breast cancer histology challenge; BCBH, bioimaging challenge 2015 breast histology; BreakHis, breast cancer histopathological database; CCM, correctly classified malignant; CCR, correct classification rate; CKD, covariance-kernel descriptor; CNNs, convolutional neural networks; DCNNs, deep CNNs; DenseNet, dense neural network; FABCD, fully annotated breast cancer database; FCNs, fully convolutional networks; LERM, log-Euclidean Riemannian Metric; MCUa, multilevel context and uncertainty aware; MFSCNet, classification of mammary cancer fusing spatial and channel features network; MLP, multilayer perceptron; PDIs, phylogenetic diversity indices; QDA, quadratic discriminant analysis; ResNet-152, deep residual network with 152 layers; RF, random forest; SimCLR, simple framework for contrastive learning of visual representations; SOA, state-of-the-art; SSL, self-supervised learning; SVM, support vector machine; UCSB, University of California Santa Barbara; WAID, weakly annotated image descriptor; WSIs, whole-slide images; Xgboost, eXtreme Gradient Boost.
Molecular subtyping
After the identification of a malignant breast lesion, immunohistochemistry is commonly performed to ascertain the molecular subtype. This procedure involves analyzing the levels of ER, PR, Her2 and the Ki67 mitotic index to determine the subtype, which can include luminal A, luminal B, Her2-enriched or triple negative, among other possibilities. The results of the studies regarding the use of AI models for molecular subtyping, with or without Ki67 measurement, have been summarized in Table 4. The AI models have exhibited an impressive accuracy rate (AUC of 0.75–0.91 vs. 0.67–0.8) when compared with conventional multiple instance learning models (Bae et al., Reference Bae, Jeon, Hwangbo, Yoo, Han and Feng2023) and approximately 90% for an automated BC classification system utilizing SVM (Aswathy and Jagannath Reference Aswathy and Jagannath2021).
Table 4. Summary of studies assessing breast cancer molecular subtyping (i.e., according to estrogen receptors (ER), progesterone receptors (PR) and Her2 – with or without ki67 mitotic index analysis)

AUC, area under the curve; BCBH, bioimaging challenge 2015 breast histology; BreakHis, breast cancer histopathological database; MLP, multilayer perceptron; SVM, support vector machine; UCSB, University of California Santa Barbara; WSIs, whole-slide images; Xgboost, eXtreme.
Mitotic index assessment and quantification
Several studies have aimed to create models for identifying the mitotic proliferation index (Ki67) in BC. A notable approach is the FMDet method, designed to detect mitotic rates in breast histopathology images while addressing the domain shift problem caused by variability across different scanners (Wang et al., Reference Wang, Zhang, Yang, Xiang, Luo, Wang, Zhang, Yang, Huang and Han2023). To enhance model applicability, two key strategies were implemented:
1. A novel data augmentation technique using Fourier analysis was introduced, which alters the frequency characteristics of training images to generate diverse samples that reflect real-world datasets. This involves replacing the low-frequency components of a source image with those from a reference image from a different domain.
2. Pixel-level annotations, specifically “instance masks,” were utilized for identifying mitotic figures. These masks, derived from the Mitosis Domain Generalization challenge (MIDOG) 2021 bounding box data and a pretrained nucleus segmentation model (HoVer-Net), improved detection accuracy by allowing the network to capture subtle morphological differences in mitotic cells, surpassing traditional bounding box annotations (Wang et al., Reference Wang, Zhang, Yang, Xiang, Luo, Wang, Zhang, Yang, Huang and Han2023). The model outperformed all other submissions in the challenge, with an F1 score of 0.77 (Wang et al., Reference Wang, Zhang, Yang, Xiang, Luo, Wang, Zhang, Yang, Huang and Han2023).
A DL model specifically targeted for identifying mitotic proliferation in breast histopathology images has shown considerable potential (Saha et al., Reference Saha, Chakraborty and Racoceanu2018. The precision, recall and F-score measures of this model were 92%, 88% and 90%, respectively. Significantly, the model’s performance improved when handcrafted features were integrated into its DL architecture. The model was trained and tested using datasets from MITOS-ATYPIA-14, ICPR-2012 and AMIDA-13. The study’s findings highlight the model’s high true positive rate, which indicates its precise ability to identify mitotic cells. Details of included studies focusing on training models for detecting and quantifying mitotic proliferation index have been tabulated in Table 5.
Table 5. Summary of studies assessing the Ki67 mitotic index

AUC, area under the curve; SimCLR, simple framework for contrastive learning of visual representations; SSL, self-supervised learning; WSIs, whole-slide images.
Obstacles to widespread adoption of AI
Despite the remarkable outcomes demonstrated by the examined AI models in research contexts, huge challenges and constraints persist that hinder the extensive use of AI on a broader scale (Soliman et al., Reference Soliman, Li and Parwani2024). A primary procedural drawback of supervised AI models is their need of extensive, annotated datasets for training. Manual annotation is labor-intensive and exhibits variability both among and across pathologists, undermining the fundamental objective of AI models. Similarly, the intraclass heterogeneity and dependence on binary categorization during training, as previously mentioned, constitute additional barriers that may impact the efficacy of AI models. Furthermore, the lack of a defined area size for assessing Ki67 may lead to either an overestimation or underestimation of Ki67 activity. Moreover, certain preanalytic variables, including suboptimal sample quality, air bubbles, staining artifacts, unexpected staining patterns and discrepancies in interlaboratory sample preparation and staining techniques, remain unavoidable to this day. Consequently, each AI tool necessitates validation and verification under these specific conditions. The identified variables may influence the accuracy of specimen analysis conducted by the AI model, resulting in incorrect outcomes. Additionally, significant economic and regulatory challenges persist regarding the implementation of AI technology in pathology laboratories on a global or national scale (van Diest et al., Reference van Diest, Flach, van Dooijeweert, Makineli, Breimer, Stathonikos, Pham, Nguyen and Veta2024). The concern regarding AI’s potential to replace pathologists raises ethical issues, as noted by Moxley-Wyles et al. (Reference Moxley-Wyles, Colling and Verrill2020) and van Diest et al. (Reference van Diest, Flach, van Dooijeweert, Makineli, Breimer, Stathonikos, Pham, Nguyen and Veta2024). Critics, comprising both individuals and governmental bodies, articulate concerns regarding the notion that a patient’s treatment may be dictated by an AI analysis. This is particularly concerning as most existing algorithms remain in the early stages of development, having been tested on limited populations and lacking adequate safety data. The broad adoption of this technology faces obstacles stemming from the diverse staining techniques employed across laboratories, the necessity for full digitalization prior to implementation, and the partial integration of Picture Archiving and Communication Systems with AI algorithms (van Diest et al., Reference van Diest, Flach, van Dooijeweert, Makineli, Breimer, Stathonikos, Pham, Nguyen and Veta2024). Addressing these concerns is crucial prior to the deployment of AI models in clinical practice.
Addressing limited sample size
A study aimed at improving BC histopathology image classification addressed the challenge of limited slide image datasets through data augmentation and transfer learning (Zhu et al., Reference Zhu, Song, Wang, Dong, Guo and Liu2019). The researchers developed a hybrid CNN that combines local and global visual inputs to capture detailed features and structure. They introduced a Squeeze-Excitation-Pruning block to reduce the model size without sacrificing accuracy. To enhance generalization, they used a multimodel assembly technique, training various models on different data subsets and merging their predictions. This approach proved more effective than a single-model systems and outperformed existing methods on the BACH dataset (87.5% patient-level and 84.4% image-level accuracy), indicating its potential for real-world clinical use (Zhu et al., Reference Zhu, Song, Wang, Dong, Guo and Liu2019).
Recent studies have explored the development of advanced models for classifying BC subtypes using histopathology images. A hybrid CNN-LSTM model achieved high accuracy in binary and multiclass classifications, with a binary accuracy ranging from 98.07% to 99.75% and multiclass accuracy from 88.04% to 96.5% (Srikantamurthy et al., Reference Srikantamurthy, Rallabandi, Dudekula, Natarajan and Park2023). Another study utilized a pretrained DenseNet-169 model on the BreakHis dataset, achieving 98.73% accuracy on the validation set and 94.55% on the test set, highlighting the importance of compatible-domain transfer learning – a method through which histological images are used in the pretraining of the model and then fine-tuned on a finite cytological target dataset (Shamshiri et al., Reference Shamshiri, Krzyżak, Kowal and Korbicz2023). Additionally, various studies emphasize the importance of feature selection and fusion to improve classification accuracy, noting that integrating DL features with handcrafted attributes can lead to better outcomes. Multiscale analysis, incorporating different image patch scales, also contributes to enhanced accuracy in classification tasks (Attallah et al., Reference Attallah, Anwar, Ghanem and Ismail2021; Liu et al., Reference Liu, Feng, Chen, Liu, Qu and Yang2022).
Training datasets
Another important consideration in the included studies is the choice of dataset and the use of transfer learning. Multiple studies have utilized the BreakHis dataset, which contains images of varying magnification levels, to advance BC diagnosis, despite challenges like small sample sizes and data variability (Gandomkar et al., Reference Gandomkar, Brennan and Mello-Thoms2018; Nahid et al., Reference Nahid, Mehrabi and Kong2018; Carvalho et al., Reference Carvalho, Antonio Filho, Silva, Araujo, Diniz, Silva, Paiva and Gattass2020). To address these issues, researchers have applied transfer learning techniques, initially training models on larger datasets like ImageNet before fine-tuning them on BreakHis (Kanavati et al. Reference Kanavati, Ichihara and Tsuneki2022; Laxmisagar and Hanumantharaju, Reference Laxmisagar and Hanumantharaju2022; Liu et al., Reference Liu, Hu, Tang, Wang, He, Zeng, Lin, He and Huo2022; Xu et al., Reference Xu, An, Zhang, Liu and Lu2022). Additionally, compatible-domain transfer learning has been shown to boost model performance (Shamshiri et al., Reference Shamshiri, Krzyżak, Kowal and Korbicz2023). Various approaches have been explored to make up for insufficient training datasets, including the combination of multiple classifiers and a self-trained AI algorithm utilizing a three-stage analysis technique with a WSI stacking system (Bae et al., Reference Bae, Jeon, Hwangbo, Yoo, Han and Feng2023). Overall, these efforts highlight the potential of DL and image analysis in enhancing BC diagnosis and prognosis.
Discussion
Commercially available AI models specifically designed for BC detection exist (Soliman et al., Reference Soliman, Li and Parwani2024). The mentioned entities comprise Mindpeak, Owkin, Visiopharm, Paige Breast Suite and IBEX Galen Breast. Mindpeak is located in Hamburg, Germany (Abele et al., Reference Abele, Tiemann, Krech, Wellmann, Schaaf, Länger, Peters, Donner, Keil, Daifalla, Mackens, Mamilos, Minin, Krümmelbein, Krause, Stark, Zapf, Päpper, Hartmann and Lang2023); Owkin is based in Paris, France; Visiopharm is situated in Hovedstaden, Denmark (Shafi et al., Reference Shafi, Kellough, Lujan, Satturwar, Parwani and Li2022); Paige Breast Suite is established in New York, United States and IBEX Galen Breast is positioned in Tel Aviv, Israel. These algorithms improve pathologists’ consistency, precision and sensitivity while decreasing time demands (Soliman et al., Reference Soliman, Li and Parwani2024). Nonetheless, various limitations persist in obstructing its extensive implementation on a larger scale. AI-driven specimen analysis now demonstrates specific procedural drawbacks, as observed by Soliman et al. (Reference Soliman, Li and Parwani2024). They also evaluated the contribution of AI in enhancing histological analysis of BC and to collectively assess the efficacy of each AI model, while also emphasizing potential limitations and downsides associated with each model.
Tissue classification
Most studies in this review evaluated the efficacy of AI in accurately classifying breast tissue specimens, demonstrating significant precision in distinguishing normal (or benign) tissues from malignant ones (Amin et al., Reference Amin and Ahn2023; Attallah et al., Reference Attallah, Anwar, Ghanem and Ismail2021; Gandomkar et al., Reference Gandomkar, Brennan and Mello-Thoms2018; Singh et al., Reference Singh, Kumar, Gupta and Al-Turjman2024). In the research conducted by Singh et al. (Reference Singh, Kumar, Gupta and Al-Turjman2024), the SegEIR-Net model achieved an accuracy surpassing 98% on the BreakHis dataset. Likewise, the Histo-CADx model introduced by Attallah et al. (Reference Attallah, Anwar, Ghanem and Ismail2021) has attained an accuracy of 99.54%. The initial study presenting the BreakHis dataset (Spanhol et al., Reference Spanhol, Oliveira, Petitjean and Heutte2016), aimed at evaluating AI model efficacy in tissue classification, attained an accuracy of approximately 97%. Han et al. (Reference Han, Wei, Zheng, Yin, Li and Li2017) conducted a study that attained an average accuracy of approximately 93% for eight-class classification (comprising four benign and four malignant categories) of the BreakHis dataset, in contrast to the traditional binary or ternary classification systems evaluated in most studies. The findings from these studies demonstrate that AI can continuously prove to be a dependable instrument for breast specimen classification.
Nonetheless, despite these encouraging results, many issues persist regarding tissue classification. For instance, there remains a notable heterogeneity in the datasets employed and the outcomes evaluated. Should research evaluate the efficacy of AI in distinguishing between normal and malignant? distinguishing benign from malignant? Perhaps, DCIS from invasive carcinoma (Amin et al., Reference Amin and Ahn2023)? In-class heterogeneity exists, and binary classification tasks for AI models (e.g., normal vs. abnormal) are frequently insufficient due to the presence of gray zones and unusual findings in clinical practice (Amin et al., Reference Amin and Ahn2023; Bejnordi et al., Reference Bejnordi, Veta, Van Diest, Van Ginneken, Karssemeijer, Litjens, Van Der Laak, Hermsen, Manson, Balkenhol and Geessink2017). Furthermore, certain unusual subtypes exist but may not be included in training and/or testing datasets (Hatta et al., Reference Hatta, Ichiuji, Mabu, Kugler, Hontani, Okoshi, Fuse, Kawada, Kido, Imamura, Naiki and Inai2023). The presence of in-class heterogeneity and imbalanced datasets complicates the assessment of AI’s accuracy in classifying each category, particularly the rare ones (Amin et al., Reference Amin and Ahn2023; Hatta et al., Reference Hatta, Ichiuji, Mabu, Kugler, Hontani, Okoshi, Fuse, Kawada, Kido, Imamura, Naiki and Inai2023). Similarly, Bejnordi et al. (Reference Bejnordi, Veta, Van Diest, Van Ginneken, Karssemeijer, Litjens, Van Der Laak, Hermsen, Manson, Balkenhol and Geessink2017) reported that although their model excelled in distinguishing benign from malignant tumors, it encountered difficulties with borderline categories such as DCIS. Additionally, Amin et al. (Reference Amin and Ahn2023) also documented similar findings.
Molecular subtyping
Current issues associated with manual molecular subtyping include frequent interpathologist discrepancies, particularly with Her2 status (Robbins et al., Reference Robbins, Fernandez, Han, Wong, Harigopal, Podoll, Singh, Ly, Kuba, Wen, Sanders, Brock, Wei, Fadare, Hanley, Jorns, Snir, Yoon, Rabe, Soong, Reisenbichler and Rimm2023). In contrast to specimen categorization, which relies mostly on visible microscopic features, subtyping is contingent upon supplementary technical factors, such as the extent of staining and/or manual enumeration in the computation of Ki67. Therefore, several studies are being conducted using different AI models to address the challenge of inconsistency (Abele et al., Reference Abele, Tiemann, Krech, Wellmann, Schaaf, Länger, Peters, Donner, Keil, Daifalla, Mackens, Mamilos, Minin, Krümmelbein, Krause, Stark, Zapf, Päpper, Hartmann and Lang2023).
Similar to the reported findings about AI’s performance in tissue classification, the studies assessing molecular subtyping in this review have likewise yielded promising outcomes. The data from the studies reported in this review indicate that AI excelled in both molecular subtyping and Ki67 computation. The research conducted by Bae et al. (Reference Bae, Jeon, Hwangbo, Yoo, Han and Feng2023) attained a molecular subtyping accuracy of 91%. Aswathy et al. (Reference Aswathy and Jagannath2021) found a balanced accuracy of 91.2% for their SVM model in predicting ER, PR and Her2 status. Consistent with findings in this review, several other studies have documented significantly enhanced interpathologist agreement rates following the implementation of AI aid. AI enhanced interpathologist agreement rates from approximately 88% to 96% for the Ki67 score and from roughly 89% to 93% for the ER/PR status (Abele et al., Reference Abele, Tiemann, Krech, Wellmann, Schaaf, Länger, Peters, Donner, Keil, Daifalla, Mackens, Mamilos, Minin, Krümmelbein, Krause, Stark, Zapf, Päpper, Hartmann and Lang2023). Similarly, concerning Her2 status, a study by Jung indicated a significant rise in interpathologist agreement rates from approximately 49.3% to 74.1% (p < 0.001), with the use of AI driven ER/PR and Her2 analyzers Additionally, the concordance rates for ER and PR status improved, albeit to a lower degree (93.0–96.5% p = 0.096 for ER, 84.6–91.5%, p = 0.006 for PR) (Jung et al., Reference Jung, Song, Cho, Shin, Lee, Jung, Lee, Park, Song, Park, Song, Park, Lee, Kang, Park, Pereira, Yoo, Chung, Ali and Kim2024).
Conclusion and future directions
Various AI tools may rapidly become an effective aid to histopathologists facing increasing demands for precise and speedy BC diagnosis. The identified drawbacks are paramount and need to be effectively addressed before we can reap the true benefits of this technology. It is recommended that future research endeavors focus on the following key areas to improve the performance and validity of existing AI models in the context of histopathological evaluation:
-
1. Establishment of Standardized Datasets: The creation of standardized, multi-institutional datasets that adhere to consistent preanalytic methodologies – such as sample preparation and tissue staining – should be prioritized to improve the generalizability of AI models across diverse clinical settings.
-
2. Integration of Multimodal Data: To enhance the predictive performance of AI systems, it is imperative to incorporate additional diagnostic modalities, including but not limited to imaging techniques and molecular profiling, into the analysis of histopathological specimens. This multimodal approach can offer a more comprehensive understanding of disease characteristics.
-
3. Development of Ethical Protocols: The formulation of robust ethical guidelines is essential to ensure the responsible application of AI technologies. This includes strategies for mitigating biases inherent in data and algorithms and enhancing transparency in the decision-making processes of AI systems.
-
4. Improvement in Economic Viability: It is crucial to explore cost-effective strategies for the implementation of AI solutions within clinical practice. An analysis of economic sustainability will ultimately support the broader adoption and integration of AI technologies in the healthcare sector.
By addressing these areas, future research can significantly contribute to the advancement of AI methodologies, ensuring they are both practical and ethically sound in clinical applications.
Open peer review
To view the open peer review materials for this article, please visit http://doi.org/10.1017/pcm.2025.10006.
Data availability statement
Data sharing not applicable – no new data generated.
Author contribution
AJ and AS made substantial contributions to the conception and design of the work. A.J. drafted the work and revised it critically for important intellectual content. A.S. was responsible for final revision and approval of the version to be published.
Financial support
This research received no specific grant from any funding agency, commercial or not-for-profit sectors.
Competing interests
The authors declare none.







Comments
No accompanying comment.