A geometric and dosimetric comparison of three AI-based autocontouring packages in the head and neck region

Jussi Sillanpaa; Amit Sood; Margaget Reynolds

doi:10.1017/S1460396925100204

A geometric and dosimetric comparison of three AI-based autocontouring packages in the head and neck region

Published online by Cambridge University Press: 06 August 2025

Jussi Sillanpaa

Amit Sood and

Margaget Reynolds

Show author details

Jussi Sillanpaa*: Affiliation:
Dept. of Radiation Oncology, University of Minnesota, 420 Delaware St SE, Minneapolis, Minnesota, MN USA
Amit Sood: Affiliation:
Dept. of Radiation Oncology, University of Minnesota, 420 Delaware St SE, Minneapolis, Minnesota, MN USA
Margaget Reynolds: Affiliation:
Dept. of Radiation Oncology, University of Minnesota, 420 Delaware St SE, Minneapolis, Minnesota, MN USA
*: Corresponding author: Jussi Sillanpaa; Email: silla032@umn.edu

Article contents

Abstract
Introduction:
Methods:
Results:
Conclusions:
Introduction
Methods
Results
Discussion
Conclusions
Financial support
Competing interests
References

Rights & Permissions

Abstract

Introduction:

AI-based autocontouring products claim to be able to segment organs with accuracy comparable to humans. We compare the geometric and dosimetric performance of three AI-based autocontouring packages (Autocontour 2.5.6, (“RF”); Annotate 2.3.1, (“TP”) and RT-Mind_AI 1.0, (“MM”)) in the head and neck region.

Methods:

We generated 14 organ at risk (OAR) autocontours on 13 computed tomography (CT) image sets. They were compared with clinical (human-generated) contours. The geometric differences were quantified by calculating Dice coefficients and Hausdorff distances. The autocontours were compared visually with the clinical controus by an expert physician. The autocontour sets were also ranked for accuracy by two physicians. The dosimetric effects were evaluated by recalculating treatment plans on the autocontoured CT sets.

Results:

RF and TP slightly outperformed MM in geometric metrics (the percentage of OARs having mean Dice coefficients > 0.7 was RF 57.1 %, TP 64.3 % and MM 50.0%). The physician judged RF and TP contours to be more anatomically accurate, on average, than the manual contours (manual contour mean accuracy score 2.49, RF 2.28, MM 3.24, TP 1.93). The mean scores given to the autocontours by the two physicians were better for RF and TP, compared to MM (RF 1.86, MM 2.36, TP 1.77). The dosimetric differences were similar for all three programs and were not strongly correlated with the geometric differences.

Conclusions:

The performance of the three autocontouring packages in the head and neck region is similar, with TP and RF slightly outperforming MM. The correlation between geometric and dosimetric metrics is not strong, and dosimetric evaluation is therefore recommended before clinical use of autocontouring software.

Keywords

AI autocontouring artificial intelligence efficiency and quality head and neck cancer

Information

Type: Original Article
Information: Journal of Radiotherapy in Practice , Volume 24 , 2025 , e33

DOI: https://doi.org/10.1017/S1460396925100204 [Opens in a new window]
Creative Commons: This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright: © The Author(s), 2025. Published by Cambridge University Press

Introduction

Contouring organs at risk (OARs) is a crucial task in radiation therapy – the treatment plan and the dose–volume histogram (DVH) are only as good as the contours and generating the contours is a major time commitment.^{Reference Jameson, Holloway, Vial, Vinod and Metcalfe1,Reference Segedin and Petric2} Atlas-based autocontouring has been available for more than two decades, but its usefulness outside the brain is limited.^{Reference Greenham, Dean and Fu3,Reference Vrtovec, Mocnik, Strojan, Pernus and Ibragimov4} Recently, artificial intelligence (AI)-based autocontouring tools have become commercially available.^{Reference Cardenas, Yang, Anderson, Court and Brock5–Reference Fu, Lei, Wang, Curran, Liu and Yang7} They promise improved accuracy, greatly reduced variation and significant efficiency gains.^{Reference Valentini, Boldrini, Damiani and Muren8–Reference Cardenas, Yang, Anderson, Court and Brock10} Although AI-generated contours will always have to be reviewed and, if necessary, adjusted by humans,^{Reference Claessens, Oria and Brouwer11} they may enable considerable time savings.^{Reference Young, Wortham, Wernick, Evans and Ennis12} This is especially true for the head and neck region, for which a large number of OARs are often contoured^{Reference van der Veen, Willems and Dechuyner13} and a delay in the start of radiotherapy is associated with an increased risk of local recurrence.^{Reference Huang, Barbera, Brouwers, Browman and Mackillop14,Reference Chen, King, Pearcey, Kerba and Mackillop15} Adaptive radiotherapy, in particular, would benefit greatly from fast OAR contour generation.^{Reference Glide-Hurst, Lee and Yock16,Reference Lim-Reinders, Keller, Al-Ward, Sahgal and Kim17}

We study the performance of three commercially available AI-based autocontouring packages (Autocontour 2.5.6, RADformation Inc. (“RF”), New York, NY, USA; Annotate 2.3.1, Therapanacea (“TP”), Paris, France; RT-Mind_AI 1.0, MedMind Inc. (“MM”), Delaware, USA) in the head and neck region. Head and neck contouring is a useful test case for several reasons. The large number of OARs translates to a high potential for time savings, the patients have often had prior surgery, meaning their anatomy may be distorted and organs may have been fully or partially removed and the presence of metallic dental work results in artifacts that make contouring challenging. We compare the autocontours with those generated manually by experienced dosimetrists and quantify the geometric and dosimetric effects of using autocontouring and rank the autocontours for anatomical accuracy.

Methods

The computed tomography (CT) treatment planning image sets of 13 head and neck cancer patients (5 oropharynx, 2 sinonasal, 2 oral cavity, 2 orbit, 1 buccal and 1 frontal face) consenting for the use of their patient information in research were randomly selected (according to a retrospective research protocol, approved by the institutional review board). The slice thickness of the CT scans was 2 mm; the dose calculation grid resolution was 2–3 mm, depending on the patient. The contours for 14 OARs (brainstem, L parotid, R parotid, chiasm, L optic nerve, R optic nerve, esophagus, mandible, oral cavity, L cochlea, R cochlea, L submandibular gland, R submandibular gland, spinal cord) were generated by the three autocontouring packages and compared with the clinical (human-generated) contours, generated by experienced dosimetrists and reviewed by physicians. The clinical set did not include every OAR for every patient, and the autocontours were only evaluated if a corresponding clinical contour existed.

The RF and TP autocontouring packages are not trainable, while MM can be adjusted to mimic a particular physician’s contouring. We used MM with the default settings, to keep the comparison fair and because one of the common aims of using autocontouring is to enforce uniformity in OAR structures across an institution.

Geometric performance was quantified by calculating Dice similarity coefficients (DSCs) and Hausdorff distances (HDs) between the clinical contours and the autocontours, using the 3D Slicer software. The DSC between structures A and B is defined as^{Reference Dice18}

$${\rm DSC = 2 \times volume \; (A \cap B) / (volume \; (A) + volume \; (B))}$$

DSC has a value between 0 and 1, with 1 indicating perfect overlap and 0 no overlap. Values of approximately 0.7 are generally considered indicating good overlap, but this will depend on the size of the structure – the overlapping interiors of large structures will result in a high DSC, even if the boundaries do not match well.

The two directional, 3-dimensional HD between structures A and B is defined as the maximum of the minimum distances of points a on structure A and b on structure B,^{Reference Vaassen, Hazelaar and Vaniqui19} HD(A, B)

$${\rm= \max \mathop \{ \nolimits_{a \in A}^{Sup} [d(a, B)], \mathop \; \nolimits_{b \in B}^{Sup} [d(A, b)]\}}$$

HD has a unit of length and a non-negative value, with 0 mm indicating perfect agreement. We calculated HD95 (95% of the points on the boundaries of the structures are within HD95 of each other). In contrast to DSC, it is easier for small structures to get good (small) HD values, even if they do not overlap at all (a longer contour makes finding a really bad point more likely). HD is determined by the part of the contour with the worst agreement, whereas DSC is affected by all areas that are non-overlapping.

A good DSC or HD indicates good agreement between the autocontours and clinical contours, but does not by itself guarantee anatomical accuracy – there is considerable interobserver variation in clinical contours. This is especially true if the OAR is not expected to get a significant dose that would justify spending a lot of time contouring it manually. Therefore, the anatomical accuracy of the OAR contours was also ranked subjectively by physicians experienced in treating head and neck cancer. A physician not involved in the creation of the manual contours compared their accuracy with the autocontours. The four contour sets were ranked on a four point Likert scale from most (1) to least accurate (4) for each patient and organ, and the scores averaged. Two physicians did a similar ranking for the autocontours only, which were ranked most (1) to least (3) accurate and the scores averaged.

We also quantified the dosimetric performance of the autocontouring packages. Intensity Modulated Radiotherapy (IMRT) treatment plans generated on the clinical contours were recalculated (without reoptimising) on the autocontoured structures and the change in the DVH quantified. The treatment planning system employed was Philips Pinnacle 16.2.1 (Philips Medical Systems, Gainesville, FL, USA). We did not generate new treatment plans based on the autocontoured structures, as this would have added an uncontrolled variable (whether the change in the DVH is due to a change in the OAR contour or the quality of optimisation in the new plan).

Results

The DSCs are presented in Table 1 and HD95 distances in Table 2. The physician-generated anatomical accuracy scores are listed in Table 3A (manual contours compared with the autocontours) and Table 3B (mean of autocontour scores from two physicians), and the DVH metrics in Table 4. The DVH metrics selected corresponds to those used at our institution for evaluating clinical plans (D_MAX for the brainstem, spinal cord, optic nerves and chiasm, D50 for the parotid and submandibular glands). Figure 1 shows a comparison between clinical and automatically generated parotid and spine contours. Figure 2 shows a comparison of HD95 and DSC for selected OARs for RF, and a comparison of the DSCs of all the autocontouring packages for the same OARs. Figure 3 shows the dosimetric change for the spinal cord and left parotid as a function of HD95 and DSC.

Table 1. The mean dice similarity coefficient and standard deviation of the autocontours (N = number of patients with the OAR contoured), the best value for each OAR in bold

Table 2. The mean HD95 and standard deviation [mm] of the autocontours (N = number of patients with the OAR contoured), the best value for each organ in bold

Table 3A. Anatomical accuracy of the contours, compared to clinical contours (a lower number denotes higher accuracy, the best value for each organ in bold)

Table 3B. Anatomical accuracy of the autocontours (mean of scores from two physicians). The best value for each organ in bold

Table 4. Mean changes in DVH metrics for clinical treatment plans recalculated on autocontour sets. The smallest absolute mean change is printed in bold

Figure 1. Parotid, oral cavity and spinal cord contours for a sample patient. Clinical contours: green, RF: blue, TP: yellow, MM: orange.

Figure 2. Comparison of DSC and HD95. Top: HD95 and DSC for RF; Middle: DSCs of MM and TP, compared to RF; Bottom: HD95s of MM and TP, compared to RF.

Figure 3. Comparison of absolute relative changes in dosimetry with DSC and HD95. Top: Change in spinal cord D_MAX, compared to DSC; second from Top: Change in spinal cord D_MAX, compared to HD95; Third from Top: Change in left parotid D50, compared to DSC; Bottom: Change in left parotid D50, compared to HD95.

All three autocontouring packages posted similar DSC results (the mean DSC of all OARs is RF 0.668, TP 0.704, MM 0.662). 23/40 (57.5%) of the DSCs were above 0.7 and 35/40 (87.5%) above 0.5, the exceptions being the chiasm and the cochlea. These are small structures, so a low DSC is not surprising.

RF had the best DSC for 2 OARs, TP for 11 and for 1, RF and TP were equally good (MM does not generate cochlear contours on a CT scan). The percentage of OARs having mean Dice coefficients > 0.7 was RF 57.1 %, TP 64.3 %, MM 50.0%). With the exception of the chiasm, the DSC values for RF and TP were similar to those reported in the literature for earlier versions of these programs.^{Reference Goddard, Velten and Tang20–Reference Doolan, Charalambous and Roussakis22}

RF and TP had slightly lower HD95 values than MM (mean HD95 of all OARs is RF 4.6 mm, TP 4.6 mm, MM 6.7 mm). RF had the best HD95 for 7 OARs, TP for 5, MM for 1 and for 1, TP and MM were equally good (MM does not generate cochlear contours on a CT scan). In particular, the oral cavity and the optical structures generated by RF and TP matched the clinical contours better than the MM. The structures with the highest mean HD95 were the oral cavity and esophagus, with the parotids and chiasm also having relatively large HDs, despite their smaller size.

The physician-generated anatomical accuracy scores, averaged over all OARs, were clinical contours 2.49, RF 2.28, TP 1.93 and MM 3.24. Similarly to HD95, RF and TP outperformed MM and were in fact judged slightly more anatomically accurate than the clinical contours. Similar results of AI-based autocontouring being more accurate than clinical contours have been reported by other authors for various body sites.^{Reference Goddard, Velten and Tang20} The clinical contours were judged most accurate for 3 OARs, RF for 4, TP for 6 and for one OAR, RF and TP were equally accurate. When the two physicians compared the autocontouring sets only, the results were very similar (RF 1.86, TP 1.77, MM 2.36), with RF most accurate for 7 OARs, TP for 6 and MM for 1.

Discussion

Generally, a high DSC corresponds to a low HD95, but the relationship between the two metrics is not very strong, for reasons noted in the Methods section. This is illustrated for RF in Figure 2A. Figures 2B and 2C compare the DSC and HD95 values of the three autocontouring packages; this time the correlations are evident, all three have similar DSC and HD95 for the same patient.

The autocontouring packages will always attempt to generate a contour, even for OARs that have been surgically removed (e.g., some patients in our set had their submandibular glands removed prior to the simulation CT scan). The person reviewing the autocontours should be cognizant of this and remove the contours for OARs that are not present, lest target coverage be compromised in an attempt to spare a non-existent OAR.

A possible reason for RF and TP outperforming the clinical contours is that if the OAR is far from the target and not expected to get a significant dose, a person manually contouring it may not want to spend a lot of time maximising the anatomical accuracy. The accuracy of OAR contours is important even in these cases, for example if the patient requires reirradiation at a location closer to the OAR or a retrospective dose response study is carried out at a later date.

It is known that contouring is subject to a degree of interobserver variability; separate physicians may draw the same OAR quite differently. This is particularly relevant to the physician-generated anatomical accuracy scores. A detailed study of the effects of interobserver variability would require blinded comparisons of clinical and AI generated contours by a large number of physicians and is unfortunately outside the scope of this article.

Although several authors have studied the performance of AI-based autocontouring algorithms, they have usually quantified only the geometric differences, not the dosimetric ones.^{Reference van der Veen, Willems and Dechuyner13,Reference Goddard, Velten and Tang20,Reference Kim, Chun, Chang, Lee, Keum and Kim23–Reference Marschner, Datar and Gaasch27} The relationship between the two, however, is not as straight forward as one might suppose. As shown in Figure 3, a high DSC or a low HD95 do not always correspond to a small dosimetric effect – if the OAR is in a high dose gradient, even good geometric agreement can result in a large dosimetric effect and vice versa. If the user is interested in the magnitude of the dosimetric effects, a dose calculation with the autocontours should always be performed when testing autocontouring packages.

The mean difference in the D50 values does not depend strongly on the exact shape of the contour and are typically on the order of a few percent. For individual patients, the change in parotid D50 can be big, as indicated by the large standard deviation listed in Table 4. The changes in mean DMAX values are typically bigger since they are determined by the voxel receiving the largest dose.

The MM autocontouring package is trainable, whereas the other two are not. Had we trained MM on our institution’s prior patients, its geometric agreement with the clinical contours most likely would have improved.

Chiasm and optic nerves

While the clinical chiasm contours were X-shaped in the axial plane, the autocontours were more elliptical. The chiasm and optic nerves in the clinical contours always overlapped, and the autocontoured ones did not always do so and the transition point from chiasm to optic nerve varied. This resulted in worse geometric metrics, even if the optic pathway as a whole was well contoured. The DSC values of the chiasm were the lowest of any OAR, due partly to its small size but also to the fact that all the autocontouring packages posted better anatomical accuracy scores than the clinical contours for this structure. Other investigators have also reported low DSCs for autocontoured chiasms, possibly due to its poor visualisation on a CT scan.^{Reference Kim, Chun, Chang, Lee, Keum and Kim23,Reference Ibragimov and Xing24} We had the clinical contours for the chiasm retrospectively reviewed by a group of physicians; they agreed that the anatomical accuracy of the contours was not as good as for the other OARs. The HD95 values for nerves were low for RF and TP, but over 5 mm for MM.

The DMAX values for the autocontoured optic nerves and chiasms are, on average, smaller than for the clinical contour. The location of the hotspot in the nerve was usually at the chiasm end; since the clinical nerve contours always overlapped with chiasm, the DMAX for the clinical contours was, on average, higher.

Brainstem and spinal cord

The brainstem DSC and HD95 values were very good for all the autocontouring packages. For the spine, the RF contours are bigger than the others and often approximate the spinal canal, rather than the true spinal cord. This results in the DMAX values being systematically higher for the RF contours. It should be noted that clinical contouring of spinal cord varies from clinic to clinic, a lot of institutions will intentionally contour the whole canal or add a margin, while still calling the structure spinal cord.

Parotids and submandibular glands

The parotid and submandibular gland DSC values were very good. The HD95 values of the submandibular glands were lower than for the parotids, partly due to their smaller size. These structures are not very easy to delineate on a CT set and are often affected by metal artifacts due to dental work. RF and TP were judged to be more anatomically accurate than the clinical contours.

Oral cavity, mandible and esophagus

The DSC values for these structures were very good. The HD95 for the oral cavity and esophagus were relatively high, partly due to their large size. RF and TP outperformed MM for these OARs. The MM oral cavity contours had larger volumes and extended further in the posterior and inferior directions. The caveats that were noted for clinical spinal cord contours also apply to oral cavity (a lot of institutions intentionally err on the side of a generous contour).

Conclusions

This study evaluated the geometric and dosimetric performance of three AI-based autocontouring packages in the head and neck region. The geometric agreement between the clinical contours and RF and TP was slightly better than with MM. The mean anatomical accuracy of the two (RF and TP) of the three autocontouring packages was judged to be better than the original clinical contours; the dosimetric performance of all three was very similar. Had MM been trained on previous contours of the physicians at our institution, its performance would most likely have improved.

The dosimetric effects depend on both the quality of auto contours and the dose gradients in the plan, thus the correlation between geometric and dosimetric metrics was not strong. All three autocontoring packages can be used to generate OAR contours that can be used clinically with a modest amount of human editing, resulting in treatment planning time savings and uniformity of contouring. All autocontouring packages should be evaluated against the current contouring practice of the institution and checked for systematic differences (e.g., spinal canal vs. spinal cord, the superior/inferior level at which the cord contour ends) before being put it into clinical use; dosimetric comparisons are also recommended.

Acknowledgements

None.

Financial support

None.

Competing interests

None.

References

Jameson, MG, Holloway, LC, Vial, PJ, Vinod, SK, Metcalfe, PE. A review of methods of analysis in contouring studies for radiation oncology. J Med Imaging Radiat Oncol 2010; 54: 401–410.10.1111/j.1754-9485.2010.02192.xCrossRef Google Scholar PubMed

Segedin, B, Petric, P. Uncertainties in target volume delineation in radiotherapy -are they relevant and what can we do about them? Radiol Oncol 2016; 50: 254–625.10.1515/raon-2016-0023CrossRef Google Scholar

Greenham, S, Dean, J, Fu, CK, et al. Evaluation of atlas-based auto-segmentation software in prostate cancer patients. J Med Radiat Sci 2014; 61 (3): 151–158.CrossRef Google Scholar PubMed

Vrtovec, T, Mocnik, D, Strojan, P, Pernus, F, Ibragimov, B. Auto-seg-mentation of organs at risk for head and neck radiotherapy planning: From atlas-based to deep learning methods. MedPhys 2020; 47: E929–E950.Google Scholar PubMed

Cardenas, CE, Yang, J, Anderson, BM, Court, LE, Brock, KB. Advances in autosegmentation. Semin Radiat Oncol 2019; 29: 185–975.10.1016/j.semradonc.2019.02.001CrossRef Google Scholar PubMed

Lei, Y, Fu, Y, Wang, T, et al. Deep Learning Architecture Design for Multi-Organ Segmentation. Auto-Segmentation for Radiation Oncology. CRC Press; 2021.Google Scholar

Fu, Y, Lei, Y, Wang, T, Curran, WJ, Liu, T, Yang, X. A review of deep learning based methods for medical image multi-organ segmentation. Phys Med 2021; 85: 107–122.10.1016/j.ejmp.2021.05.003CrossRef Google Scholar PubMed

Valentini, V, Boldrini, L, Damiani, A, Muren, LP. Recommendations on how to establish evidence from auto-segmentation software in radiotherapy. Radiother Oncol 2014;112 (3): 317–320.10.1016/j.radonc.2014.09.014CrossRef Google Scholar PubMed

Tao, CJ, Yi, JL, Chen, NY, et al. Multi-subject atlas-based autosegmentation reduces interobserver variation and improves dosimetric parameter consistency for organs at risk in nasopharyngeal carcinoma: a multi-institution clinical study. Radiother Oncol 2015; 115: 407–411.10.1016/j.radonc.2015.05.012CrossRef Google Scholar PubMed

Cardenas, C, Yang, J, Anderson, BM,Court, LE,Brock, KB. Advances in auto-segmentation. Sem Radiat Oncol 2019; 29: 185–197.10.1016/j.semradonc.2019.02.001CrossRef Google Scholar PubMed

Claessens, M, Oria, CS, Brouwer, C, et al. Quality assurance for AI-based applications in radiation therapy. Sem Radiat Oncol 2022; 32: 421–431.10.1016/j.semradonc.2022.06.011CrossRef Google Scholar PubMed

Young, AV, Wortham, A, Wernick, I, Evans, A, Ennis, RD. Atlas-based segmentation improves consistency and decreases time required for contouring postoperative endometrial cancer nodal volumes. Int J Radiat Oncol Biol Phys 2011; 79: 943–947.10.1016/j.ijrobp.2010.04.063CrossRef Google Scholar PubMed

van der Veen, J, Willems, S, Dechuyner, S, et al. Benefits of deep learning for delineation of organs at risk in head and neck cancer. Radiother Oncol 2019; 138: 68–74.10.1016/j.radonc.2019.05.010CrossRef Google Scholar PubMed

Huang, J, Barbera, L, Brouwers, M, Browman, G, Mackillop, J. Does delay in starting treatment affect the outcomes of radiotherapy? A systematic review. J Clin Oncol 2003; 21: 555–563.10.1200/JCO.2003.04.171CrossRef Google Scholar PubMed

Chen, Z, King, W, Pearcey, R, Kerba, M, Mackillop, W. The relationship between waiting time for radiotherapy and clinical outcomes: a systematic review of the literature. Radiother Oncol 2008; 87: 3–16.10.1016/j.radonc.2007.11.016CrossRef Google Scholar PubMed

Glide-Hurst, CK, Lee, P, Yock, AD, et al. Adaptive radiation therapy (ART) strategies and technical considerations: a state of the ART review from NRG oncology. Int J Radiat Oncol Biol Phys 2021; 109: 1054–1075.CrossRef Google Scholar PubMed

Lim-Reinders, S, Keller, BM, Al-Ward, S, Sahgal, A, Kim, A. Online adaptive radiation therapy. Int J Radiat Oncol Biol Phys 2017; 15 (4): 994–1003.10.1016/j.ijrobp.2017.04.023CrossRef Google Scholar

Dice, LR. Measures of the amount of ecologic association between species. Ecology 1945; 26 (3): 297–302.10.2307/1932409CrossRef Google Scholar

Vaassen, F, Hazelaar, C, Vaniqui, A, et al. Evaluation of measures for assessing time-saving of automatic organ-at-risk segmentation in radiotherapy. Phys Imaging Radiat Oncol 2020; 13: 1–6.CrossRef Google Scholar PubMed

Goddard, L, Velten, C, Tang, J, et al. Evaulation of multiple-vendor AI autocontouring solutions. Radiat Oncol 2024; 19 (1): 69.10.1186/s13014-024-02451-4CrossRef Google Scholar

Kim, Y, Biggs, S, Mackonis, E. Investigation on performance of multiple AI-based auto-contouring systems in organs at risks (OARs) delineation. Phys Eng Sci Med. 2024; 47: 1123–1140.10.1007/s13246-024-01434-9CrossRef Google Scholar PubMed

Doolan, P, Charalambous, S, Roussakis, Y, et al. A clinical evaluation of the performance of five commercial artificial intelligence contouring systems for radiotherapy. Front Oncol 2023; 13: 1213068.10.3389/fonc.2023.1213068CrossRef Google Scholar PubMed

Kim, N, Chun, J, Chang, JS, Lee, CG, Keum, KiC, Kim, JS. Feasibility of continual deep learning-based segmentation for personalized adaptive radiation therapy in head and neck area. Cancers 2021; 13: 1–195.Google Scholar PubMed

Ibragimov, B, Xing, L. Segmentation of organs-at-risks in head and neck CT images using convolutional neural networks. Med Phys 2017; 44: 547–554.10.1002/mp.12045CrossRef Google Scholar PubMed

Hu, Y, Nguyen, H, Smith, C, et al. Clinical assessment of a novel machine-learning automated contouring tool for radiotherapy planning. J Appl Clin Med Phys 2023; 24: e13949.10.1002/acm2.13949CrossRef Google Scholar PubMed

Bustos, L, Sarkar, A, Doyle, L, et al. Feasibility evaluation of novel AI-based deep-learning contouring algorithm for radiotherapy. J Appl Clin Med Phys 2023; 24: e14090.CrossRef Google Scholar

Marschner, S, Datar, M, Gaasch, A, et al. A deep image-to-image network organ segmentation algorithm for radiation treatment planning: principles and evaluation. Radiat Oncol 2022; 17: 129.10.1186/s13014-022-02102-6CrossRef Google Scholar PubMed

Table 1. The mean dice similarity coefficient and standard deviation of the autocontours (N = number of patients with the OAR contoured), the best value for each OAR in bold

Table 2. The mean HD95 and standard deviation [mm] of the autocontours (N = number of patients with the OAR contoured), the best value for each organ in bold

Table 3A. Anatomical accuracy of the contours, compared to clinical contours (a lower number denotes higher accuracy, the best value for each organ in bold)

Table 3B. Anatomical accuracy of the autocontours (mean of scores from two physicians). The best value for each organ in bold

Table 4. Mean changes in DVH metrics for clinical treatment plans recalculated on autocontour sets. The smallest absolute mean change is printed in bold

Figure 1. Parotid, oral cavity and spinal cord contours for a sample patient. Clinical contours: green, RF: blue, TP: yellow, MM: orange.

Figure 2. Comparison of DSC and HD95. Top: HD95 and DSC for RF; Middle: DSCs of MM and TP, compared to RF; Bottom: HD95s of MM and TP, compared to RF.

Article contents

A geometric and dosimetric comparison of three AI-based autocontouring packages in the head and neck region

Abstract

Keywords

Information

Introduction

Methods

Results

Discussion

Chiasm and optic nerves

Brainstem and spinal cord

Parotids and submandibular glands

Oral cavity, mandible and esophagus

Conclusions

Acknowledgements

Financial support

Competing interests

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests