Introduction
Contouring organs at risk (OARs) is a crucial task in radiation therapy – the treatment plan and the dose–volume histogram (DVH) are only as good as the contours and generating the contours is a major time commitment. Reference Jameson, Holloway, Vial, Vinod and Metcalfe1,Reference Segedin and Petric2 Atlas-based autocontouring has been available for more than two decades, but its usefulness outside the brain is limited. Reference Greenham, Dean and Fu3,Reference Vrtovec, Mocnik, Strojan, Pernus and Ibragimov4 Recently, artificial intelligence (AI)-based autocontouring tools have become commercially available. Reference Cardenas, Yang, Anderson, Court and Brock5–Reference Fu, Lei, Wang, Curran, Liu and Yang7 They promise improved accuracy, greatly reduced variation and significant efficiency gains. Reference Valentini, Boldrini, Damiani and Muren8–Reference Cardenas, Yang, Anderson, Court and Brock10 Although AI-generated contours will always have to be reviewed and, if necessary, adjusted by humans, Reference Claessens, Oria and Brouwer11 they may enable considerable time savings. Reference Young, Wortham, Wernick, Evans and Ennis12 This is especially true for the head and neck region, for which a large number of OARs are often contoured Reference van der Veen, Willems and Dechuyner13 and a delay in the start of radiotherapy is associated with an increased risk of local recurrence. Reference Huang, Barbera, Brouwers, Browman and Mackillop14,Reference Chen, King, Pearcey, Kerba and Mackillop15 Adaptive radiotherapy, in particular, would benefit greatly from fast OAR contour generation. Reference Glide-Hurst, Lee and Yock16,Reference Lim-Reinders, Keller, Al-Ward, Sahgal and Kim17
We study the performance of three commercially available AI-based autocontouring packages (Autocontour 2.5.6, RADformation Inc. (“RF”), New York, NY, USA; Annotate 2.3.1, Therapanacea (“TP”), Paris, France; RT-Mind_AI 1.0, MedMind Inc. (“MM”), Delaware, USA) in the head and neck region. Head and neck contouring is a useful test case for several reasons. The large number of OARs translates to a high potential for time savings, the patients have often had prior surgery, meaning their anatomy may be distorted and organs may have been fully or partially removed and the presence of metallic dental work results in artifacts that make contouring challenging. We compare the autocontours with those generated manually by experienced dosimetrists and quantify the geometric and dosimetric effects of using autocontouring and rank the autocontours for anatomical accuracy.
Methods
The computed tomography (CT) treatment planning image sets of 13 head and neck cancer patients (5 oropharynx, 2 sinonasal, 2 oral cavity, 2 orbit, 1 buccal and 1 frontal face) consenting for the use of their patient information in research were randomly selected (according to a retrospective research protocol, approved by the institutional review board). The slice thickness of the CT scans was 2 mm; the dose calculation grid resolution was 2–3 mm, depending on the patient. The contours for 14 OARs (brainstem, L parotid, R parotid, chiasm, L optic nerve, R optic nerve, esophagus, mandible, oral cavity, L cochlea, R cochlea, L submandibular gland, R submandibular gland, spinal cord) were generated by the three autocontouring packages and compared with the clinical (human-generated) contours, generated by experienced dosimetrists and reviewed by physicians. The clinical set did not include every OAR for every patient, and the autocontours were only evaluated if a corresponding clinical contour existed.
The RF and TP autocontouring packages are not trainable, while MM can be adjusted to mimic a particular physician’s contouring. We used MM with the default settings, to keep the comparison fair and because one of the common aims of using autocontouring is to enforce uniformity in OAR structures across an institution.
Geometric performance was quantified by calculating Dice similarity coefficients (DSCs) and Hausdorff distances (HDs) between the clinical contours and the autocontours, using the 3D Slicer software. The DSC between structures A and B is defined as Reference Dice18

DSC has a value between 0 and 1, with 1 indicating perfect overlap and 0 no overlap. Values of approximately 0.7 are generally considered indicating good overlap, but this will depend on the size of the structure – the overlapping interiors of large structures will result in a high DSC, even if the boundaries do not match well.
The two directional, 3-dimensional HD between structures A and B is defined as the maximum of the minimum distances of points a on structure A and b on structure B, Reference Vaassen, Hazelaar and Vaniqui19 HD(A, B)

HD has a unit of length and a non-negative value, with 0 mm indicating perfect agreement. We calculated HD95 (95% of the points on the boundaries of the structures are within HD95 of each other). In contrast to DSC, it is easier for small structures to get good (small) HD values, even if they do not overlap at all (a longer contour makes finding a really bad point more likely). HD is determined by the part of the contour with the worst agreement, whereas DSC is affected by all areas that are non-overlapping.
A good DSC or HD indicates good agreement between the autocontours and clinical contours, but does not by itself guarantee anatomical accuracy – there is considerable interobserver variation in clinical contours. This is especially true if the OAR is not expected to get a significant dose that would justify spending a lot of time contouring it manually. Therefore, the anatomical accuracy of the OAR contours was also ranked subjectively by physicians experienced in treating head and neck cancer. A physician not involved in the creation of the manual contours compared their accuracy with the autocontours. The four contour sets were ranked on a four point Likert scale from most (1) to least accurate (4) for each patient and organ, and the scores averaged. Two physicians did a similar ranking for the autocontours only, which were ranked most (1) to least (3) accurate and the scores averaged.
We also quantified the dosimetric performance of the autocontouring packages. Intensity Modulated Radiotherapy (IMRT) treatment plans generated on the clinical contours were recalculated (without reoptimising) on the autocontoured structures and the change in the DVH quantified. The treatment planning system employed was Philips Pinnacle 16.2.1 (Philips Medical Systems, Gainesville, FL, USA). We did not generate new treatment plans based on the autocontoured structures, as this would have added an uncontrolled variable (whether the change in the DVH is due to a change in the OAR contour or the quality of optimisation in the new plan).
Results
The DSCs are presented in Table 1 and HD95 distances in Table 2. The physician-generated anatomical accuracy scores are listed in Table 3A (manual contours compared with the autocontours) and Table 3B (mean of autocontour scores from two physicians), and the DVH metrics in Table 4. The DVH metrics selected corresponds to those used at our institution for evaluating clinical plans (D_MAX for the brainstem, spinal cord, optic nerves and chiasm, D50 for the parotid and submandibular glands). Figure 1 shows a comparison between clinical and automatically generated parotid and spine contours. Figure 2 shows a comparison of HD95 and DSC for selected OARs for RF, and a comparison of the DSCs of all the autocontouring packages for the same OARs. Figure 3 shows the dosimetric change for the spinal cord and left parotid as a function of HD95 and DSC.
Table 1. The mean dice similarity coefficient and standard deviation of the autocontours (N = number of patients with the OAR contoured), the best value for each OAR in bold

Table 2. The mean HD95 and standard deviation [mm] of the autocontours (N = number of patients with the OAR contoured), the best value for each organ in bold

Table 3A. Anatomical accuracy of the contours, compared to clinical contours (a lower number denotes higher accuracy, the best value for each organ in bold)

Table 3B. Anatomical accuracy of the autocontours (mean of scores from two physicians). The best value for each organ in bold

Table 4. Mean changes in DVH metrics for clinical treatment plans recalculated on autocontour sets. The smallest absolute mean change is printed in bold


Figure 1. Parotid, oral cavity and spinal cord contours for a sample patient. Clinical contours: green, RF: blue, TP: yellow, MM: orange.

Figure 2. Comparison of DSC and HD95. Top: HD95 and DSC for RF; Middle: DSCs of MM and TP, compared to RF; Bottom: HD95s of MM and TP, compared to RF.

Figure 3. Comparison of absolute relative changes in dosimetry with DSC and HD95. Top: Change in spinal cord D_MAX, compared to DSC; second from Top: Change in spinal cord D_MAX, compared to HD95; Third from Top: Change in left parotid D50, compared to DSC; Bottom: Change in left parotid D50, compared to HD95.
All three autocontouring packages posted similar DSC results (the mean DSC of all OARs is RF 0.668, TP 0.704, MM 0.662). 23/40 (57.5%) of the DSCs were above 0.7 and 35/40 (87.5%) above 0.5, the exceptions being the chiasm and the cochlea. These are small structures, so a low DSC is not surprising.
RF had the best DSC for 2 OARs, TP for 11 and for 1, RF and TP were equally good (MM does not generate cochlear contours on a CT scan). The percentage of OARs having mean Dice coefficients > 0.7 was RF 57.1 %, TP 64.3 %, MM 50.0%). With the exception of the chiasm, the DSC values for RF and TP were similar to those reported in the literature for earlier versions of these programs. Reference Goddard, Velten and Tang20–Reference Doolan, Charalambous and Roussakis22
RF and TP had slightly lower HD95 values than MM (mean HD95 of all OARs is RF 4.6 mm, TP 4.6 mm, MM 6.7 mm). RF had the best HD95 for 7 OARs, TP for 5, MM for 1 and for 1, TP and MM were equally good (MM does not generate cochlear contours on a CT scan). In particular, the oral cavity and the optical structures generated by RF and TP matched the clinical contours better than the MM. The structures with the highest mean HD95 were the oral cavity and esophagus, with the parotids and chiasm also having relatively large HDs, despite their smaller size.
The physician-generated anatomical accuracy scores, averaged over all OARs, were clinical contours 2.49, RF 2.28, TP 1.93 and MM 3.24. Similarly to HD95, RF and TP outperformed MM and were in fact judged slightly more anatomically accurate than the clinical contours. Similar results of AI-based autocontouring being more accurate than clinical contours have been reported by other authors for various body sites. Reference Goddard, Velten and Tang20 The clinical contours were judged most accurate for 3 OARs, RF for 4, TP for 6 and for one OAR, RF and TP were equally accurate. When the two physicians compared the autocontouring sets only, the results were very similar (RF 1.86, TP 1.77, MM 2.36), with RF most accurate for 7 OARs, TP for 6 and MM for 1.
Discussion
Generally, a high DSC corresponds to a low HD95, but the relationship between the two metrics is not very strong, for reasons noted in the Methods section. This is illustrated for RF in Figure 2A. Figures 2B and 2C compare the DSC and HD95 values of the three autocontouring packages; this time the correlations are evident, all three have similar DSC and HD95 for the same patient.
The autocontouring packages will always attempt to generate a contour, even for OARs that have been surgically removed (e.g., some patients in our set had their submandibular glands removed prior to the simulation CT scan). The person reviewing the autocontours should be cognizant of this and remove the contours for OARs that are not present, lest target coverage be compromised in an attempt to spare a non-existent OAR.
A possible reason for RF and TP outperforming the clinical contours is that if the OAR is far from the target and not expected to get a significant dose, a person manually contouring it may not want to spend a lot of time maximising the anatomical accuracy. The accuracy of OAR contours is important even in these cases, for example if the patient requires reirradiation at a location closer to the OAR or a retrospective dose response study is carried out at a later date.
It is known that contouring is subject to a degree of interobserver variability; separate physicians may draw the same OAR quite differently. This is particularly relevant to the physician-generated anatomical accuracy scores. A detailed study of the effects of interobserver variability would require blinded comparisons of clinical and AI generated contours by a large number of physicians and is unfortunately outside the scope of this article.
Although several authors have studied the performance of AI-based autocontouring algorithms, they have usually quantified only the geometric differences, not the dosimetric ones. Reference van der Veen, Willems and Dechuyner13,Reference Goddard, Velten and Tang20,Reference Kim, Chun, Chang, Lee, Keum and Kim23–Reference Marschner, Datar and Gaasch27 The relationship between the two, however, is not as straight forward as one might suppose. As shown in Figure 3, a high DSC or a low HD95 do not always correspond to a small dosimetric effect – if the OAR is in a high dose gradient, even good geometric agreement can result in a large dosimetric effect and vice versa. If the user is interested in the magnitude of the dosimetric effects, a dose calculation with the autocontours should always be performed when testing autocontouring packages.
The mean difference in the D50 values does not depend strongly on the exact shape of the contour and are typically on the order of a few percent. For individual patients, the change in parotid D50 can be big, as indicated by the large standard deviation listed in Table 4. The changes in mean DMAX values are typically bigger since they are determined by the voxel receiving the largest dose.
The MM autocontouring package is trainable, whereas the other two are not. Had we trained MM on our institution’s prior patients, its geometric agreement with the clinical contours most likely would have improved.
Chiasm and optic nerves
While the clinical chiasm contours were X-shaped in the axial plane, the autocontours were more elliptical. The chiasm and optic nerves in the clinical contours always overlapped, and the autocontoured ones did not always do so and the transition point from chiasm to optic nerve varied. This resulted in worse geometric metrics, even if the optic pathway as a whole was well contoured. The DSC values of the chiasm were the lowest of any OAR, due partly to its small size but also to the fact that all the autocontouring packages posted better anatomical accuracy scores than the clinical contours for this structure. Other investigators have also reported low DSCs for autocontoured chiasms, possibly due to its poor visualisation on a CT scan. Reference Kim, Chun, Chang, Lee, Keum and Kim23,Reference Ibragimov and Xing24 We had the clinical contours for the chiasm retrospectively reviewed by a group of physicians; they agreed that the anatomical accuracy of the contours was not as good as for the other OARs. The HD95 values for nerves were low for RF and TP, but over 5 mm for MM.
The DMAX values for the autocontoured optic nerves and chiasms are, on average, smaller than for the clinical contour. The location of the hotspot in the nerve was usually at the chiasm end; since the clinical nerve contours always overlapped with chiasm, the DMAX for the clinical contours was, on average, higher.
Brainstem and spinal cord
The brainstem DSC and HD95 values were very good for all the autocontouring packages. For the spine, the RF contours are bigger than the others and often approximate the spinal canal, rather than the true spinal cord. This results in the DMAX values being systematically higher for the RF contours. It should be noted that clinical contouring of spinal cord varies from clinic to clinic, a lot of institutions will intentionally contour the whole canal or add a margin, while still calling the structure spinal cord.
Parotids and submandibular glands
The parotid and submandibular gland DSC values were very good. The HD95 values of the submandibular glands were lower than for the parotids, partly due to their smaller size. These structures are not very easy to delineate on a CT set and are often affected by metal artifacts due to dental work. RF and TP were judged to be more anatomically accurate than the clinical contours.
Oral cavity, mandible and esophagus
The DSC values for these structures were very good. The HD95 for the oral cavity and esophagus were relatively high, partly due to their large size. RF and TP outperformed MM for these OARs. The MM oral cavity contours had larger volumes and extended further in the posterior and inferior directions. The caveats that were noted for clinical spinal cord contours also apply to oral cavity (a lot of institutions intentionally err on the side of a generous contour).
Conclusions
This study evaluated the geometric and dosimetric performance of three AI-based autocontouring packages in the head and neck region. The geometric agreement between the clinical contours and RF and TP was slightly better than with MM. The mean anatomical accuracy of the two (RF and TP) of the three autocontouring packages was judged to be better than the original clinical contours; the dosimetric performance of all three was very similar. Had MM been trained on previous contours of the physicians at our institution, its performance would most likely have improved.
The dosimetric effects depend on both the quality of auto contours and the dose gradients in the plan, thus the correlation between geometric and dosimetric metrics was not strong. All three autocontoring packages can be used to generate OAR contours that can be used clinically with a modest amount of human editing, resulting in treatment planning time savings and uniformity of contouring. All autocontouring packages should be evaluated against the current contouring practice of the institution and checked for systematic differences (e.g., spinal canal vs. spinal cord, the superior/inferior level at which the cord contour ends) before being put it into clinical use; dosimetric comparisons are also recommended.
Acknowledgements
None.
Financial support
None.
Competing interests
None.