Introduction

In multiple sclerosis (MS), atrophy of deep grey matter (DGM) structures like the caudate nucleus (caudate), putamen and thalamus is associated with cognitive and clinical impairment [1,2,3,4]. Accurate segmentations of these structures from structural MRI are key to understanding these atrophic processes and their role in MS.

However, it is unclear whether DGM segmentation using state-of-the-art automated methods is as accurate in MS cases as in healthy controls. Since studies have shown that white matter (WM) lesions and atrophy could affect measures such as whole-brain grey matter (GM) volume, it could be expected that such pathology also affects DGM segmentation [1, 5,6,7,8,9].

A direct comparison of automated methods to expert manual (reference) segmentation was performed by Derakhshan et al. (2010) in a small dataset containing 3 slices each of 3 MS patients [1]. Although that paper provided insights into the spatial overlap between automated and manual segmentations, with 3 slices per subject no volumetric analysis was possible. Moreover, the small number of subjects did not allow any analysis of relations between segmentation performance and MS-related pathological changes.

Therefore, this study quantitatively investigated automated segmentation performance in a whole-brain dataset of 32 subjects including MS patients and healthy controls. Four publicly available segmentation method packages (FSL-FIRST [10], FreeSurfer [11], Geodesic Information Flow (GIF) [12] and volBrain [13]) were evaluated in terms of volumetric and spatial agreement with manual segmentations created by combining manual outlines of three trained raters by majority voting. Moreover, the relation of segmentation accuracy with total and regional lesion load, whole-brain volume, and volume of the structure of interest was assessed to determine possible confounding disease relations factors.

Methods

Subjects

In total 21 MS patients and 11 healthy controls subjects were retrospectively selected from two of the multi-center studies of the MAGNIMS Study Group (www.magnims.eu) [14, 15]. Demographic details of the subjects are listed in Table 1. Subjects had been recruited at nine European centers, see Supplementary data A for the centers.

Table 1 Demographics of the subjects

The selection of the cases was based on maximizing the number of scanners and the number of secondary progressive MS (SPMS) and primary progressive MS (PPMS) cases while considering the workload for the three raters. All patients and controls had given informed consent for the use of their brain MRI-scans for research within the original study.

Acquisition

An overview of acquisition parameters for each site is given in Supplementary Tables 1 and 2. Briefly, MRI data were obtained using magnets operating at 3 T for all cases with three vendors (Siemens, Philips and GE). One of the two following imaging protocols was used: (1) 3D T1‐weighted scan (different pulse sequences for different venders) and a dual-echo spin echo scan with both 2D T2-weighted and 2D proton density (PD) weighted; or (2) 3D T1-weighted magnetization prepared rapid gradient echo (MPRAGE) scan and 2D fluid-attenuated inversion recovery (FLAIR) T2-weighted fast spin-echo sequence.

Manual segmentation of DGM structures

Manual segmentation of three DGM structures was performed using the SPINE online environment for collaborative research (https://spinevirtuallab.org). The caudate nucleus, putamen and thalamus were all manually segmented on the full 3D T1-weighted images in each subject by each rater. Four scans were outlined a second time by each rater to examine intra-rater variability. A summary of the segmentation protocol is added to the supplementary data (Supplementary Protocol 1).

The manual segmentations of the three raters were combined into a reference using majority voting: i.e., a voxel was classified as part of a structure if at least 2 of the 3 raters assigned it to that structure.

Lesion segmentation

Lesion segmentation was also performed manually by one expert rater, on the FLAIR scan or on the PD scan. The lesion segmentation was performed using the Medical Image Processing, Analysis, and Visualization (MIPAV) software environment whereby only lesions of at least three voxels were included.

Lesion filling

Lesion-filling is a common pre-processing step in patient scans, in which the intensities of voxels identified as being part of WM lesions are replaced by intensities similar to normal-appearing white matter. In this study, lesion-filling was applied using two algorithms: lesion segmentation toolbox (LST-LF) [16], and LEAP [8], and both versions of lesion-filled images as well as native images were analyzed. The lesion masks were first co-registered from their original PD or FLAIR space to 3D-T1 space using FSL-FLIRT with tri-linear interpolation and a threshold of 0.5 because this was previously found to provide good results for whole-brain GM volume measurements [6]. The Supplementary data B provides a description of LST-LF and LEAP. In this study, lesion-filling was applied using two algorithms: lesion segmentation toolbox (LST-LF) [16], and LEAP [8], and both versions of lesion-filled images as well as native images were analyzed. The lesion masks were first co-registered from their original PD or FLAIR space to 3D-T1 space using FSL-FLIRT with tri-linear interpolation and a threshold of 0.5, based on literature [6]. Second, the lesion was filled on the 3D-T1 weighted image with the use of the lesion mask in 3D-T1 space. The Supplementary data B provides a description of the two used lesion filling methods (LST-LF and LEAP).

Automatic DGM segmentation method

Within this study four automatic DGM segmentation methods were assessed; FSL-FIRST (https://fsl.fmrib.ox.ac.uk/fsl/fslwiki/FIRST), FreeSurfer (https://surfer.nmr.mgh.harvard.edu), volBrain (https://volBrain.upv.es) and GIF (https://niftyweb.cs.ucl.ac.uk).

FSL-FIRST, version 6.0.1, has previously been described by Patenaude et al. (2011). In short, FSL-FIRST finds the most plausible outline based on the observed intensities from the T1-weighted input image using shape and appearance models derived from a large training dataset. Surface meshes of the subcortical structures were converted to boundary corrected voxelwise segmentations [10].

FreeSurfer, version 6.0.0. is described on the FreeSurferWiki page (https://surfer.nmr.mgh.harvard.edu/fswiki/). In short, labels are assigned to each voxel in the subcortical region (WM + subcortical GM). From these segmentations, the binary segmentations for the individual structure were extracted [11].

Geodesic Information Flow (GIF), versions V2.0, uses manually created atlases for segmentation of the input images. GIF captures the local variation in morphology and in standard space locations. With the use of an iterative geodesic minimization algorithm and the manual labels, more accurate segmentations are expected [12].

VolBrain, version 1.0 is an online pipeline for volumetric brain analysis The proposed pipeline is based on a library of manually labeled atlas cases to perform the segmentation process, including subcortical structure segmentation as proposed by Coupé et al. 2011 [13, 17].

Brain volume

The normalized brain volume (NBV) and brain volume (BV) were measured with SIENAX (part of FSL version 5.0) [18] on the lesion filled data. SIENAX is the cross-sectional pipeline of the SIENA method [19]. Based on voxel intensities it estimates partial volume fractions of GM, WM and cerebrospinal fluid (CSF) for each voxel. Volumes of GM and WM were added to obtain BV. SIENAX performs normalization of skull size to MNI space to obtain NBV.

Relation with MS pathology

The association of automatic segmentation performance, as measured by Dice similarity coefficient (DSC, see statistical analyses section), with multiple MS-related disease parameters, i.e. WM lesion load, regional lesion load, normalized brain volume (NBV) and DGM structure volume, was investigated. WM lesion load was determined from the manual lesion outlines. Regional lesion load was evaluated by measuring the lesion load within a pre-defined distance from the DGM structure (see Fig. 1). Using the distance transform, the distance of each voxel to the reference of the structure in the specific subject under investigation was calculated. By thresholding of the subject- and structure-specific distance map and masking with the subject-specific WM mask obtained with FreeSurfer, for each case, a “surrounding WM border” region at distances of 0–10 mm was defined. Region-specific lesion volumes were obtained by masking the WM lesion mask in 3D-T1 space with these WM surrounding WM border.

Fig. 1
figure 1

Method for lesion load calculation within a set border. a a distance field is created around DGM structure, in this case the caudate nucleus. b Distance is set around the DGM structure, seen in grey. c An overlay is created between the DGM border and the lesion mask in T1 space. In grey the lesion border is shown, in red the lesions without overlap with the border and in yellow the lesions with an overlap in the border

Statistical analyses

Intra-rater agreement was measured between the first and second manual segmentation of the structures with the Dice similarity coefficient (DSC) [20]:

$$DSC(A,B) = \frac{2(A \cap B)}{{A + B}}$$

with A and B defined as the segmentations and where ∩ is the intersection of the two segmentations.

Volumetric analysis was done by comparing the volume of the reference with volumes of the automated segmentation. Moreover, the intra-class correlation coefficient (ICC) for the absolute agreement was calculated [21].

Spatial overlap between reference and the automated method was measured with DSC. Student’s t test was used to compare DSC with the reference between controls and patients. To assess the effect of lesion-filling, two-way ANOVA analysis was performed comparing DSC of native images separately with those from each of the two lesion-filling methods (LEAP and LST).

The relation of disease pathology with spatial performance was examined using Spearman's correlation coefficient, in which 0.0 <|r|< 0.2 was considered a weak correlation, 0.2 ≤|r|< 0.5 a moderate correlation and |r|≥ 0.5 a strong correlation [22].

For all statistical analysis P-values < 0.05 were considered statistically significant.

Results

Manual segmentation

Volumes of the reference DGM structures are listed in Table 2. Intra-rater agreement was assessed through DSC means per structure for three cases (Table 3). Intra-rater DSC was consistently high, with DSC ≥ 0.85 for all the experts across all six DGM structures. No difference in inter-rater DSC was observed between raters or between DGM structures. The manual labels were combined by majority voting to create the reference segmentation. An evaluation at the voxel level showed similar numbers of voxels segmented by only a single rater in both MS and controls, indicating that there was no greater disagreement between the raters for MS patients compared to controls, see Table 4.

Table 2 For all structures and hemispheres, first the mean volume ± standard deviation (std) in millimeter of reference and the four automated segmentation software
Table 3 For all structures and hemispheres the spatial overlap of intra rater agreement. Spatial overlap is shown with the mean dice similarity coefficient ± standard deviation and is calculated over four subjects
Table 4 The average (± standard deviation) amount of voxels that were selected by one rater for both healthy control groups (HC) as patients (MS) group

Performance of automated methods

Volumetric agreement

DGM volumes of the reference and automatic segmentations were compared. In Fig. 2, an example T1 image is shown along with the corresponding segmentation of reference and automated method. Figure 3 and Table 2 show the volumetric and spatial agreement between reference and automated method. Over the total dataset (n = 32) automated average volumes all differed from reference segmentations: caudate and putamen volumes were on average underestimated by all automated method, while thalamus volumes were overestimated by FSL-FIRST and FreeSurfer and underestimated by GIF and volBrain (all p < 0.01). Despite these systematic differences, ICC for FSL-FIRST, FreeSurfer and volBrain varied between good (0.60 ≤ ICC < 0.75) and excellent (ICC ≥ 0.75) and for GIF from fair (0.40 ≤ ICC < 0.60) to excellent (Table 3).

Fig. 2
figure 2

T1 weighted images and segmentation of majority voting, FSL-FIRST, Freesurfer, GIF and volBrain. Segmentations of both left and right hemisphere and for all three structures; caudate, putamen and thalamus

Fig. 3
figure 3

Majority voting segmentation volume and volume by automatic segmentation are given for each deep gray matter structure and segmentation method. Volumes are given in milliliters

Spatial agreement

The DSC between reference and automatic segmentations were assessed. Figure 4, Table 5 show the DSC for both controls and patients. For thalamus, all the DSC were significantly lower for patients compared to controls (p < 0.05). For putamen, this was the case for FreeSurfer and GIF, for both left and right hemisphere. The volumes were, however, different in all cases, mostly lower in MS. For the caudate, only a significant difference in DSC between controls and patients was found in the right hemisphere for FreeSurfer and Gif. In all cases, a large variation was observed for patients compared to controls, see Fig. 4.

Fig. 4
figure 4

Dice similarity coefficients between segmentations from majority voting and each automated method per DGM structure for both healthy controls (HC) and patients (green)

Table 5 For all structures and hemispheres the spatial overlap between the “Gold standard” and the automated segmentation methods for both control and patients group

Relation with pathology

Higher WM lesion load was associated with a lower performance of the automated method: total WM lesion load was negatively correlated with DSC for all method (Table 6 and Fig. 5). The correlation was moderate to strong for all structures and for all methods and both sides (|r|> 0.2), however, not all were significant, see Table 6. The regional WM lesion load, i.e., that located within 10 mm of the structure, was also negatively correlated with DSC (Table 6), however, these correlations ranged from weak to moderate for putamen and thalamus and for caudate from moderate too strong.

Table 6 Spearman correlation between the dice similarity index and lesion load (LL), regional lesion load (RLL), normalized brain volume (NBV) and volume of region of interest (ROIV)
Fig. 5
figure 5

Dice similarity coefficients versus lesion load, represented per DGM structure and segmentation method and left (blue) and right (green) hemisphere

Both NBV and the volume of the structure of interest itself were positively correlated with DSC (Table 6). The correlations between NBV and DSC were often not significant and ranged between weak and strong for the different structures, methods and sides. The correlations between the volume of the structure and DSC were often significant. Only the correlation of DSC and volume measured on left caudate with GIF and on left thalamus with FreeSurfer were not significant. The correlations ranged for the putamen, thalamus and right caudate from moderate too strong and for left thalamus from weak too strong.

To overcome the effect of lesions, two lesion-filling methods (LST-LF and LEAP) were used. Two-way ANOVA analysis per hemisphere, per method and DGM structure showed no significant difference in segmentation volume and DSC for both lesion filling methods compared to native (non-filled) patients images (see Supplementary Fig. 1 and Supplementary Table 5). Moreover, Student’s t test showed no significant difference in segmentation volume or DSC for either of the filling methods compared to native patient images (Table 7).

Table 7 For all structures, hemispheres and automated segmentation method the p-value of students t test between dice similarity index of non-filled and filled T1 images before applying automated segmentation method

Discussion

Using a systematic and objective evaluation against a consensus of manual segmentations in a multi-center dataset, this study provides evidence that automated DGM segmentation methods performed worse on brain scans of MS patients than on those of healthy controls. Higher lesion volumes were associated with poorer DGM segmentation performance.

The accuracy of DGM segmentations is not an academic question but also has great clinical importance. Clinical and cognitive deterioration in MS have been linked to brain and GM atrophy [5, 23, 24], and several treatments are now available that are able to reduce brain atrophy rates in MS [25,26,27,28]. Accurate measurement of the volumes of DGM structures in MS is becoming especially important, because of the strong relation of DGM atrophy with cognitive impairment (1,5). This study reveals that existing DGM segmentation methods perform not as accurate in MS patients as in controls, as reflected by the lower DSC (overlap) scores. This implies that the results may incorporate increased random variability and bias when applied to MS cases and should be interpreted with great caution. In the future, methodological improvements are required to achieve better performance in MS.

Only a limited number of studies directly investigated the performance of DGM segmentation methods when applied to MS. Derakhshan et al. (2010) evaluated six automated segmentation method for GM atrophy on T1 MR images of three MS patients. They concluded that severe shortcomings are present in the segmentation of DGM structures [1]. The current study extends those findings substantially by investigating a multi-center dataset comprising 21 MS subjects and 11 controls using full three-dimensional manual segmentations of three DGM structures bilaterally. Importantly, using this dataset we were able to objectively compare several widely applied automated segmentation techniques in a multi-center setting. By selecting from previously acquired data a subset that maximized the number of scanners and the number of progressive patients, we were able to demonstrate quantitatively that this performance impairment exists in MS patients with a relatively long disease duration and/or progressive course. It would next be important to confirm this independently, as well as to investigate if the effect already occurs in early MS or CIS, given that DGM atrophy already occurs at those early stages [29].

To obtain insights that could aid in amending the impairment of segmentation performance, we investigated several possible causes. One important candidate reason for the reduced accuracy of DGM segmentation in MS is formed by the focal WM lesions. Previous work on whole-brain total GM volume measurement has shown that MS WM lesions affect the GM volume measurement for a number of different packages [6,7,8,9, 30]. Similarly, the presence of local or overall brain atrophy or diffusions damage could affect the performance of segmentation methods [31]. The precise mechanism behind these deteriorating effects may differ between packages but could include effects on image intensity histograms, image registration and non-brain tissue removal [5]. Therefore, we investigated whether total lesion load, regional lesion load, NBV and the volume of the structure itself were related to the performance of the automated DGM segmentation method. The strongest association with poorer accuracy in MS cases compared to healthy controls was observed for total WM lesion load. Higher regional lesions load and lower total and local brain volumes were also associated with poorer performance, but less strongly. It should be mentioned that the small size of regional lesion load—which is confined to a narrow region around the structure of interest—and the relatively small number of MS patients may have hampered our ability to detect this association. Moreover, we should mention that the relation between the volume of the structure and the performance of the automated DGM segmentation methods could also result from the artifact that the performance is dependent on the volume (greater volume could result in higher DSC). Therefore, the positive relationship should be studied in more detail for a better understanding of this effect.

While the accuracy of the segmentation was the most important focus of the present work, the accuracy of the resulting volumes may be considered at least equally important from a clinical viewpoint. Here to, systematic differences were observed between the automated methods and the reference measurements. We also saw a difference between the volumes obtained from different methods for each structure separately. The automated methods underestimated volumes of caudate and putamen while the volume of the thalamus was generally overestimated. This difference could be related to the different anatomical definitions used in the manual standard and the automated methods. One specific example is the question of whether the lateral geniculate nuclei bodies should be included or excluded when segmenting the thalamus [32]. The differences for the structures could also be indirectly related to disease effects: due to the anatomical location of the structures, some brain regions are more prone to contain lesions than others or could be more affected by regional atrophy (both of which could impair DGM segmentation). A study with more patients and a more diverse lesion load could give more insight if the automated method performs differently on DGM structures. Moreover, a study on spatial patterns on the DGM structures could also give more insight into the performance of the methods.

It has been suggested that filling lesions increases the accuracy of total GM segmentation, and we also expected an improvement of DGM segmentation after lesions filling [6, 30, 33]. However, we measured no difference in the performance of the automated method compared to manual segmentation after filling lesions. This is similar to the results reported in 2014 by Popescu et al. for filling with FLS-lesion filling and LEAP and segmentation with FSL-First for multiple DGM structures (e.g. thalamus, putamen, caudate nucleus, brainstem) (7). Therefore, it seems that lesion filling increases the accuracy of total GM segmentation, however, it does not increase the accuracy of DGM segmentation. Our hypothesis on this is that an underlying factor such as regional atrophy or GM lesion load or a combination of the pathology aspects (e.g. lesion load, atrophy, NAWM, diffusion damage) could be a cause. It should be mentioned that the lesions were manually outlined on either T2/PD images or FLAIR images, potentially leading to differences in the lesion segmentations that could have affected our results. However, after dividing the group into two different sets the same effects were visible as in the complete group, though less significant, as expected for the smaller group sizes.

Moreover, we suggest a study on the effect of WM-GM contrast-to-noise ratio which might cause this effect. As MS pathology affects both the WM and GM, resulting in more variation in the WM and GM signal, it is possible that the WM-GM contrast ratio is changed. Ratio could be changed due to iron change or damage of WM and/or GM [34]. Westlye et al. (2009) showed in Alzheimer’s Disease (AD) that cortical thickness in subjects with regionally reduced tissue contrast was overestimated compared to subjects without reduces tissue contrast. They indicate that the overestimation is related to alterations in myelin density and water compartment close to the WM. Moreover, adjusting for local variability in tissue contrast could correct the overestimation [35]. Therefore, further studies should investigate this in MS and, moreover, other possible causes (e.g. diffuse signal changes, the effect of image processing) should be investigated as well.

Furthermore, as it is important to have segmentation with accurate spatial level for correct localization and shape, future research could take are more in-depth approach regarding shape analysis e.g. quantitative vertex displacement analysis [36]. This analysis enables the finding of vertices which have a significantly different shape from the reference and would be of added value in unraveling why some packages are outperforming others.

In conclusion, the performance of four state-of-the-art automated DGM segmentation method is impaired in MS, which warrants caution in interpreting DGM volumes both in group studies and in individual patients. Poorer accuracy was associated with higher WM lesion load and smaller global-local brain volumes, but the mechanism is not yet understood. Remarkably, the impaired performance was not improved by lesion-filling. More research is needed to understand the underlying causes of reduced accuracy and then eliminate their effects.