Room: Track 2
Purpose: Deep learning segmentation models will implicitly learn to predict the presence of a structure based on its overall prominence in the training dataset. This phenomenon is observed and accounted for in deep learning applications such as natural language processing (NLP) but is often neglected in segmentation literature. The purpose of this work is to call attention to a detail of deep-learning-based segmentation which is often overlooked: namely, post-processing with intensity-based thresholding.
Methods: We first choose an architecture and training procedure which is a standard deep-learning-based segmentation model in medical physics. We use a 5-downblock 2-D U-Net with 20% dropout and batch normalization. We train our model independently on 10 structures from the Cancer Imaging Archive‘s Head-Neck-Radiomics-HN1 dataset. Each model’s sensitivity to post-processing with intensity-based thresholding was analyzed by varying the threshold hyperparameter. Model performance was assessed with the Dice coefficient. We quantify the prominence of structures in the training data by counting the proportion of ones in the Boolean target images. The dependence of optimum threshold on structure prominence was investigated.
Results: We observe significant decreases in perceived model performance with conventional 0.5-thresholding. Deviations of as little as +/- 0.02 from the optimum threshold induce a median reduction in Dice score of 3.1% for our models. There is statistical evidence to support a weak correlation between the optimum threshold and the “true” fraction in binary mask targets.
Conclusion: Our results suggest that those practicing deep-learning-based contouring should consider their post-processing procedures as a potential avenue for improved performance. For intensity-based post-processing, we recommend a dense sampling of thresholds with performance assessment on a validation dataset.