MENU

Click here to

×

Are you sure ?

Yes, do it No, cancel

Deep Learning in Medical Physics: Reality Or Noise?

G Valdes*, M Romero-Calvo , T Solberg , Y Interian , UCSF Comprehensive Cancer Center, San Francisco, CA

Presentations

(Monday, 7/15/2019) 4:30 PM - 5:30 PM

Room: Exhibit Hall | Forum 2

Purpose: The application of deep learning algorithms (DL) to small datasets (hundreds of points) has been common in the medical physics. To justify the small number of data points, transfer learning is often referenced. In this work, we perform an in-depth study of the performance of DL as a function of dataset size.

Methods: 112,120 frontal-view x-ray images from the NIH ChestXray14 dataset were used in our analysis. We first studied two real tasks: unbalanced multi-label classification of 14 diseases, and binary classification of pneumonia vs non-pneumonia. The dataset was randomly split into training, validation and testing (69%, 8%, 23%). Using PyTorch, a popular convolution neural network (CNN), DensNet121, was trained (with and without transfer learning) using different numbers of data points for both tasks (N=50 to N=77,880). Additionally, to study the behavior of CNNs under known and controlled conditions, we generated simple functions that mapped the images to different outcomes and attempted to recover them. Linear functions from radiomics features (79 in total) and small known CNNs were chosen as the generators. The area-under-the-curve (AUC) and balanced accuracy on the test set were reported, with confident intervals calculated using boostrap.

Results: In the multi-label problem, DensNet121 needed at least 1600 patients to be comparable to, and 10,000 to outperform, radiomics-based logistic regression. In classifying pneumonia vs. non-pneumonia, both CNN and radiomics-based methods performed poorly when N < 2000. This was true regardless of whether transfer learning was used. The same behavior was observed for the simpler problems when the generator functions were known.

Conclusion: Our experiments in simple controlled and real tasks suggest that DL performs poorly in small datasets, even when transfer learning is used. Therefore, it is unlikely that results reported in datasets of similar size will generalize well, and such attempts should be met with skepticism.

Keywords

Maximum Likelihood Estimation

Taxonomy

IM- Dataset analysis/biomathematics: Machine learning

Contact Email