Click here to


Are you sure ?

Yes, do it No, cancel

Effectiveness of Simple Data Imputation for Missing Feature Values in Binary Classification

A Chatterjee1*, H Woodruff2, M Lobbes2, Y van Wijk2, M Beuque2, J Seuntjens1, P Lambin2, (1) McGill University, Montreal, QC, CA, (2) Maastricht University Medical Centre, Maastricht, NL


(Sunday, 7/12/2020)   [Eastern Time (GMT-4)]

Room: AAPM ePoster Library

Purpose: Outcome prediction is affected by incomplete predictor data. We hypothesized that simple data imputation methods can preserve performance for a variety of machine learning (ML) classifiers.

Methods: The dataset comprised 525 contrast-enhanced mammograms (250 benign, 275 malignant) and 3259 derived radiomic features; 500 cases were randomly selected to create a balanced set, then randomly split for training and testing (1:1); this was repeated ten times. Six classifiers were used: Naïve Bayes, Linear Discriminant (LD), KNN, SVM (linear kernel), Decision Tree, and Random Forest (RF). Three feature selection methods were investigated; for the method most resistant to overtraining, imputation experiments were performed. For each experimental iteration, the missing fraction (MF) of training set data for each model feature was set to a certain value common to all features (MF = 0.1, 0.2,... 0.5). Testing sets did not have missing data. For each MF, the incomplete dataset was created 100 times, to obtain mean performance metrics, then further averaged over the ten training-testing partitions. We used two imputation
methods: feature median and randomly choosing a feature value from existing values.

Results: On average, default LASSO chose 24 features, modified LASSO chose 11, and correlation-based feature selection, the winner, chose 7. For all ML tools, up to MF=0.5, test set performance was virtually unaffected relative to MF=0. In contrast, imputed training set performance suffered relative to MF=0, but was better for median imputation than random selection. For LD, KNN and RF, at MF=0.5 with median imputation, training accuracy dropped by <5% relative to MF=0. For RF, unlike other tools, imputation type mattered less for training set performance.

Conclusion: Median imputation for incomplete training data ensured that testing set performance was unaffected. RF is likely to be the ideal classifier for imputed training data. Findings need to be verified in other clinical datasets.


Not Applicable / None Entered.


Not Applicable / None Entered.

Contact Email