Click here to


Are you sure ?

Yes, do it No, cancel

Building Robust Machine Learning Models for Colorectal Cancer Risk Prediction

B Nartowt1*, G Hart2 , W Muhammad3 , Y Liang4 , J Deng5 , (1) Yale/New Haven Hospital, New Haven, CT, (2) Yale University, New Haven, CT, (3) Yale School of Medicine, Yale University, New Haven, CT, (4) Medical College of Wisconsin, Milwaukee, WI, (5) Yale Univ. School of Medicine, New Haven, CT


(Tuesday, 7/16/2019) 10:00 AM - 10:30 AM

Room: Exhibit Hall | Forum 6

Purpose: Colorectal cancer (CRC) is third in prevalence and mortality among cancers in the US. Screening is recommended for ages 50-75 or anyone with a family history (FH) of CRC by the United States Preventative Services Task Force (USPSTF). However, since 1974 CRC has grown more prevalent in ages 18-49. Further, ages 50-75 is currently a large demographic. Thus, the aim of this study is to build robust machine learning models for more efficient risk-stratification.

Methods: The National Health Interview Survey (NHIS) and the Prostate, Lung, Colorectal, and Ovarian (PLCO) Screening Trial datasets contain 2,379 respondents whose first cancer was CRC and 280,669 never- cancer respondents. They were used for training and cross-testing 5 machine learning models: artificial neural network (ANN), naive Bayes (NB), linear discriminant analysis (LDA), support vector machine (SVM), and decision tree (DT). The models’ predictors were age, body-mass index, smoking habits, Hispanic ethnicity, sex, race, and incidence of joint-aching/arthritis, emphysema, strokes, hypertension, coronary heart disease, myocardial infarction, liver comorbidity, diabetes, ulcers, and bronchitis. After training and cross-testing, the model with the highest concordance was used to stratify CRC risk into low, medium, and high risk groups.

Results: Among the 5 machine learning models, ANN had the highest concordance of 0.82 ± .10, and gave sensitivity of 0.73 ± 0.03, specificity of 0.78 ± 0.04, positive predictive value of 0.17 ± 0.04, and negative predictive value of 0.72 ± 0.04. The total variance is due to the standard error and the error from cross-testing. Compared to USPSTF guidelines, the ANN misclassifies a lower percentage of the CRC/never-cancer populations as low/high-risk, respectively.

Conclusion: A multi-parameterized ANN was the top performer in scoring CRC risk based solely on personal health data. The trained ANN can be used to stratify individual’s CRC risk for more effective screening and intervention.


Risk, ROC Analysis, Statistical Analysis


TH- Dataset analysis/biomathematics: Machine learning techniques

Contact Email