Room: Exhibit Hall | Forum 4
Purpose: Natural language processing has shown significant utility in extracting otherwise heterogeneous and unorganized information from clinical notes into structured radiotherapy registry, to support further analysis. Unfortunately, notes often miss readily accessible labels for treatment site and note type categorization. This work introduces a multi-text classification method to facilitate information aggregation for this process.
Methods: We formulated note categorization as a supervised text-based multiple classification problem. Text from each note was first transferred into a more representable numerical form using bag-of-words model and term-frequency, inverse document frequency (tf-idf) calculation, converting each note into a 15000-D feature vector. The most relevant unigrams and bigrams were identified using Chi-squared test. To predict note type from this converted vector, logistic regression (LR), Multinomial Naïve Bayes classifier (MNBC), linear support vector classifier (LinearSVC) and random forests (RF) were investigated as potential classification models. Cross validation was used to select the the best classifier, whose performance was further tested and assessed.
Results: LinearSVC has been identified as the best classifier, yielding an estimated mean accuracy of 0.87. Close inspection of misclassified notes revealed content involvement of more than one categories, and warrant human intervention.
Conclusion: We have demonstrated the efficacy of our approach to automatically associate site/type categories to clinical notes. The same approach can be used to categorize text for treatment modality, technique, etc, both as an information aggregation tool and a quality-assurance measure for note content consistency.
Classifier Design, Feature Extraction
TH- Dataset analysis/biomathematics: Machine learning techniques