Room: AAPM ePoster Library
Purpose: Incident learning systems are amassing reports both within institutions and nationally/internationally. As these data repositories grow, the (unstructured) data must be analyzed in a way that that is scalable. Here we develop a novel method for categorizing reports which uses an unsupervised clustering algorithm on the narrative text.
Methods: We analyzed 6,430 reports of near-miss incident reports from a single institution. The text of the report is tokenized using the tidytext package in R. For each report and each token, we calculated the text frequency inverse document frequency (TFIDF), a standard metric using in natural language processing that quantifies the frequency in a report and the uniqueness among reports. After generating a numeric matrix of TFIDF values, we applied K-means clustering in order to group reports, using K = 30 clusters, 5 starting values, and a maximum of 100 iterations per starting value.
Results: The K-mean clustering algorithm resulted in meaningful clusters of reports, though this initial analysis resulted in two large clusters that do not define narrow topics. Identification of tokens with high TFIDF values within these clusters suggests that removing certain low-value tokens (e.g. numerical quantities) before clustering could help improve clustering. Further development is focusing on token refinement, and optimizing the starting values and number of iterations per starting value to improve clustering.
Conclusion: The method presented here offers a novel means of categorizing and analyzing near-miss incident reports. Because it requires minimal human intervention it is potentially more reproducible and scalable to large data sets and across systems. The ability to categorize safety reports has many applications including the identification of high-priority quality gaps, quantifying trends over time and to comparison across systems.