Extracting Heterogeneously Formatted Clinical Data From DICOM Secondary Capture Using OCR

E Somasundaram*, S Brady , H Li , L He , T Maloney , J Dudley , J Dillman , Cincinnati Childrens Hospital Med Ctr, Cincinnati, OH

E Somasundaram

Presentations

(Sunday, 7/14/2019)

Room: ePoster Forums

Purpose: To automatically extract heterogeneously formatted clinical data from DICOM secondary image captures stored in PACS using optical character recognition (OCR) for model training in machine learning environment.

Methods: An IRB approved waiver of consent was obtained for this retrospective study. Patient DICOM data containing secondary capture images of slice-wise liver stiffness values from MR elastography studies were processed for extraction. All images within a patient MR study were processed sequentially until the secondary capture images with the desired tabulated data were detected. The images with tabulated data were processed using various noise removal, thresholding, and convolving operations to detect contours of objects that resemble the shape of a table. From each contour, image patches were created, and depending on the size of the image patch, further processing steps such as grid line removal, image resizing and text sharpening were performed. The processed image was then sent through an OCR engine and the output string values were scanned for certain text landmarks using regular expression matching to identify the desired table after which the required data was extracted. The mean of the elastography stiffness values were compared to reported values data-mined from radiologist reports. The entire application was built using python libraries including: â€œOpenCVâ€?, â€œpytesseractâ€? and â€œreâ€?.

Results: The algorithm was used to extract elastography liver stiffness values from 834 patient studies acquired over a span of 3 years, the results of which were saved as image captures in the DICOM series of the patient folder in PACS. The mean of the extracted stiffness values, compared to those reported in the radiologist reports, were 100% accurate on 60% of the cases, 98% accurate on 20% of the cases and 95% accurate on the remaining 20% of the cases.

Conclusion: An automatic algorithm to extract tabulated clinical data embedded in DICOM images with highly heterogeneous formatting has been developed, and the resulting accuracy is â‰¥ 95% for all images saved to PACS.

Keywords

PACS, Computer Software, Computer Vision

Taxonomy

IM- Dataset analysis/biomathematics: PACS

Contact Email

Are you sure ?

Extracting Heterogeneously Formatted Clinical Data From DICOM Secondary Capture Using OCR

E Somasundaram*, S Brady , H Li , L He , T Maloney , J Dudley , J Dillman , Cincinnati Childrens Hospital Med Ctr, Cincinnati, OH

Presentations

Additional Links