FuzzyPACS: Linking Large Unorganised Image and Report Databases for Development and Validation of Deep Learning Algorithms

Oral Presentation at the European Congress of Radiology, Vienna, 2019


Developing and validation of Deep Learning (DL) algorithms for medical imaging requires access to large organised datasets of images and their corresponding reports. Currently, most medical imaging data in the world is unorganised and requires images and text reports to be manually linked. An approach for linking medical iamges and reports of patients, where no unique identifier for linking them exists, is presented.

Methods and Materials

A dicom image database of 311,694 studies and a separate MySQL database with 296,938 reports needed to be matched at study level. No unique identifier existed to link the two databases and not all reports had matching images, and there was only partial overlap between the databases. Additionally, patient names were inexactly entered with varied formats in the two databases making direct matching impossible. Fuzzywuzzy Python library, which incorporates fuzzy string matching, a technique based on Levenshtein Distance between string to estimate text similarity, was used to match patient name in the two databases following date and modality level filters. Four fuzzy matching techniques (simple, partial, token-set and token-sort ratios) were evaluated.


Simple, partial, token-set and token-sort ratios gave 4.56%, 46.45%, 57.37% and 7.97% matches of reports respectively with 95% match confidence. Token set ratio, which had the highest match percentage, matched 170,336 reports to their corresponding studies.


Fuzzy matching is a promising technique to merge independent datasets withoutunique identifiers, saving thousands of man-hours, critical for development and validation of DL algorithms.