Are radiologists’ bad teachers for AI algorithms? – Differences in the interobserver variability between consensus-defined labelling and free labelling of NIH Chestxray14 dataset

Oral Presentation at the European Congress of Radiology, Vienna, 2019


To assess differences in interobserver variability before and after a consensus-based definition of the NIH Chestxray14 dataset labels.

Methods and Materials

We randomly extracted 800 x-rays from the NIH chestxray14 dataset. They were read by three radiologists with more than ten years’ experience. Of the14 NIH labels, atelectasis, consolidation and pneumonia were clubbed
under ‘opacity’. The other labels were used ‘as is’.The study was divided into two parts. During the first part, the radiologist assigned the labels for 400 x-rays based on their prior domain knowledge. In the next part, the labels were defined on the remaining 400 cases, post-consensus. The interobserver variability was assessed (Fleiss bounds) via the Krippendorff’s alpha coefficient corrected for chance.


The interobserver variability between free and consensus labelling did not vary in general. Opacity, pneumothorax, effusion, nodule mass, ‘no finding’ were in the ‘fair to good’ (0.41-0.75) range, while infiltration, emphysema, fibrosis and pleural thickening were in the ‘poor’ (<0.40) bound in both tests. Interestingly effusion and cardiomegaly labelling worsened to ‘poor’ post-consensus. Significantly, no label was in the ‘very good’ (>0.75) category.

Conclusion: Our assessment of the Chestxray14 dataset suggests no label has ‘very good’ agreement both in free and consensus-based labelling. There is evidence to support instances where specific labels might require stricter expert definitions. In conclusion, AI training is advised only on labelling with alpha close to 0.75 (example ‘Normal’ vs ‘Abnormal’ or ‘pneumothorax’) when employing a purely image-based interobserver metric as ground truth.