Pneumonia Detection from Chest X-Rays

Project Overview

In this project, you will apply the skills that you have acquired in this 2D medical imaging course to analyze data from the NIH Chest X-ray Dataset and train a CNN to classify a given chest x-ray for the presence or absence of pneumonia. This project will culminate in a model that can predict the presence of pneumonia with human radiologist-level accuracy that can be prepared for submission to the FDA for 510(k) clearance as software as a medical device. As part of the submission preparation, you will formally describe your model, the data that it was trained on, and a validation plan that meets FDA criteria.

You will be provided with the medical images with clinical labels for each image that were extracted from their accompanying radiology reports.

The project will include access to a GPU for fast training of deep learning architecture, as well as access to 112,000 chest x-rays with disease labels acquired from 30,000 patients.

Pneumonia and X-Rays in the Wild

Chest X-ray exams are one of the most frequent and cost-effective types of medical imaging examinations. Deriving clinical diagnoses from chest X-rays can be challenging, however, even by skilled radiologists.

When it comes to pneumonia, chest X-rays are the best available method for diagnosis. More than 1 million adults are hospitalized with pneumonia and around 50,000 die from the disease every year in the US alone. The high prevalence of pneumonia makes it a good candidate for the development of a deep learning application for two reasons: 1) Data availability in a high enough quantity for training deep learning models for image classification 2) Opportunity for clinical aid by providing higher accuracy image reads of a difficult-to-diagnose disease and/or reduce clinical burnout by performing automated reads of very common scans.

The diagnosis of pneumonia from chest X-rays is difficult for several reasons:

The appearance of pneumonia in a chest X-ray can be very vague depending on the stage of the infection
Pneumonia often overlaps with other diagnoses
Pneumonia can mimic benign abnormalities

For these reasons, common methods of diagnostic validation performed in the clinical setting are to obtain sputum cultures to test for the presence of bacteria or viral bodies that cause pneumonia, reading the patient's clinical history and taking their demographic profile into account, and comparing a current image to prior chest X-rays for the same patient if they are available.

About the Dataset

The dataset provided to you for this project was curated by the NIH specifically to address the problem of a lack of large x-ray datasets with ground truth labels to be used in the creation of disease detection algorithms.

The data is mounted in the Udacity Jupyter GPU workspace provided to you, along with code to load the data. Alternatively, you can download the data from the kaggle website or official NIH website and run it locally. You are STRONGLY recommended to complete the project using the Udacity workspace since the data is huge, and you will need GPU to accelerate the training process.

There are 112,120 X-ray images with disease labels from 30,805 unique patients in this dataset. The disease labels were created using Natural Language Processing (NLP) to mine the associated radiological reports. The labels include 14 common thoracic pathologies:

Atelectasis
Consolidation
Infiltration
Pneumothorax
Edema
Emphysema
Fibrosis
Effusion
Pneumonia
Pleural thickening
Cardiomegaly
Nodule
Mass
Hernia

The biggest limitation of this dataset is that image labels were NLP-extracted so there could be some erroneous labels but the NLP labeling accuracy is estimated to be >90%.

The original radiology reports are not publicly available but you can find more details on the labeling process here.

Dataset Contents:

112,120 frontal-view chest X-ray PNG images in 1024*1024 resolution (under images folder)
Meta data for all images (Data_Entry_2017.csv): Image Index, Finding Labels, Follow-up #, Patient ID, Patient Age, Patient Gender, View Position, Original Image Size and Original Image Pixel Spacing.

Project Steps

1. Exploratory Data Analysis

This phase included analysis of patient demographics, disease distribution, comorbidities, and pixel-level intensity histograms. Key findings:

Age and gender distribution
X-ray view positions (PA/AP)
Pneumonia and non-pneumonia case counts
Disease co-occurrence and frequency per patient
Pixel intensity histograms for healthy and disease states

EDA Visualizations:

Disease distribution across the NIH dataset showing prevalence of all 14 disease categories. Note the severe class imbalance with Infiltration being most common (17.7%) and Hernia least common (0.2%). Pneumonia represents 1.26% of cases.

Top 10 diseases that co-occur with pneumonia. Infiltration (42.3%), Edema (23.8%), and Effusion (18.8%) are the most common co-occurring conditions. This multi-label nature explains why the model may produce false positives for other infiltrative processes.

Patient demographics and pneumonia prevalence across age groups. Top left: Age distribution by gender showing predominance of adult patients. Top right: Gender distribution (56% Male, 44% Female). Bottom left: Pneumonia cases by gender showing similar patterns. Bottom right: Pneumonia prevalence rate by age group, with highest rates in young populations.

Further EDA details and visualizations can be found in the Exploratory Data Analysis notebook.

Ground Truth

Labels extracted from radiology reports using NLP
Reports authored by board-certified radiologists
Binary and multi-label annotations
Limitations: label noise, inter-observer variability, not pathological confirmation

2. Building and Training Your Model

Data Split:

Patient-level stratified splitting (prevents data leakage)
No patient overlap between train/validation/test sets
Stratified by pneumonia status to maintain class balance

Training Dataset: NIH Chest X-ray Dataset (Clinical Center)

79,113 images, 21,563 patients
Pneumonia prevalence: 1.26%
Age: 1-120 years (mean: 46.9)
Gender: 56% Male, 44% Female
View: 60% PA, 40% AP/Lateral
Multi-label: 18.4% images have >1 disease

Validation Dataset:

16,370 images, 4,621 patients
Pneumonia prevalence: 1.36%
Age: mean 46.2 years
View: 60.2% PA

Model Architecture:

DenseNet121 base (pre-trained on ImageNet)
Two-stage transfer learning: Stage 1 (feature extraction, frozen base), Stage 2 (top 20% layers unfrozen for fine-tuning)
Custom head: 3-layer fully connected network with dropout
Output: Single neuron with sigmoid activation (binary classification)

graph TB
    Input["Input Image<br/>224×224×3"] --> DenseNet["DenseNet121 Base<br/>(ImageNet Pre-trained)"]
    
    subgraph DenseNet121["DenseNet121 Architecture (~7M params)"]
        Conv["Initial Conv(64) + MaxPool"]
        DB1["Dense Block 1<br/>6 layers, 256 features"]
        T1["Transition 1<br/>Conv + AvgPool"]
        DB2["Dense Block 2<br/>12 layers, 512 features"]
        T2["Transition 2<br/>Conv + AvgPool"]
        DB3["Dense Block 3<br/>24 layers, 1024 features"]
        T3["Transition 3<br/>Conv + AvgPool"]
        DB4["Dense Block 4<br/>16 layers, 1024 features"]
        
        Conv --> DB1 --> T1 --> DB2 --> T2 --> DB3 --> T3 --> DB4
    end
    
    DenseNet --> |7×7×1024<br/>Feature Maps| GAP["GlobalAveragePooling2D<br/>Output: 1024 features"]
    
    subgraph CustomHead["Custom Classification Head (~2.1M params)"]
        GAP --> FC1["Dense(1024, ReLU)"]
        FC1 --> Drop1["Dropout(0.5)"]
        Drop1 --> FC2["Dense(512, ReLU)"]
        FC2 --> Drop2["Dropout(0.3)"]
        Drop2 --> Output["Dense(1, Sigmoid)<br/>dtype=float32"]
    end
    
    Output --> Prob["Pneumonia Probability<br/>Range: 0-1"]
    
    style Input fill:#e1f5ff
    style DenseNet121 fill:#ffe1e1
    style CustomHead fill:#e1ffe1
    style Prob fill:#f5e1ff
    
    Note1["STAGE 1 (Epochs 1-17):<br/>All DenseNet121 frozen<br/>Train custom head only"]
    Note2["STAGE 2 (Epochs 18-32):<br/>Top 20% unfrozen<br/>(Dense Blocks 3-4)<br/>Fine-tune with lr=1e-5"]
    
    style Note1 fill:#fff4e1,stroke:#ff9800
    style Note2 fill:#fff4e1,stroke:#ff9800

DenseNet121-based architecture for pneumonia detection.

Training Details:

Augmentation: Horizontal flip, normalization
Batch size: 64 (initial), 32 (after OOM)
Optimizer: Adam
Class weighting to address severe imbalance (1.2% pneumonia prevalence)

Performance Assessment:

Metrics: AUC, F1, sensitivity, specificity
Best validation AUC: 0.6345
Threshold selection: Sensitivity-optimized (0.40)

2-Stage Transfer Learning Baseline Results:

Training and validation loss, accuracy, and AUC across 32 epochs (Stage 1: epochs 1-17 frozen, Stage 2: epochs 18-32 fine-tuned). Best validation AUC: 0.6345 at epoch 17. Receiver Operating Characteristic curve showing model discrimination ability. AUC: 0.6345 (validation), 0.6213 (test).

Experiments Overview:

Experiment 1A: Baseline DenseNet121 model, standard preprocessing and training parameters.
Experiment 1B: DenseNet121 with advanced data augmentation and modified training parameters.
Experiment 2A: Fine-tuned DenseNet121, selective layer freezing, best overall performance.
Experiment 2B: Further fine-tuning and hyperparameter optimization for improved robustness.

Experiment 1A, 1B, 2A, and 2B Results Comparison

MODEL PERFORMANCE COMPARISON (Validation Set, Threshold=0.5)

Metric	Exp 1A(ImageNet)	Exp 1B(Aggressive Aug)	Exp 2A(20% Fine-tune)	Exp 2B(35% Fine-tune)
AUC	0.6687	0.6641	0.6736	0.6715
Accuracy	0.3619	0.9103	0.8375	0.6980
Precision	0.0177	0.0337	0.0340	0.0246
Recall	0.8468	0.2027	0.4009	0.5495
Specificity	0.3552	0.9201	0.8435	0.7001
F1-Score	0.0347	0.0578	0.0627	0.0470

See key Findings from Comparative Analysis in FDA Submission or code and results in Build and train model.ipynb

Experiment Visualizations: Confusion matrices for all experiments at threshold=0.5 showing true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN). Experiment 2A demonstrates superior classification performance with optimal balance between sensitivity and specificity. Matrix colors indicate prediction accuracy, with darker blue showing higher counts.

Quantitative comparison of key performance metrics (AUC-ROC, Accuracy, Precision, Recall, F1-Score) across all experiments at threshold=0.5. Experiment 2A (orange bars) achieves the best overall performance with 67.38% AUC and balanced classification metrics. Chart demonstrates consistent superiority of 2A across all evaluation criteria.

Precision-Recall curves showing performance trade-offs across experiments at different classification thresholds. Average Precision (AP) scores indicate Experiment 2A provides the best precision-recall balance for the severely imbalanced pneumonia dataset (1.2% prevalence). Higher curves indicate better ability to maintain high precision while achieving high recall.

Receiver Operating Characteristic curves comparing all four experiments. Experiment 2A (20% fine-tuning, orange line) achieves the highest AUC of 67.38%, demonstrating optimal balance between feature extraction and fine-tuning depth. The curves show progressive improvement from baseline (1A) to optimized fine-tuning (2A), with diminishing returns at excessive fine-tuning depth (2B).

Sensitivity (True Positive Rate) vs Specificity (True Negative Rate) trade-off analysis across different operating thresholds for all experiments. Each curve shows how the classification threshold affects the balance between detecting pneumonia cases (sensitivity) and correctly identifying non-pneumonia cases (specificity). Critical for determining optimal clinical operating point - higher curves indicate better overall discrimination ability.

Training and validation loss and AUC progression over epochs for all experiments. Experiment 2A shows optimal convergence pattern with minimal overfitting - validation metrics closely track training metrics without significant divergence. Experiments 1B shows early plateauing, while 2B exhibits slight overfitting in later epochs.

Note: Ensemble Models (Weighted Average) did not outperform the best single model (Experiment 2A).

3. Clinical Workflow Integration

DICOM wrapper for clinical deployment: checks modality, body part, patient position, age, and image dimensions
Preprocessing: pixel extraction, normalization, resizing, channel conversion, standardization

Inference Visualizations:

Clinical DICOM inference results on test cases. Baseline Model successfully processed 4 valid chest X-ray DICOM files, demonstrating deployment readiness.

Clinical DICOM inference results on test cases. Exp2A Model successfully processed 4 valid chest X-ray DICOM files, demonstrating deployment readiness. (Exp2A Best).

Inference code and results can be found in the DICOM Inference notebook.

6. FDA Validation Plan

Inclusion: Adult patients 18-80, PA/AP view
Exclusion: Pediatric (<18), prior pneumonia, non-standard views
Target: 1,000 cases, balanced pneumonia/non-pneumonia
Ground truth: Consensus reading by 3 board-certified radiologists
Multi-center acquisition for generalizability

For further details, see FDA_Submission.md.

Based on performance analysis across multiple evaluation metrics, Experiment 2A is recommended for clinical deployment due to:

Superior discrimination ability (highest AUC: 67.38%)
Optimal generalization (minimal overfitting in training curves)
Balanced sensitivity-specificity profile appropriate for screening
Robust performance validated through cross-experiment comparison
Appropriate fine-tuning depth (20% of layers) prevents overfitting while improving features

All inference results, clinical validations, and FDA submission documentation utilize the Experiment 2A model as the primary algorithm.

7. Future Improvement

Potential directions for future improvement include:

Trying more advanced architectures such as EfficientNet-B4 to potentially boost classification performance.
Experimenting with different class weighting strategies to better address class imbalance in the dataset.
Executing focal loss cell to further mitigate the impact of class imbalance and improve model robustness, especially for rare classes.

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
.github/workflows		.github/workflows
metrice		metrice
Build and train model.ipynb		Build and train model.ipynb
CODEOWNERS		CODEOWNERS
EDA.ipynb		EDA.ipynb
FDA_Submission.md		FDA_Submission.md
FDA_Submission_Template.md		FDA_Submission_Template.md
Inference.ipynb		Inference.ipynb
README.md		README.md
my_model.json		my_model.json
sample_labels.csv		sample_labels.csv
test1.dcm		test1.dcm
test2.dcm		test2.dcm
test3.dcm		test3.dcm
test4.dcm		test4.dcm
test5.dcm		test5.dcm
test6.dcm		test6.dcm

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Pneumonia Detection from Chest X-Rays

Project Overview

Pneumonia and X-Rays in the Wild