In this project, you will apply the skills that you have acquired in this 2D medical imaging course to analyze data from the NIH Chest X-ray Dataset and train a CNN to classify a given chest x-ray for the presence or absence of pneumonia. This project will culminate in a model that can predict the presence of pneumonia with human radiologist-level accuracy that can be prepared for submission to the FDA for 510(k) clearance as software as a medical device. As part of the submission preparation, you will formally describe your model, the data that it was trained on, and a validation plan that meets FDA criteria.
You will be provided with the medical images with clinical labels for each image that were extracted from their accompanying radiology reports.
The project will include access to a GPU for fast training of deep learning architecture, as well as access to 112,000 chest x-rays with disease labels acquired from 30,000 patients.
Chest X-ray exams are one of the most frequent and cost-effective types of medical imaging examinations. Deriving clinical diagnoses from chest X-rays can be challenging, however, even by skilled radiologists.
When it comes to pneumonia, chest X-rays are the best available method for diagnosis. More than 1 million adults are hospitalized with pneumonia and around 50,000 die from the disease every year in the US alone. The high prevalence of pneumonia makes it a good candidate for the development of a deep learning application for two reasons: 1) Data availability in a high enough quantity for training deep learning models for image classification 2) Opportunity for clinical aid by providing higher accuracy image reads of a difficult-to-diagnose disease and/or reduce clinical burnout by performing automated reads of very common scans.
The diagnosis of pneumonia from chest X-rays is difficult for several reasons:
- The appearance of pneumonia in a chest X-ray can be very vague depending on the stage of the infection
- Pneumonia often overlaps with other diagnoses
- Pneumonia can mimic benign abnormalities
For these reasons, common methods of diagnostic validation performed in the clinical setting are to obtain sputum cultures to test for the presence of bacteria or viral bodies that cause pneumonia, reading the patient's clinical history and taking their demographic profile into account, and comparing a current image to prior chest X-rays for the same patient if they are available.
The dataset provided to you for this project was curated by the NIH specifically to address the problem of a lack of large x-ray datasets with ground truth labels to be used in the creation of disease detection algorithms.
The data is mounted in the Udacity Jupyter GPU workspace provided to you, along with code to load the data. Alternatively, you can download the data from the kaggle website or official NIH website and run it locally. You are STRONGLY recommended to complete the project using the Udacity workspace since the data is huge, and you will need GPU to accelerate the training process.
There are 112,120 X-ray images with disease labels from 30,805 unique patients in this dataset. The disease labels were created using Natural Language Processing (NLP) to mine the associated radiological reports. The labels include 14 common thoracic pathologies:
- Atelectasis
- Consolidation
- Infiltration
- Pneumothorax
- Edema
- Emphysema
- Fibrosis
- Effusion
- Pneumonia
- Pleural thickening
- Cardiomegaly
- Nodule
- Mass
- Hernia
The biggest limitation of this dataset is that image labels were NLP-extracted so there could be some erroneous labels but the NLP labeling accuracy is estimated to be >90%.
The original radiology reports are not publicly available but you can find more details on the labeling process here.
- 112,120 frontal-view chest X-ray PNG images in 1024*1024 resolution (under images folder)
- Meta data for all images (Data_Entry_2017.csv): Image Index, Finding Labels, Follow-up #, Patient ID, Patient Age, Patient Gender, View Position, Original Image Size and Original Image Pixel Spacing.
This phase included analysis of patient demographics, disease distribution, comorbidities, and pixel-level intensity histograms. Key findings:
- Age and gender distribution
- X-ray view positions (PA/AP)
- Pneumonia and non-pneumonia case counts
- Disease co-occurrence and frequency per patient
- Pixel intensity histograms for healthy and disease states
EDA Visualizations:
Disease distribution across the NIH dataset showing prevalence of all 14 disease categories. Note the severe class imbalance with Infiltration being most common (17.7%) and Hernia least common (0.2%). Pneumonia represents 1.26% of cases.
Top 10 diseases that co-occur with pneumonia. Infiltration (42.3%), Edema (23.8%), and Effusion (18.8%) are the most common co-occurring conditions. This multi-label nature explains why the model may produce false positives for other infiltrative processes.
Patient demographics and pneumonia prevalence across age groups. Top left: Age distribution by gender showing predominance of adult patients. Top right: Gender distribution (56% Male, 44% Female). Bottom left: Pneumonia cases by gender showing similar patterns. Bottom right: Pneumonia prevalence rate by age group, with highest rates in young populations.
Further EDA details and visualizations can be found in the Exploratory Data Analysis notebook.
- Labels extracted from radiology reports using NLP
- Reports authored by board-certified radiologists
- Binary and multi-label annotations
- Limitations: label noise, inter-observer variability, not pathological confirmation
Data Split:
- Patient-level stratified splitting (prevents data leakage)
- No patient overlap between train/validation/test sets
- Stratified by pneumonia status to maintain class balance
Training Dataset: NIH Chest X-ray Dataset (Clinical Center)
- 79,113 images, 21,563 patients
- Pneumonia prevalence: 1.26%
- Age: 1-120 years (mean: 46.9)
- Gender: 56% Male, 44% Female
- View: 60% PA, 40% AP/Lateral
- Multi-label: 18.4% images have >1 disease
Validation Dataset:
- 16,370 images, 4,621 patients
- Pneumonia prevalence: 1.36%
- Age: mean 46.2 years
- View: 60.2% PA
Model Architecture:
- DenseNet121 base (pre-trained on ImageNet)
- Two-stage transfer learning: Stage 1 (feature extraction, frozen base), Stage 2 (top 20% layers unfrozen for fine-tuning)
- Custom head: 3-layer fully connected network with dropout
- Output: Single neuron with sigmoid activation (binary classification)
graph TB
Input["Input Image<br/>224×224×3"] --> DenseNet["DenseNet121 Base<br/>(ImageNet Pre-trained)"]
subgraph DenseNet121["DenseNet121 Architecture (~7M params)"]
Conv["Initial Conv(64) + MaxPool"]
DB1["Dense Block 1<br/>6 layers, 256 features"]
T1["Transition 1<br/>Conv + AvgPool"]
DB2["Dense Block 2<br/>12 layers, 512 features"]
T2["Transition 2<br/>Conv + AvgPool"]
DB3["Dense Block 3<br/>24 layers, 1024 features"]
T3["Transition 3<br/>Conv + AvgPool"]
DB4["Dense Block 4<br/>16 layers, 1024 features"]
Conv --> DB1 --> T1 --> DB2 --> T2 --> DB3 --> T3 --> DB4
end
DenseNet --> |7×7×1024<br/>Feature Maps| GAP["GlobalAveragePooling2D<br/>Output: 1024 features"]
subgraph CustomHead["Custom Classification Head (~2.1M params)"]
GAP --> FC1["Dense(1024, ReLU)"]
FC1 --> Drop1["Dropout(0.5)"]
Drop1 --> FC2["Dense(512, ReLU)"]
FC2 --> Drop2["Dropout(0.3)"]
Drop2 --> Output["Dense(1, Sigmoid)<br/>dtype=float32"]
end
Output --> Prob["Pneumonia Probability<br/>Range: 0-1"]
style Input fill:#e1f5ff
style DenseNet121 fill:#ffe1e1
style CustomHead fill:#e1ffe1
style Prob fill:#f5e1ff
Note1["STAGE 1 (Epochs 1-17):<br/>All DenseNet121 frozen<br/>Train custom head only"]
Note2["STAGE 2 (Epochs 18-32):<br/>Top 20% unfrozen<br/>(Dense Blocks 3-4)<br/>Fine-tune with lr=1e-5"]
style Note1 fill:#fff4e1,stroke:#ff9800
style Note2 fill:#fff4e1,stroke:#ff9800
DenseNet121-based architecture for pneumonia detection.
Training Details:
- Augmentation: Horizontal flip, normalization
- Batch size: 64 (initial), 32 (after OOM)
- Optimizer: Adam
- Class weighting to address severe imbalance (1.2% pneumonia prevalence)
Performance Assessment:
- Metrics: AUC, F1, sensitivity, specificity
- Best validation AUC: 0.6345
- Threshold selection: Sensitivity-optimized (0.40)
![]() |
![]() |
|---|---|
![]() |
![]() |
Training and validation loss, accuracy, and AUC across 32 epochs (Stage 1: epochs 1-17 frozen, Stage 2: epochs 18-32 fine-tuned). Best validation AUC: 0.6345 at epoch 17. Receiver Operating Characteristic curve showing model discrimination ability. AUC: 0.6345 (validation), 0.6213 (test).
Experiments Overview:
- Experiment 1A: Baseline DenseNet121 model, standard preprocessing and training parameters.
- Experiment 1B: DenseNet121 with advanced data augmentation and modified training parameters.
- Experiment 2A: Fine-tuned DenseNet121, selective layer freezing, best overall performance.
- Experiment 2B: Further fine-tuning and hyperparameter optimization for improved robustness.
MODEL PERFORMANCE COMPARISON (Validation Set, Threshold=0.5)
| Metric | Exp 1A(ImageNet) | Exp 1B(Aggressive Aug) | Exp 2A(20% Fine-tune) | Exp 2B(35% Fine-tune) |
|---|---|---|---|---|
| AUC | 0.6687 | 0.6641 | 0.6736 | 0.6715 |
| Accuracy | 0.3619 | 0.9103 | 0.8375 | 0.6980 |
| Precision | 0.0177 | 0.0337 | 0.0340 | 0.0246 |
| Recall | 0.8468 | 0.2027 | 0.4009 | 0.5495 |
| Specificity | 0.3552 | 0.9201 | 0.8435 | 0.7001 |
| F1-Score | 0.0347 | 0.0578 | 0.0627 | 0.0470 |
See key Findings from Comparative Analysis in FDA Submission or code and results in Build and train model.ipynb
Experiment Visualizations:
Confusion matrices for all experiments at threshold=0.5 showing true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN). Experiment 2A demonstrates superior classification performance with optimal balance between sensitivity and specificity. Matrix colors indicate prediction accuracy, with darker blue showing higher counts.
Quantitative comparison of key performance metrics (AUC-ROC, Accuracy, Precision, Recall, F1-Score) across all experiments at threshold=0.5. Experiment 2A (orange bars) achieves the best overall performance with 67.38% AUC and balanced classification metrics. Chart demonstrates consistent superiority of 2A across all evaluation criteria.
Precision-Recall curves showing performance trade-offs across experiments at different classification thresholds. Average Precision (AP) scores indicate Experiment 2A provides the best precision-recall balance for the severely imbalanced pneumonia dataset (1.2% prevalence). Higher curves indicate better ability to maintain high precision while achieving high recall.
Receiver Operating Characteristic curves comparing all four experiments. Experiment 2A (20% fine-tuning, orange line) achieves the highest AUC of 67.38%, demonstrating optimal balance between feature extraction and fine-tuning depth. The curves show progressive improvement from baseline (1A) to optimized fine-tuning (2A), with diminishing returns at excessive fine-tuning depth (2B).
Sensitivity (True Positive Rate) vs Specificity (True Negative Rate) trade-off analysis across different operating thresholds for all experiments. Each curve shows how the classification threshold affects the balance between detecting pneumonia cases (sensitivity) and correctly identifying non-pneumonia cases (specificity). Critical for determining optimal clinical operating point - higher curves indicate better overall discrimination ability.
Training and validation loss and AUC progression over epochs for all experiments. Experiment 2A shows optimal convergence pattern with minimal overfitting - validation metrics closely track training metrics without significant divergence. Experiments 1B shows early plateauing, while 2B exhibits slight overfitting in later epochs.
Note: Ensemble Models (Weighted Average) did not outperform the best single model (Experiment 2A).
- DICOM wrapper for clinical deployment: checks modality, body part, patient position, age, and image dimensions
- Preprocessing: pixel extraction, normalization, resizing, channel conversion, standardization
Inference Visualizations:
Clinical DICOM inference results on test cases. Baseline Model successfully processed 4 valid chest X-ray DICOM files, demonstrating deployment readiness.
Clinical DICOM inference results on test cases. Exp2A Model successfully processed 4 valid chest X-ray DICOM files, demonstrating deployment readiness. (Exp2A Best).
Inference code and results can be found in the DICOM Inference notebook.
- Inclusion: Adult patients 18-80, PA/AP view
- Exclusion: Pediatric (<18), prior pneumonia, non-standard views
- Target: 1,000 cases, balanced pneumonia/non-pneumonia
- Ground truth: Consensus reading by 3 board-certified radiologists
- Multi-center acquisition for generalizability
For further details, see FDA_Submission.md.
Based on performance analysis across multiple evaluation metrics, Experiment 2A is recommended for clinical deployment due to:
- Superior discrimination ability (highest AUC: 67.38%)
- Optimal generalization (minimal overfitting in training curves)
- Balanced sensitivity-specificity profile appropriate for screening
- Robust performance validated through cross-experiment comparison
- Appropriate fine-tuning depth (20% of layers) prevents overfitting while improving features
All inference results, clinical validations, and FDA submission documentation utilize the Experiment 2A model as the primary algorithm.
Potential directions for future improvement include:
- Trying more advanced architectures such as EfficientNet-B4 to potentially boost classification performance.
- Experimenting with different class weighting strategies to better address class imbalance in the dataset.
- Executing focal loss cell to further mitigate the impact of class imbalance and improve model robustness, especially for rare classes.







