This document provides comprehensive documentation for the processing pipeline for the UniToChest dataset, a publicly available chest CT resource comprising 715 scans from 623 patients with expert-annotated lung nodule segmentation masks.
Dataset Name: UniToChest
Source: Zenodo - Publicly available chest CT dataset
Reference: Chaudhry et al. (2022) - ICIAP 2022
Total Scans: 715 CT scans
Total Patients: 623 unique patients
Original Data Source: https://zenodo.org/records/5797912
Pre-processed Data Repository: Available on Zenodo at https://zenodo.org/uploads/18285682
Metadata & Annotations: All generated CSV files saved in local directory
- Training Set: 579 scans from 501 patients (80.0%)
- Validation Set: 66 scans from 62 patients (9.9%)
- Test Set: 70 scans from 63 patients (10.1%)
Purpose: Convert raw DICOM CT scans and PNG segmentation masks to unified NIfTI format
Process:
- Reads patient metadata from dataset CSV files (
train_dataset.csv,val_dataset.csv,test_dataset.csv) - Groups DICOM slices by patient ID and exam number
- Sorts slices by slice index for proper 3D volume reconstruction
- Uses pydicom to read DICOM CT slices
- Converts PNG masks to binary segmentation volumes
- Uses SimpleITK to create 3D volumes with proper spacing information
- Extracts spacing metadata from DICOM headers:
- In-plane spacing from
PixelSpacing - Through-plane spacing from
SliceThickness
- In-plane spacing from
Key Scripts:
trainDataset_nifti.py- Processes training setvalDataset_nifti.py- Processes validation settestDataset_nifti.py- Processes test setDemo_Dicom_to_CT-HAID.ipynb- Demonstration notebook
Outputs:
- CT images:
unitochest_nifti_ct/unitochestPT{PatientID}_exam{ExamID}_0000.nii.gz - Segmentation masks:
unitochest_nifti_mask/unitochestPT{PatientID}_exam{ExamID}_mask.nii.gz
File Naming Convention:
unitochestPT{patient_id}_exam{exam_id}_0000.nii.gz # CT scan
unitochestPT{patient_id}_exam{exam_id}_mask.nii.gz # Segmentation mask
Purpose: Standardize image spacing across the entire dataset for consistent downstream processing
Target Spacing: [0.703125, 0.703125, 1.25] mm (X, Y, Z)
Resampling Implementation:
#!/bin/bash
docker exec -it ft42_g6xai bash -c "python resample_images.py \
/path/to/unitochest_nifti_ct \
/path/to/unitochest_nifti_ct_resampled \
--spacing 0.703125 0.703125 1.25 \
--start 0 --end 715 \
--extension .nii.gz"Interpolation: B-spline (preserves intensity gradients and Hounsfield Units)
#!/bin/bash
docker exec -it ft42_g6xai bash -c "python resample_images.py \
/path/to/unitochest_nifti_mask \
/path/to/unitochest_nifti_mask_resampled \
--spacing 0.703125 0.703125 1.25 \
--start 0 --end 715 \
--is_label \
--extension .nii.gz"Interpolation: Nearest-neighbor (preserves discrete label values)
Resampling Algorithm (resample_images.py):
- Uses SimpleITK ResampleImageFilter
- Calculates output size based on spacing ratio:
out_size = [ int(round(original_size[i] * (original_spacing[i] / out_spacing[i]))) for i in range(3) ]
- Preserves spatial metadata (origin, direction cosines)
- Supports batch processing with start/end indices
Outputs:
- Resampled CT scans:
unitochest_nifti_ct_resampled/ - Resampled masks:
unitochest_nifti_mask_resampled/
Purpose: Extract 3D bounding boxes from segmentation masks for object detection tasks
Process:
- Identifies all non-zero voxels in segmentation masks
- Computes min/max coordinates in voxel space (x, y, z)
- Converts center coordinates to world coordinate system (mm) using SimpleITK:
center_world = image.TransformIndexToPhysicalPoint(center_voxel)
- Calculates bounding box dimensions in mm:
- Width (w), Height (h), Depth (d)
- Stores patient ID and file paths for reference
Nodule Diameter Calculation: Diameter computed as 3D Euclidean diagonal:
diameter_mm = sqrt(w² + h² + d²)
Outputs:
- File:
filtered_unitochest_bounding_boxes_annotations.csv - Columns:
Patient: Patient ID with exam numberct_path: Path to CT imagemask_path: Path to segmentation maskcoordX,coordY,coordZ: Center coordinates in world space (mm)w,h,d: Bounding box dimensions (mm)diameter_mm: 3D nodule diameter (mm)
Purpose: Merge imaging metadata with bounding box annotations for comprehensive dataset characterization
DICOM Metadata Extracted:
- Patient Demographics:
PatientID: Unique patient identifierPatientAge: Age in yearsPatientSex: Gender (F/M)
- Acquisition Parameters:
Manufacturer: Scanner vendor (GE MEDICAL SYSTEMS, Philips)ConvolutionKernel: Reconstruction kernel (STANDARD, LUNG, etc.)SliceThickness: Through-plane resolution (mm)PixelSpacing: In-plane resolution (mm)KVP: Tube voltage (kV)PatientPosition: Patient positioning (HFS, FFS)
Processing (see DataAnalysis.ipynb):
- Loads train/val/test split CSV files with DICOM metadata
- Assigns split labels (Train, Validation, Test)
- Combines all splits into unified dataframe
- Extracts numeric age values from DICOM age strings
- Merges with bounding box annotations
Outputs:
unitochest_train_dataset_dicom_metadata.csvunitochest_val_dataset_dicom_metadata.csvunitochest_test_dataset_dicom_metadata.csvfiltered_unitochest_bounding_boxes_annotations_metadata.csv
Purpose: Generate comprehensive statistics for publication and dataset documentation
Statistics Computed:
- Patient & Scan Counts: Unique patients and total scans per split
- Age Statistics: Mean ± SD, age range (min-max)
- Gender Distribution: Female/Male counts and percentages
- Scanner Manufacturers: GE MEDICAL SYSTEMS vs Philips distribution
- Reconstruction Kernels: STANDARD, LUNG, and other kernel frequencies
- Patient Positioning: HFS (Head-First Supine) vs FFS (Feet-First Supine)
Visualizations Created:
- Age Distribution: Histogram with KDE overlay by split
- Slice Thickness: Violin plots with scatter overlay
- Manufacturer Distribution: Count plot by split
Key Findings (Table S2):
- Mean patient age: 67.0 ± 10.2 years
- Gender: 40.0% Female, 60.0% Male
- Scanners: 76.2% GE MEDICAL SYSTEMS, 23.8% Philips
- Kernels: 49.8% STANDARD, 21.5% LUNG
- Positioning: 63.9% HFS, 36.1% FFS
Output Files:
- Summary tables printed in notebook
- Visualization:
unitochest_clinical_subplots.png - Individual plots:
age_distribution_by_split.png,slice_thickness_violin.png, etc.
Purpose: Analyze derived 3D bounding box annotations and nodule size distributions
Nodule Statistics Computed (Table S3):
- Total Annotations: 8,321 nodules across all splits
- Training: 7,260 nodules
- Validation: 366 nodules
- Test: 695 nodules
- Diameter Statistics:
- Overall: 21.4 ± 20.2 mm (mean ± SD), median 16.42 mm
- Training: 21.2 ± 19.7 mm, median 16.55 mm
- Validation: 25.0 ± 24.6 mm, median 17.0 mm
- Test: 22.1 ± 22.3 mm, median 15.44 mm
Visualizations:
- Violin Plots: Distribution of nodule diameters by split with overlaid scatter points
- KDE Curves: Kernel density estimation showing normalized size distributions
Output Files:
nodule_diameter_distribution_by_split.pngnodule_diameter_kde_by_split.png
Purpose: Create annotation-level dataset for nodule classification and diagnostic tasks
Process:
- Assigns unique annotation IDs per nodule:
{PatientID}_exam{ExamID}_{AnnotationIndex} - Creates
UNIQUE_ANNOTATION_IDcolumn for each bounding box - Formats for patch extraction and classification tasks
- Links spatial coordinates with clinical metadata
Outputs:
filtered_unitochest_bounding_boxes_annotations_metadata_nifty_name_lesionIdX.csv- Includes columns:
UNIQUE_ANNOTATION_ID: Unique identifierUNIQUE_ANNOTATION_ID_nifti: NIfTI filename formatannotation-idx: Sequential annotation number per patient- Spatial coordinates and dimensions
- Split information
| Output Type | File Name | Location | Description |
|---|---|---|---|
| Converted CT Images | unitochestPT{ID}_exam{E}_0000.nii.gz |
unitochest_nifti_ct/ |
CT scans in NIfTI format |
| Segmentation Masks | unitochestPT{ID}_exam{E}_mask.nii.gz |
unitochest_nifti_mask/ |
Binary lung nodule masks |
| Resampled CT | Same naming | unitochest_nifti_ct_resampled/ |
CT at 0.703125×0.703125×1.25mm |
| Resampled Masks | Same naming | unitochest_nifti_mask_resampled/ |
Resampled segmentation masks |
| Bounding Boxes | filtered_unitochest_bounding_boxes_annotations.csv |
Working directory | 3D bounding box annotations |
| DICOM Metadata (Train) | unitochest_train_dataset_dicom_metadata.csv |
Working directory | Training set metadata |
| DICOM Metadata (Val) | unitochest_val_dataset_dicom_metadata.csv |
Working directory | Validation set metadata |
| DICOM Metadata (Test) | unitochest_test_dataset_dicom_metadata.csv |
Working directory | Test set metadata |
| Merged Annotations | filtered_unitochest_bounding_boxes_annotations_metadata.csv |
Working directory | Bounding boxes + DICOM metadata |
| Diagnostic Dataset | filtered_unitochest_bounding_boxes_annotations_metadata_nifty_name_lesionIdX.csv |
Working directory | Annotation-level classification dataset |
- Total Patients: 623 unique patients
- Total CT Scans: 715
- Total Nodule Annotations: 8,321 expert-annotated nodules
- Image Modality: CT (Computed Tomography)
- Segmentation Type: Expert manual annotations
- Annotation Density: ~11.6 nodules per scan on average
- Original Spacing: Variable across patients
- Resampled Spacing: 0.703125 × 0.703125 × 1.25 mm³
- Image Format: NIfTI (.nii.gz)
- Intensity Unit: Hounsfield Units (HU)
- Age Range: Variable (see Table S2)
- Gender Distribution: 40% Female, 60% Male
- Scanner Manufacturers:
- GE MEDICAL SYSTEMS (76.2%)
- Philips (23.8%)
- Reconstruction Kernels:
- STANDARD (49.8%)
- LUNG (21.5%)
- Others
- Patient Positioning:
- HFS - Head-First Supine (63.9%)
- FFS - Feet-First Supine (36.1%)
- Training: 579 scans, 501 patients, 7,260 nodules (80.0%)
- Validation: 66 scans, 62 patients, 366 nodules (9.9%)
- Test: 70 scans, 63 patients, 695 nodules (10.1%)
- Split Method: Pre-defined by dataset creators
- Overall Diameter: 21.4 ± 20.2 mm (mean ± SD)
- Median Diameter: 16.42 mm
- Range: Variable small to large nodules
- Distribution: Right-skewed (more small nodules)
# Medical Image I/O
SimpleITK >= 2.0.0 # Medical image reading/writing/processing
nibabel >= 3.2.0 # NIfTI file handling
pydicom >= 2.0.0 # DICOM file handling
# Data Processing
pandas >= 1.3.0 # DataFrame operations
numpy >= 1.20.0 # Numerical operations
Pillow >= 8.0.0 # PNG mask loading
# Visualization
matplotlib >= 3.4.0 # Plotting
seaborn >= 0.11.0 # Statistical visualizations
# Utilities
tabulate # Table formattingpip install SimpleITK nibabel pydicom pandas numpy Pillow matplotlib seaborn tabulateThe resampling scripts use Docker for reproducible execution:
docker exec -it ft42_g6xai bash -c "python resample_images.py ..."If you use this dataset or processing pipeline, please cite:
Original UniToChest Dataset:
Chaudhry, H. A. H. et al. (2022). "Unitochest: A lung image dataset for segmentation of cancerous nodules on CT scans." In International Conference on Image Analysis and Processing (ICIAP 2022), pp. 185-196. Springer.
Dataset available at: https://zenodo.org/records/5797912
Pre-processed and resampled CT images with 3D bounding box annotations are available at: Zenodo Repository: https://zenodo.org/uploads/18285682
This dataset is publicly available for research purposes. Users must:
- Acknowledge the dataset creators
- Follow ethical guidelines for medical data usage
- Not attempt to re-identify patients
- All data is de-identified
- Original study approved by institutional review board
- No protected health information (PHI) included
Processing Date: January 2026
Processed By: Fakrul Islam Tushat
Processing Scripts Version: 1.0
Processing Environment: Python 3.8+, Docker, Linux
- Original Dataset: https://zenodo.org/records/5797912
- Original Publication: Chaudhry et al., ICIAP 2022
- Pre-processed Data: https://zenodo.org/uploads/18285682
- Processing Notebooks:
Demo_Dicom_to_CT-HAID.ipynb- DICOM to NIfTI conversion demoDataAnalysis.ipynb- Statistical analysis and visualization
UnitoChest/
├── Demo_Dicom_to_CT-HAID.ipynb # DICOM to NIfTI conversion demo
├── DataAnalysis.ipynb # Statistical analysis & visualization
├── trainDataset_nifti.py # Training set conversion script
├── valDataset_nifti.py # Validation set conversion script
├── testDataset_nifti.py # Test set conversion script
├── resample_images.py # Resampling utility script
├── resampled_ct.sh # Bash script for CT resampling
├── resampled_mask.sh # Bash script for mask resampling
├── unitochest_nifti_ct/ # Converted CT images
├── unitochest_nifti_mask/ # Converted segmentation masks
├── filtered_unitochest_bounding_boxes_annotations.csv # Bounding box annotations
├── unitochest_train_dataset_dicom_metadata.csv # Training metadata
├── unitochest_val_dataset_dicom_metadata.csv # Validation metadata
├── unitochest_test_dataset_dicom_metadata.csv # Test metadata
├── filtered_unitochest_bounding_boxes_annotations_metadata.csv # Merged annotations
└── UNITOCHEST_PROCESSING_DOCUMENTATION.md # This documentation
Citation:
If you use this converted dataset, please cite:
-
Original UniToChest dataset: Chaudhry, H. A. H. et al. (2022). "Unitochest: A lung image dataset for segmentation of cancerous nodules on CT scans." In International Conference on Image Analysis and Processing (ICIAP 2022), pp. 185-196. Springer. DOI/URL: https://zenodo.org/records/5797912
-
This converted NIfTI version (optional but appreciated): [Your Name/Lab] (2026). UniToChest Dataset: Converted NIfTI CT Scans with Expert-Annotated Lung Nodule Segmentation Masks. Zenodo. https://zenodo.org/uploads/18285682
Related Links:
- Original UniToChest dataset: https://zenodo.org/records/5797912
- Original publication: Chaudhry, H. A. H. et al., ICIAP 2022
- Processing code & documentation: [Your GitHub repository]
- Resampled version with 3D bounding boxes: https://zenodo.org/uploads/18285682
License:
This converted dataset inherits the license from the original UniToChest collection. All data is de-identified for research use.
Last Updated: January 17, 2026