This project presents a high-accuracy solution for handwritten letter and character recognition using the EMNIST dataset. It leverages a pre-trained Convolutional Neural Network (CNN), fine-tunes it with advanced augmentation and regularization techniques, and employs a systematic approach to overcome the challenges inherent in the dataset.
- High Accuracy: Achieves 88.61% on the imbalanced EMNIST ByClass split and 91.06% on the EMNIST Balanced split, performing close to or at state-of-the-art benchmarks.
- Transfer Learning: Utilizes pre-trained
EfficientNet-B2andEfficientNet-B3models, modified for grayscale input and the specific EMNIST class structure. - Advanced Augmentation: Employs
MixUpandCutMixto effectively regularize the model, significantly reducing overfitting and improving generalization. - Class Imbalance Handling: Implements a custom-weighted
KLDivLossfunction to address the severe class imbalance in the ByClass dataset, ensuring fair training across all characters. - Systematic Optimization: Uses
Weights & Biasesfor experiment tracking and systematically determines the best optimizer (Lion), learning rate scheduler (CosineAnnealingWarmRestarts), and hyperparameters.
This project uses the EMNIST (Extended MNIST) dataset, which is a large collection of handwritten characters and digits.
-
Structure: The images are reformatted into a
$28 \times 28$ grayscale format, similar to the original MNIST dataset. -
Dataset Splits Used:
- ByClass: 62 classes, highly imbalanced. Contains 697,932 training images and 116,323 test images.
- Balanced: 47 balanced classes.
-
Challenges:
- Class Imbalance: In the ByClass split, the most frequent class appears over 17 times more often than the least frequent one (33,374 vs 1,896 samples).
- Data Quality: The dataset contains mislabeled and pre-augmented images (e.g., rotated by 90 degrees), which complicates classification.
The core of this project is a modified EfficientNet model. While several architectures were tested, EfficientNet-B2 and EfficientNet-B3 provided the best balance of performance and computational efficiency.
The pre-trained model was adapted for this task with two key modifications:
- The first convolutional layer was changed to accept 1-channel grayscale images instead of the standard 3-channel RGB input.
- The final classification layer was replaced with a new one tailored for the 62 classes of the EMNIST ByClass dataset.
-
Image Resizing: Original
$28 \times 28$ images were resized to$112 \times 112$ . This resolution was experimentally determined to offer the best trade-off between feature extraction quality and computational load. - EMNIST Orientation fix: A custom Orientation fix transform was applied to fix EMNIST images being Rotated.
- Normalization: The dataset's mean and standard deviation were recalculated after resizing and applied to all images.
-
Augmentation for Regularization: Overfitting was a significant challenge. While initial attempts included dropout and standard augmentations, the most effective strategy was a combination of
MixUpandCutMix. This approach proved so effective that other augmentations were no longer necessary.
- Framework: The model was built and trained using PyTorch.
- Optimizer: After experimenting with Adam, AdamW, and SGD, the
Lionoptimizer was found to deliver the best results. - Loss Function: To counter class imbalance in the ByClass split, a
KLDivLossfunction was used with pre-computed class weights. This ensures that the model does not become biased towards more frequent classes. - Learning Rate Scheduler: A warmup schedule was implemented using
SequentialLR, which transitions to aCosineAnnealingWarmRestartsscheduler after 5 epochs. This stabilized initial training and helped convergence.
The model achieved better performance then established benchmarks.
| Dataset Split | Validation Accuracy | Benchmark | F1 Score |
|---|---|---|---|
| EMNIST ByClass | 88.61% | 88.43% | 87.59% |
| EMNIST Balanced | 91.06% | 91.06% | 90.98% |