A complete perception pipeline — detection, tracking, segmentation, and deployment — trained on KITTI and BDD10K, optimized with TensorRT, and deployed in C++.
This project implements a production-grade autonomous driving perception system covering four core tasks:
- Object Detection — YOLOv8-l/m on KITTI (4 classes), scratch vs fine-tuning comparison
- Multi-Object Tracking — ByteTrack with Kalman filter on KITTI tracking sequences
- Semantic Segmentation — UNet, DeepLabV3+, SegFormer on BDD10K (19 classes)
- Deployment & Optimization — PyTorch → ONNX → TensorRT FP16, standalone C++ inference app
| Phase | Task | Best Result | Model |
|---|---|---|---|
| 1 | Object Detection | 95.70% mAP@50 | YOLOv8-l (KITTI, 4 classes) |
| 2 | Multi-Object Tracking | 84.0% MOTA, 84.9% IDF1 | ByteTrack @ 60.2 FPS |
| 3 | Semantic Segmentation | 64.49% mIoU | SegFormer MiT-B3 (BDD10K, 19 classes) |
| 4 | TensorRT Deployment | 3.2× speedup, C++ @ 91.3 FPS | YOLOv8-l FP16, zero accuracy loss |
flowchart LR
subgraph datasets [Datasets]
KITTI["KITTI\n7.5K images\n21 sequences"]
BDD10K["BDD10K\n7K train / 1K val\n19 classes"]
end
subgraph training [Training — Python / PyTorch]
Det["Phase 1\nDetection\nYOLOv8-l/m"]
Track["Phase 2\nTracking\nByteTrack"]
Seg["Phase 3\nSegmentation\nUNet / DLV3+ / SegFormer"]
end
subgraph deploy [Deployment — Phase 4]
ONNX["ONNX Export"]
TRT["TensorRT FP16\n3.2x speedup"]
CPP["C++ Application\n91.3 FPS E2E"]
end
KITTI --> Det
KITTI --> Track
BDD10K --> Seg
Det --> Track
Det --> ONNX
Seg --> ONNX
ONNX --> TRT
TRT --> CPP
| Phase | Status | Documentation |
|---|---|---|
| Phase 1: Detection | Complete | KITTI Report, BDD100K Report |
| Phase 2: Tracking | Complete | Tracking Report |
| Phase 3: Segmentation | Complete | Segmentation Report |
| Phase 4: Deployment | Complete | Deployment Report |
5 models trained comparing scratch training vs fine-tuning from BDD100K.
| Model | Training | mAP@50 | mAP@50-95 | Epochs | Time |
|---|---|---|---|---|---|
| YOLOv8-l | From scratch | 95.70% | 80.60% | 100 | 4.2 hrs |
| YOLOv8-l | Fine-tuned (BDD→KITTI) | 95.28% | 78.91% | 100 | 2.6 hrs |
| YOLOv8-m | From scratch | 95.57% | 79.76% | 100 | 3.0 hrs |
| YOLOv8-l | Fine-tuned (BDD→KITTI) | 94.85% | 76.64% | 30 | 48 min |
| YOLOv8-m | Fine-tuned (BDD→KITTI) | 94.38% | 75.30% | 30 | 35 min |
Key finding: Training from scratch outperformed fine-tuning by +1.7% mAP@50-95 despite being 38% slower — revealing the BDD100K→KITTI domain gap.
Classes: Car, Truck, Pedestrian, Cyclist (4 classes) — see KITTI report for per-class breakdown and convergence analysis.
ByteTrack tracker with Kalman filter, evaluated on 21 KITTI tracking sequences (853 GT tracks).
| Metric | Target | Result |
|---|---|---|
| MOTA | > 50% | 84.0% |
| IDF1 | > 60% | 84.9% |
| FPS | > 30 | 60.2 |
| Mostly Tracked (MT) | > 50% | 65.4% |
| Recall | — | 87.8% |
| Precision | — | 96.9% |
Key finding: Tuning IoU matching threshold from 0.8→0.5 yielded +37.7pp MOTA improvement — the default was far too strict for Kalman filter predictions, causing ~17,000 unnecessary missed detections.
See tracking report for parameter sweep analysis and demo videos.
Three architectural families compared on BDD10K (7,000 train / 1,000 val, 19 Cityscapes-compatible classes).
| Model | Encoder | mIoU | Pixel Acc | FPS | Params |
|---|---|---|---|---|---|
| UNet | ResNet-50 | 60.49% | 93.04% | 116.5 | 32.5M |
| DeepLabV3+ | ResNet-101 | 63.14% | 93.65% | 105.9 | 45.7M |
| SegFormer | MiT-B2 | 64.26% | 93.61% | 46.0 | 27.4M |
| SegFormer | MiT-B3 | 64.49% | 93.81% | 34.5 | 47.2M |
Architecture progression: UNet → DeepLabV3+ (+2.65% mIoU via ASPP multi-scale reasoning) → SegFormer (+1.12% via transformer global context). Road IoU > 94% across all models (robust ground plane estimation). SegFormer MiT-B2 is the most parameter-efficient — 99.6% of B3's accuracy with 42% fewer parameters.
See segmentation report for per-class IoU, category analysis, and architectural insights.
Full PyTorch → ONNX → TensorRT FP16 pipeline with a standalone C++ inference application.
Detection — YOLOv8-l (960x960):
| Backend | Precision | Infer FPS | E2E FPS | mAP@50 |
|---|---|---|---|---|
| PyTorch | FP32 | 97.1 | 54.1 | 95.68% |
| TensorRT (Python) | FP16 | 313.0 | 85.9 | 95.62% |
| TensorRT (C++) | FP16 | 309.0 | 91.3 | 95.62% |
Detection — YOLOv8-m (960x960):
| Backend | Precision | Infer FPS | E2E FPS | mAP@50 |
|---|---|---|---|---|
| PyTorch | FP32 | 152.5 | 66.8 | 95.55% |
| TensorRT (Python) | FP16 | 429.1 | 93.3 | 95.53% |
| TensorRT (C++) | FP16 | 429.4 | 102.9 | 95.53% |
Segmentation — DeepLabV3+ (1280x720):
| Backend | Precision | Infer FPS | E2E FPS | mIoU |
|---|---|---|---|---|
| PyTorch | FP32 | 75.0 | 30.2 | 63.03% |
| TensorRT (Python) | FP16 | 342.6 | 43.4 | 63.02% |
C++ cross-validation: 1,122/1,122 KITTI images produce bit-identical detections vs Python TRT reference (0.0 px max coordinate diff). The hardest bug was a sort-stability issue caused by FP16 confidence ties in NMS — documented in debug report.
C++ timing breakdown (YOLOv8-l): Preprocessing 4.0 ms, H2D copy 3.4 ms, inference 3.2 ms, D2H copy 0.3 ms, postprocess 0.1 ms. The bottleneck is CPU preprocessing and memory transfers — a CUDA preprocessing kernel would cut latency by ~35%.
See deployment report for full benchmark tables and implementation details.
- Python 3.12+
- CUDA 12.x (for GPU acceleration)
- TensorRT 10.x (for deployment)
- CMake 3.18+ and MSVC (for C++ build)
git clone https://github.com/Hongye-Chen/ad-perception-system.git
cd ad-perception-system
conda create -n ad_perception python=3.12
conda activate ad_perception
pip install -r requirements.txt# KITTI (detection + tracking)
# Register at https://www.cvlibs.uni-tuebingen.de/datasets/kitti/
python scripts/kitti/prepare_kitti.py
# BDD10K segmentation (for Phase 3)
# Download from https://www.bdd100k.com/# YOLOv8-l from scratch on KITTI
python scripts/train_yolov8.py --data data/kitti/processed/dataset.yaml --model yolov8l --epochs 100 --batch 16 --imgsz 960 --device 1 --name yolov8l_kitti_scratch --patience 30
# YOLOv8-l fine-tuned from BDD100K
python scripts/train_yolov8.py --data data/kitti/processed/dataset.yaml --model "models/detection/yolov8l_bdd100k_1280/weights/best.pt" --epochs 100 --batch 16 --imgsz 960 --device 1 --name yolov8l_kitti_finetune --optimizer AdamW --lr0 0.001 --freeze 10# Run ByteTrack on KITTI sequences
python scripts/track_kitti.py
# Evaluate MOT metrics
python scripts/evaluate_tracking.py# Train SegFormer MiT-B2 on BDD10K
python scripts/segmentation/train_segmentation.py --model segformer --encoder mit_b2 --crop_size 768 --batch_size 4 --lr 0.00006 --encoder_lr 0.000006 --epochs 100 --patience 20 --output_dir models/segmentation/segformer_mit_b2_bdd10k
# Evaluate
python scripts/segmentation/evaluate_segmentation.py --model segformer --encoder mit_b2 --checkpoint models/segmentation/segformer_mit_b2_bdd10k/best.pt
# Compare all models
python scripts/segmentation/compare_models.py --metrics_files models/segmentation/*/metrics.json --output_dir results/segmentation# Export YOLOv8-l to ONNX
python scripts/deployment/export_detection.py
# Build TensorRT engines
python scripts/deployment/build_trt_engines.py
# Validate accuracy (ONNX and TRT vs PyTorch)
python scripts/deployment/validate_onnx.py
python scripts/deployment/validate_trt.py
# Benchmark all backends
python scripts/deployment/benchmark_detection.pycd cpp/build
cmake ..
cmake --build . --config Release
# Run inference
.\Release\ad_perception_infer.exe --engine ..\..\models\deployment\yolov8l_kitti_fp16.engine --image ..\..\data\kitti\processed\images\val\000000.png
# Benchmark
.\Release\ad_perception_infer.exe --engine ..\..\models\deployment\yolov8l_kitti_fp16.engine --benchmark --iterations 200ad-perception-system/
├── scripts/
│ ├── train_yolov8.py # YOLOv8 detection training
│ ├── track_kitti.py # ByteTrack on KITTI sequences
│ ├── evaluate_tracking.py # MOT metrics evaluation
│ ├── segmentation/ # Segmentation training, evaluation, comparison, visualization
│ ├── deployment/ # ONNX export, TRT build, validation, benchmarking
│ ├── trackers/ # ByteTrack, Kalman filter implementations
│ ├── kitti/ # KITTI dataset preparation
│ └── bdd100k/ # BDD100K dataset preparation
├── cpp/ # C++ TensorRT inference application
│ ├── CMakeLists.txt
│ ├── include/ # trt_engine.h, preprocess.h, postprocess.h, cuda_utils.h
│ └── src/ # main.cpp, trt_engine.cpp, preprocess.cpp, postprocess.cpp
├── configs/ # YAML configs (detection, tracking, export)
├── notebooks/ # Jupyter notebooks for data exploration
├── docs/ # Phase reports, plans, guides
├── results/ # Benchmark results, metrics JSONs
├── models/ # Trained checkpoints, ONNX, TensorRT engines (git-ignored)
├── data/ # KITTI, BDD100K, BDD10K datasets (git-ignored)
├── requirements.txt
├── environment.yml
└── LICENSE
See docs/project_structure.md for a detailed breakdown of every file.
- Phase 1: Detection — KITTI (primary)
- Phase 1: Detection — BDD100K (archived)
- Phase 2: Tracking — KITTI
- Phase 3: Segmentation — BDD10K
- Phase 4: Deployment
- Phase 4: C++ Cross-Validation Debug
- Dataset Transition — BDD100K → KITTI rationale
- Environment Setup
- Training Guide
- YOLOv8-l/m (Ultralytics): Single-stage anchor-free detector, trained at 960x960 on KITTI
- Scratch training vs BDD100K fine-tuning comparison across 5 configurations
- SGD (scratch) vs AdamW (fine-tuning) with layer freezing
- ByteTrack: Two-stage association (high-confidence + low-confidence matches) with Kalman filter motion prediction
- Systematic parameter sweep on IoU matching threshold, revealing +37.7pp MOTA improvement
- UNet (ResNet-50): Encoder-decoder baseline with skip connections
- DeepLabV3+ (ResNet-101): Multi-scale context via Atrous Spatial Pyramid Pooling
- SegFormer (MiT-B2/B3): Hierarchical vision transformer with MLP decoder
- Combined CrossEntropy + Dice loss, differential learning rates, 768x768 training crop
- ONNX export with no embedded NMS (platform-agnostic postprocessing)
- TensorRT FP16 engines via
trtexec— 3.2x inference speedup, zero mAP@50 loss - C++ application: TRT engine loading, letterbox preprocessing, per-class NMS with
std::stable_sort, JSON benchmark output - Cross-validation: Bit-identical results between C++ and Python on 1,122 images
- Ultralytics YOLOv8
- ByteTrack
- segmentation-models-pytorch (UNet, DeepLabV3+)
- HuggingFace Transformers (SegFormer)
- KITTI Vision Benchmark
- BDD100K / BDD10K
- NVIDIA TensorRT
This project is licensed under the MIT License — see the LICENSE file for details.