Real-Time Autonomous Driving Perception System

A complete perception pipeline — detection, tracking, segmentation, and deployment — trained on KITTI and BDD10K, optimized with TensorRT, and deployed in C++.

Project Overview

This project implements a production-grade autonomous driving perception system covering four core tasks:

Object Detection — YOLOv8-l/m on KITTI (4 classes), scratch vs fine-tuning comparison
Multi-Object Tracking — ByteTrack with Kalman filter on KITTI tracking sequences
Semantic Segmentation — UNet, DeepLabV3+, SegFormer on BDD10K (19 classes)
Deployment & Optimization — PyTorch → ONNX → TensorRT FP16, standalone C++ inference app

Highlights

Phase	Task	Best Result	Model
1	Object Detection	95.70% mAP@50	YOLOv8-l (KITTI, 4 classes)
2	Multi-Object Tracking	84.0% MOTA, 84.9% IDF1	ByteTrack @ 60.2 FPS
3	Semantic Segmentation	64.49% mIoU	SegFormer MiT-B3 (BDD10K, 19 classes)
4	TensorRT Deployment	3.2× speedup, C++ @ 91.3 FPS	YOLOv8-l FP16, zero accuracy loss

System Architecture

flowchart LR
    subgraph datasets [Datasets]
        KITTI["KITTI\n7.5K images\n21 sequences"]
        BDD10K["BDD10K\n7K train / 1K val\n19 classes"]
    end

    subgraph training [Training — Python / PyTorch]
        Det["Phase 1\nDetection\nYOLOv8-l/m"]
        Track["Phase 2\nTracking\nByteTrack"]
        Seg["Phase 3\nSegmentation\nUNet / DLV3+ / SegFormer"]
    end

    subgraph deploy [Deployment — Phase 4]
        ONNX["ONNX Export"]
        TRT["TensorRT FP16\n3.2x speedup"]
        CPP["C++ Application\n91.3 FPS E2E"]
    end

    KITTI --> Det
    KITTI --> Track
    BDD10K --> Seg
    Det --> Track
    Det --> ONNX
    Seg --> ONNX
    ONNX --> TRT
    TRT --> CPP

Project Status

Phase	Status	Documentation
Phase 1: Detection	Complete	KITTI Report, BDD100K Report
Phase 2: Tracking	Complete	Tracking Report
Phase 3: Segmentation	Complete	Segmentation Report
Phase 4: Deployment	Complete	Deployment Report

Results

Phase 1: Object Detection (KITTI)

5 models trained comparing scratch training vs fine-tuning from BDD100K.

Model	Training	mAP@50	mAP@50-95	Epochs	Time
YOLOv8-l	From scratch	95.70%	80.60%	100	4.2 hrs
YOLOv8-l	Fine-tuned (BDD→KITTI)	95.28%	78.91%	100	2.6 hrs
YOLOv8-m	From scratch	95.57%	79.76%	100	3.0 hrs
YOLOv8-l	Fine-tuned (BDD→KITTI)	94.85%	76.64%	30	48 min
YOLOv8-m	Fine-tuned (BDD→KITTI)	94.38%	75.30%	30	35 min

Key finding: Training from scratch outperformed fine-tuning by +1.7% mAP@50-95 despite being 38% slower — revealing the BDD100K→KITTI domain gap.

Classes: Car, Truck, Pedestrian, Cyclist (4 classes) — see KITTI report for per-class breakdown and convergence analysis.

Phase 2: Multi-Object Tracking (KITTI)

ByteTrack tracker with Kalman filter, evaluated on 21 KITTI tracking sequences (853 GT tracks).

Metric	Target	Result
MOTA	> 50%	84.0%
IDF1	> 60%	84.9%
FPS	> 30	60.2
Mostly Tracked (MT)	> 50%	65.4%
Recall	—	87.8%
Precision	—	96.9%

Key finding: Tuning IoU matching threshold from 0.8→0.5 yielded +37.7pp MOTA improvement — the default was far too strict for Kalman filter predictions, causing ~17,000 unnecessary missed detections.

See tracking report for parameter sweep analysis and demo videos.

Phase 3: Semantic Segmentation (BDD10K)

Three architectural families compared on BDD10K (7,000 train / 1,000 val, 19 Cityscapes-compatible classes).

Model	Encoder	mIoU	Pixel Acc	FPS	Params
UNet	ResNet-50	60.49%	93.04%	116.5	32.5M
DeepLabV3+	ResNet-101	63.14%	93.65%	105.9	45.7M
SegFormer	MiT-B2	64.26%	93.61%	46.0	27.4M
SegFormer	MiT-B3	64.49%	93.81%	34.5	47.2M

Architecture progression: UNet → DeepLabV3+ (+2.65% mIoU via ASPP multi-scale reasoning) → SegFormer (+1.12% via transformer global context). Road IoU > 94% across all models (robust ground plane estimation). SegFormer MiT-B2 is the most parameter-efficient — 99.6% of B3's accuracy with 42% fewer parameters.

See segmentation report for per-class IoU, category analysis, and architectural insights.

Phase 4: Deployment & Optimization

Full PyTorch → ONNX → TensorRT FP16 pipeline with a standalone C++ inference application.

Detection — YOLOv8-l (960x960):

Backend	Precision	Infer FPS	E2E FPS	mAP@50
PyTorch	FP32	97.1	54.1	95.68%
TensorRT (Python)	FP16	313.0	85.9	95.62%
TensorRT (C++)	FP16	309.0	91.3	95.62%

Detection — YOLOv8-m (960x960):

Backend	Precision	Infer FPS	E2E FPS	mAP@50
PyTorch	FP32	152.5	66.8	95.55%
TensorRT (Python)	FP16	429.1	93.3	95.53%
TensorRT (C++)	FP16	429.4	102.9	95.53%

Segmentation — DeepLabV3+ (1280x720):

Backend	Precision	Infer FPS	E2E FPS	mIoU
PyTorch	FP32	75.0	30.2	63.03%
TensorRT (Python)	FP16	342.6	43.4	63.02%

C++ cross-validation: 1,122/1,122 KITTI images produce bit-identical detections vs Python TRT reference (0.0 px max coordinate diff). The hardest bug was a sort-stability issue caused by FP16 confidence ties in NMS — documented in debug report.

C++ timing breakdown (YOLOv8-l): Preprocessing 4.0 ms, H2D copy 3.4 ms, inference 3.2 ms, D2H copy 0.3 ms, postprocess 0.1 ms. The bottleneck is CPU preprocessing and memory transfers — a CUDA preprocessing kernel would cut latency by ~35%.

See deployment report for full benchmark tables and implementation details.

Installation

Prerequisites

Python 3.12+
CUDA 12.x (for GPU acceleration)
TensorRT 10.x (for deployment)
CMake 3.18+ and MSVC (for C++ build)

Setup Environment

git clone https://github.com/Hongye-Chen/ad-perception-system.git
cd ad-perception-system

conda create -n ad_perception python=3.12
conda activate ad_perception

pip install -r requirements.txt

Download Datasets

# KITTI (detection + tracking)
# Register at https://www.cvlibs.uni-tuebingen.de/datasets/kitti/
python scripts/kitti/prepare_kitti.py

# BDD10K segmentation (for Phase 3)
# Download from https://www.bdd100k.com/

Quick Start

Detection — Training

# YOLOv8-l from scratch on KITTI
python scripts/train_yolov8.py --data data/kitti/processed/dataset.yaml --model yolov8l --epochs 100 --batch 16 --imgsz 960 --device 1 --name yolov8l_kitti_scratch --patience 30

# YOLOv8-l fine-tuned from BDD100K
python scripts/train_yolov8.py --data data/kitti/processed/dataset.yaml --model "models/detection/yolov8l_bdd100k_1280/weights/best.pt" --epochs 100 --batch 16 --imgsz 960 --device 1 --name yolov8l_kitti_finetune --optimizer AdamW --lr0 0.001 --freeze 10

Tracking — Run and Evaluate

# Run ByteTrack on KITTI sequences
python scripts/track_kitti.py

# Evaluate MOT metrics
python scripts/evaluate_tracking.py

Segmentation — Training and Evaluation

# Train SegFormer MiT-B2 on BDD10K
python scripts/segmentation/train_segmentation.py --model segformer --encoder mit_b2 --crop_size 768 --batch_size 4 --lr 0.00006 --encoder_lr 0.000006 --epochs 100 --patience 20 --output_dir models/segmentation/segformer_mit_b2_bdd10k

# Evaluate
python scripts/segmentation/evaluate_segmentation.py --model segformer --encoder mit_b2 --checkpoint models/segmentation/segformer_mit_b2_bdd10k/best.pt

# Compare all models
python scripts/segmentation/compare_models.py --metrics_files models/segmentation/*/metrics.json --output_dir results/segmentation

Deployment — Export and Benchmark

# Export YOLOv8-l to ONNX
python scripts/deployment/export_detection.py

# Build TensorRT engines
python scripts/deployment/build_trt_engines.py

# Validate accuracy (ONNX and TRT vs PyTorch)
python scripts/deployment/validate_onnx.py
python scripts/deployment/validate_trt.py

# Benchmark all backends
python scripts/deployment/benchmark_detection.py

C++ Inference

cd cpp/build
cmake ..
cmake --build . --config Release

# Run inference
.\Release\ad_perception_infer.exe --engine ..\..\models\deployment\yolov8l_kitti_fp16.engine --image ..\..\data\kitti\processed\images\val\000000.png

# Benchmark
.\Release\ad_perception_infer.exe --engine ..\..\models\deployment\yolov8l_kitti_fp16.engine --benchmark --iterations 200

Project Structure

ad-perception-system/
├── scripts/
│   ├── train_yolov8.py           # YOLOv8 detection training
│   ├── track_kitti.py            # ByteTrack on KITTI sequences
│   ├── evaluate_tracking.py      # MOT metrics evaluation
│   ├── segmentation/             # Segmentation training, evaluation, comparison, visualization
│   ├── deployment/               # ONNX export, TRT build, validation, benchmarking
│   ├── trackers/                 # ByteTrack, Kalman filter implementations
│   ├── kitti/                    # KITTI dataset preparation
│   └── bdd100k/                  # BDD100K dataset preparation
├── cpp/                          # C++ TensorRT inference application
│   ├── CMakeLists.txt
│   ├── include/                  # trt_engine.h, preprocess.h, postprocess.h, cuda_utils.h
│   └── src/                      # main.cpp, trt_engine.cpp, preprocess.cpp, postprocess.cpp
├── configs/                      # YAML configs (detection, tracking, export)
├── notebooks/                    # Jupyter notebooks for data exploration
├── docs/                         # Phase reports, plans, guides
├── results/                      # Benchmark results, metrics JSONs
├── models/                       # Trained checkpoints, ONNX, TensorRT engines (git-ignored)
├── data/                         # KITTI, BDD100K, BDD10K datasets (git-ignored)
├── requirements.txt
├── environment.yml
└── LICENSE

See docs/project_structure.md for a detailed breakdown of every file.

Documentation

Phase Reports

Guides

Dataset Transition — BDD100K → KITTI rationale
Environment Setup
Training Guide

Technical Details

Detection

YOLOv8-l/m (Ultralytics): Single-stage anchor-free detector, trained at 960x960 on KITTI
Scratch training vs BDD100K fine-tuning comparison across 5 configurations
SGD (scratch) vs AdamW (fine-tuning) with layer freezing

Tracking

ByteTrack: Two-stage association (high-confidence + low-confidence matches) with Kalman filter motion prediction
Systematic parameter sweep on IoU matching threshold, revealing +37.7pp MOTA improvement

Segmentation

UNet (ResNet-50): Encoder-decoder baseline with skip connections
DeepLabV3+ (ResNet-101): Multi-scale context via Atrous Spatial Pyramid Pooling
SegFormer (MiT-B2/B3): Hierarchical vision transformer with MLP decoder
Combined CrossEntropy + Dice loss, differential learning rates, 768x768 training crop

Deployment

ONNX export with no embedded NMS (platform-agnostic postprocessing)
TensorRT FP16 engines via trtexec — 3.2x inference speedup, zero mAP@50 loss
C++ application: TRT engine loading, letterbox preprocessing, per-class NMS with std::stable_sort, JSON benchmark output
Cross-validation: Bit-identical results between C++ and Python on 1,122 images

Acknowledgments

Ultralytics YOLOv8
ByteTrack
segmentation-models-pytorch (UNet, DeepLabV3+)
HuggingFace Transformers (SegFormer)
KITTI Vision Benchmark
BDD100K / BDD10K
NVIDIA TensorRT

License

This project is licensed under the MIT License — see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
configs		configs
cpp		cpp
data/processed/bdd100k		data/processed/bdd100k
docs		docs
models		models
notebooks		notebooks
results		results
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
requirements.txt		requirements.txt
test_installation.py		test_installation.py

Folders and files

Latest commit

History

Repository files navigation

Real-Time Autonomous Driving Perception System

Project Overview

Highlights

System Architecture

Project Status

Results

Phase 1: Object Detection (KITTI)

Phase 2: Multi-Object Tracking (KITTI)

Phase 3: Semantic Segmentation (BDD10K)

Phase 4: Deployment & Optimization

Installation

Prerequisites

Setup Environment

Download Datasets

Quick Start

Detection — Training

Tracking — Run and Evaluate

Segmentation — Training and Evaluation

Deployment — Export and Benchmark

C++ Inference

Project Structure

Documentation

Phase Reports

Guides

Technical Details

Detection

Tracking

Segmentation

Deployment

Acknowledgments

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages