This tutorial demonstrates a complete monocular visual odometry (VO) pipeline using OpenCV. Visual odometry estimates the camera's motion by analyzing changes in images over time - a fundamental component of Visual SLAM.
Visual Odometry (VO) is the process of estimating the pose (position and orientation) of a camera by analyzing the sequence of images it captures.
| Aspect | Visual Odometry | Visual SLAM |
|---|---|---|
| Output | Incremental pose estimates | Global map + trajectory |
| Loop closure | No | Yes |
| Drift | Accumulates | Corrected by loops |
| Complexity | Simpler | More complex |
| Use case | Local navigation | Global localization |
+-------------------+ +-------------------+ +-------------------+
| Image Input | --> | Feature Detection | --> | Feature Tracking |
| (Frame t, t+1) | | (FAST/ORB/GFTT) | | (KLT Optical |
+-------------------+ +-------------------+ | Flow) |
+-------------------+
|
v
+-------------------+ +-------------------+ +-------------------+
| Pose Update | <-- | Pose Recovery | <-- | Essential Matrix |
| (Accumulate R,t) | | (recoverPose) | | (findEssentialMat|
+-------------------+ +-------------------+ | with RANSAC) |
+-------------------+
^
|
+-------------------+
| Scale Estimation |
| (from GT or other |
| sensors) |
+-------------------+
Features are distinctive points in an image that can be reliably detected and tracked.
| Detector | Description | Speed | Use Case |
|---|---|---|---|
| FAST | Corner detector | Very fast | Real-time VO |
| GFTT | Shi-Tomasi corners | Fast | General tracking |
| ORB | Oriented FAST + BRIEF | Fast | Feature matching |
Lucas-Kanade optical flow tracks features from one frame to the next:
- Uses image pyramids for handling large motions
- Iterative refinement for sub-pixel accuracy
- Handles feature loss and outliers
The essential matrix E encodes the relative rotation and translation:
x2^T * E * x1 = 0(epipolar constraint)- Uses RANSAC for robust estimation against outliers
- Requires camera intrinsics (focal length, principal point)
Decompose the essential matrix to get R and t:
E = [t]_x * R(skew-symmetric matrix of t times R)- Four possible solutions, only one geometrically valid
- OpenCV's
recoverPoseselects the correct solution via triangulation
Monocular VO cannot determine absolute scale from images alone:
- Ground truth: Use known trajectory (for evaluation)
- Known objects: Use objects with known dimensions
- IMU fusion: Accelerometer provides metric scale
- Ground plane: Assume known camera height
2_15/
├── README.md
├── CMakeLists.txt
├── Dockerfile
├── include/
│ └── monocular_vo.hpp # VO pipeline header
├── src/
│ └── monocular_vo.cpp # VO pipeline implementation
└── examples/
├── feature_tracking_demo.cpp # Feature detection/tracking demo
├── monocular_vo.cpp # Standalone monocular VO pipeline
└── run_vo_kitti.cpp # Complete VO on KITTI dataset
- OpenCV 4.x (core, features2d, calib3d, video, highgui)
mkdir build && cd build
cmake ..
make -j4docker build -t slam_zero_to_hero:2_15 .- Download a sequence from KITTI Visual Odometry
- Extract to a folder (e.g.,
/data/kitti/sequences/00/)
# Local
./build/feature_tracking_demo /path/to/images 100
# Docker
docker run -it --rm \
-v /path/to/kitti:/data \
-e DISPLAY=$DISPLAY \
-v /tmp/.X11-unix:/tmp/.X11-unix \
slam_zero_to_hero:2_15 ./feature_tracking_demo /data/sequences/00/image_0 100# Local
./build/run_vo_kitti /path/to/kitti/sequences/00/image_0 /path/to/kitti/poses/00.txt
# Docker
docker run -it --rm \
-v /path/to/kitti:/data \
-e DISPLAY=$DISPLAY \
-v /tmp/.X11-unix:/tmp/.X11-unix \
slam_zero_to_hero:2_15 ./run_vo_kitti /data/sequences/00/image_0 /data/poses/00.txt#include <opencv2/opencv.hpp>
#include <opencv2/video/tracking.hpp>
#include <opencv2/calib3d.hpp>
class MonocularVO {
public:
MonocularVO(double focal, cv::Point2d pp)
: focal_(focal), pp_(pp), frame_id_(0) {
R_total_ = cv::Mat::eye(3, 3, CV_64F);
t_total_ = cv::Mat::zeros(3, 1, CV_64F);
}
void processFrame(const cv::Mat& frame) {
cv::Mat gray;
cv::cvtColor(frame, gray, cv::COLOR_BGR2GRAY);
if (frame_id_ == 0) {
// First frame: detect features
detectFeatures(gray, prev_pts_);
prev_gray_ = gray.clone();
frame_id_++;
return;
}
// Track features
std::vector<cv::Point2f> curr_pts;
std::vector<uchar> status;
trackFeatures(prev_gray_, gray, prev_pts_, curr_pts, status);
// Filter by status
std::vector<cv::Point2f> prev_good, curr_good;
for (size_t i = 0; i < status.size(); i++) {
if (status[i]) {
prev_good.push_back(prev_pts_[i]);
curr_good.push_back(curr_pts[i]);
}
}
// Estimate motion
if (prev_good.size() >= 8) {
cv::Mat E, R, t, mask;
E = cv::findEssentialMat(curr_good, prev_good, focal_, pp_,
cv::RANSAC, 0.999, 1.0, mask);
cv::recoverPose(E, curr_good, prev_good, R, t, focal_, pp_, mask);
// Accumulate pose (with scale)
double scale = getScale(); // From ground truth or other source
t_total_ = t_total_ + scale * (R_total_ * t);
R_total_ = R * R_total_;
}
// Re-detect if too few features
if (curr_good.size() < 100) {
detectFeatures(gray, prev_pts_);
} else {
prev_pts_ = curr_good;
}
prev_gray_ = gray.clone();
frame_id_++;
}
cv::Mat getPosition() const { return t_total_.clone(); }
cv::Mat getRotation() const { return R_total_.clone(); }
private:
void detectFeatures(const cv::Mat& gray, std::vector<cv::Point2f>& pts) {
cv::goodFeaturesToTrack(gray, pts, 2000, 0.01, 10);
}
void trackFeatures(const cv::Mat& prev, const cv::Mat& curr,
std::vector<cv::Point2f>& prev_pts,
std::vector<cv::Point2f>& curr_pts,
std::vector<uchar>& status) {
std::vector<float> err;
cv::calcOpticalFlowPyrLK(prev, curr, prev_pts, curr_pts,
status, err, cv::Size(21, 21), 3);
}
virtual double getScale() { return 1.0; } // Override for real scale
double focal_;
cv::Point2d pp_;
int frame_id_;
cv::Mat prev_gray_;
std::vector<cv::Point2f> prev_pts_;
cv::Mat R_total_, t_total_;
};// Detect good features to track
std::vector<cv::Point2f> corners;
cv::goodFeaturesToTrack(
gray, // Input image
corners, // Output corners
2000, // Max corners
0.01, // Quality level
10, // Min distance between corners
cv::noArray(), // Mask (optional)
3, // Block size
false, // Use Harris detector
0.04 // Harris parameter
);
// Track features using Lucas-Kanade
std::vector<cv::Point2f> next_pts;
std::vector<uchar> status;
std::vector<float> error;
cv::calcOpticalFlowPyrLK(
prev_gray, // Previous image
curr_gray, // Current image
prev_pts, // Previous points
next_pts, // Tracked points (output)
status, // Status (1 = found, 0 = lost)
error, // Tracking error
cv::Size(21, 21), // Window size
3, // Pyramid levels
cv::TermCriteria(
cv::TermCriteria::COUNT | cv::TermCriteria::EPS,
30, 0.01) // Termination criteria
);// Camera intrinsics (KITTI sequence 00)
double focal = 718.856;
cv::Point2d pp(607.1928, 185.2157);
// Essential matrix from point correspondences
cv::Mat E, mask;
E = cv::findEssentialMat(
curr_pts, // Points in current frame
prev_pts, // Points in previous frame
focal, // Focal length
pp, // Principal point
cv::RANSAC, // RANSAC method
0.999, // Confidence
1.0, // Threshold
mask // Inlier mask
);
// Decompose to R and t (t is unit vector, scale unknown)
cv::Mat R, t;
int inliers = cv::recoverPose(E, curr_pts, prev_pts, R, t, focal, pp, mask);
std::cout << "Rotation:\n" << R << std::endl;
std::cout << "Translation (unit): " << t.t() << std::endl;
std::cout << "Inliers: " << inliers << std::endl;class MonocularVO_KITTI : public MonocularVO {
public:
MonocularVO_KITTI(double focal, cv::Point2d pp,
const std::string& poses_file)
: MonocularVO(focal, pp) {
loadGroundTruth(poses_file);
}
private:
void loadGroundTruth(const std::string& file) {
std::ifstream f(file);
std::string line;
while (std::getline(f, line)) {
std::istringstream iss(line);
std::vector<double> pose(12);
for (int i = 0; i < 12; i++) iss >> pose[i];
ground_truth_.push_back(cv::Point3d(pose[3], pose[7], pose[11]));
}
}
double getScale() override {
if (frame_id_ < 2 || frame_id_ >= ground_truth_.size())
return 1.0;
cv::Point3d p1 = ground_truth_[frame_id_ - 1];
cv::Point3d p2 = ground_truth_[frame_id_];
return cv::norm(p2 - p1);
}
std::vector<cv::Point3d> ground_truth_;
};void visualizeTrajectory(const std::vector<cv::Point3d>& trajectory,
const std::vector<cv::Point3d>& ground_truth) {
cv::Mat traj_img(800, 800, CV_8UC3, cv::Scalar(255, 255, 255));
double scale = 1.0; // Adjust based on trajectory extent
for (size_t i = 1; i < trajectory.size(); i++) {
// Draw estimated trajectory (blue)
cv::Point2i p1(400 + scale * trajectory[i-1].x,
400 - scale * trajectory[i-1].z);
cv::Point2i p2(400 + scale * trajectory[i].x,
400 - scale * trajectory[i].z);
cv::line(traj_img, p1, p2, cv::Scalar(255, 0, 0), 2);
// Draw ground truth (green)
if (i < ground_truth.size()) {
cv::Point2i g1(400 + scale * ground_truth[i-1].x,
400 - scale * ground_truth[i-1].z);
cv::Point2i g2(400 + scale * ground_truth[i].x,
400 - scale * ground_truth[i].z);
cv::line(traj_img, g1, g2, cv::Scalar(0, 255, 0), 2);
}
}
cv::imshow("Trajectory", traj_img);
}// KITTI sequence 00 camera parameters
double focal = 718.856; // Focal length in pixels
cv::Point2d pp(607.1928, 185.2157); // Principal point
// Camera matrix K
cv::Mat K = (cv::Mat_<double>(3,3) <<
focal, 0, pp.x,
0, focal, pp.y,
0, 0, 1);// Frame-to-frame motion: R_curr, t_curr
// World pose: R_world, t_world
// Update world pose
t_world = t_world + scale * (R_world * t_curr);
R_world = R_curr * R_world;
// Extract position
double x = t_world.at<double>(0);
double y = t_world.at<double>(1);
double z = t_world.at<double>(2);| Limitation | Cause | Mitigation |
|---|---|---|
| Scale ambiguity | Single camera | IMU fusion, known objects |
| Pure rotation | No baseline | Detect and skip |
| Texture-less regions | No features | Use direct methods |
| Motion blur | Fast motion | Higher frame rate |
| Drift | No loop closure | Add SLAM backend |
- Feature distribution: Use grid-based detection
- Outlier rejection: Use bidirectional tracking
- Re-detection: Detect new features when count drops
- Motion model: Use constant velocity assumption for prediction
// Bidirectional tracking for outlier rejection
std::vector<cv::Point2f> back_pts;
cv::calcOpticalFlowPyrLK(curr_gray, prev_gray, curr_pts, back_pts, status, err);
// Check forward-backward consistency
for (size_t i = 0; i < prev_pts.size(); i++) {
if (status[i] && cv::norm(prev_pts[i] - back_pts[i]) > 1.0) {
status[i] = 0; // Reject inconsistent track
}
}double computeATE(const std::vector<cv::Point3d>& estimated,
const std::vector<cv::Point3d>& ground_truth) {
double sum_sq = 0.0;
int n = std::min(estimated.size(), ground_truth.size());
for (int i = 0; i < n; i++) {
sum_sq += cv::norm(estimated[i] - ground_truth[i]) *
cv::norm(estimated[i] - ground_truth[i]);
}
return std::sqrt(sum_sq / n); // RMSE
}Measures local consistency over fixed distances (e.g., 100m).
- Nister, D., Naroditsky, O., & Bergen, J. (2004). "Visual odometry". CVPR 2004.
- Scaramuzza, D., & Fraundorfer, F. (2011). "Visual Odometry: Part I & II". IEEE RAM.
- KITTI Vision Benchmark Suite: https://www.cvlibs.net/datasets/kitti/
- OpenCV Video Analysis