This tutorial covers global feature extraction and the Bag of Visual Words (BoVW) approach for place recognition and loop closure detection in Visual SLAM, using the DBoW2 library.
While local features (like ORB or SIFT) describe individual keypoints, global features describe an entire image in a single compact representation. This is crucial for:
- Loop Closure Detection: Recognizing previously visited places
- Place Recognition: Image retrieval from a database
- Relocalization: Finding camera pose when tracking is lost
Bag of Visual Words is a technique borrowed from natural language processing, adapted for image retrieval. The core idea is to represent images as histograms of visual "words" instead of raw pixels.
Just as a text document can be summarized by the frequency of words it contains (ignoring grammar and word order), an image can be summarized by the frequency of visual patterns (local features) it contains.
Text Document: Image:
-------------- ------
Words: ["cat", "sat", ...] Visual Words: [corner_pattern_1, blob_pattern_2, ...]
Word Histogram Visual Word Histogram
A visual vocabulary is a set of representative visual patterns (cluster centers) learned from a large set of training images. Each pattern is called a "visual word."
Step 1: Extract features from many images
Image1 -> [f1, f2, f3, ...]
Image2 -> [f4, f5, f6, ...]
...
Step 2: Cluster all features (e.g., k-means)
Clusters -> Visual Words (Vocabulary)
Step 3: Organize into hierarchical tree for fast lookup
(This is what makes DBoW2 efficient)
DBoW2 uses a hierarchical k-means tree for efficient word lookup:
Parameters:
k = branching factor (children per node)
L = depth levels
Example: k=10, L=6 gives 10^6 = 1,000,000 possible words
Tree Structure:
[Root]
/ / | \ \
[.] [.] [.] [.] [.] ... (k nodes)
/ \ / \ / \ / \ / \
... ... ... ... ... (L levels)
[w1][w2][w3]...[wN] (Leaf nodes = Visual Words)
Once we have a vocabulary, each image is converted to a sparse vector:
// For each feature in the image:
// 1. Traverse tree to find closest visual word
// 2. Increment that word's count
Image -> BowVector: {word_id: weight, word_id: weight, ...}
Example: {42: 0.15, 156: 0.08, 789: 0.12, ...}To improve discrimination, words are weighted using TF-IDF:
- TF (Term Frequency): How often a word appears in this image
- IDF (Inverse Document Frequency): Penalize common words (appear in many images)
weight = TF * IDF
IDF(word) = log(N / n_word)
where N = total images in database
n_word = images containing this word
Similarity between two images is computed by comparing their BoW vectors:
// L1-norm scoring (default in DBoW2)
score = 1 - 0.5 * |v1 - v2|_L1
// Score ranges from 0 (completely different) to 1 (identical)| Component | Description |
|---|---|
Vocabulary |
Hierarchical tree of visual words |
Database |
Stores BoW vectors with inverted index |
BowVector |
Sparse histogram of visual words |
FeatureVector |
Groups features by tree nodes (for geometric verification) |
| Type | Class | Description |
|---|---|---|
| ORB | OrbVocabulary, OrbDatabase |
Binary descriptor, fast |
| BRIEF | BriefVocabulary, BriefDatabase |
Binary descriptor |
| SURF | SurfVocabulary, SurfDatabase |
Floating-point descriptor |
Loop closure detection is critical for:
- Correcting accumulated drift
- Building globally consistent maps
- Enabling revisiting known areas
1. Extract Features
Current Frame -> ORB Keypoints + Descriptors
2. Query Database
Descriptors -> BowVector -> Query(database) -> Candidate Matches
3. Geometric Verification
For each candidate:
- Match features using FeatureVector
- Estimate fundamental/essential matrix
- Check inlier ratio
4. Accept Loop Closure
If geometric verification passes:
- Add loop constraint to pose graph
- Trigger global optimization
DBoW2 provides a FeatureVector that groups features by vocabulary tree nodes. This enables efficient feature matching between loop candidates:
// FeatureVector structure: {node_id: [feature_indices...], ...}
// Only compare features that share the same node
void matchFeatures(const FeatureVector& fv1, const FeatureVector& fv2) {
// Iterate through shared nodes
for (auto& node : shared_nodes) {
// Only compare features within the same node
// Much faster than brute-force matching!
}
}2_9/
├── README.md # This file
├── CMakeLists.txt # Build configuration
├── Dockerfile # Docker environment
└── examples/
├── vocabulary_training.cpp # Create vocabulary from images
├── loop_closure_detection.cpp # Loop detection demo
└── image_retrieval.cpp # Image retrieval example
- OpenCV 4.x
- DBoW2 library
mkdir build && cd build
cmake ..
make -j4docker build . -t slam_zero_to_hero:2_9# Vocabulary training (creates vocabulary from images)
./build/vocabulary_training /path/to/training/images output_vocabulary.yml.gz
# Loop closure detection demo
./build/loop_closure_detection vocabulary.yml.gz /path/to/sequence/
# Image retrieval demo
./build/image_retrieval vocabulary.yml.gz /path/to/database/ query_image.jpgdocker run -it --rm \
-v /path/to/data:/data \
slam_zero_to_hero:2_9#include "DBoW2/DBoW2.h"
#include <opencv2/features2d.hpp>
// Extract ORB features from training images
cv::Ptr<cv::ORB> orb = cv::ORB::create(1000);
std::vector<std::vector<cv::Mat>> features;
for (const auto& image : training_images) {
std::vector<cv::KeyPoint> keypoints;
cv::Mat descriptors;
orb->detectAndCompute(image, cv::noArray(), keypoints, descriptors);
// Convert to vector of single-row matrices
std::vector<cv::Mat> desc_vec;
for (int i = 0; i < descriptors.rows; i++) {
desc_vec.push_back(descriptors.row(i));
}
features.push_back(desc_vec);
}
// Create vocabulary with k=10, L=5
OrbVocabulary voc(10, 5, DBoW2::TF_IDF, DBoW2::L1_NORM);
voc.create(features);
voc.save("vocabulary.yml.gz");// Load vocabulary
OrbVocabulary voc("vocabulary.yml.gz");
// Create database with direct index (level 4 for efficient matching)
OrbDatabase db(voc, true, 4);
// Add keyframes to database
for (size_t i = 0; i < keyframes.size(); ++i) {
db.add(keyframes[i].descriptors);
}
// Query for loop candidates
DBoW2::QueryResults results;
db.query(current_frame.descriptors, results, 5); // Top 5 matches
// Filter and verify results
for (const auto& result : results) {
if (result.Score > 0.3 &&
std::abs((int)result.Id - current_id) > 30) {
// Potential loop closure candidate
// Proceed with geometric verification
std::cout << "Loop candidate: Frame " << result.Id
<< " (score: " << result.Score << ")" << std::endl;
}
}DBoW2::BowVector bow_vec;
DBoW2::FeatureVector feat_vec;
// Transform descriptors to BoW representation
voc.transform(descriptors, bow_vec, feat_vec, 4);
// bow_vec contains {word_id: tf-idf_weight, ...}
// feat_vec contains {node_id: [local_feature_indices], ...}| Parameter | Description | Typical Values |
|---|---|---|
k |
Branching factor | 9-10 |
L |
Tree depth | 3-6 |
weighting |
TF-IDF, TF, IDF, BINARY | TF_IDF |
scoring |
L1_NORM, L2_NORM, CHI_SQUARE, etc. | L1_NORM |
Higher k, L:
+ More discriminative vocabulary
+ Better precision
- Slower training
- Larger vocabulary file
Lower k, L:
+ Faster training and lookup
+ Smaller file
- Less discriminative
- May have more false positives
ORB-SLAM2 and ORB-SLAM3 use DBoW2 for:
- Loop Closure Detection: Query database for similar keyframes
- Relocalization: Find camera pose when tracking is lost
- Fast Feature Matching: Use FeatureVector for efficient matching
// Typical ORB-SLAM loop closure workflow
class LoopClosing {
public:
bool DetectLoop(KeyFrame* current_kf) {
// Query database (excluding recent keyframes)
db_.query(current_kf->bow_vec, results, 10);
// Find consistent candidates
for (const auto& result : results) {
if (CheckTemporalConsistency(result)) {
// Geometric verification with Sim3
if (ComputeSim3(current_kf, candidate_kf)) {
return true;
}
}
}
return false;
}
};-
Vocabulary Size: Use k=10, L=5 or L=6 for good balance (100K to 1M words)
-
Training Data: Train on images similar to your target environment
- Indoor vocabulary for indoor SLAM
- Outdoor vocabulary for outdoor SLAM
-
Direct Index Level: Set to 3-4 for efficient feature matching
-
Temporal Consistency: Require consistent matches across multiple frames
-
Geometric Verification: Always verify loop candidates with RANSAC
| Operation | Complexity | Time (typical) |
|---|---|---|
| Vocabulary Training | O(n * k * L * iterations) | Minutes to hours |
| Transform to BoW | O(n_features * L) | < 5 ms |
| Database Query | O(n_words * n_images) | < 10 ms |
| Add to Database | O(n_words) | < 1 ms |
- Images with repetitive patterns (windows, tiles)
- Solution: Use geometric verification and temporal consistency
- Vocabulary not trained on representative data
- Solution: Train on similar environment or use larger vocabulary
- Large databases consume memory
- Solution: Use direct index level 0 if not doing geometric verification
- DBoW2 GitHub
- DBoW2 Paper: "Bags of Binary Words for Fast Place Recognition in Image Sequences"
- ORB-SLAM2: Uses DBoW2 for loop closure
- Vocabulary Tree Paper: "Scalable Recognition with a Vocabulary Tree"