-
Notifications
You must be signed in to change notification settings - Fork 35
Description
related to #232
#232 mentions that when max_hbm_for_vectors is set to a relatively small value, unit tests fail. I debugged the code and found the root cause: when max_hbm_for_vectors is set to a relatively small value and the number of inserted elements B is greater than max_capacity, this results in a high load factor for the hash table. At this point, find_and_lock_when_full is invoked. find_and_lock_when_full locks the key, but in accum_or_assign_kernel, the key is unlocked, which occurs before the vector is actually written via write_with_accum_kernel. This creates a race condition.
template <class K, class V, class S, int Strategy, uint32_t TILE_SIZE = 4>
__global__ void accum_or_assign_kernel(...) {
for (size_t t = (blockIdx.x * blockDim.x) + threadIdx.x; t < N; t += blockDim.x * gridDim.x) {
// ... some code
const int bucket_size = buckets_size[bkt_idx];
do {
if (bucket_size < bucket_max_size) {
occupy_result = find_and_lock_when_vacant<K, V, S, TILE_SIZE>(
g, bucket, insert_key, insert_score, evicted_key, start_idx,
key_pos, src_lane, bucket_max_size);
} else {
start_idx = (start_idx / TILE_SIZE) * TILE_SIZE;
// When buckets are full , invoke this
occupy_result = find_and_lock_when_full<K, V, S, TILE_SIZE,
ScoreFunctor::LOCK_MEM_ORDER,
ScoreFunctor::UNLOCK_MEM_ORDER>(
g, bucket, insert_key, insert_score, evicted_key, start_idx,
key_pos, src_lane, bucket_max_size);
}
occupy_result = g.shfl(occupy_result, src_lane);
} while (occupy_result == OccupyResult::CONTINUE);
// ... some code
if (g.thread_rank() == src_lane) {
*(value_or_deltas + key_idx) = (bucket->vectors + key_pos * dim);
*(founds + key_idx) = is_accum;
bucket->digests(key_pos)[0] = get_digest<K>(insert_key);
ScoreFunctor::update(bucket, key_pos, scores, key_idx, insert_score, (occupy_result != OccupyResult::DUPLICATE));
// This leads to premature unlocking of the key.
(bucket->keys(key_pos))->store(insert_key, ScoreFunctor::UNLOCK_MEM_ORDER);
}
}
}I have drawn a simple diagram to illustrate this.

After testing, I found that any place where unlocking occurs prematurely has the potential to cause issues:
- insert_or_assign
- find_or_insert
- upsert_and_evict
Solution
I performed a quick verification: by directly writing vectors inside accum_or_assign_kernel, rather than saving the value pointer and writing in write_with_accum_kernel, the unit test passes.
Indeed, a more efficient approach is to maintain the lock until the vector is written, and then unlock it. This is similar to what the upsert_kernel_lock_key_hybrid and write_kernel_unlock_key functions do in the insert_or_assign interface.
So I'm curious: why doesn't accum_or_assign follow the same approach as the insert_or_assign interface, which uses upsert_kernel_lock_key_hybrid paired with write_kernel_unlock_key?
Moreover, the insert_or_assign interface also has upsert_kernel and write_by_cpu/write_kernel, which unlock prematurely as well. I’m not clear why the same interface includes these two different implementations.