K truss optimizations#5525
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
| void per_v_pair_dst_nbr_intersection_for_each(raft::handle_t const& handle, | ||
| GraphViewType const& graph_view, | ||
| IntersectionOp intersection_op, | ||
| bool do_expensive_check = false, |
There was a problem hiding this comment.
Let's sync on the purpose of this primitive.
This is mainly to avoid instantiations of creating a temporary (intersection_offsets, intersection_indices) pair, is this correct?
There are several design questions.
- Should we support this for arbitrary vertex pairs or just for edges?
- In case we support this for edges, should we support this for the entire set of edges only or allowing calling this on a subset of edges?
- Is this mainly to update edge property values? Or should we return (v0, v1, value) triplets.
The current K-Truss implementation takes
if constexpr (multi_gpu) {
// no change
}
else {
// new code exploiting the low level details of how we store graph/edge property data, this won't work with new graph/edge property data structure updates in the future (e.g. dynamic graph).
}
I don't want to merge this to the cugraph main. We should design a primitive that serves our purpose (e.g. avoid creating a temporary intersection_offsets, intersection_indices pair) and works for both SG & MG.
There was a problem hiding this comment.
This is mainly to avoid instantiations of creating a temporary (intersection_offsets, intersection_indices) pair, is this correct?
Yes, that's the primary motivation. The existing per_v_pair_dst_nbr_intersection
materializes the full intersection result into (intersection_offsets,
intersection_indices) before returning. K-truss and edge_triangle_count then
iterate those arrays in a second pass to apply per-triangle work.
Another issue is that intersection_indices scales with the total number of
discovered triangles, not with the number of edges, which on a dense graph can be
one to two orders of magnitude larger than the edge count. As a result the
existing edge_triangle_count_impl has to process its edge list in chunks to keep
peak memory bounded, at the expense of repeated setup and additional kernel
launches per chunk that would otherwise collapse into a single pass.
The new primitive per_v_pair_dst_nbr_intersection_for_each invokes the caller's operator inside the
intersection kernel, per common neighbor, once. Neither output array is ever
allocated. Peak memory is whatever the caller's operator needs, typically
O(num_edges) for a counts array, never O(num_triangles). No chunking, no second
pass.
There was a problem hiding this comment.
Should we support this for arbitrary vertex pairs or just for edges?
Although the edge-pair overload takes a generic VertexPairIterator and doesn't enforce that the pairs are edges, the per-pair kernel in the detailed nbr_intersection_for_each.cuh implicitly assumes they are: it computes the position of the edge connecting each pair and passes it to the user operator, which is undefined when the pair is not an edge. K-Truss and edge_triangle_count only invoke this primitive with edges but other algorithms may call it for arbitrary pairs.
There was a problem hiding this comment.
In case we support this for edges, should we support this for the entire set of edges only or allowing calling this on a subset of edges?
Both are supported through two overloads.
- The first overload takes (vertex_pair_first, vertex_pair_last), letting the caller pass an arbitrary subset of edges.
K-trusspeeling(unrolling) uses this overload to process the weak edges identified at each iteration, which are typically a small fraction of the total edges. - The second overload iterates the graph's edges directly from the CSR. If an edge mask is attached to the graph view, only the active edges are visited otherwise all edges are.
edge_triangle_countuses this overload for its initial triangle count pass
There was a problem hiding this comment.
Or should we return (v0, v1, value) triplets.
Could you clarify which of the two you have in mind? Either returning one triplet per discovered triangle, or returning one triplet per edge pair with value being the sum of contributions across every triangle
that touches it (which is O(E) instead of O(triangles)), with value being the reduction (sum). The former reintroduces the same limitation I mentioned above, but I believe you are referring to the latter.
There was a problem hiding this comment.
One triplet per relevant vertex pairs (so all vertex pairs are edges, # triplets <= # edges).
The new primitive performs an in-place update of the edge property whereas returning (v0, v1, value) triplets would require an extra pass just to update the edge property (though it is more compatible with the cuGraph API).
However, for the k-truss unrolling (peeling) phase, returning (v0, v1, value) triplets does not cover the ownership logic and would produce inaccurate results when two or more edges in the same triangle are weak. In fact, when a triangle has multiple weak edges, the intersection discovers it once per weak edge, so each of its three sides receives one contribution per discovery. With a per-edge sum reducer the count for every side gets decremented multiple times, even though the triangle should be peeled exactly once.
The new primitive handles this because the operator sees the full triangle (p, q, r) on each discovery and can apply a lexicographic ownership rule across sibling weak edges so exactly one of the discoveries performs the decrement and the others return without writing. Per-triangle decisions of this kind are not expressible as a per-edge reduction.
There was a problem hiding this comment.
We are emitting one number per edge in each triangle and performing reduction per edge.
Yes, that's accurate, and I believe this is more suitable for cuGraph. Accumulating (atomically) triangle-count contributions per edge, returning triplets, then applying them in an extra pass should be the approach for
cuGraph (while the version that avoids the extra pass can live in a private repo).
Per-triangle decisions of this kind are not expressible as a per-edge reduction.
Actually this can be supported. As long as the primitive accepts a per-triangle operator that sees the full (p, q, r), we can put the ownership logic for k-truss directly into that operator. It runs once per discovered triangle,
checks whether the current edge owns it under the lex rule, and only emits a contribution when it does
There was a problem hiding this comment.
I don't want to merge this to the cugraph main. We should design a primitive that serves our purpose (e.g. avoid creating a temporary intersection_offsets, intersection_indices pair) and works for both SG & MG.
if constexpr (multi_gpu) {
// no change
}
else {
// new code exploiting the low level details of how we store graph/edge property data, this won't work with new graph/edge property data structure updates in the future (e.g. dynamic graph).
}
I agree. I was focused on optimizing the SG path and kept the MG branch on the existing implementation so it stayed correct. I have not yet worked on the MG path.
There was a problem hiding this comment.
https://github.com/rapidsai/cugraph/blob/main/cpp/include/cugraph/prims/transform_reduce_src_dst_nbr_intersection_of_e_endpoints_by_v.cuh#L564
I think this function resembles this but instead of reducing by v, we need to reduce by e.
You are right. This primitive is very similar to what I am trying to achieve but with edges instead. It iterates over each edge and intersect the endpoints neighbor list and the functor returns a tuple of 3 values where the first 2 are added to the source and destination vertex's accumulator and the third to every vertex in the intersection list.
For the edge variant, the same iteration applies but the three return values would map to the three edges of the triangle. Following the convention of the existing primitive, the natural shape is a 3-tuple: the first value is added once to the (p, q) edge being processed, the second is broadcast to each (p, r) edge for every r in the intersection list, and the third is broadcast to each (q, r) edge for every r in the intersection list. The reduction is per edge and the output is an array of length num_edges. (A 2-tuple variant that uses a single value for both (p, r) and (q, r) would also work where the two sides receive the same value, but the 3-tuple mirrors the existing primitive's shape more directly.)
To support the k-truss unrolling (peeling) phase, the functor needs to make a different decision per intersection vertex r, because the ownership rule depends on which sibling weak edges exist in that specific triangle. This matches the per-intersection-vertex variant noted in the docstring as a future addition (transform_reduce_triplet_*). The functor would see (p, q, r) for each r, check the weak bitmask for (p, r) and (q, r), apply the lexicographic comparison, and emit -1 only when (p, q) is the lex-smallest weak edge in the triangle, otherwise 0. The per-edge reduction then sums those per-triangle decisions into the correct net decrement and avoids the overcount.
There was a problem hiding this comment.
For the edge variant, the same iteration applies but the three return values would map to the three edges of the triangle. Following the convention of the existing primitive, the natural shape is a 3-tuple: the first value is added once to the
(p, q)edge being processed, the second is broadcast to each(p, r)edge for every r in the intersection list, and the third is broadcast to each(q, r)edge for every r in the intersection list. The reduction is per edge and the output is an array of length num_edges. (A 2-tuple variant that uses a single value for both(p, r)and(q, r)would also work where the two sides receive the same value, but the 3-tuple mirrors the existing primitive's shape more directly.)
After thinking thoroughly about this, the 2-tuple may actually be more clean. let me elaborate
The 2-tuple (val_pair, val_supporting):
val_pair-> added to the edge being processed (p-> q edge).val_supporting-> broadcast to each(p, r)and(q, r)for every r in the intersection.
For triangle counts, the two supporting edges symmetrically need the same value, so collapsing them into one slot is more natural than the 3-tuple's redundancy.
I was more concerned for the ktruss peeling(unrolling) phase where both values (val_* ) though same, depend on the lexi order and the bitmask
And the case where the same edge endpoints (p, q) have 2 or more neighbors ([r1, r2]) such that :
- Triangle
(p, q, r1): ownership may say "emit -1 to all three". - Triangle
(p, q, r2): ownership may say "emit 0 to all three".
Different triangles, different decisions. A single returned tuple can't express both. That's the per-edge variant's limitation, and it's the reason peeling needs the per-r variant similar to what was discussed but not implemented yet in this primitive
In the per-r variant, the functor is called once per (p, q, r) triplet, sees one triangle, makes one ownership decision, and returns one 2-tuple. Both supporting edges share the value.
The only reason to ever go to 3-tuple is if a future algorithm genuinely needs pr ≠ qr which isn't the case today.
No description provided.