[CodeGen][AMDGPU] ArgCompareOp VectorDistribute Pipeline Support

## Overview

This issue tracks the implementation of full VectorDistribute pipeline support for `ArgCompareOp` (argmax/argmin operations) on AMDGPU targets. The goal is to enable efficient GPU code generation for reduction operations that return both a selected value and its corresponding index.

### ArgCompareOp Overview

`ArgCompareOp` is defined in [LinalgExtOps.td:645-770](https://github.com/iree-org/iree/blob/main/compiler/src/iree/compiler/Dialect/LinalgExt/IR/LinalgExtOps.td#L645-L770) and performs:
- A reduction over a specified dimension of a tensor
- Returns **two outputs**: the selected value AND its corresponding index
- Uses a **user-defined comparator region** that receives two values and returns `i1`

**Key Design Feature**: The comparator region provides flexibility to express:
- **argmax**: `arith.cmpf ogt, %a, %b` (greater than)
- **argmin**: `arith.cmpf olt, %a, %b` (less than)
- **Custom logic**: Any boolean predicate comparing two values

```mlir
// Example: argmax (select larger value)
iree_linalg_ext.arg_compare dimension(1)
  ins(%input : tensor<2x10xf32>)
  outs(%out_val, %out_idx : tensor<2xf32>, tensor<2xi32>) {
^bb0(%a: f32, %b: f32):
  %cmp = arith.cmpf ogt, %a, %b : f32
  iree_linalg_ext.yield %cmp : i1
}

// Example: argmin (select smaller value)
iree_linalg_ext.arg_compare dimension(1)
  ins(%input : tensor<2x10xf32>)
  outs(%out_val, %out_idx : tensor<2xf32>, tensor<2xi32>) {
^bb0(%a: f32, %b: f32):
  %cmp = arith.cmpf olt, %a, %b : f32
  iree_linalg_ext.yield %cmp : i1
}

// Example: custom comparator (select value with larger absolute value)
iree_linalg_ext.arg_compare dimension(1)
  ins(%input : tensor<2x10xf32>)
  outs(%out_val, %out_idx : tensor<2xf32>, tensor<2xi32>) {
^bb0(%a: f32, %b: f32):
  %abs_a = math.absf %a : f32
  %abs_b = math.absf %b : f32
  %cmp = arith.cmpf ogt, %abs_a, %abs_b : f32
  iree_linalg_ext.yield %cmp : i1
}
```

### Current VectorDistribute Pipeline

The VectorDistribute pipeline for reduction op follows:

```
linalg.generic → vector.multi_reduction → gpu.subgroup_reduce → amdgpu.dpp → rocdl.update.dpp
```

For `arg_compare`, the proposed pipeline is:

```
iree_linalg_ext.arg_compare (implicit-index mode)
    ↓ (TileAndDistributeToWorkgroups - if split reduction needed)
Partial reductions in scf.forall:
    iree_linalg_ext.arg_compare (implicit-index, computes indices from ivs)
        → produces partial (value, index) pairs
    ↓ (PartialReductionOpInterface::mergeReductions)
Merge reduction:
    iree_linalg_ext.arg_compare (explicit-index mode, 2 inputs)
        ins(%partial_values, %partial_indices)
        → merges (value, index) pairs
    ↓ (GenericVectorizationPass - vectorizeArgCompareOp)
    iree_vector_ext.arg_compare (vectorized form)
        → vector<...xf16>, vector<...xi32>
        → includes cloned comparator region
    ↓ (LLVMGPUConfigureTensorLayouts - setArgCompareAnchor)
    iree_vector_ext.to_layout with NestedLayoutAttr
        → distributed across threads/subgroups
    ↓ (LLVMGPUVectorDistribute - DistributeArgCompare)
    DPP butterfly reduction with cloned comparator
        → amdgpu.dpp for cross-lane data movement
        → execute comparator at each reduction stage
        → handle tie-breaking (prefer smaller index)
    ↓ (GPU lowering)
    amdgpu.dpp + rocdl.ballot + rocdl.readlane + AMD intrinsics
```

**Key implementation:**
- New `DistributeArgCompare` pattern handles all comparator types uniformly
- DPP butterfly reduction pattern (6 stages for 64-thread subgroup)
- Comparator region cloned and executed at each stage

### Existing ArgMax Implementation (ROCM UKernel)

The ROCM argmax ukernel in [iree_uk_amdgpu_argmax_f32i64.c](https://github.com/iree-org/iree/blob/main/compiler/plugins/target/ROCM/builtins/ukernel/iree_uk_amdgpu_argmax_f32i64.c) demonstrates the key algorithm:

```c
// 1. Reduce to find maximum across subgroup
float wgMax = laneMax;
for (int i = 1; i < warpSize; i *= 2) {
  wgMax = __builtin_fmaxf(__shfl_xor_f(wgMax, i), wgMax);
}

// 2. Use ballot to find which lanes have the max
uint64_t laneHasMaxValmask = __ballot(wgMax == laneMax);

// 3. Handle index selection
if (__builtin_popcountll(laneHasMaxValmask) == 1) {
  // Single max holder - direct write
  if (wgMax == laneMax) {
    outputBufferIdx[offset] = laneResult;
  }
} else {
  // Multiple max holders - find smallest index (argmax semantics)
  int64_t indexVal = wgMax == laneMax ? laneResult : INT64_MAX;
  laneResult = __ockl_wfred_min_i64(indexVal);
  if (laneID == 0) {
    outputBufferIdx[offset] = laneResult;
  }
}
```
---

## Implementation Plan

### Phase 1: Explicit-Index Mode Foundation ✅ (Completed)

This phase addressed the merge reduction challenge by extending arg_compare to accept optional index inputs.

- [x] **Extend op definition** 
      https://github.com/iree-org/iree/pull/23153
- [x] **Update tiling and split reduction** 
      https://github.com/iree-org/iree/pull/23218
      https://github.com/iree-org/iree/pull/23193
- [x] **Add verifier checks** 
      https://github.com/iree-org/iree/pull/23198
- [x] ~~**GPU Generalization** [Deprecated]~~ ([PR #23015](https://github.com/iree-org/iree/pull/23015))

### Phase 2: VectorDistribute Pipeline Integration (Current Focus)

- [x] **PartialReductionOuterReduction Support**
       https://github.com/iree-org/iree/pull/23102) 
- [x] **Add iree_vector_ext.arg_compare Op**
       https://github.com/iree-org/iree/pull/23386
- [x] **Vectorization Support**
      https://github.com/iree-org/iree/pull/23440
      https://github.com/iree-org/iree/pull/23775
- [ ] **Layout Configuration and Analysis**
      https://github.com/iree-org/iree/pull/23693
- [ ] **Distribution Pattern Implementation**
      - [WIP] upstream gpu.ballot so that we can do a decent distribution without leaking target info (rocdl.ballot). 
      https://github.com/iree-org/iree/pull/23793
  - **File**: `compiler/src/iree/compiler/Codegen/Common/GPU/GPUNestedLayoutDistributionPatterns.cpp`
  - Add `DistributeArgCompare` pattern for `iree_vector_ext.arg_compare`
  - Generate DPP butterfly reduction (6 stages for 64-thread subgroup)
  - Clone comparator region at each DPP stage and execute inline
  - Use `amdgpu.dpp` for cross-lane data movement (shuffle both values and indices)
  - Handle tie-breaking: when values are equal, prefer smaller index
  - Handle local reduction for thread-local elements before subgroup reduction
  - Lower to `amdgpu.dpp` + `rocdl.ballot` + `rocdl.readlane` intrinsics

- [ ] **KernelConfig Integration**
  - **File**: `compiler/src/iree/compiler/Codegen/LLVMGPU/LLVMGPUSelectLoweringStrategy.cpp`
  - Add `setArgCompareReductionConfig()` to configure reduction pipeline
  - Set workgroup size, subgroup size, and reduction tile sizes
  - **Note**: Should be last step after distribution patterns are working

- [ ] **Testing**
  - **Unit tests**: `compiler/src/iree/compiler/Codegen/LLVMGPU/test/ROCDL/gpu_vector_distribution.mlir`
    - Test argmax (simple comparator: ogt)
    - Test argmin (simple comparator: olt)
    - Test custom comparator (e.g., absolute value comparison)
    - Test both implicit and explicit-index modes
    - Verify generated DPP instructions and tie-breaking logic
  - **E2E tests**: `tests/e2e/linalg_ext/argcompare_amdgpu.mlir`
    - Real-world argmax/argmin workloads
    - Attention mechanism with argmax
    - Custom comparator examples
    - Performance validation on MI250/MI300

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CodeGen][AMDGPU] ArgCompareOp VectorDistribute Pipeline Support #23005

Overview

ArgCompareOp Overview

Current VectorDistribute Pipeline

Existing ArgMax Implementation (ROCM UKernel)

Implementation Plan

Phase 1: Explicit-Index Mode Foundation ✅ (Completed)

Phase 2: VectorDistribute Pipeline Integration (Current Focus)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[CodeGen][AMDGPU] ArgCompareOp VectorDistribute Pipeline Support #23005

Description

Overview

ArgCompareOp Overview

Current VectorDistribute Pipeline

Existing ArgMax Implementation (ROCM UKernel)

Implementation Plan

Phase 1: Explicit-Index Mode Foundation ✅ (Completed)

Phase 2: VectorDistribute Pipeline Integration (Current Focus)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions