[experiment] Support unplaced TileOps by fifield · Pull Request #2265 · Xilinx/mlir-aie

fifield · 2025-05-02T23:48:30Z

This branch is an experiment to see what it takes to support unplaced aie dialect.

It is based on a a simple extension to aie.tile op to support ? as the row or column operand, meaning the row or column is not physically placed:

// unplaced tile
%tile_c_r = aie.tile(?, ?)

// unplaced shim
%shim_noc_tile_c_0 = aie.tile(?, 0)

// unplaced memtile
%mem_tile_c_0 = aie.tile(?, 1)

To test this I add a "null placer" to iron placers.py:

class NullPlacer(Placer):
    """NullPlacer is a simple implementation of a placer. The NullPlacer does not do any placement.
    """

    def __init__(self):
        super().__init__()

    def make_placement(
        self,
        device: Device,
        rt: Runtime,
        workers: list[Worker],
        object_fifos: list[ObjectFifoHandle],
    ):
        for worker in workers:
            if worker.tile == AnyComputeTile:
                worker.place(Tile(-1, -1))
                for buffer in worker.buffers:
                    buffer.place(worker.tile)
            for of in object_fifos:
                of_endpoints = of.all_of_endpoints()
                for ofe in of_endpoints:
                    if ofe.tile == AnyMemTile:
                        ofe.place(Tile(-1, 1))
                    elif ofe.tile == AnyComputeTile:
                        ofe.place(Tile(-1, -1))
                    elif ofe.tile == AnyShimTile:
                        ofe.place(Tile(-1, 0))

So that unplaced MLIR is emitted from unplaced IRON:

# place_test.py
@construct_and_print_module
def shim_three_in(module):
    N = 4096
    n = 1024

    n_ty = np.ndarray[(n,), np.dtype[np.int32]]

    n_inputs = 3
    of_ins = []
    for i in range(n_inputs):
        of_ins.append(ObjectFifo(n_ty, name=f"in_{i}"))

    def core_fn(of_in):
        pass

    workers = []
    for i in range(n_inputs):
        workers.append(Worker(core_fn, [of_ins[i].cons()]))

    rt = Runtime()
    with rt.sequence(n_ty, n_ty, n_ty) as (A, B, C):
        rt.start(*workers)
        rt.fill(of_ins[0].prod(), A)
        rt.fill(of_ins[1].prod(), B)
        rt.fill(of_ins[2].prod(), C)

    module = Program(NPU2Col2(), rt).resolve_program(NullPlacer())
    return module

emits:

module {
  aie.device(npu2_2col) {
    %shim_noc_tile_c_0 = aie.tile(?, 0)
    %tile_c_r = aie.tile(?, ?)
    %shim_noc_tile_c_0_0 = aie.tile(?, 0)
    %tile_c_r_1 = aie.tile(?, ?)
    %shim_noc_tile_c_0_2 = aie.tile(?, 0)
    %tile_c_r_3 = aie.tile(?, ?)
    aie.objectfifo @in_0(%shim_noc_tile_c_0, {%tile_c_r}, 2 : i32) : !aie.objectfifo<memref<1024xi32>> 
    aie.objectfifo @in_2(%shim_noc_tile_c_0_0, {%tile_c_r_1}, 2 : i32) : !aie.objectfifo<memref<1024xi32>> 
    aie.objectfifo @in_1(%shim_noc_tile_c_0_2, {%tile_c_r_3}, 2 : i32) : !aie.objectfifo<memref<1024xi32>> 
    %core_c_r = aie.core(%tile_c_r) {
      aie.end
    }
    %core_c_r_4 = aie.core(%tile_c_r_3) {
      aie.end
    }
    %core_c_r_5 = aie.core(%tile_c_r_1) {
      aie.end
    }
    aiex.runtime_sequence @sequence(%arg0: memref<1024xi32>, %arg1: memref<1024xi32>, %arg2: memref<1024xi32>) {
      %0 = aiex.dma_configure_task_for @in_0 {
        aie.dma_bd(%arg0 : memref<1024xi32>, 0, 1024, [<size = 1, stride = 0>, <size = 1, stride = 0>, <size = 1, stride = 0>, <size = 1024, stride = 1>]) {burst_length = 0 : i32}
        aie.end
      }
      aiex.dma_start_task(%0)
      %1 = aiex.dma_configure_task_for @in_1 {
        aie.dma_bd(%arg1 : memref<1024xi32>, 0, 1024, [<size = 1, stride = 0>, <size = 1, stride = 0>, <size = 1, stride = 0>, <size = 1024, stride = 1>]) {burst_length = 0 : i32}
        aie.end
      }
      aiex.dma_start_task(%1)
      %2 = aiex.dma_configure_task_for @in_2 {
        aie.dma_bd(%arg2 : memref<1024xi32>, 0, 1024, [<size = 1, stride = 0>, <size = 1, stride = 0>, <size = 1, stride = 0>, <size = 1024, stride = 1>]) {burst_length = 0 : i32}
        aie.end
      }
      aiex.dma_start_task(%2)
    }
  }
}

which can be placed with the mlir pass in this branch:

$ python place_test.py  | aie-opt -canonicalize -aie-sequential-placer -canonicalize
module {
  aie.device(npu2_2col) {
    %tile_0_2 = aie.tile(0, 2)
    %tile_0_3 = aie.tile(0, 3)
    %shim_noc_tile_0_0 = aie.tile(0, 0)
    %tile_0_4 = aie.tile(0, 4)
    aie.objectfifo @in_0(%shim_noc_tile_0_0, {%tile_0_2}, 2 : i32) : !aie.objectfifo<memref<1024xi32>> 
    aie.objectfifo @in_2(%shim_noc_tile_0_0, {%tile_0_3}, 2 : i32) : !aie.objectfifo<memref<1024xi32>> 
    aie.objectfifo @in_1(%shim_noc_tile_0_0, {%tile_0_4}, 2 : i32) : !aie.objectfifo<memref<1024xi32>> 
    %core_0_2 = aie.core(%tile_0_2) {
      aie.end
    }
    %core_0_4 = aie.core(%tile_0_4) {
      aie.end
    }
    %core_0_3 = aie.core(%tile_0_3) {
      aie.end
    }
    aiex.runtime_sequence @sequence(%arg0: memref<1024xi32>, %arg1: memref<1024xi32>, %arg2: memref<1024xi32>) {
      %0 = aiex.dma_configure_task_for @in_0 {
        aie.dma_bd(%arg0 : memref<1024xi32>, 0, 1024, [<size = 1, stride = 0>, <size = 1, stride = 0>, <size = 1, stride = 0>, <size = 1024, stride = 1>]) {burst_length = 0 : i32}
        aie.end
      }
      aiex.dma_start_task(%0)
      %1 = aiex.dma_configure_task_for @in_1 {
        aie.dma_bd(%arg1 : memref<1024xi32>, 0, 1024, [<size = 1, stride = 0>, <size = 1, stride = 0>, <size = 1, stride = 0>, <size = 1024, stride = 1>]) {burst_length = 0 : i32}
        aie.end
      }
      aiex.dma_start_task(%1)
      %2 = aiex.dma_configure_task_for @in_2 {
        aie.dma_bd(%arg2 : memref<1024xi32>, 0, 1024, [<size = 1, stride = 0>, <size = 1, stride = 0>, <size = 1, stride = 0>, <size = 1024, stride = 1>]) {burst_length = 0 : i32}
        aie.end
      }
      aiex.dma_start_task(%2)
    }
  }
}

format

kurtis-b-1 · 2025-10-01T21:06:32Z

At the level of the unplaced MLIR, is it necessary to know the specific memory access patterns? I wonder that if there was a way to provide the size of the application workload, the loop and tiling variables, and the direction of the tilings, then maybe the placement pass could also decide the memory access pattern.
The data movement will be partly decided by the kernel implementation, so the buffer sizes passed to the Kernel objects would still need to be explicit, I think.

fifield force-pushed the unplaced_tileop branch from 56f003c to f2fbac6 Compare May 6, 2025 20:43

This was referenced May 7, 2025

SequentialPlacer takes into account max channels per tile #2221

Merged

Program.resolve_program has side-effects which prevents reuse #2302

Closed

fifield force-pushed the unplaced_tileop branch 3 times, most recently from b01739a to c2ee6ba Compare May 20, 2025 15:46

fifield force-pushed the unplaced_tileop branch from c2ee6ba to 263fc98 Compare May 29, 2025 03:29

fifield force-pushed the unplaced_tileop branch from 263fc98 to 0887a3f Compare June 6, 2025 22:35

fifield force-pushed the unplaced_tileop branch from 0887a3f to dd5c151 Compare June 18, 2025 20:47

fifield force-pushed the unplaced_tileop branch 2 times, most recently from 0a3e8a9 to 9904e93 Compare July 2, 2025 18:07

etabeta1 mentioned this pull request Jul 28, 2025

RuntimeError: the operation has been invalidated when calling a kernel after a second one has been programmed onto the NPU etabeta1/nengo-aie#2

Closed

fifield force-pushed the unplaced_tileop branch 2 times, most recently from f4e1aad to 0aec197 Compare July 31, 2025 17:37

fifield force-pushed the unplaced_tileop branch from 0aec197 to 7a135c2 Compare August 7, 2025 20:48

fifield force-pushed the unplaced_tileop branch from 7a135c2 to 5e589be Compare August 15, 2025 16:53

fifield force-pushed the unplaced_tileop branch from 5e589be to 9113faf Compare August 27, 2025 21:28

fifield force-pushed the unplaced_tileop branch from 9113faf to 8b71658 Compare September 9, 2025 21:08

fifield added 3 commits September 23, 2025 14:33

Unplaced TileOps

2432efb

fixup tests

eab62e9

update test

4494b93

fifield force-pushed the unplaced_tileop branch from 8b71658 to 4494b93 Compare September 23, 2025 20:33

Update placers.py

9f7ce82

format

fifield force-pushed the unplaced_tileop branch from 5116b67 to 9f7ce82 Compare September 23, 2025 20:54

fifield mentioned this pull request Sep 30, 2025

Cores per column placer 2 #2619

Merged

hunhoffe mentioned this pull request Nov 25, 2025

SequentialPlacer Channel Counting #2738

Open

4 tasks

Merge branch 'main' into unplaced_tileop

0c955db

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[experiment] Support unplaced TileOps#2265

[experiment] Support unplaced TileOps#2265
fifield wants to merge 5 commits intoXilinx:mainfrom
fifield:unplaced_tileop

fifield commented May 2, 2025

Uh oh!

kurtis-b-1 commented Oct 1, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

fifield commented May 2, 2025

Uh oh!

kurtis-b-1 commented Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kurtis-b-1 commented Oct 1, 2025 •

edited

Loading