CUDA_ERROR_ILLEGAL_ADDRESS in dft.numint.eval_ao with shls_slice

### Bug Description
I'm seeing a `CUDA_ERROR_ILLEGAL_ADDRESS` in `gpu4pyscf.dft.numint.eval_ao` when evaluating a subset of shells via `shls_slice`. 

The crash happens when passing the global `ao_loc` array from `SortedMole`. It looks like the kernel (or the Python calling sequence) expects `ao_loc_slice` to contain offsets relative to the start of the slice, not global offsets. Even if you manaually re-index `ao_loc` to start at 0 for the slice, the resulting AO values are often scrambled because `SortedMole` reorders atoms/shells in a way that makes mapping a specific AO subset back to the original molecule basis very error-prone.

I've had to resort to rebuilding sub-molecules from original atom indices to get correct O(N) evaluation, but it would be much better if `eval_ao` supported this natively without crashing.

### Reproduction
```python
import numpy as np
import cupy as cp
from pyscf import gto
from gpu4pyscf.dft import numint as gni

def reproduce():
    # Setup Molecule (CHEMBL100179_00)
    mol = gto.Mole()
    mol.atom = """
    C      -4.08900000      0.24860000      0.26420000
    N      -2.64900000      0.35520000      0.26430000
    C      -2.12700000      0.52000000     -1.07690000
    C      -0.63450000      0.46570000     -1.11830000
    C       0.11700000      0.03750000     -0.10670000
    C       1.58620000      0.02360000     -0.19000000
    C       2.26090000      1.11590000     -0.73550000
    C       3.63920000      1.13130000     -0.82990000
    C       4.34860000      0.03570000     -0.37350000
    F       5.69730000      0.04180000     -0.45920000
    C       3.71350000     -1.06580000      0.16870000
    C       2.33440000     -1.06260000      0.25960000
    C      -0.53640000     -0.41080000      1.17080000
    C      -2.00220000     -0.75840000      0.92830000
    """
    mol.basis = 'gth-tzv2p'
    mol.pseudo = 'gth-pbe'
    mol.unit = 'Angstrom'
    mol.build()
    
    grid_coords = np.zeros((100, 3))
    grid_coords[:, 0] = np.linspace(-5, 5, 100)
    
    ni_gpu = gni.NumInt()
    # build to get gdftopt/sorted_mol
    ni_gpu.build(mol, grid_coords[:1])
    opt = ni_gpu.gdftopt
    sorted_mol = opt._sorted_mol
    
    # Target a subset of shells (simulating screening)
    active_shls = np.arange(0, min(80, sorted_mol.nbas), dtype=np.int32)
    ao_loc_sorted = sorted_mol.ao_loc_nr()
    active_ao_count = sum(ao_loc_sorted[ish+1] - ao_loc_sorted[ish] for ish in active_shls)

    print(f"Triggering gni.eval_ao with {len(active_shls)} shells...")
    chunk_gpu = cp.asarray(grid_coords)
    
    # This call causes CUDA_ERROR_ILLEGAL_ADDRESS
    ao_chunk_gpu = gni.eval_ao(
        sorted_mol, 
        chunk_gpu, 
        shls_slice=cp.asarray(active_shls),
        ao_loc_slice=cp.asarray(ao_loc_sorted), 
        nao_slice=active_ao_count,
        ctr_offsets_slice=opt.l_ctr_offsets, 
        gdftopt=opt, 
        transpose=True
    )
    
    cp.cuda.Device().synchronize()

if __name__ == "__main__":
    reproduce()
```

### Environment
- **GPU**: NVIDIA GeForce RTX 4090 (Driver 550.120, Compute 8.9)
- **CUDA**: 12.2
- **pyscf**: 2.12.1
- **gpu4pyscf**: 1.6.1
- **cupy**: 14.0.1
- **torch**: 2.10.0+cu128


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA_ERROR_ILLEGAL_ADDRESS in dft.numint.eval_ao with shls_slice #723

Bug Description

Reproduction

Environment

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

CUDA_ERROR_ILLEGAL_ADDRESS in dft.numint.eval_ao with shls_slice #723

Description

Bug Description

Reproduction

Environment

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions