-
Notifications
You must be signed in to change notification settings - Fork 152
Description
Background
According to our ABACUS LCAO profiling, the cost of solving generalized eigenvalue problems plays a dominant role as the scale of the structure is large.
For example, when considering 512 Si atoms each 4x4x4 supercell and using 8 processes, the overhead of ELPA is as follows:
| CLASS_NAME | NAME | TIME(Sec) | CALLS | AVG | PER% |
|---|---|---|---|---|---|
| Diago_LCAO_Matrix | elpa_solve | 134.26 | 9 | 15 | 41 |
Consequently, to boost the solving procedure with LCAO module in abacus, a more efficient eigensolver will be beneficial.
Describe the solution you'd like
CuSolver may be the best policy. The conclusion is based on our report Eigensolver Benchmark where the performance of eigenvector APIs from ELPA and cuSolver is benchmarked against different criterion. Focusing on GPU accelerating situation, the overhead(in seconds) with respect to solving partial or all eigenvectors is recorded with 1 processes(1 OMP thread), nblk=32 and one V100 GPU.
| SOLVER | elpa1+gpu(part) | elpa2+gpu(part) | elpa1+gpu(all) | elpa2+gpu(all) | cuSolver(all) |
|---|---|---|---|---|---|
| 104(26) | 0.002 | 0.002 | 0.01 | 0.02 | 0.005 |
| 832(153) | 0.15 | 0.17 | 0.28 | 0.24 | 0.057 |
| 6656(1228) | 48* | 50^ | 63 | 62 | 1.069 |
*:18s when tuning nblk=512
^:28s when tuning nblk=128
As is vividly shown above, even though partial eigenvectors need to compute, cuSolver, which computes all by default, exhibits a much more satisfactory performance than elpa.
Additional context
We(i.e. ByteDance) plan to divide this cuSolver realization into three steps:
-
Step 1: Support a single GPU accelerating. (We are now at this step.)
-
Step 2: Support a single node multiGPU accelerating.
-
Step 3: Support multi-nodes multiGPU accelerating.
At the stage of Step 1, it is a possible strategy that a single GPU would first gather from all the processes to form the whole H and S matrix. After calling cuSolver API cusolverDnDsygvd, the outcome would finally be scattered. Depending on cuSolverMG APIs which have not matured, Step 2 and 3 would start in a proper time。