cuSolver may prominently enhance the efficiency of LCAO module

## Background

According to our [ABACUS LCAO profiling](https://bytedance.feishu.cn/docs/doccn5tOaYdxX4fct8vVlkATNng#), the cost of solving generalized eigenvalue problems plays a dominant role as the scale of the structure is large.

For example, when considering 512 Si atoms each 4x4x4 supercell and using 8 processes, the overhead of ELPA is as follows:
| CLASS_NAME | NAME | TIME(Sec) | CALLS | AVG | PER% |
| :-----:| :----: | :----: | :-----:| :----: | :----: |
|Diago_LCAO_Matrix|elpa_solve|134.26|9|15|41|

Consequently, to boost the solving procedure with LCAO module in abacus, a more efficient eigensolver will be beneficial.   



## Describe the solution you'd like

CuSolver may be the best policy. The conclusion is based on our report [Eigensolver Benchmark](https://bytedance.feishu.cn/docs/doccnKve9m8dQPXqpC4RuDn259d) where the performance of eigenvector APIs from ELPA and cuSolver is benchmarked against different criterion. Focusing on GPU accelerating situation, the overhead(**in seconds**) with respect to solving partial or all eigenvectors is recorded with 1 processes(1 OMP thread), nblk=32 and one V100 GPU.

|SOLVER|elpa1+gpu(part) |elpa2+gpu(part) |elpa1+gpu(all) |elpa2+gpu(all) |cuSolver(all)| 
| :-----:| :----: | :----: | :-----:| :----: | :----: |
104(26) |0.002|0.002|0.01|0.02|0.005
832(153) |0.15|0.17|0.28|0.24|0.057
6656(1228) |48* |50^ |63|62|1.069

*：18s when tuning nblk=512

^：28s when tuning nblk=128

As is vividly shown above, even though partial eigenvectors need to compute, cuSolver, which computes all by default, exhibits a much more satisfactory performance than elpa. 

## Additional context

We(i.e. **ByteDance**) plan to divide this cuSolver realization into three steps：

- Step 1: Support a single GPU accelerating. (*We are now at this step.*)

- Step 2: Support a single node multiGPU accelerating.

- Step 3: Support multi-nodes multiGPU accelerating.

At the stage of Step 1, it is a possible strategy that a single GPU would first gather from all the processes to form the whole H and S matrix. After calling cuSolver API **cusolverDnDsygvd**, the outcome would finally be scattered. Depending on cuSolverMG APIs which have not matured, Step 2 and 3 would start in a proper time。 







Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cuSolver may prominently enhance the efficiency of LCAO module #690

Background

Describe the solution you'd like

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

SOLVER	elpa1+gpu(part)	elpa2+gpu(part)	elpa1+gpu(all)	elpa2+gpu(all)	cuSolver(all)
104(26)	0.002	0.002	0.01	0.02	0.005
832(153)	0.15	0.17	0.28	0.24	0.057
6656(1228)	48*	50^	63	62	1.069

cuSolver may prominently enhance the efficiency of LCAO module #690

Description

Background

Describe the solution you'd like

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions