Skip to content

Commit b188f21

Browse files
authored
perf(TDDFT): Add CUDA acceleration for snap_psibeta_half function (Useful information about largely improves the snap_psibeta_half function) (#6808)
* feat(tddft): Add CUDA acceleration for snap_psibeta with constant memory grids - Implement GPU-accelerated snap_psibeta_neighbor_batch_kernel - Use constant memory for Lebedev and Gauss-Legendre integration grids - Add multi-GPU support via set_device_by_rank - Initialize/finalize GPU resources in each calculate_HR call - Remove static global variables for cleaner resource management - CPU fallback when GPU processing fails * refactor(snap_psibeta_atom_batch_gpu): add timer and simplify code structure - Add ModuleBase::timer for snap_psibeta_atom_batch_gpu function - Remove GPU fallback to CPU design (return true/false in void function) - Replace fallback returns with error messages and proper early exits - Ensure timer is properly called on all exit paths - Simplify code structure for better readability * remove snap_psibeta_neighbor_batch_gpu * perf(gpu): optimize snap_psibeta_atom_batch_kernel with loop restructuring - Move ylm0 computation outside radial loop (saves 140x redundant calculations) - Hoist A_dot_leb and dR calculations outside inner loop - Add #pragma unroll hints for radial and m0 loops Achieves 23.3% speedup on snap_psibeta_gpu (19.27s -> 14.78s). Numerical correctness verified: energy matches baseline (-756.053 Ry). * perf(gpu): optimize compute_ylm_gpu with atan2 and sincos - Replace conditional atan branches with single atan2 call - Use sincos() instead of separate sin/cos calls Achieves 8.4% additional speedup (14.78s -> 13.56s) Combined with loop restructuring: 29.6% total from baseline Numerical correctness verified: -756.053 Ry * make the code more concise * perf(cuda): optimize compute_ylm_gpu with template parameters - Convert compute_ylm_gpu to templated version with L as template param - Use linear array for Legendre polynomials (reduces from 25 to 15 doubles) - Add DISPATCH_YLM macro for runtime-to-template dispatch - Add MAX_M0_SIZE constant for result array sizing - Replace C++17 constexpr if with regular if for C++14 compatibility - Enable compiler loop unrolling with #pragma unroll Performance: snap_psibeta_gpu improved from 13.27s to 9.83s (1.35x speedup) * perf(cuda): use warp shuffle for reduction in snap_psibeta kernel - Replace shared memory tree reduction with warp shuffle reduction - Use warp_reduce_sum for intra-warp reduction (faster shuffle ops) - Reduce shared memory from BLOCK_SIZE (2KB) to NUM_WARPS (64 bytes) - Cross-warp reduction done by first warp reading from shared memory Reduces register usage from 94 to 88, shared memory from 2KB to 64 bytes. * refactor(cuda): improve snap_psibeta_kernel code organization and documentation - Add comprehensive file headers explaining purpose and key features - Organize code into logical sections with clear separators - Add doxygen-style documentation for all functions, structs, and constants - Fix inaccurate comments (BLOCK_SIZE requirement, direction vector normalization) - Remove unused variables (dR, distance01) - Remove finalize_gpu_resources() as it's not needed for constant memory - Improve inline comments explaining algorithms and optimizations * refactor(td_nonlocal_lcao): use runtime check for GPU/CPU branch selection - Add use_gpu runtime flag that checks both __CUDA macro and PARAM.inp.device - GPU path is now only enabled when __CUDA is defined AND device == "gpu" - Makes the conditional logic clearer with if/else instead of nested #ifdef * refactor(snap_psibeta): unify CUDA error checking macros - Move CUDA_CHECK macro to shared header snap_psibeta_kernel.cuh - Remove duplicate CUDA_CHECK definition from snap_psibeta_gpu.cu - Remove CUDA_CHECK_KERNEL macro and replace all usages with CUDA_CHECK - Reduces code duplication and improves consistency * refactor(snap_psibeta_kernel): use ModuleBase constants - Replace local PI, FOUR_PI, SQRT2 definitions with ModuleBase:: versions - Add include for source_base/constants.h * refactor(snap_psibeta): use ModuleBase::WARNING_QUIT for error handling - Replace fprintf(stderr, ...) with ModuleBase::WARNING_QUIT - Update CUDA_CHECK macro to use WARNING_QUIT instead of fprintf - Add includes for tool_quit.h and string header - Consistent error handling with ABACUS codebase conventions
1 parent 29f141d commit b188f21

File tree

6 files changed

+1430
-33
lines changed

6 files changed

+1430
-33
lines changed

source/source_lcao/module_operator_lcao/td_nonlocal_lcao.cpp

Lines changed: 75 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,21 @@
11
#include "td_nonlocal_lcao.h"
22

3-
#include "source_io/module_parameter/parameter.h"
43
#include "source_base/timer.h"
54
#include "source_base/tool_title.h"
65
#include "source_cell/module_neighbor/sltk_grid_driver.h"
7-
#include "source_lcao/module_operator_lcao/operator_lcao.h"
6+
#include "source_io/module_parameter/parameter.h"
87
#include "source_lcao/module_hcontainer/hcontainer_funcs.h"
8+
#include "source_lcao/module_operator_lcao/operator_lcao.h"
99
#include "source_lcao/module_rt/snap_psibeta_half_tddft.h"
10+
#ifdef __CUDA
11+
#include "source_base/module_device/device.h"
12+
#include "source_lcao/module_rt/kernels/snap_psibeta_gpu.h"
13+
#endif
14+
1015
#include "source_pw/module_pwdft/global.h"
1116
#ifdef _OPENMP
12-
#include <unordered_set>
1317
#include <omp.h>
18+
#include <unordered_set>
1419
#endif
1520

1621
template <typename TK, typename TR>
@@ -127,6 +132,27 @@ void hamilt::TDNonlocal<hamilt::OperatorLCAO<TK, TR>>::calculate_HR()
127132
ModuleBase::TITLE("TDNonlocal", "calculate_HR");
128133
ModuleBase::timer::tick("TDNonlocal", "calculate_HR");
129134

135+
// Determine whether to use GPU path:
136+
// GPU is only used when both __CUDA is defined AND device is set to "gpu"
137+
#ifdef __CUDA
138+
const bool use_gpu = (PARAM.inp.device == "gpu");
139+
#else
140+
const bool use_gpu = false;
141+
#endif
142+
143+
// Initialize GPU resources if using GPU
144+
if (use_gpu)
145+
{
146+
#ifdef __CUDA
147+
// Use set_device_by_rank for multi-GPU support
148+
int dev_id = 0;
149+
#ifdef __MPI
150+
dev_id = base_device::information::set_device_by_rank(MPI_COMM_WORLD);
151+
#endif
152+
module_rt::gpu::initialize_gpu_resources();
153+
#endif
154+
}
155+
130156
const Parallel_Orbitals* paraV = this->hR_tmp->get_atom_pair(0).get_paraV();
131157
const int npol = this->ucell->get_npol();
132158
const int nlm_dim = TD_info::out_current ? 4 : 1;
@@ -145,9 +171,27 @@ void hamilt::TDNonlocal<hamilt::OperatorLCAO<TK, TR>>::calculate_HR()
145171
nlm_tot[i].resize(nlm_dim);
146172
}
147173

148-
#pragma omp parallel
174+
if (use_gpu)
149175
{
150-
#pragma omp for schedule(dynamic)
176+
#ifdef __CUDA
177+
// GPU path: Atom-level GPU batch processing
178+
module_rt::gpu::snap_psibeta_atom_batch_gpu(orb_,
179+
this->ucell->infoNL,
180+
T0,
181+
tau0 * this->ucell->lat0,
182+
cart_At,
183+
adjs,
184+
this->ucell,
185+
paraV,
186+
npol,
187+
nlm_dim,
188+
nlm_tot);
189+
#endif
190+
}
191+
else
192+
{
193+
// CPU path: OpenMP parallel over neighbors to compute nlm_tot
194+
#pragma omp parallel for schedule(dynamic)
151195
for (int ad = 0; ad < adjs.adj_num + 1; ++ad)
152196
{
153197
const int T1 = adjs.ntype[ad];
@@ -160,35 +204,36 @@ void hamilt::TDNonlocal<hamilt::OperatorLCAO<TK, TR>>::calculate_HR()
160204
all_indexes.insert(all_indexes.end(), col_indexes.begin(), col_indexes.end());
161205
std::sort(all_indexes.begin(), all_indexes.end());
162206
all_indexes.erase(std::unique(all_indexes.begin(), all_indexes.end()), all_indexes.end());
163-
for (int iw1l = 0; iw1l < all_indexes.size(); iw1l += npol)
207+
208+
// CPU path: loop over orbitals
209+
for (size_t iw1l = 0; iw1l < all_indexes.size(); iw1l += npol)
164210
{
165211
const int iw1 = all_indexes[iw1l] / npol;
166212
std::vector<std::vector<std::complex<double>>> nlm;
167-
// nlm is a vector of vectors, but size of outer vector is only 1 when out_current is false
168-
// and size of outer vector is 4 when out_current is true (3 for <psi|r_i * exp(-iAr)|beta>, 1 for
169-
// <psi|exp(-iAr)|beta>) inner loop : all projectors (L0,M0)
170-
171-
// snap_psibeta_half_tddft() are used to calculate <psi|exp(-iAr)|beta>
172-
// and <psi|rexp(-iAr)|beta> as well if current are needed
173213
module_rt::snap_psibeta_half_tddft(orb_,
174-
this->ucell->infoNL,
175-
nlm,
176-
tau1 * this->ucell->lat0,
177-
T1,
178-
atom1->iw2l[iw1],
179-
atom1->iw2m[iw1],
180-
atom1->iw2n[iw1],
181-
tau0 * this->ucell->lat0,
182-
T0,
183-
cart_At,
184-
TD_info::out_current);
214+
this->ucell->infoNL,
215+
nlm,
216+
tau1 * this->ucell->lat0,
217+
T1,
218+
atom1->iw2l[iw1],
219+
atom1->iw2m[iw1],
220+
atom1->iw2n[iw1],
221+
tau0 * this->ucell->lat0,
222+
T0,
223+
cart_At,
224+
TD_info::out_current);
185225
for (int dir = 0; dir < nlm_dim; dir++)
186226
{
187227
nlm_tot[ad][dir].insert({all_indexes[iw1l], nlm[dir]});
188228
}
189229
}
190230
}
231+
}
191232

233+
// 2. calculate <psi_I|beta>D<beta|psi_{J,R}> for each pair of <IJR> atoms
234+
// This runs for BOTH GPU and CPU paths
235+
#pragma omp parallel
236+
{
192237
#ifdef _OPENMP
193238
// record the iat number of the adjacent atoms
194239
std::set<int> ad_atom_set;
@@ -205,7 +250,7 @@ void hamilt::TDNonlocal<hamilt::OperatorLCAO<TK, TR>>::calculate_HR()
205250
const int thread_id = omp_get_thread_num();
206251
std::set<int> ad_atom_set_thread;
207252
int i = 0;
208-
for(const auto iat1 : ad_atom_set)
253+
for (const auto iat1: ad_atom_set)
209254
{
210255
if (i % num_threads == thread_id)
211256
{
@@ -215,7 +260,6 @@ void hamilt::TDNonlocal<hamilt::OperatorLCAO<TK, TR>>::calculate_HR()
215260
}
216261
#endif
217262

218-
// 2. calculate <psi_I|beta>D<beta|psi_{J,R}> for each pair of <IJR> atoms
219263
for (int ad1 = 0; ad1 < adjs.adj_num + 1; ++ad1)
220264
{
221265
const int T1 = adjs.ntype[ad1];
@@ -228,7 +272,7 @@ void hamilt::TDNonlocal<hamilt::OperatorLCAO<TK, TR>>::calculate_HR()
228272
continue;
229273
}
230274
#endif
231-
275+
232276
const ModuleBase::Vector3<int>& R_index1 = adjs.box[ad1];
233277
for (int ad2 = 0; ad2 < adjs.adj_num + 1; ++ad2)
234278
{
@@ -247,9 +291,9 @@ void hamilt::TDNonlocal<hamilt::OperatorLCAO<TK, TR>>::calculate_HR()
247291
if (TD_info::out_current)
248292
{
249293
std::complex<double>* tmp_c[3] = {nullptr, nullptr, nullptr};
250-
for (int i = 0; i < 3; i++)
294+
for (int ii = 0; ii < 3; ii++)
251295
{
252-
tmp_c[i] = TD_info::td_vel_op->get_current_term_pointer(i)
296+
tmp_c[ii] = TD_info::td_vel_op->get_current_term_pointer(ii)
253297
->find_matrix(iat1, iat2, R_vector[0], R_vector[1], R_vector[2])
254298
->get_pointer();
255299
}
@@ -276,13 +320,13 @@ void hamilt::TDNonlocal<hamilt::OperatorLCAO<TK, TR>>::calculate_HR()
276320
}
277321
}
278322
}
279-
}
280-
}
281-
323+
} // end omp parallel for matrix assembly
324+
} // end for iat0
282325
ModuleBase::timer::tick("TDNonlocal", "calculate_HR");
283326
}
284327

285328
// cal_HR_IJR()
329+
286330
template <typename TK, typename TR>
287331
void hamilt::TDNonlocal<hamilt::OperatorLCAO<TK, TR>>::cal_HR_IJR(
288332
const int& iat1,
@@ -396,7 +440,6 @@ void hamilt::TDNonlocal<hamilt::OperatorLCAO<TK, TR>>::set_HR_fixed(void* hR_tmp
396440
this->allocated = false;
397441
}
398442

399-
400443
// contributeHR()
401444
template <typename TK, typename TR>
402445
void hamilt::TDNonlocal<hamilt::OperatorLCAO<TK, TR>>::contributeHR()
@@ -436,7 +479,6 @@ void hamilt::TDNonlocal<hamilt::OperatorLCAO<TK, TR>>::contributeHR()
436479
return;
437480
}
438481

439-
440482
template <typename TK, typename TR>
441483
void hamilt::TDNonlocal<hamilt::OperatorLCAO<TK, TR>>::contributeHk(int ik)
442484
{

source/source_lcao/module_rt/CMakeLists.txt

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,13 @@ if(ENABLE_LCAO)
1818
boundary_fix.cpp
1919
)
2020

21+
if(USE_CUDA)
22+
list(APPEND objects
23+
kernels/cuda/snap_psibeta_kernel.cu
24+
kernels/cuda/snap_psibeta_gpu.cu
25+
)
26+
endif()
27+
2128
add_library(
2229
tddft
2330
OBJECT

0 commit comments

Comments
 (0)