When submitting jobs on Joule 3.0, users must be mindful of the impact of large core counts and the AMD EPYC architecture on job performance. On Joule 2.0, a 240 core MPI job required 6 nodes. On Joule 3.0, a 240 core MPI job requires 2 nodes. Requesting even a few nodes on Joule 3.0 can result in hundreds of MPI processes. For programs that rely solely on MPI for parallelization, this can result in poor job performance due to communication bottlenecks if too many MPI processes are requested. As such, MPI-only programs should request as few nodes as possible for job submissions. Depending on the program, errors maybe reported if too many MPI processes are requested.
The design of the AMD EPYC 9534 processor works well with programs that utilize OpenMP. When using OpenMP, OMP_NUM_THREADS should be set to 8 with each thread being mapped to a L3 Cache. This will ensure each thread and associated workers are all mapped to a CCD. The shared L3 Cache will allow the quick transfer of data between the Zen 4 cores.
Programs that use a Hybrid combination of both MPI and OpenMP tend to perform well on Joule 3.0. Using OpenMP greatly reduces the number of MPI threads that have to communicate. While this does allow for scaling to a much larger number of nodes, jobs that utilize the Hybrid approach for small numbers of nodes may see worse performance versus using a MPI-only submission. For programs like VASP, the Hybrid approach usually starts to offer superior performance versus MPI-only on submissions on node requests of 8 or more. This tipping point varies from job-to-job. To simplify this evaluation process, documented example submission scripts can be found in the /nfs/apps/Submission/ directory. Each script was tested against as many jobs as possible to determine what the optimum default setting should be. Typically only the #SBATCH –nodes=1 line will require changing. MPI and Hybrid example scripts were provided for VASP.