Skip to content

Latest commit

 

History

History
519 lines (399 loc) · 19.8 KB

cp2k.md

File metadata and controls

519 lines (399 loc) · 19.8 KB

{#ref-uenv-cp2k}

CP2K

CP2K is a quantum chemistry and solid state physics software package that can perform atomistic simulations of solid state, liquid, molecular, periodic, material, crystal, and biological systems.

CP2K provides a general framework for different modeling methods such as DFT using the mixed Gaussian and plane waves approaches GPW and GAPW. Supported theory levels include DFTB, LDA, GGA, MP2, RPA, semi-empirical methods (AM1, PM3, PM6, RM1, MNDO, …), and classical force fields (AMBER, CHARMM, …). CP2K can do simulations of molecular dynamics, metadynamics, Monte Carlo, Ehrenfest dynamics, vibrational analysis, core level spectroscopy, energy minimization, and transition state optimization using NEB or dimer method. See CP2K Features for a detailed overview.

!!! note "uenvs"

[CP2K] is provided on [ALPS][platforms-on-alps] via [uenv][ref-uenv].
Please have a look at the [uenv documentation][ref-uenv] for more information about uenvs and how to use them.

Dependencies

On our systems, CP2K is built with the following dependencies:

!!! note "GPU-aware MPI" COSMA and DLA-Future are built with [GPU-aware MPI][ref-communication-cray-mpich-gpu-aware], which requires setting MPICH_GPU_SUPPORT_ENABLED=1. On the HPC platform, MPICH_GPU_SUPPORT_ENABLED=1 is set by default.

!!! note "CUDA cache path for JIT compilation" DBCSR uses JIT compilation for CUDA kernels. The default location is in the home directory, which can put unnecessary burden on the filesystem and lead to performance degradation. Because of this we set CUDA_CACHE_PATH to point to the in-memory filesystem in /dev/shm. On the HPC platform, CUDA_CACHE_PATH is set to a directory under /dev/shm by default.

!!! warning "BLAS/LAPACK on Eiger"

On Eiger, the default BLAS/LAPACK library is Intel oneAPI MKL (oneMKL) until `[email protected]`. 
From `[email protected]` the default BLAS/LAPACK library is [OpenBLAS].

Running CP2K

Running on the HPC platform

To start a job, two bash scripts are potentially required: a slurm submission script, and a wrapper to start the CUDA MPS daemon so that multiple MPI ranks can use the same GPU.

#!/bin/bash -l

#SBATCH --job-name=cp2k-job
#SBATCH --time=00:30:00 (1)
#SBATCH --nodes=4
#SBATCH --ntasks-per-core=1
#SBATCH --ntasks-per-node=32 (2)
#SBATCH --cpus-per-task=8 (3)
#SBATCH --account=<ACCOUNT>
#SBATCH --hint=nomultithread
#SBATCH --hint=exclusive
#SBATCH --no-requeue
#SBATCH --uenv=<CP2K_UENV>
#SBATCH --view=cp2k

export CUDA_CACHE_PATH="/dev/shm/$USER/cuda_cache" # (5)!
export MPICH_GPU_SUPPORT_ENABLED=1 # (6)!
export MPICH_MALLOC_FALLBACK=1
export OMP_NUM_THREADS=$((SLURM_CPUS_PER_TASK - 1)) # (4)!

ulimit -s unlimited
srun --cpu-bind=socket ./mps-wrapper.sh cp2k.psmp -i <CP2K_INPUT> -o <CP2K_OUTPUT>
  1. Time format: HH:MM:SS

  2. Number of MPI ranks per node

  3. Number of CPUs per MPI ranks

  4. OpenBLAS spawns an extra thread, therefore it is necessary to set OMP_NUM_THREADS to SLURM_CPUS_PER_TASK - 1 for good performance. With Intel MKL, this is not necessary and one can set OMP_NUM_THREADS to SLURM_CPUS_PER_TASK.

  5. DBCSR relies on extensive JIT compilation, and we store the cache in memory to avoid I/O overhead. This is set by default on the HPC platform, but it's set here explicitly as it's essential to avoid performance degradation.

  6. CP2K's dependencies use GPU-aware MPI, which requires enabling support at runtime. This is set by default on the HPC platform, but it's set here explicitly as it's a requirement in general for enabling GPU-aware MPI.

  • Change to your project account name
  • Change <CP2K_UENV> to the name (or path) of the actual CP2K uenv you want to use
  • Change <PATH_TO_CP2K_DATA_DIR> to the actual path to the CP2K data directory
  • Change <CP2K_INPUT> and <CP2K_OUTPUT> to the actual input and output files

With the above scripts, you can launch a CP2K calculation on 4 nodes, with 32 MPI ranks per node and 8 OpenMP threads per rank with

sbatch run_cp2k.sh

!!! note

The `mps-wrapper.sh` script, required to properly over-subscribe the GPU, is provided at the following page: 
[NVIDIA GH200 GPU nodes: multiple ranks per GPU][ref-slurm-gh200-multi-rank-per-gpu].

!!! warning

The `--cpu-bind=socket` option is necessary to get good performance.

!!! warning

Each GH200 node has 4 modules, each of them composed of a ARM Grace CPU with 72 cores and a H200 GPU directly
attached to it. Please see [Alps hardware][ref-alps-hardware] for more information.
It is important that the number of MPI ranks passed to [slurm] with `--ntasks-per-node` is a multiple of 4.

??? note

     In the example above, we use 32 MPI ranks with 8 OpenMP threads, for a total of 64 cores per GPU and 256 cores
     per node. Experiments have shown that CP2K performs and scales better when the number of MPI ranks is a power
     of 2, even if some cores are left idling. 

??? info "Running regression tests"

If you want to run CP2K regression tests with the CP2K executable provided by the uenv, make sure to use the version
of the regression tests corresponding to the version of CP2K provided by the uenv. The regression test data is
sometimes adjusted, and using the wrong version of the regression tests can lead to test failures.

??? example "Scaling of QS/H2O-1024 benchmark"

The [`QS/H2O-1024` benchmark](https://github.com/cp2k/cp2k/blob/master/benchmarks/QS/H2O-1024.inp) is a DFT
molecular dynamics simulation of liquid water. It relies on [DBCSR] for block sparse matrix-matrix multiplication.

All calculations were run with 32 MPI ranks per node, and 8 OpenMP threads per rank (best configuration for this
benchmark). 

!!! note 
    `H2O-102.inp` is the largest example of DFT molecular dynamics simulation of liquid water that fits on a single
    GH200 node.

| Number of nodes | Wall time (s) | Speedup | Efficiency |
|:---------------:|:-------------:|:-------:|:----------:|
| 1               |  793.1        | 1.00    | 1.00       |
| 2               |  535.2        | 1.48    | 0.74       |
| 4               |  543.9        | 1.45    | 0.36       |
| 8               |  487.3        | 1.64    | 0.20       |
| 16              |  616.7        | 1.28    | 0.08       |

Scaling is not ideal on more than two nodes.

??? example "Scaling of QS_mp2_rpa/128-H2O/H2O-128-RI-MP2-TZ benchmark"

The `QS_mp2_rpa/128-H2O/H2O-128-RI-MP2-TZ` benchmark is a straightforward modification of the
[`QS_mp2_rpa/64-H2O/H2O-64-RI-MP2-TZ` benchmark](https://github.com/cp2k/cp2k/blob/master/benchmarks/QS_mp2_rpa/64-H2O/H2O-64-RI-MP2-TZ.inp).

It is a RI-MP2 calculation of a water cluster with 128 atoms.

??? info "Input file"
    
    ```Fortran
    &GLOBAL
      PRINT_LEVEL MEDIUM
      PROJECT H2O-128-RI-MP2-TZ
      RUN_TYPE ENERGY
    &END GLOBAL

    &FORCE_EVAL
      METHOD Quickstep
      &DFT
        BASIS_SET_FILE_NAME ./BASIS_H2O
        POTENTIAL_FILE_NAME ./POTENTIAL_H2O
        WFN_RESTART_FILE_NAME ./H2O-128-PBE-TZ-RESTART.wfn
        &MGRID
          CUTOFF 800
          REL_CUTOFF 50
        &END MGRID
        &QS
          EPS_DEFAULT 1.0E-12
        &END QS
        &SCF
          EPS_SCF 1.0E-6
          MAX_SCF 30
          SCF_GUESS RESTART
          &OT
            MINIMIZER CG
            PRECONDITIONER FULL_ALL
          &END OT
          &OUTER_SCF
            EPS_SCF 1.0E-6
            MAX_SCF 20
          &END OUTER_SCF
          &PRINT
            &RESTART OFF
            &END RESTART
          &END PRINT
        &END SCF
        &XC
          &HF
            FRACTION 1.0
            &INTERACTION_POTENTIAL
              CUTOFF_RADIUS 6.0
              POTENTIAL_TYPE TRUNCATED
              T_C_G_DATA ./t_c_g.dat
            &END INTERACTION_POTENTIAL
            &MEMORY
              MAX_MEMORY 16384
            &END MEMORY
            &SCREENING
              EPS_SCHWARZ 1.0E-8
              SCREEN_ON_INITIAL_P TRUE
            &END SCREENING
          &END HF
          &WF_CORRELATION
            MEMORY 1200
            NUMBER_PROC 1
            &INTEGRALS
              &WFC_GPW
                CUTOFF 300
                EPS_FILTER 1.0E-12
                EPS_GRID 1.0E-8
                REL_CUTOFF 50
              &END WFC_GPW
            &END INTEGRALS
            &RI_MP2
            &END RI_MP2
          &END WF_CORRELATION
          &XC_FUNCTIONAL NONE
          &END XC_FUNCTIONAL
        &END XC
      &END DFT
      &SUBSYS
        &CELL
          ABC 15.6404 15.6404 15.6404
        &END CELL
        &KIND H
          BASIS_SET cc-TZ
          BASIS_SET RI_AUX RI-cc-TZ
          POTENTIAL GTH-HF-q1
        &END KIND
        &KIND O
          BASIS_SET cc-TZ
          BASIS_SET RI_AUX RI-cc-TZ
          POTENTIAL GTH-HF-q6
        &END KIND
        &TOPOLOGY
          COORD_FILE_FORMAT XYZ
          COORD_FILE_NAME ./H2O-128.xyz
        &END TOPOLOGY
      &END SUBSYS
    &END FORCE_EVAL
    ```

All calculations run for this scaling tests were using 32 MPI ranks per node and 8 OpenMP threads per rank. 
The smallest amount of nodes necessary to run this calculation is 8.


| Number of nodes | Wall time (s) | Speedup | Efficiency |
|:---------------:|:-------------:|:-------:|:----------:|
| 8               |  2037.0       | 1.00    | 1.00       |
| 16              |  1096.2       | 1.85    | 0.92       |
| 32              |  611.5        | 3.33    | 0.83       |
| 64              |  410.5        | 4.96    | 0.62       |
| 128             |  290.9        | 7.00    | 0.43       |

MP2 calculations scale well on GH200, up to a large number of nodes ($> 50\%$ efficiency with 64 nodes).

??? example "Scaling of QS_mp2_rpa/128-H2O/H2O-128-RI-dRPA-TZ benchmark"

The [`QS_mp2_rpa/128-H2O/H2O-128-RI-dRPA-T` benchmark](https://github.com/cp2k/cp2k/blob/master/benchmarks/QS_mp2_rpa/128-H2O/H2O-128-RI-dRPA-TZ.inp)
is a RPA energy calculation, traditionally used to benchmark the performance of the [COSMA] library. 
It a very large calculation, which requires at least 8 GH200 nodes to run. 
The calculations were run with 16 MPI ranks per node and 16 OpenMP threads per rank.
For RPA workloads, a higher ratio of threads per rank were beneficial.

| Number of nodes | Wall time (s) | Speedup | Efficiency |
|:---------------:|:-------------:|:-------:|:----------:|
| 8               |  575.4        | 1.00    | 1.00       |
| 16              |  465.8        | 1.23    | 0.61       |
| 32              |  281.1        | 2.04    | 0.51       |
| 64              |  205.3        | 2.80    | 0.35       |
| 128             |  185.8        | 3.09    | 0.19       |

This RPA input scales well until 32 GH200 nodes.

Running on Eiger

On Eiger, a similar sbatch script can be used:

#!/bin/bash -l
#SBATCH --job-name=cp2k-job
#SBATCH --time=00:30:00 (1)
#SBATCH --nodes=1
#SBATCH --ntasks-per-core=1
#SBATCH --ntasks-per-node=32 (2)
#SBATCH --cpus-per-task=4 (3)
#SBATCH --account=<ACCOUNT>
#SBATCH --hint=nomultithread
#SBATCH --hint=exclusive
#SBATCH --constraint=mc
#SBATCH --uenv=<CP2K_UENV>
#SBATCH --view=cp2k

export OMP_NUM_THREADS=$((SLURM_CPUS_PER_TASK - 1)) # (4)!

ulimit -s unlimited
srun --cpu-bind=socket cp2k.psmp -i <CP2K_INPUT> -o <CP2K_OUTPUT>
  1. Time format: HH:MM:SS

  2. Number of MPI ranks per node

  3. Number of CPUs per MPI ranks

  4. OpenBLAS spawns an extra thread, therefore it is necessary to set OMP_NUM_THREADS to SLURM_CPUS_PER_TASK - 1 for good performance. With Intel MKL, this is not necessary and one can set OMP_NUM_THREADS to SLURM_CPUS_PER_TASK.

  • Change to your project account name
  • Change <CP2K_UENV> to the name (or path) of the actual CP2K uenv you want to use
  • Change <PATH_TO_CP2K_DATA_DIR> to the actual path to the CP2K data directory
  • Change <CP2K_INPUT> and <CP2K_OUTPUT> to the actual input and output files

!!! warning

The `--cpu-bind=socket` option is necessary to get good performance.

??? info "Running regression tests"

If you want to run CP2K regression tests with the CP2K executable provided by the uenv, make sure to use the version
of the regression tests corresponding to the version of CP2K provided by the uenv. The regression test data is
sometimes adjusted, and using the wrong version of the regression tests can lead to test failures.

Building CP2K from Source

!!! warning The following installation instructions are up-to-date with the latest version of CP2K provided by the uenv. That is, they work when manually compiling the CP2K source code corresponding to the CP2K version provided by the uenv. They are not necessarily up-to-date with the latest version of CP2K available on the master branch.

If you are trying to build CP2K from source, make sure you understand what is different in `master`
compared to the latest version of CP2K provided by the uenv.

The CP2K uenv provides all the dependencies required to build CP2K from source, with several optional features enabled. You can follow these steps to build CP2K from source:

uenv start --view=develop <CP2K_UENV> # (1)!

cd <PATH_TO_CP2K_SOURCE> # (2)!

mkdir build && cd build
CC=mpicc CXX=mpic++ FC=mpifort cmake \
    -GNinja \
    -DCMAKE_CUDA_HOST_COMPILER=mpicc \ # (3)!
    -DCP2K_USE_LIBXC=ON \
    -DCP2K_USE_LIBINT2=ON \
    -DCP2K_USE_SPGLIB=ON \
    -DCP2K_USE_ELPA=ON \
    -DCP2K_USE_SPLA=ON \
    -DCP2K_USE_SIRIUS=ON \
    -DCP2K_USE_COSMA=ON \
    -DCP2K_USE_PLUMED=ON \
    -DCP2K_USE_DFTD4=ON \
    -DCP2K_USE_DLAF=ON \
    -DCP2K_USE_ACCEL=CUDA -DCP2K_WITH_GPU=H100 \ # (4)!
    ..

ninja -j 32
  1. Start the CP2K uenv and load the develop view (which provides all the necessary dependencies)

  2. Go to the CP2K source directory

  3. The H100 option enables the sm_90 architecture for the CUDA backend

!!! note "Eiger: libxsmm"

On `x86` we deploy with `libxmm`. Add `-DCP2K_USE_LIBXSMM=ON` to the CMake invocation to use `libxsmm`.

??? note "Eiger: Intel MKL (before [email protected])"

On `x86` we deployed with `intel-oneapi-mkl` before `[email protected]`. 
If you are using a pre-`[email protected]` uenv, add `-DCP2K_SCALAPACK_VENDOR=MKL` to the CMake invocation to find MKL.

??? note "CUDA architecture for [email protected] and earlier"

`[email protected]` (and earlier) does not support compiling for `cuda_arch=90`. Use `-DCP2K_WITH_GPU=A100` instead, 
which enables the `sm_80` architecture.

See manual.cp2k.org/CMake for more details.

Known issues

DLA-Future

The cp2k/2025.1 uenv provides CP2K with DLA-Future support enabled. The DLA-Future library is initialized even if you don't explicitly ask to use it. This can lead to some surprising warnings and failures described below.

CUSOLVER_STATUS_INTERNAL_ERROR during initialization

If you are heavily over-subscribing the GPU by running multiple ranks per GPU, you may encounter the following error:

created exception: cuSOLVER function returned error code 7 (CUSOLVER_STATUS_INTERNAL_ERROR): pika(bad_function_call)
terminate called after throwing an instance of 'pika::cuda::experimental::cusolver_exception'
what(): cuSOLVER function returned error code 7 (CUSOLVER_STATUS_INTERNAL_ERROR): pika(bad_function_call)

The reason is that too many cuSOLVER handles are created. If you don't need DLA-Future, you can manually set the number of BLAS and LAPACK handlers to 1 by setting the following environment variables:

DLAF_NUM_GPU_BLAS_HANDLES=1
DLAF_NUM_GPU_LAPACK_HANDLES=1

Warning about pika only using one worker thread

When running CP2K with multiple tasks per node and only one core per task, the initialization of DLA-Future may trigger the following warning:

The pika runtime will be started with only one worker thread because the
process mask has restricted the available resources to only one thread. If
this is unintentional make sure the process mask contains the resources
you need or use --pika:ignore-process-mask to use all resources. Use
--pika:print-bind to print the thread bindings used by pika.

This warning is triggered because the runtime used by DLA-Future, pika, should typically be used with more than one thread and indicates a configuration mistake. However, if you are not using DLA-Future, the warning is harmless and can be ignored. The warning cannot be silenced.

DBCSR GPU scaling

On the GH200 architecture, it has been observed that the GPU accelerated version of DBCSR does not perform optimally in some cases. For example, in the QS/H2O-1024 benchmark above, CP2K does not scale well beyond 2 nodes. The CPU implementation of DBCSR does not suffer from this. A workaround was implemented in DBCSR, in order to switch GPU acceleration on/off with an environment variable:

export DBCSR_RUN_ON_GPU=0

While GPU acceleration is very good on few nodes, the CPU implementation scales better. Therefore, for CP2K jobs running on many nodes, it is worth investigating the use of the DBCSR_RUN_ON_GPU environment variable.

Some niche application cases such as the QS_low_scaling_postHF benchmarks only run efficiently with the CPU version of DBCSR. Generally, if the function dbcsr_multiply_generic takes a significant portion of the timing report (at the end of the CP2K output file), it is worth investigating the effect of the DBCSR_RUN_ON_GPU environment variable.

CUDA grid backend with high angular momenta basis sets

The CP2K grid CUDA backend is currently buggy on Alps. Using basis sets with high angular momenta ($l \ge 3$) result in slow calculations, especially for force calculations with meta-GGA functionals.

As a workaround, you can disable CUDA acceleration for the grid backend:

&GLOBAL
    &GRID
        BACKEND CPU
    &END GRID
&END GLOBAL

??? info "Fix available upon request"

A fix for this issue for the HIP backend is currently being tested by CSCS engineers. If you would like to test it,
please contact us and we will be able to provide the source code. The fix will eventually land on the upstream
[CP2K] repository.