HPC scale up: best parameters selection

Hi everyone, I’m running some tests on an HPC and I noticed that the performances worsen as I increase the number of cores (I’m getting the best results with 8 cores). Are there parameters which I should turn on when building BDM from source?
Here are some data from the same simulation performed with 8 cores and 32 cores.

  • 8 cores:
    ThreadInfo:
    max_threads : 8
    num_numa nodes : 4
    thread to numa mapping : 0 0 0 0 0 0 0 0
    thread id in numa node : 0 1 2 3 4 5 6 7
    num threads per numa : 8 0 0 0
    Agents per numa node
    numa node 0 → size: 365
    numa node 1 → size: 0
    numa node 2 → size: 0
    numa node 3 → size: 0
    Nodes: 1
    Cores per node: 8
    CPU Utilized: 00:02:26
    CPU Efficiency: 62.93% of 00:03:52 core-walltime
    Job Wall-clock time: 00:00:29
    Memory Utilized: 6.30 MB
    Memory Efficiency: 0.32% of 1.95 GB

  • 32 cores:
    ThreadInfo:
    max_threads : 32
    num_numa nodes : 4
    thread to numa mapping : 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2
    thread id in numa node : 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9
    num threads per numa : 11 11 10 0
    Agents per numa node
    numa node 0 → size: 126
    numa node 1 → size: 125
    numa node 2 → size: 114
    numa node 3 → size: 0
    Nodes: 1
    Cores per node: 32
    CPU Utilized: 00:26:04
    CPU Efficiency: 87.28% of 00:29:52 core-walltime
    Job Wall-clock time: 00:00:56
    Memory Utilized: 6.29 MB
    Memory Efficiency: 0.08% of 7.81 GB

Thank you!

Hi @nicogno,

From the output, it seems that you’re running with a relatively low number of agents (365). Currently, the default batch size for a CPU thread is set to 1000 agents; which comes down to each thread being responsible for 1000 agents at a time. Maybe you can try to reduce this by changing the parameter Param::scheduling_batch_size to for example 10.

Cheers,
Ahmad

Hi @ahesam, thank you for your support.
I changed the scheduling_batch_size parameter to 10 and then to 100, but still the performances are worse than those that I observed running the same simulation on my laptop (8 cores). Here are the results from the usual simulation with 32 and 64 cores, run with scheduling_batch_size = 10.

  • 32 cores:
    ThreadInfo:
    max_threads : 32
    num_numa nodes : 4
    thread to numa mapping : 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3
    thread id in numa node : 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
    num threads per numa : 8 8 8 8
    Agents per numa node
    numa node 0 → size: 92
    numa node 1 → size: 91
    numa node 2 → size: 91
    numa node 3 → size: 91
    Nodes: 1
    Cores per node: 32
    CPU Utilized: 00:27:54
    CPU Efficiency: 75.82% of 00:36:48 core-walltime
    Job Wall-clock time: 00:01:09
    Memory Utilized: 6.10 MB
    Memory Efficiency: 0.08% of 7.81 GB

  • 64 cores:
    ThreadInfo:
    max_threads : 64
    num_numa nodes : 4
    thread to numa mapping : 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
    thread id in numa node : 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
    num threads per numa : 24 24 16 0
    Agents per numa node
    numa node 0 → size: 138
    numa node 1 → size: 136
    numa node 2 → size: 91
    numa node 3 → size: 0
    Nodes: 1
    Cores per node: 64
    CPU Utilized: 01:39:17
    CPU Efficiency: 84.62% of 01:57:20 core-walltime
    Job Wall-clock time: 00:01:50
    Memory Utilized: 6.34 MB
    Memory Efficiency: 0.04% of 15.62 GB

Since the number of agents is relatively low, could the issue be related to the diffusion ops? The simulation instantiates ~ 10 different diffusion grids. On my laptop the same simulation runs in approximately 20-25 seconds.

Thanks you.

Cheers,
Nicolò

Hey @nicogno ,

To get additional insights into the runtime of different operations set Param::statistics to true.
BioDynamo will generate similar output to the example below at the end of the simulation:

***********************************************
ESC[1mSimulation Metadata:ESC[0m
***********************************************

ESC[1mGeneralESC[0m
Command                         : ./cell-grow-divide --config=../bdm.json --config=../scalability.json --config=../500-iterations.json 
Simulation name                 : cell-grow-divide
Total simulation runtime        : 109397 ms
Peak memory usage (MB)          : 5716.78
Number of iterations executed   : 501
Number of agents                : 12662138
Output directory                : output/cell-grow-divide
  size                          : 4.0K
BioDynaMo version:              : v0.9-65-ga475b314

***********************************************

ESC[1mTotal execution time per operationESC[0m
agent ops: 83217
behavior: 0
bound space: 1
diffusion: 2
discretization: 0
load balancing: 2538
mechanical forces: 2
propagate staticness: 1
set up iteration: 236
tear down iteration: 476
update environment: 16161
update staticness: 6
visualize: 0
...

Could you please rerun your experiments with this parameter?

Lukas

Hi @lukas,

currently I can only set Param::statistics to true for the test simulations that I run on the login node, and that prints out the following (with Param::scheduling_batch_size set to `10):

Simulation Metadata:
***********************************************

General
Command				: ./build/alveoli_new 
Simulation name			: alveoli_new
Total simulation runtime	: 20949 ms
Peak memory usage (MB)		: 466.488
Number of iterations executed	: 10000
Number of agents		: 365
Diffusion grids
  FGF-2:
	Resolution		: 8
	Size			: 275 x 275 x 275
	Voxels			: 512

  ...

Output directory		: output/alveoli_new
  size				: 512
BioDynaMo version:		: v0.9-28-g63a248bf

***********************************************

Total execution time per operation
agent ops: 1274
boxes_to_alveoli: 1
count_cells_temp: 417
diffusion: 9600
load balancing: 47
measure_subs_concentration: 1
monocytes_flux: 58
set up iteration: 485
substance_depletion: 1394
tear down iteration: 432
update environment: 2236
visualize: 1

***********************************************

Thread Info
max_threads		: 96
num_numa nodes		: 4
thread to numa mapping	: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 
thread id in numa node	: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 
num threads per numa	: 24 24 24 24 

***********************************************

Agents per numa node
numa node 0 -> size: 92
numa node 1 -> size: 91
numa node 2 -> size: 91
numa node 3 -> size: 91

If I run the same simulation on the cluster and I set Param::statistics to true, I get

*** Break *** segmentation violation
 Generating stack trace...
 0x000014b5147391e7 in <unknown> from /home/nc/root_install/lib/libCling.so
 0x000014b5147407f5 in <unknown> from /home/nc/root_install/lib/libCling.so
 0x000014b514748c8e in cling::DeclUnloader::VisitFunctionDecl(clang::FunctionDecl*) + 0x1ce from /home/nc/root_install/lib/libCling.so
 0x000014b51470bad5 in cling::TransactionUnloader::unloadDeserializedDeclarations(cling::Transaction*, cling::DeclUnloader&) + 0x115 from /home/nc/root_install/lib/libCling.so
 0x000014b51470bed6 in cling::TransactionUnloader::RevertTransaction(cling::Transaction*) + 0x256 from /home/nc/root_install/lib/libCling.so
 0x000014b5146f1a04 in cling::Interpreter::unload(cling::Transaction&) + 0x174 from /home/nc/root_install/lib/libCling.so
 0x000014b514779ac1 in cling::IncrementalParser::commitTransaction(llvm::PointerIntPair<cling::Transaction*, 2u, cling::IncrementalParser::EParseResult, llvm::PointerLikeTypeTraits<cling::Transaction*>, llvm::PointerIntPairInfo<cling::Transaction*, 2u, llvm::PointerLikeTypeTraits<cling::Transaction*> > >&, bool) at IncrementalParser.cpp:? from /home/nc/root_install/lib/libCling.so
 0x000014b51477bd8b in cling::IncrementalParser::Compile(llvm::StringRef, cling::CompilationOptions const&) + 0x5b from /home/nc/root_install/lib/libCling.so
 0x000014b5146ef99e in cling::Interpreter::DeclareInternal(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, cling::CompilationOptions const&, cling::Transaction**) const + 0x3e from /home/nc/root_install/lib/libCling.so
 0x000014b5146efab5 in cling::Interpreter::parseForModule(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x55 from /home/nc/root_install/lib/libCling.so
 0x000014b51461ddb0 in <unknown> from /home/nc/root_install/lib/libCling.so
 0x000014b5146224a7 in <unknown> from /home/nc/root_install/lib/libCling.so
 0x000014b514628843 in <unknown> from /home/nc/root_install/lib/libCling.so
 0x000014b514682d15 in <unknown> from /home/nc/root_install/lib/libCling.so
 0x000014b5146f87c0 in <unknown> from /home/nc/root_install/lib/libCling.so
 0x000014b514f7f020 in <unknown> from /home/nc/root_install/lib/libCling.so
 0x000014b515555b12 in <unknown> from /home/nc/root_install/lib/libCling.so
 0x000014b5155562f5 in <unknown> from /home/nc/root_install/lib/libCling.so
 0x000014b514fd708a in <unknown> from /home/nc/root_install/lib/libCling.so
 0x000014b5150c0786 in <unknown> from /home/nc/root_install/lib/libCling.so
 0x000014b514e35fc3 in <unknown> from /home/nc/root_install/lib/libCling.so
 0x000014b514e386b6 in <unknown> from /home/nc/root_install/lib/libCling.so
 0x000014b514e30e28 in <unknown> from /home/nc/root_install/lib/libCling.so
 0x000014b514e33e35 in <unknown> from /home/nc/root_install/lib/libCling.so
 0x000014b514e262f1 in <unknown> from /home/nc/root_install/lib/libCling.so
 0x000014b514e2e599 in <unknown> from /home/nc/root_install/lib/libCling.so
 0x000014b514e2e7e3 in <unknown> from /home/nc/root_install/lib/libCling.so
 0x000014b514e2e8d5 in <unknown> from /home/nc/root_install/lib/libCling.so
 0x000014b514de39cc in <unknown> from /home/nc/root_install/lib/libCling.so
 0x000014b514e36921 in <unknown> from /home/nc/root_install/lib/libCling.so
 0x000014b514702955 in cling::LookupHelper::findScope(llvm::StringRef, cling::LookupHelper::DiagSetting, clang::Type const**, bool) const + 0x505 from /home/nc/root_install/lib/libCling.so
 0x000014b5146260ec in <unknown> from /home/nc/root_install/lib/libCling.so
 0x000014b51462d14b in <unknown> from /home/nc/root_install/lib/libCling.so
 0x000014b52e2ad8d7 in TClassEdit::TSplitType::ShortType(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&, int) + 0x257 from /home/nc/root_install/lib/libCore.so
 0x000014b52e2ae660 in TClassEdit::GetNormalizedName(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&, std::experimental::fundamentals_v1::basic_string_view<char, std::char_traits<char> >) at TClassEdit.cxx:? from /home/nc/root_install/lib/libCore.so
 0x000014b52e2c9aad in TClass::GetClass(char const*, bool, bool, unsigned long, unsigned long) + 0x43d from /home/nc/root_install/lib/libCore.so
 0x000014b52e2da6b4 in TClassRef::InternalGetClass() const + 0x34 from /home/nc/root_install/lib/libCore.so
 0x000014b52e2bd61e in TBaseClass::GetClassPointer(bool) + 0x2e from /home/nc/root_install/lib/libCore.so
 0x000014b52e2c29f1 in TClass::GetBaseClassOffsetRecurse(TClass const*) + 0x71 from /home/nc/root_install/lib/libCore.so
 0x000014b52db4c894 in TBufferJSON::JsonSpecialClass(TClass const*) const + 0x64 from /home/nc/root_install/lib/libRIO.so
 0x000014b52db66c83 in TBufferJSON::JsonWriteObject(void const*, TClass const*, bool) + 0xa3 from /home/nc/root_install/lib/libRIO.so
 0x000014b52db6ab40 in TBufferJSON::WriteFastArray(void*, TClass const*, int, TMemberStreamer*) + 0x100 from /home/nc/root_install/lib/libRIO.so
 0x000014b52dd96555 in int TStreamerInfo::WriteBufferAux<char**>(TBuffer&, char** const&, TStreamerInfo::TCompInfo* const*, int, int, int, int, int) at TStreamerInfoWriteBuffer.cxx:? from /home/nc/root_install/lib/libRIO.so
 0x000014b52dc0828d in TStreamerInfoActions::GenericWriteAction(TBuffer&, void*, TStreamerInfoActions::TConfiguration const*) + 0x3d from /home/nc/root_install/lib/libRIO.so
 0x000014b52db48547 in TBufferText::ApplySequence(TStreamerInfoActions::TActionSequence const&, void*) + 0xc7 from /home/nc/root_install/lib/libRIO.so
 0x000014b52db48dc4 in TBufferText::WriteClassBuffer(TClass const*, void*) + 0x74 from /home/nc/root_install/lib/libRIO.so
 0x000014b52db66f3c in TBufferJSON::JsonWriteObject(void const*, TClass const*, bool) + 0x35c from /home/nc/root_install/lib/libRIO.so
 0x000014b52db68157 in TBufferJSON::StoreObject(void const*, TClass const*) + 0x47 from /home/nc/root_install/lib/libRIO.so
 0x000014b52db69407 in TBufferJSON::ConvertToJSON(void const*, TClass const*, int, char const*) + 0x167 from /home/nc/root_install/lib/libRIO.so
 0x000014b52e80e163 in bdm::Param::ToJsonString[abi:cxx11]() const at /home/nc/root_install/include/TString.h:229 from /home/nc/biodynamo-v0.9.28/lib/libbiodynamo.so
 0x000014b52e833a8d in bdm::operator<<(std::basic_ostream<char, std::char_traits<char> >&, bdm::Simulation&) at /usr/include/c++/8/bits/basic_string.h:6328 from /home/nc/biodynamo-v0.9.28/lib/libbiodynamo.so
 0x000014b52e834076 in bdm::Simulation::~Simulation() at /usr/include/c++/8/ostream:113 from /home/nc/biodynamo-v0.9.28/lib/libbiodynamo.so
 0x000014b52f00c916 in bdm::Simulate(int, char const**) + 0x3626 from /work/home/nc/alveoli_new/build/libalveoli_new.so
 0x000014b5287157b3 in __libc_start_main + 0xf3 from /lib64/libc.so.6
 0x000000000040086e in _start + 0x2e from /home/nc/alveoli_new/build/alveoli_new
srun: error: mpsc0037: task 0: Exited with exit code 139

Thank you,

Nicolò

Hi everyone, the issue was most likely due to differences between the login nodes environment and the cluster nodes environment.
It was solved following the steps below:

  1. the custom ROOT version (v 6.24) installed in the home directory (on the login node) was removed and replaced by ROOT v 6.22. The latter was installed using the script provided by the BioDynaMo installer (biodynamo/util/build-third-party at master · BioDynaMo/biodynamo · GitHub) running the command ./util/build-third-party/build-root.sh 6.22.06

  2. The variables CC and CXX were unset and python 3.9.1 was installed using pyenv (https://biodynamo.org/docs/userguide/prerequisites/) on the cluster node

  3. The latest version of BioDynaMo was downloaded (GitHub - BioDynaMo/biodynamo: BioDynaMo project) and built on the cluster node

  4. The following script was used to perform the steps 2) and 3) and run the BioDynaMo tests

     #!/bin/bash
     unset CC
     unset CXX
     gcc --version
     g++ --version
     export PATH="$HOME/.pyenv/bin:$PATH"
     eval "$(pyenv init -)"
     pyenv shell 3.9.1
     . $HOME/bdm-build-third-party/root-install/bin/thisroot.sh
     cd $HOME/biodynamo/
     rm -rf build
     mkdir build
     cd build
     cmake -Dparaview=off .. 
     make -j10
     . bin/thisbdm.sh
     bin/biodynamo-unit-tests
    
  5. The tests run without any problem, but the script above showed that the libnuma devel package was missing from the cluster nodes thus the script was modified as follows and run using sbatch (tumor_concept was used as a test case https://biodynamo.org/docs/userguide/tumor_concept/ )

     #!/bin/bash
     #SBATCH ...
     ....
     export OMP_NUM_THREADS=1
     module load gcc/8.3.1
     module load git/2.29.2
     module load valgrind/3.15.0
     module load openucx/1.7.0
     module load openmpi/4.0.5
     module load cmake/3.19.3
     export PATH="$HOME/.pyenv/bin:$PATH"
     eval "$(pyenv init -)"
     pyenv shell 3.9.1
     source $HOME/bdm-build-third-party/root-install/bin/thisroot.sh
     cd $HOME/biodynamo/
     rm -rf build || true
     mkdir build
     cd build
     cmake -Dtest=off -Dparaview=off -Dnuma=off ..
     make -j96
     source bin/thisbdm.sh
     cd $HOME/tumor_concept
     rm -rf build || true
     mkdir build
     cd build
     cmake -Dnuma=off .. 
     make -j96
     ./tumor_concept
    

Using the above script, the tumor_concept simulation run without problems and the simulation metadata were printed.

1 Like