11.4.4. Working with the acoll Collective Component
11.4.4.1. Introduction
The acoll (AMD Collective) component is a high-performance MPI collective implementation optimized for AMD Zen-based processors. At present, acoll is optimized for the following commonly used collective algorithms:
MPI_Allgather
MPI_Allreduce
MPI_Alltoall
MPI_Barrier
MPI_Bcast
MPI_Gather
MPI_Reduce
The component uses topology-aware algorithms that leverage subgroups, NUMA domains, and socket hierarchies to achieve optimal performance on AMD Zen architectures.
11.4.4.2. Enabling the acoll Component
To enable the acoll component, set its priority higher than other collective components:
shell$ mpirun <common mpi runtime parameters> --bind-to core \
--mca coll acoll,tuned,libnbc,basic --mca coll_acoll_priority 40 <executable>
The component will only activate for communicators meeting the minimum size threshold (default: 16 ranks). Use --bind-to core for optimal performance.
11.4.4.3. MCA Parameters
11.4.4.3.1. Core Configuration
Parameter |
Default |
Description |
|---|---|---|
|
0 |
Component selection priority |
|
16 |
Minimum communicator size for which |
11.4.4.3.2. Topology Parameters
Parameter |
Default |
Description |
|---|---|---|
|
10 |
Maximum number of nested layers of subcommunicators to be derived by |
|
-1 |
Force NUMA based subgroups to be used in hierarchical version of the broadcast collective. Default is auto-tuned, where NUMA based subgroup is enabled based on communicator size and message size. Force enable (1) or disable (0) NUMA-based communicator split; -1 for auto. |
|
-1 |
Enable (1) or disable (0) socket based hierarchy rather than node based hierarchy in broadcast and allreduce collectives. The default value is -1, where the hierarchy is pre-configured based on message size and communicator size. |
11.4.4.3.3. Runtime Optimization Parameters
Parameter |
Default |
Description |
|---|---|---|
|
0 |
Dynamically select algorithms for the broadcast collective. If set to (1), uses the hierarchical algorithm specified by the command line arguments |
|
0 |
If set to (1), disables shared-memory data copy based broadcast collective. |
|
0 |
Choose binomial (0) or flat tree (1) based topology for the first stage of hierarchical broadcast collective. |
|
0 |
Choose binomial (0) or flat tree (1) based topology for the second stage of hierarchical broadcast collective. |
|
0 |
Choose binomial (0) or flat tree (1) based topology for the third stage of hierarchical broadcast collective. |
|
0 |
If set (1), uses 2 stages instead of 3 stages in hierarchical broadcast collective. |
|
0 |
Barrier algorithm selection for the intra-node case: shared-memory hierarchical algorithm (0), shared-memory flat algorithm (1), non-shared memory algorithm (any other value). This parameter is ignored for multinode cases. |
|
0 |
Factor that specifies the amount of parallelism to go for in parallel-split alltoall algorithm. Set it to 0 (default) to use pre-configured value that is set based on communicator size, message size and mapping pattern; 2, 4, 8, 16, 32, 64 are supported values. |
|
0 |
Message size (bytes) threshold below which parallel-split alltoall is enabled. Default is 0 that uses a pre-configured value. |
|
0 |
Disable Shared Memory Single Copy (SMSC) based algorithms when set to 1. This is applicable to allreduce and reduce collectives. |
|
1 |
Use application send/recv buffers for SMSC registration instead of temporary buffers. This parameter is applicable when SMSC is enabled using |
|
4MB for each rank |
Maximum size (bytes) for temporary SMSC buffers (default: 4 MB). This parameter is applicable when SMSC is enabled and |
|
1 |
If set to 0, disable allocation of reserved memory for use in |
|
4MB for each rank |
Use to specify the amount of reserved memory to be allocated in |
11.4.4.4. Usage Examples
11.4.4.4.1. Basic Usage
Enable acoll with default settings:
shell$ mpirun --mca coll acoll,tuned,libnbc,basic \
--mca coll_acoll_priority 40 ./my_app
11.4.4.4.2. Multi-Node Configuration
For multi-node runs, enable dynamic rules:
shell$ mpirun --mca coll acoll,tuned,libnbc,basic \
--mca coll_acoll_priority 40 \
--mca coll_acoll_bcast_lin0 0 \
--mca coll_acoll_bcast_lin1 1 \
--mca coll_acoll_bcast_lin2 1 \
--mca coll_acoll_use_dynamic_rules 1 ./my_app
If coll_acoll_bcast_lin0, coll_acoll_bcast_lin1, and coll_acoll_bcast_lin2 are not specified when coll_acoll_use_dynamic_rules is passed as 1, default value (0) will be used for each of these parameters.
11.4.4.4.4. Alltoall Tuning
Enable parallel split algorithm for alltoall with split factor 4:
shell$ mpirun --mca coll acoll,tuned,libnbc,basic \
--mca coll_acoll_priority 40 \
--mca coll_acoll_alltoall_split_factor 4 ./my_app