11.4.1. Available Collective Components
Open MPI’s coll framework provides a number of components
implementing collective communication, each of which target a
different environment or scenario. Some of these components may not
be available depending on how Open MPI was compiled and what hardware
is available on the system. A run-time decision based on each
component’s self reported priority, selects which component will be
used. These priorities may be adjusted on the command line or with
any of the other usual ways of setting MCA variables, giving us a way
to influence or override component selection. In the end, which of
the available components is selected depends on a number of factors
such as the underlying hardware and the whether or not a specific
collective is provided by the component as not all components
implement all collectives. However, there is always a fallback
basic component that steps in and takes over when another component
fails to provide an implementation.
The following provides a list of components and their primary target scenario:
han: component providing hierarchical algorithms.
libnbc: component providing non-blocking collective operations based on a modified version of the libNBC library.
self: component providing single-process collective algorithms.
tuned: component providing fine grained mechanisms to switch between algorithms for each operation and message size. See The tuned Component for more details.
ucc: component using the UCC library for collective operations. See The ucc Component for more details.
xhc: shared memory collective component, employing hierarchical & topology-aware algorithms, with XPMEM for data transfers. See XPMEM Hierarchical Collectives (xhc) for more details.
acoll: collective component tuned for AMD Zen architectures. See Working with the acoll Collective Component for more details.
accelerator: component providing host-proxy algorithms for some collective operations using device buffers.
ftagree: component providing fault-tolerant collective operations.
inter: component providing collective operaitons for inter-communicators.
basic: component providing basic algorithms, used as a fall-back component.
sync: component used in scenarios where some nodes can be overrun with messages. This component can be used to insert synchronization points every n-th execution of a collective operations.
portals4: component targetting portals4 networks.
Different component can and will be used for different collective operations, since no component is providing implementations for all operations defined in the MPI specification.
11.4.1.1. Displaying collective component selection
Open MPI 6.0.x provides a mechanism to display which component has been selected for a particular communication and communicator by setting the verbosity level of the coll_base_verbose mca variable.
Specifically, setting coll_base_verbose to certain values will influence which functions are precisely being displayed:
values between 1 - 19: will print the selected component for blocking and non-blocking collectives assigned to MPI_COMM_WORLD, but not for persistent collective operations
value 20: will print the selected component for all blocking and non-blocking collectives for all communicators, but not the persistent collectives
values larger than 20: will print the selected component for all communicators and all collective operations
Example:
shell$ mpiexec --mca coll_base_verbose 10 -n 4 ./<executable>
...
coll:base:comm_select: communicator MPI_COMM_WORLD rank 1 allgather -> accelerator
coll:base:comm_select: communicator MPI_COMM_WORLD rank 1 allgatherv -> tuned
coll:base:comm_select: communicator MPI_COMM_WORLD rank 1 allreduce -> accelerator
coll:base:comm_select: communicator MPI_COMM_WORLD rank 1 alltoall -> accelerator
coll:base:comm_select: communicator MPI_COMM_WORLD rank 1 alltoallv -> tuned
coll:base:comm_select: communicator MPI_COMM_WORLD rank 1 alltoallw -> basic
coll:base:comm_select: communicator MPI_COMM_WORLD rank 1 barrier -> tuned
coll:base:comm_select: communicator MPI_COMM_WORLD rank 1 bcast -> accelerator
coll:base:comm_select: communicator MPI_COMM_WORLD rank 1 exscan -> accelerator
coll:base:comm_select: communicator MPI_COMM_WORLD rank 1 gather -> tuned
coll:base:comm_select: communicator MPI_COMM_WORLD rank 1 gatherv -> basic
coll:base:comm_select: communicator MPI_COMM_WORLD rank 1 reduce -> accelerator
coll:base:comm_select: communicator MPI_COMM_WORLD rank 1 reduce_scatter_block -> accelerator
coll:base:comm_select: communicator MPI_COMM_WORLD rank 1 reduce_scatter -> accelerator
coll:base:comm_select: communicator MPI_COMM_WORLD rank 1 scan -> accelerator
coll:base:comm_select: communicator MPI_COMM_WORLD rank 1 scatter -> tuned
coll:base:comm_select: communicator MPI_COMM_WORLD rank 1 scatterv -> basic
coll:base:comm_select: communicator MPI_COMM_WORLD rank 1 neighbor_allgather -> basic
coll:base:comm_select: communicator MPI_COMM_WORLD rank 1 neighbor_allgatherv -> basic
coll:base:comm_select: communicator MPI_COMM_WORLD rank 1 neighbor_alltoall -> basic
coll:base:comm_select: communicator MPI_COMM_WORLD rank 1 neighbor_alltoallv -> basic
coll:base:comm_select: communicator MPI_COMM_WORLD rank 1 neighbor_alltoallw -> basic
coll:base:comm_select: communicator MPI_COMM_WORLD rank 1 reduce_local -> accelerator
coll:base:comm_select: communicator MPI_COMM_WORLD rank 1 iallgather -> libnbc
coll:base:comm_select: communicator MPI_COMM_WORLD rank 1 iallgatherv -> libnbc
coll:base:comm_select: communicator MPI_COMM_WORLD rank 1 iallreduce -> libnbc
coll:base:comm_select: communicator MPI_COMM_WORLD rank 1 ialltoall -> libnbc
coll:base:comm_select: communicator MPI_COMM_WORLD rank 1 ialltoallv -> libnbc
coll:base:comm_select: communicator MPI_COMM_WORLD rank 1 ialltoallw -> libnbc
Note
While this output can provide valuable information, it might not always accurately reflect which component executes the operation, since some components have built-in logic to call the next component in the priority list if certain conditions are not met. For example, the accelerator collective component will use this mechanism to hand-off the execution of the operation to the next component in the priority list if the collective operation invoked does not use device buffers.