11.4.3. The ucc Component

The ucc collective component uses the Unified Collective Communication (UCC) library to offload selected MPI collective operations to UCC. This component is useful on systems where UCC has been configured for the target transport or accelerator environment.

11.4.3.1. Building with UCC

Open MPI must be configured with UCC support:

shell$ ./configure --with-ucc=/path/to/ucc-install

If UCC support is explicitly requested and the UCC headers and library cannot be found, configure aborts. The ucc component is disabled when Open MPI is configured with progress thread support, because the UCC driver does not currently support progress threads.

11.4.3.2. Enabling the Component

The component is not enabled by default. Enable it at run time and give it a high enough priority to be selected:

shell$ mpirun --mca coll_ucc_enable 1 \
              --mca coll_ucc_priority 100 \
              -np 64 ./my_mpi_app

The ucc component is considered only for intracommunicators whose size is at least coll_ucc_np. The default value of coll_ucc_np is 2.

11.4.3.3. UCC Layers and Protocols

For each MPI communicator selected for UCC, Open MPI creates a UCC team: the UCC group object used to initialize and execute collective operations. Inside UCC, collective implementations are selected through two kinds of layers:

  • Collective layers (CLs), such as basic and hier, decide how a collective is decomposed.

  • Team layers (TLs), such as ucp, self, cuda, nccl, rccl, sharp, and mlx5, provide the underlying transport or accelerator implementation.

For example, the ucp TL uses UCX/UCP transports such as InfiniBand, RoCE, and shared memory; sharp uses SHARP in-network collective offload; and nccl or rccl can be used for GPU collectives on CUDA or ROCm memory.

The basic CL is the general-purpose layer. The hier CL can use system hierarchy when it is available; for example, it may split work across NODE and NET subgroups, plus the FULL group, and then pipeline phases through different TLs. A typical hierarchical protocol could use an intra-node reduction, an inter-node operation such as SHARP, and an intra-node broadcast.

The exact CLs, TLs, and algorithms available depend on how UCC was built. Use UCC’s own tools to inspect the installed library:

shell$ ucc_info -s   # Show available CLs and TLs
shell$ ucc_info -A   # Show supported collective algorithms
shell$ ucc_info -caf # Show UCC configuration variables

Open MPI’s coll_ucc_cls MCA parameter is passed to UCC as its CLS setting. It can be used to restrict team creation to specific UCC collective layers, for example:

shell$ mpirun --mca coll_ucc_enable 1 \
              --mca coll_ucc_cls hier \
              ./my_mpi_app

For lower-level TL tuning, use UCC environment variables such as UCC_TL_<NAME>_TUNE or a UCC configuration file. UCC scores TLs based on factors including the collective type, message size, memory type, and team size.

11.4.3.4. Selecting Collective Operations

Use coll_ucc_cts to choose which collective operations the component should provide. By default, the component enables all supported blocking and nonblocking operations.

shell$ mpirun --mca coll_ucc_enable 1 \
              --mca coll_ucc_cts allreduce,iallreduce,bcast,ibcast \
              ./my_mpi_app

Prefix the value with ^ to start from all supported operations and disable specific operations from that set:

shell$ mpirun --mca coll_ucc_enable 1 \
              --mca coll_ucc_cts ^alltoall,ialltoall \
              ./my_mpi_app

The supported operation names are:

  • barrier, bcast, allreduce, alltoall, alltoallv, allgather, allgatherv, reduce, gather, gatherv, reduce_scatter_block, reduce_scatter, scatterv, and scatter

  • ibarrier, ibcast, iallreduce, ialltoall, ialltoallv, iallgather, iallgatherv, ireduce, igather, igatherv, ireduce_scatter_block, ireduce_scatter, iscatterv, and iscatter

The aliases colls_b, colls_i (or colls_nb), and colls_p select all blocking, nonblocking, and persistent collective operations, respectively. Individual persistent collective operations can be selected by adding the _init suffix to the blocking operation name, for example allreduce_init.

11.4.3.5. Other MCA Parameters

Parameter

Default

Description

coll_ucc_enable

0

Enable or disable the component.

coll_ucc_priority

10

Component selection priority.

coll_ucc_verbose

0

Verbosity level for component logging.

coll_ucc_np

2

Minimum communicator size for enabling the component.

coll_ucc_cls

UCC default

Comma-separated list of UCC collective layers to use for team creation, passed to UCC as CLS.

coll_ucc_cts

All supported blocking and nonblocking operations

Comma-separated list of UCC collective types to enable.

11.4.3.6. Verifying Selection

Use coll_base_verbose to check which collective component Open MPI selects for each operation:

shell$ mpirun --mca coll_ucc_enable 1 \
              --mca coll_ucc_priority 100 \
              --mca coll_base_verbose 20 \
              ./my_mpi_app

See Available Collective Components for more details about interpreting collective component selection output.