| Internet-Draft | Topology-Aware Collective Communication | May 2026 |
| Wang, et al. | Expires 14 November 2026 | [Page] |
This document describes a topology-aware method for constructing collective communication schedules in distributed systems. Instead of selecting from a small set of predefined communication algorithms, the method expands a target network topology along a time dimension, tracks per-node data state, and incrementally builds a schedule through candidate-source discovery and link-to-chunk matching. The approach is intended for heterogeneous or asymmetric topologies in which fixed communication patterns often underutilize available links or create avoidable bottlenecks. According to the source material, the resulting schedule is intended for collective communication tasks involving data distribution, aggregation, reduction, and synchronization, including all-gather, reduce-scatter, and all-reduce.¶
This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.¶
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.¶
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."¶
This Internet-Draft will expire on 14 November 2026.¶
Copyright (c) 2026 IETF Trust and the persons identified as the document authors. All rights reserved.¶
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Revised BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Revised BSD License.¶
Large-scale distributed training systems rely heavily on collective communication. During training, nodes repeatedly exchange model parameters, gradients, or intermediate results. As model size and cluster size increase, the communication subsystem becomes a major factor in overall job completion time.¶
Many existing implementations choose a communication procedure from a predefined algorithm library. This works reasonably well on regular topologies, but it becomes less effective when the underlying network contains heterogeneous links, asymmetric connectivity, or multi-level structure. In such environments, a fixed algorithm can overload some links while leaving others underused.¶
Other approaches attempt to generate schedules through global optimization or exhaustive search. These methods can produce good results for some inputs, but their construction cost often grows quickly with the number of nodes and links, which makes them harder to use in large systems.¶
This document describes a different construction framework. It uses a time-expanded network to represent both topology and time evolution, models which data chunks are currently available at which nodes, and incrementally builds communication steps until the target communication state is reached.¶
The construction method described in this document is motivated by five practical issues.¶
The source material indicates that the construction method is intended to meet the following goals:¶
This section summarizes the key terms used by the construction method.¶
The method takes as input a topology description, link attributes, a collective communication objective, and an initial data placement. It produces a communication schedule consisting of per-time-layer send and receive actions, together with the derived transfer paths of each chunk.¶
The constructor receives a set of nodes, a set of directed links, and link attributes including link bandwidth. It also receives the target collective mode. The source material explicitly mentions all-gather, reduce-scatter, and all-reduce, and more generally describes distribution, aggregation, reduction, and synchronization tasks.¶
The communication objective is expressed as a desired postcondition over chunks. The postcondition defines which nodes are expected to hold which chunks after the collective operation completes.¶
Before schedule construction begins, the payload is partitioned into chunks. Each chunk is treated as an independent schedulable item.¶
The constructor initializes a precondition describing which chunks are currently available at each node. It also initializes the postcondition that describes the desired final state. These two state descriptions provide the basis for deciding whether construction is complete and which transfers are still needed.¶
The original topology is expanded into discrete time layers. Each original node is represented by a sequence of node copies, one for each layer. A directed link in the original topology becomes a temporal edge that carries a chunk from a source node in one layer to a destination node in the next layer.¶
This representation captures spatial connectivity and temporal progression in one model. The source material describes a layer-by-layer expansion process rather than requiring a fixed final time horizon in advance.¶
For each unsatisfied target, the constructor searches for candidate sources that already hold the required chunk and can reach the destination through a valid temporal edge in the current layer.¶
According to the source material, this process starts from the target node and traces reachable links in the current time layer to identify source nodes that already hold the required chunk. The candidate set therefore reflects both current chunk availability and current temporal connectivity.¶
After candidate sources have been identified, the constructor selects feasible transfers for the current time layer. A feasible transfer binds one chunk to one directed link from one source node to one destination node.¶
According to the source material, valid matching respects link direction and only uses links that are not already occupied in the current layer. The source material also notes that alternative embodiments may consider path length, node load, or link utilization when adjusting matching decisions.¶
Once transfers have been selected for the current layer, the constructor updates node state for the next layer. Any chunk that is successfully delivered becomes part of the receiving node's precondition in the following layer.¶
If unsatisfied targets remain after the update, the constructor extends the time-expanded network and repeats candidate discovery and link-to-chunk matching. This process continues until the postcondition is fully satisfied.¶
The final output is a topology-aware communication schedule. According to the source material, the output includes the transfer path of each chunk, the send/receive relationships in each time layer, and the complete collective communication plan.¶
The same construction framework can be applied to several collective communication patterns by changing the initial and target state definitions.¶
The source material explicitly lists the node set, link set, link direction, link bandwidth, and target collective mode as inputs to schedule construction. It also describes the generated schedule as an output that can later be read and executed by sending, receiving, and processing data according to the per-layer communication arrangement.¶
The source material further states that, after construction, the resulting communication schedule can be executed layer by layer until the target completion state is reached, and the final result can then be supplied to later training, scheduling, or control modules.¶
This document includes no request to IANA.¶
This document describes a schedule construction method and does not define a new wire protocol or a new security mechanism.¶