Networked Systems for Distributed Machine Learning at Scale
On July 3rd, 2024, I gave a talk at the UCL/Huawei Joint Lab Workshop on "Building Better Protocols for Future Smart Networks" that took place on UCL's campus in London.
Talk Abstract
Large-scale distributed machine learning training networks are increasingly facing scaling problems with respect to FLOPS per deployed compute node. Communication bottlenecks can inhibit the effective utilization of expensive GPU resources. The root cause of these performance problems is not insufficient transmission speed or slow servers; it is the structure of the distributed computing and the communication characteristics it incurs. Large machine learning workloads typically provide relatively asymmetric, and sometimes centralized, communication structures, such as gradient aggregation and model update distribution. Even when training networks are less centralized, the amount of data that needs to be sent to aggregate several thousand input values through collective communication functions such as AllReduce can lead to Incast problems that overload network resources and servers. This talk discusses challenges and opportunities for developing in-network aggregation systems from a distributed computing and networked systems perspective.