Networked Systems for Distributed Machine Learning at Scale at Dirk Kutscher

Networked Systems for Distributed Machine Learning at Scale

On July 3rd, 2024, I gave a talk at the UCL/Huawei Joint Lab Workshop on "Building Better Protocols for Future Smart Networks" that took place on UCL's campus in London.

Talk Abstract

Large-scale distributed machine learning training networks are increasingly facing scaling problems with respect to FLOPS per deployed compute node. Communication bottlenecks can inhibit the effective utilization of expensive GPU resources. The root cause of these performance problems is not insufficient transmission speed or slow servers; it is the structure of the distributed computing and the communication characteristics it incurs. Large machine learning workloads typically provide relatively asymmetric, and sometimes centralized, communication structures, such as gradient aggregation and model update distribution. Even when training networks are less centralized, the amount of data that needs to be sent to aggregate several thousand input values through collective communication functions such as AllReduce can lead to Incast problems that overload network resources and servers. This talk discusses challenges and opportunities for developing in-network aggregation systems from a distributed computing and networked systems perspective.

Written by dkutscher

July 22nd, 2024 at 3:23 pm

Posted in Talks

Tagged with distributed computing, Distributed Machine Learning, information-centric networking, networking

Dirk Kutscher

Networked Systems for Distributed Machine Learning at Scale

Talk Abstract

Related

Contents

Categories

Meta