Distributed Machine Learning at Dirk Kutscher

Archive for the ‘Distributed Machine Learning’ tag

Report from INET4AI Workshop at CoNEXT-2025

Organizers

Antoine Fressancourt
Dirk Kutscher

The 1st workshop on Inter-networking challenges for AI (INet4AI), collocated with ACM CoNEXt'25, was held on the 1st of December 2025 in Hong-Kong. The workshop was inspired by ongoing discussion in the IRTF on research challenges for (inter-)networking technologies for AI workloads.

This full day workshop explored some of the networking challenges of large-scale distributed AI workloads in environments, characterized by node and network heterogeneity, as well as dynamically changing resource availability and utilization. During this inaugural edition, researchers from both academia (HKUST, ETH Zurich, Politecnico di Milano, University of Napoli, Tsinghua University, TU Munich) and industry (Huawei, AMD, Microsoft, and others) discussed possible solutions to address the challenges raised by Internet-scale distributed AI systems with four workshop paper presentations and three invited talks. In this report, we will first give a summary of the workshop papers and invited talks. Then we will draw some general remarks regarding the ongoing efforts done in our community to address INet4AI challenges.

Check out the full report.

Program Overview

Invited talk — Tommaso Bonato — Uno: A One-Stop Solution for Inter- and Intra-Datacenter Congestion Control and Reliable Connectivity — Paper · Slides.
AI4Net paper — Shaked Leibzirer — Self-supervised Application-level Network Traffic Inversion — Paper · Slides.
Net4AI paper — German Sviridov — Latency-Optimal Load Balancing For Distributed MoE Inference — Paper.
Invited talk — Mingxing Zhang — From Homogeneous to Disaggregated Architectures for Large Model Inference — Slides.
Net4AI paper — Jiaheng Xiong — SCALE-CCL: A Scalable Collective Communication Library for Wide-Area Distributed Training — Paper · Slides.
Net4AI paper — Giuseppe Aceto — You’ve got a few GPUs, now what?! — Experimenting with a Nano-Cluster for Distributed Training of AI Models — Paper · Slides.
Invited talk — Wenjia Wei — Debriefing the Open Innovation Platform for UnifiedBus — Slides.

Written by dkutscher

December 24th, 2025 at 11:42 am

Posted in Conferences & Workshops,Events,IRTF

Tagged with AI, CoNEXT, Distributed Machine Learning, IRTF, machine learning

Invited Talk at FNDC: Connecting AI: Inter-Networking Challenges for Distributed Machine Learning

without comments

I gave a talk at the Future Network Development Conference (FNDC) in Nanjing on August 20th, 2025. The title of the talk was Connecting AI: Inter-Networking Challenges for Distributed Machine Learning, and I talked about our recent work on PacTrain, NetSenseML, and some new work on in-network aggregation.

PacTrain is a novel framework that accelerates distributed training by combining pruning with sparse gradient compression. Active pruning of the neural network makes the model weights and gradients sparse. By ensuring the global knowledge of the gradient sparsity among all distributed training workers, we can perform lightweight compression communication without harming accuracy. We show that the PacTrain compression scheme achieves a near-optimal compression strategy while remaining compatible with the all- reduce primitive. Experimental evaluations show that PacTrain improves training throughput by 1.25 to 8.72× compared to state-of-the-art compression-enabled systems for representative vision and language models training tasks under bandwidth-constrained conditions.

NetSenseML is a novel network adaptive distributed deep learning framework that dynamically adjusts quantization, pruning, and compression strategies in response to real-time network conditions. By actively monitoring network conditions, NetSenseML applies gradient compression only when network congestion negatively impacts convergence speed, thus effectively balancing data payload reduction and model accuracy preservation. Our approach ensures efficient resource usage by adapting reduction techniques based on current network conditions, leading to shorter convergence times and improved training efficiency. Experimental evaluations show that NetSenseML can improve training throughput by a factor of 1.55x to 9.84x compared to state-of-the-art compression-enabled systems for representative DDL training jobs in bandwidth-constrained conditions.

Written by dkutscher

August 21st, 2025 at 5:59 am

Posted in Talks

Tagged with Distributed Machine Learning, FNDC, machine learning, NetSenseML, PacTrain

NetSenseML accepted at Euro-Par

without comments

Our paper on NetSenseML: Network-Adaptive Compression for
Efficient Distributed Machine Learning has been accepted at the 31st International European on Parallel and Distributed Computing (Euro-Par-2025).

Abstract:
Training large-scale distributed machine learning models imposes considerable demands on network infrastructure, often resulting in sudden traffic spikes that lead to congestion, increased latency, and reduced throughput, which would ultimately affect convergence times and overall training performance. While gradient compression techniques are commonly employed to alleviate network load, they frequently compromise model accuracy due to the loss of gradient information.

This paper introduces NetSenseML, a novel network adaptive distributed deep learning framework that dynamically adjusts quantization, pruning, and compression strategies in response to real-time network conditions. By actively monitoring network conditions, NetSenseML applies gradient compression only when network congestion negatively impacts convergence speed, thus effectively balancing data payload reduction and model accuracy preservation.

Our approach ensures efficient resource usage by adapting reduction techniques based on current network conditions, leading to shorter convergence times and improved training efficiency. We present the design of the NetSenseML adaptive data reduction function and experimental evaluations show that NetSenseML can improve training throughput by a factor of 1.55x to 9.84x compared to state-of-the-art compression-enabled systems for representative DDL training jobs in bandwidth-constrained conditions.

References

Yisu Wang, Xinjiao Li, Ruilong Wu, Huangxun Chen, Dirk Kutscher; NetSenseML: Network-Adaptive Compression for Efficient Distributed Machine Learning; 31st International European on Parallel and Distributed Computing (Euro-Par-2025); August 2025; Preprint, Euro-Par-2025 Proceedings

Written by dkutscher

April 30th, 2025 at 5:36 pm

Posted in Publications

Tagged with congestion, distributed computing, Distributed Machine Learning, gradient compression, latency, parallel computing, parallelization

PacTrain accepted at DAC-2025

without comments

Our paper on PacTrain: Pruning and Adaptive Sparse Gradient Compression for Efficient Collective Communication in Distributed Deep Learning has been accepted at the Design Automation Conference DAC (2025) (CCF-A).

Abstract:
Large-scale deep neural networks (DNN) exhibit excellent performance for various tasks. As DNNs and datasets grow, distributed training becomes extremely time-consuming and demands larger clusters. A main bottleneck is the resulting gradient aggregation overhead. While gradient compression and sparse collective communication techniques are commonly employed to alleviate network load, many gradient compression schemes do not achieve acceleration of the training process while also preserving accuracy. This paper introduces PacTrain, a novel framework that accelerates distributed training by combining pruning with sparse gradient compression. Active pruning of the neural network makes the model weights and gradients sparse.

By ensuring the global knowledge of the gradient sparsity among all distributed training workers, we can perform lightweight compression communication without harming accuracy. We show that the PacTrain compression scheme achieves a near-optimal compression strategy while remaining compatible with the all- reduce primitive. Experimental evaluations show that PacTrain improves training throughput by 1.25 to 8.72× compared to state-of-the-art compression-enabled systems for representative vision and language models training tasks under bandwidth-constrained conditions.

Stay tuned for the pre-print.

References

Yisu Wang, Ruilong Wu, Xinjiao Li , Dirk Kutscher; PacTrain: Pruning and Adaptive Sparse Gradient Compression for Efficient Collective Communication in Distributed Deep Learning; Design Automation Conference (DAC) 2025; Preprint

Written by dkutscher

February 27th, 2025 at 4:32 am

Posted in Publications

Tagged with compression, Distributed Machine Learning, publications

Affordable HPC: Leveraging Small Clusters for Big Data and Graph Computing

without comments

In our paper at PCDS-2024, we are exploring strategies for academic researchers to optimize computational resources within limited budgets, focusing on building small, efficient computing clusters. We analyzed the comparative costs of purchasing versus renting servers, guided by market research and economic theories on tiered pricing. The paper offers detailed insights into the selection and assembly of hardware components such as CPUs, GPUs, and motherboards tailored to specific research needs. It introduces innovative methods to mitigate the performance issues caused by PCIe switch bandwidth limitations in order to enhance GPU task scheduling. Furthermore, a Graph Neural Network (GNN) framework is proposed to analyze and optimize parallelism in computing networks.

Growing Resource Demands for Large-Scale Machine Learning

Large machine learning (ML) models, such as language models (LLMs), are becoming increasingly powerful and gradually accessible to end users. However, the growth in the capabilities of these models has led to memory and inference computation demands exceeding those of personal computers and servers. To enable users, research teams, and others to utilize and experiment with these models, a distributed architecture is essential.

In recent years, scientific research has shifted from a ”wisdom paradigm” to a ”resource paradigm.” As the number of researchers and the depth of scientific exploration increase, a significant portion of research computing tasks has moved to servers. This shift has been facilitated by the development of computing frameworks and widespread use of computers, leading to an increased demand for computer procurement.

Despite the abundance of online tutorials for assembling personal computers, information on the establishment of large clusters is relatively scarce. Large Internet companies and multinational corporations usually employ professional architects and engineers or work closely with vendors to optimize their cluster performance. However, researchers often do not have access to these technical details and must rely on packaged solutions from service providers to build small clusters.

Towards Affordable HPC

In our paper "Affordable HPC: Leveraging Small Clusters for Big Data and Graph Computing", we aim to bridge this gap by providing opportunities for researchers with limited funds to build small clusters from scratch. We compiled the necessary technical details and guidelines to enable researchers to assemble clusters independently. In addition, we propose a method to mitigate the performance degradation caused by the bandwidth limitations of PCIe switches, which can help researchers prioritize GPU training tasks effectively.

The papers discusses:

How to build cost-effective clusters: We provide a comprehensive guide for researchers with limited funds, helping them to independently build small clusters and contribute to the development of large models.
Performance Optimization: We propose a method to address the performance degradation caused by PCIe switch bandwidth limitations. This method allows researchers to prioritize GPU training tasks effectively, thereby improving the overall cluster performance.
GNN for Network and Neural network parallelism: We propose a GNN (Graph Neural Network) framework that combines neural networks with parallel network flows in distributed systems. Our aim is to integrate different types of data flows, communication patterns, and computational tasks, thereby providing a novel perspective for evaluating the performance of distributed systems.

References

Ruilong Wu, Yisu Wang, Dirk Kutscher; Affordable HPC: Leveraging Small Clusters for Big Data and Graph Computing; The 1st International Symposium on Parallel Comnputing and Distributed Systems; September 2024; pre-print: https://arxiv.org/abs/2408.15568

Written by dkutscher

September 2nd, 2024 at 5:25 am

Posted in Publications

Tagged with Big Data, distributed computing, Distributed Machine Learning, Graph Computing, HPC

Networked Systems for Distributed Machine Learning at Scale

without comments

On July 3rd, 2024, I gave a talk at the UCL/Huawei Joint Lab Workshop on "Building Better Protocols for Future Smart Networks" that took place on UCL's campus in London.

Talk Abstract

Large-scale distributed machine learning training networks are increasingly facing scaling problems with respect to FLOPS per deployed compute node. Communication bottlenecks can inhibit the effective utilization of expensive GPU resources. The root cause of these performance problems is not insufficient transmission speed or slow servers; it is the structure of the distributed computing and the communication characteristics it incurs. Large machine learning workloads typically provide relatively asymmetric, and sometimes centralized, communication structures, such as gradient aggregation and model update distribution. Even when training networks are less centralized, the amount of data that needs to be sent to aggregate several thousand input values through collective communication functions such as AllReduce can lead to Incast problems that overload network resources and servers. This talk discusses challenges and opportunities for developing in-network aggregation systems from a distributed computing and networked systems perspective.

Written by dkutscher

July 22nd, 2024 at 3:23 pm

Posted in Talks

Tagged with distributed computing, Distributed Machine Learning, information-centric networking, networking

Dirk Kutscher

Archive for the ‘Distributed Machine Learning’ tag

Report from INET4AI Workshop at CoNEXT-2025

Organizers

Program Overview

Invited Talk at FNDC: Connecting AI: Inter-Networking Challenges for Distributed Machine Learning

NetSenseML accepted at Euro-Par

References

PacTrain accepted at DAC-2025

References

Affordable HPC: Leveraging Small Clusters for Big Data and Graph Computing

Growing Resource Demands for Large-Scale Machine Learning

Towards Affordable HPC

References

Networked Systems for Distributed Machine Learning at Scale

Talk Abstract

Contents

Categories

Meta