Dirk Kutscher

Personal web page

Archive for the ‘PacTrain’ tag

Invited Talk at FNDC: Connecting AI: Inter-Networking Challenges for Distributed Machine Learning

without comments

I gave a talk at the Future Network Development Conference (FNDC) in Nanjing on August 20th, 2025. The title of the talk was Connecting AI: Inter-Networking Challenges for Distributed Machine Learning, and I talked about our recent work on PacTrain, NetSenseML, and some new work on in-network aggregation.

PacTrain is a novel framework that accelerates distributed training by combining pruning with sparse gradient compression. Active pruning of the neural network makes the model weights and gradients sparse. By ensuring the global knowledge of the gradient sparsity among all distributed training workers, we can perform lightweight compression communication without harming accuracy. We show that the PacTrain compression scheme achieves a near-optimal compression strategy while remaining compatible with the all- reduce primitive. Experimental evaluations show that PacTrain improves training throughput by 1.25 to 8.72× compared to state-of-the-art compression-enabled systems for representative vision and language models training tasks under bandwidth-constrained conditions.

NetSenseML is a novel network adaptive distributed deep learning framework that dynamically adjusts quantization, pruning, and compression strategies in response to real-time network conditions. By actively monitoring network conditions, NetSenseML applies gradient compression only when network congestion negatively impacts convergence speed, thus effectively balancing data payload reduction and model accuracy preservation. Our approach ensures efficient resource usage by adapting reduction techniques based on current network conditions, leading to shorter convergence times and improved training efficiency. Experimental evaluations show that NetSenseML can improve training throughput by a factor of 1.55x to 9.84x compared to state-of-the-art compression-enabled systems for representative DDL training jobs in bandwidth-constrained conditions.

Written by dkutscher

August 21st, 2025 at 5:59 am