Dirk Kutscher

Personal web page

Archive for the ‘parallelization’ tag

NetSenseML accepted at Euro-Par

without comments

Our paper on NetSenseML: Network-Adaptive Compression for
Efficient Distributed Machine Learning
has been accepted at the 31st International European on Parallel and Distributed Computing (Euro-Par-2025).

Abstract:
Training large-scale distributed machine learning models imposes considerable demands on network infrastructure, often resulting in sudden traffic spikes that lead to congestion, increased latency, and reduced throughput, which would ultimately affect convergence times and overall training performance. While gradient compression techniques are commonly employed to alleviate network load, they frequently compromise model accuracy due to the loss of gradient information.

This paper introduces NetSenseML, a novel network adaptive distributed deep learning framework that dynamically adjusts quantization, pruning, and compression strategies in response to real-time network conditions. By actively monitoring network conditions, NetSenseML applies gradient compression only when network congestion negatively impacts convergence speed, thus effectively balancing data payload reduction and model accuracy preservation.

Our approach ensures efficient resource usage by adapting reduction techniques based on current network conditions, leading to shorter convergence times and improved training efficiency. We present the design of the NetSenseML adaptive data reduction function and experimental evaluations show that NetSenseML can improve training throughput by a factor of 1.55x to 9.84x compared to state-of-the-art compression-enabled systems for representative DDL training jobs in bandwidth-constrained conditions.

References

Yisu Wang, Xinjiao Li, Ruilong Wu, Huangxun Chen, Dirk Kutscher; NetSenseML: Network-Adaptive Compression for Efficient Distributed Machine Learning; 31st International European on Parallel and Distributed Computing (Euro-Par-2025); August 2025; accepted for publication

Rethinking Dynamic Networks and Heterogeneous Computing with Automatic Parallelization accepted at ACM APNET

without comments

Our paper on Rethinking Dynamic Networks and Heterogeneous Computing with Automatic Parallelization has been accepted by the 9th Asia-Pacific Workshop on Networking (APNET'25).

Abstract:
Hybrid parallelism techniques are crucial for the efficient training of large language models (LLMs). However, these techniques often introduce differentiated computational and communication tasks across nodes. Existing automatic parallel planning frameworks typically fail to consider both node heterogeneity and dynamic changes in network topology simultaneously, limiting their practical performance. In this paper, we address this issue by positioning heterogeneous nodes within dynamic network environments and employing a simulator to identify optimal parallel strategies. Our approach achieves fine-grained workload distribution in scenarios featuring node heterogeneity and complex networks, while also matching state-of-the-art performance in regular topologies and stable network conditions. Moreover, to mitigate the excessively long search times caused by large search spaces in existing frameworks, we propose a strategy pruning technique to rapidly eliminate infeasible parallel configurations. We further accelerate the search process by executing search tasks in parallel within the simulator. Preliminary evaluation results demonstrate that our method significantly improves training performance on heterogeneous nodes, and the proposed dynamic network design offers enhanced adaptability for complex scenarios such as cloud computing environments.

References

Ruilong Wu, Xinjiao Li, Yisu Wang, Xinyu Chen, Dirk Kutscher; Rethinking Dynamic Networks and Heterogeneous Computing with Automatic Parallelization; The 9th Asia-Pacific Workshop on Networking (APNET'25); August 2025; accepted for publication

Written by dkutscher

April 24th, 2025 at 8:21 am