Dirk Kutscher

Personal web page

Rethinking Dynamic Networks and Heterogeneous Computing with Automatic Parallelization accepted at ACM APNET

without comments

Our paper on Rethinking Dynamic Networks and Heterogeneous Computing with Automatic Parallelization has been accepted by the 9th Asia-Pacific Workshop on Networking (APNET'25).

Abstract:
Hybrid parallelism techniques are crucial for the efficient training of large language models (LLMs). However, these techniques often introduce differentiated computational and communication tasks across nodes. Existing automatic parallel planning frameworks typically fail to consider both node heterogeneity and dynamic changes in network topology simultaneously, limiting their practical performance. In this paper, we address this issue by positioning heterogeneous nodes within dynamic network environments and employing a simulator to identify optimal parallel strategies. Our approach achieves fine-grained workload distribution in scenarios featuring node heterogeneity and complex networks, while also matching state-of-the-art performance in regular topologies and stable network conditions. Moreover, to mitigate the excessively long search times caused by large search spaces in existing frameworks, we propose a strategy pruning technique to rapidly eliminate infeasible parallel configurations. We further accelerate the search process by executing search tasks in parallel within the simulator. Preliminary evaluation results demonstrate that our method significantly improves training performance on heterogeneous nodes, and the proposed dynamic network design offers enhanced adaptability for complex scenarios such as cloud computing environments.

References

Ruilong Wu, Xinjiao Li, Yisu Wang, Xinyu Chen, Dirk Kutscher; Rethinking Dynamic Networks and Heterogeneous Computing with Automatic Parallelization; The 9th Asia-Pacific Workshop on Networking (APNET'25); August 2025; accepted for publication

Written by dkutscher

April 24th, 2025 at 8:21 am