Archive for the ‘Distributed Machine Learning’ tag
Affordable HPC: Leveraging Small Clusters for Big Data and Graph Computing
In our paper at PCDS-2024, we are exploring strategies for academic researchers to optimize computational resources within limited budgets, focusing on building small, efficient computing clusters. We analyzed the comparative costs of purchasing versus renting servers, guided by market research and economic theories on tiered pricing. The paper offers detailed insights into the selection and assembly of hardware components such as CPUs, GPUs, and motherboards tailored to specific research needs. It introduces innovative methods to mitigate the performance issues caused by PCIe switch bandwidth limitations in order to enhance GPU task scheduling. Furthermore, a Graph Neural Network (GNN) framework is proposed to analyze and optimize parallelism in computing networks.
Growing Resource Demands for Large-Scale Machine Learning
Large machine learning (ML) models, such as language models (LLMs), are becoming increasingly powerful and gradually accessible to end users. However, the growth in the capabilities of these models has led to memory and inference computation demands exceeding those of personal computers and servers. To enable users, research teams, and others to utilize and experiment with these models, a distributed architecture is essential.
In recent years, scientific research has shifted from a ”wisdom paradigm” to a ”resource paradigm.” As the number of researchers and the depth of scientific exploration increase, a significant portion of research computing tasks has moved to servers. This shift has been facilitated by the development of computing frameworks and widespread use of computers, leading to an increased demand for computer procurement.
Despite the abundance of online tutorials for assembling personal computers, information on the establishment of large clusters is relatively scarce. Large Internet companies and multinational corporations usually employ professional architects and engineers or work closely with vendors to optimize their cluster performance. However, researchers often do not have access to these technical details and must rely on packaged solutions from service providers to build small clusters.
Towards Affordable HPC
In our paper "Affordable HPC: Leveraging Small Clusters for Big Data and Graph Computing", we aim to bridge this gap by providing opportunities for researchers with limited funds to build small clusters from scratch. We compiled the necessary technical details and guidelines to enable researchers to assemble clusters independently. In addition, we propose a method to mitigate the performance degradation caused by the bandwidth limitations of PCIe switches, which can help researchers prioritize GPU training tasks effectively.
The papers discusses:
- How to build cost-effective clusters: We provide a comprehensive guide for researchers with limited funds, helping them to independently build small clusters and contribute to the development of large models.
- Performance Optimization: We propose a method to address the performance degradation caused by PCIe switch bandwidth limitations. This method allows researchers to prioritize GPU training tasks effectively, thereby improving the overall cluster performance.
- GNN for Network and Neural network parallelism: We propose a GNN (Graph Neural Network) framework that combines neural networks with parallel network flows in distributed systems. Our aim is to integrate different types of data flows, communication patterns, and computational tasks, thereby providing a novel perspective for evaluating the performance of distributed systems.
References
- Ruilong Wu, Yisu Wang, Dirk Kutscher; Affordable HPC: Leveraging Small Clusters for Big Data and Graph Computing; The 1st International Symposium on Parallel Comnputing and Distributed Systems; September 2024; pre-print: https://arxiv.org/abs/2408.15568
Networked Systems for Distributed Machine Learning at Scale
On July 3rd, 2024, I gave a talk at the UCL/Huawei Joint Lab Workshop on "Building Better Protocols for Future Smart Networks" that took place on UCL's campus in London.
Talk Abstract
Large-scale distributed machine learning training networks are increasingly facing scaling problems with respect to FLOPS per deployed compute node. Communication bottlenecks can inhibit the effective utilization of expensive GPU resources. The root cause of these performance problems is not insufficient transmission speed or slow servers; it is the structure of the distributed computing and the communication characteristics it incurs. Large machine learning workloads typically provide relatively asymmetric, and sometimes centralized, communication structures, such as gradient aggregation and model update distribution. Even when training networks are less centralized, the amount of data that needs to be sent to aggregate several thousand input values through collective communication functions such as AllReduce can lead to Incast problems that overload network resources and servers. This talk discusses challenges and opportunities for developing in-network aggregation systems from a distributed computing and networked systems perspective.