Dirk Kutscher

Personal web page

Affordable HPC: Leveraging Small Clusters for Big Data and Graph Computing

without comments

In our paper at PCDS-2024, we are exploring strategies for academic researchers to optimize computational resources within limited budgets, focusing on building small, efficient computing clusters. We analyzed the comparative costs of purchasing versus renting servers, guided by market research and economic theories on tiered pricing. The paper offers detailed insights into the selection and assembly of hardware components such as CPUs, GPUs, and motherboards tailored to specific research needs. It introduces innovative methods to mitigate the performance issues caused by PCIe switch bandwidth limitations in order to enhance GPU task scheduling. Furthermore, a Graph Neural Network (GNN) framework is proposed to analyze and optimize parallelism in computing networks.

Growing Resource Demands for Large-Scale Machine Learning

Large machine learning (ML) models, such as language models (LLMs), are becoming increasingly powerful and gradually accessible to end users. However, the growth in the capabilities of these models has led to memory and inference computation demands exceeding those of personal computers and servers. To enable users, research teams, and others to utilize and experiment with these models, a distributed architecture is essential.

In recent years, scientific research has shifted from a ”wisdom paradigm” to a ”resource paradigm.” As the number of researchers and the depth of scientific exploration increase, a significant portion of research computing tasks has moved to servers. This shift has been facilitated by the development of computing frameworks and widespread use of computers, leading to an increased demand for computer procurement.

Despite the abundance of online tutorials for assembling personal computers, information on the establishment of large clusters is relatively scarce. Large Internet companies and multinational corporations usually employ professional architects and engineers or work closely with vendors to optimize their cluster performance. However, researchers often do not have access to these technical details and must rely on packaged solutions from service providers to build small clusters.

Towards Affordable HPC

In our paper "Affordable HPC: Leveraging Small Clusters for Big Data and Graph Computing", we aim to bridge this gap by providing opportunities for researchers with limited funds to build small clusters from scratch. We compiled the necessary technical details and guidelines to enable researchers to assemble clusters independently. In addition, we propose a method to mitigate the performance degradation caused by the bandwidth limitations of PCIe switches, which can help researchers prioritize GPU training tasks effectively.

The papers discusses:

  1. How to build cost-effective clusters: We provide a comprehensive guide for researchers with limited funds, helping them to independently build small clusters and contribute to the development of large models.
  2. Performance Optimization: We propose a method to address the performance degradation caused by PCIe switch bandwidth limitations. This method allows researchers to prioritize GPU training tasks effectively, thereby improving the overall cluster performance.
  3. GNN for Network and Neural network parallelism: We propose a GNN (Graph Neural Network) framework that combines neural networks with parallel network flows in distributed systems. Our aim is to integrate different types of data flows, communication patterns, and computational tasks, thereby providing a novel perspective for evaluating the performance of distributed systems.

References

Written by dkutscher

September 2nd, 2024 at 5:25 am