Archive for the ‘Publications’ Category
Networked Metaverse Systems
The term ‘Metaverse’ often denotes a wide range of existing and fictional applications. Nevertheless, there are actual systems today that can be studied and analyzed. However, whereas a considerable body of work has been published on applications and application ideas, there is less work on the technical implementation of such systems, especially from a networked systems perspective.
In a recently published open access journal article, we share some insights into the technical design of Metaverse systems, their key technologies, and their shortcomings, predominantly from a networked systems perspective. For the scope of this study, we define the ‘Metaverse’ as follows. The ‘Metaverse’ encompasses various current and emerging technologies, and the term is used to describe different applications, ranging from Augmented Reality (AR), Virtual Reality (VR),and Extended Reality (XR) to a new form of the Internet or Web. A key feature distinguishing the Metaverse from simple AR/VR is its inherently collaborative and shared nature, enabling interaction and collaboration among users in a virtual environment.
Building on Existing Platforms and Network Stacks
Most current Metaverse systems and designs are built on existing technologies and networks. For example, massively multiplayer online games such as Fortnite use a generalized client-server model. In this model, the server authoritatively manages the game state, while the client maintains a local subset of this state and can predict game flow by executing the same game code as the server on approximately the same data. Servers send information about the game world to clients by replicating relevant actors and their properties. Commercial social VR platforms such as Horizon Worlds and AltspaceVR use HTTPS to report client-side information and synchronize in-game clocks across users.
Mozilla Hubs, built with A-Frame (a web framework for building virtual reality experiences), uses WebRTC communication with a Selective Forwarding Unit (SFU). The SFU receives multiple audio and video data streams from its peers, then determines and forwards relevant data streams to connected peers. Blockchain or Non-Fungible Token (NFT)-based online games, such as Decentraland, run exclusively on the client side but allow for various data flow models, ranging from local effects and traditional client-server architectures to peer-to-peer (P2P) interactions based on state channels; Upland is built on EOSIO, an open-source blockchain protocol for scalable decentralized applications, and transports data through HTTPS. Connections between peers in Upland are established using TLS or VPN tunnels.
Many studies have focused on improving various aspects of Metaverse systems. For example, EdgeXAR is a mobile AR framework using edge offloading to enable lightweight tracking with six degrees of freedom (DOF) while reducing offloading delay from the user’s view; SORAS is an optimal resource allocation scheme for edgeenabled Metaverse, using stochastic integer programming to minimize the total network cost; Ibrahim et al. explores the issue of partial computation offloading for multiple subtasks in an in-network computing environment, aiming to minimize energy consumption and delay. However, these ideas for offloading computation and rendering tasks to edge platforms often conflict with the existing end-to-end transport protocols and overlay deployment models. Recently, a Deep Reinforcement Learning (DRL)-based multipath network orchestration framework designed for remote healthcare services is presented, automating subflow management to handle multipath networks. However, proposals for scalable multi-party communication would require interdomain multicast services, unavailable on today’s Internet.
Disconnect Between High-Level Concepts and Actual Systems
In practice, there is a significant disconnect between high-level Metaverse concepts, ideas for technical improvements, and systems that are actually developed and partially deployed. A 2022 ACM IMC paper titled Are we ready for metaverse?: a measurement study of social virtual reality platforms analyzes the performance of various social VR systems, pinpointing numerous issues related to performance, communication overhead, and scalability. These issues are primarily due to the fact that current systems leverage existing platforms, protocols, and system architectures, which cannot tap into any of the proposed architectural and technical enhancements, such as scalable multi-party communication, offloading computation, rendering tasks, etc.
Rather than merely layering ‘the Metaverse’ on top of legacy and not always ideal foundations, we consider Metaverse as a driver for future network and web applications and actively develop new designs to that end. In our article, we take a comprehensive systems approach and technically describe current Metaverse systems, focusing on their networking aspects. We document the requirements and challenges of Metaverse systems and propose a principled approach to system design for these requirements and challenges based on a thorough understanding of the needs of Metaverse systems, the current constraints and limitations, and the potential solutions of Internet technologies.
Article Overview
- We present a technical description of the ‘Metaverse’ based on existing and emerging systems, including a discussion of its fundamental properties, applications, and architectural models.
- We comprehensively study relevant enabling technologies for Metaverse systems, including HCI/XR technologies, networking, communications, media encoding, simulation, real-time rendering and AI. We also discuss current Metaverse system architectures and the integration of these technologies into actual applications.
- We conduct a detailed requirements analysis for constructing Metaverse systems. We analyze applications specific requirements and identify existing gaps in four key aspects: communication performance, mobility, large-scale operation,and end system architecture. For each area, we propose candidate technologies to address these gaps.
- We propose a research agenda for future Metaverse systems, based on our gap analysis and candidate technologies discussion. We re-assess the fundamental goals and requirements, without necessarily being constrained by existing system architectures and protocols. Based on a comprehensive understanding of what Metaverse systems need and what end-systems, devices, networks and communication services can theoretically provide, we propose specific design ideas and future research directions to realize Metaverse systems that can meet the expectations often articulated in the literature.
References
- Y. Zhang, D. Kutscher and Y. Cui; Networked Metaverse Systems: Foundations, Gaps, Research Directions; in IEEE Open Journal of the Communications Society, doi: 10.1109/OJCOMS.2024.3426098.
- Tianyuan Yu, Xinyu Ma, Varun Patil, Yekta Kocaogullar, Yulong Zhang, Jeff Burke, Dirk Kutscher, Lixia Zhang; Secure Web Objects: Building Blocks for Metaverse Interoperability and Decentralization; IEEE MetaCom 2024; August 12-14 2024; Hong Kong, China
- Dirk Kutscher, Jeff Burke, Giuseppe Fioccola, Paulo Mendes;
Statement: The Metaverse as an Information-Centric Network; 10th ACM Conference on Information-Centric Networking (ACM ICN '23); October 9 — 10, 2023, Reykjavik, Iceland - Giuseppe Fioccola , Paulo Mendes , Jeff Burke , Dirk Kutscher;
Information-Centric Metaverse; Internet Draft draft-fmbk-icnrg-metaverse-01; Work in Progress; July 2023
Affordable HPC: Leveraging Small Clusters for Big Data and Graph Computing
In our paper at PCDS-2024, we are exploring strategies for academic researchers to optimize computational resources within limited budgets, focusing on building small, efficient computing clusters. We analyzed the comparative costs of purchasing versus renting servers, guided by market research and economic theories on tiered pricing. The paper offers detailed insights into the selection and assembly of hardware components such as CPUs, GPUs, and motherboards tailored to specific research needs. It introduces innovative methods to mitigate the performance issues caused by PCIe switch bandwidth limitations in order to enhance GPU task scheduling. Furthermore, a Graph Neural Network (GNN) framework is proposed to analyze and optimize parallelism in computing networks.
Growing Resource Demands for Large-Scale Machine Learning
Large machine learning (ML) models, such as language models (LLMs), are becoming increasingly powerful and gradually accessible to end users. However, the growth in the capabilities of these models has led to memory and inference computation demands exceeding those of personal computers and servers. To enable users, research teams, and others to utilize and experiment with these models, a distributed architecture is essential.
In recent years, scientific research has shifted from a ”wisdom paradigm” to a ”resource paradigm.” As the number of researchers and the depth of scientific exploration increase, a significant portion of research computing tasks has moved to servers. This shift has been facilitated by the development of computing frameworks and widespread use of computers, leading to an increased demand for computer procurement.
Despite the abundance of online tutorials for assembling personal computers, information on the establishment of large clusters is relatively scarce. Large Internet companies and multinational corporations usually employ professional architects and engineers or work closely with vendors to optimize their cluster performance. However, researchers often do not have access to these technical details and must rely on packaged solutions from service providers to build small clusters.
Towards Affordable HPC
In our paper "Affordable HPC: Leveraging Small Clusters for Big Data and Graph Computing", we aim to bridge this gap by providing opportunities for researchers with limited funds to build small clusters from scratch. We compiled the necessary technical details and guidelines to enable researchers to assemble clusters independently. In addition, we propose a method to mitigate the performance degradation caused by the bandwidth limitations of PCIe switches, which can help researchers prioritize GPU training tasks effectively.
The papers discusses:
- How to build cost-effective clusters: We provide a comprehensive guide for researchers with limited funds, helping them to independently build small clusters and contribute to the development of large models.
- Performance Optimization: We propose a method to address the performance degradation caused by PCIe switch bandwidth limitations. This method allows researchers to prioritize GPU training tasks effectively, thereby improving the overall cluster performance.
- GNN for Network and Neural network parallelism: We propose a GNN (Graph Neural Network) framework that combines neural networks with parallel network flows in distributed systems. Our aim is to integrate different types of data flows, communication patterns, and computational tasks, thereby providing a novel perspective for evaluating the performance of distributed systems.
References
- Ruilong Wu, Yisu Wang, Dirk Kutscher; Affordable HPC: Leveraging Small Clusters for Big Data and Graph Computing; The 1st International Symposium on Parallel Comnputing and Distributed Systems; September 2024; pre-print: https://arxiv.org/abs/2408.15568
Secure Web Objects: Building Blocks for Metaverse Interoperability and Decentralization
In our upcoming paper at IEEE Metacom-2024, we propose a data-oriented approach for future Web and Metaverse system designs.
Abstract
This position paper explores how to support the Web's evolution through an underlying data-centric approach that better matches the data-orientedness of modern and emerging applications. We revisit the original vision of the Web as a hypermedia system that supports document composability and application interoperability via name-based data access. We propose the use of secure web objects (SWO), a data-oriented communication approach that can reduce complexity, centrality, and inefficiency, particularly for collaborative and local-first applications, such as the Metaverse and other collaborative applications. SWO are named, signed, application-defined objects that are secured independently of their containers or communications channels, an approach that leverages the results from over a decade-long data-centric networking research. This approach does not require intermediation by aggregators of identity, storage, and other services that are common today. We present a brief design overview, illustrated through prototypes for two editors of shared hypermedia documents: one for 3D and one for LaTeX. We also discuss our findings and suggest a roadmap for future research.
References
-
Tianyuan Yu, Xinyu Ma, Varun Patil, Yekta Kocaogullar, Yulong Zhang, Jeff Burke, Dirk Kutscher, Lixia Zhang; Secure Web Objects: Building Blocks for Metaverse Interoperability and Decentralization; IEEE MetaCom 2024, pre-print: https://arxiv.org/abs/2407.15221
-
Dirk Kutscher; Data-oriented, Decentralized, Daring: Opportunities and Research Challenges for an Information-Centric Web; Lightning Talk at NDNComm 2024; March 2024
-
Navin V. Keizer, Onur Ascigil, Michał Król, Dirk Kutscher, and George Pavlou; A Survey on Content Retrieval on the Decentralised Web; ACM Computing Surveys; March 2024; https://doi.org/10.1145/3649132
-
Dirk Kutscher, Jeff Burke, Giuseppe Fioccola, Paulo Mendes;
Statement: The Metaverse as an Information-Centric Network; 10th ACM Conference on Information-Centric Networking (ACM ICN '23); October 9 — 10, 2023, Reykjavik, Iceland; https://doi.org/10.1145/3623565.3623761
Networked Systems for Distributed Machine Learning at Scale
On July 3rd, 2024, I gave a talk at the UCL/Huawei Joint Lab Workshop on "Building Better Protocols for Future Smart Networks" that took place on UCL's campus in London.
Talk Abstract
Large-scale distributed machine learning training networks are increasingly facing scaling problems with respect to FLOPS per deployed compute node. Communication bottlenecks can inhibit the effective utilization of expensive GPU resources. The root cause of these performance problems is not insufficient transmission speed or slow servers; it is the structure of the distributed computing and the communication characteristics it incurs. Large machine learning workloads typically provide relatively asymmetric, and sometimes centralized, communication structures, such as gradient aggregation and model update distribution. Even when training networks are less centralized, the amount of data that needs to be sent to aggregate several thousand input values through collective communication functions such as AllReduce can lead to Incast problems that overload network resources and servers. This talk discusses challenges and opportunities for developing in-network aggregation systems from a distributed computing and networked systems perspective.
RFC 9556: Internet of Things (IoT) Edge Challenges and Functions
Many Internet of Things (IoT) applications have requirements that cannot be satisfied by centralized cloud-based systems (i.e., cloud computing). These include time sensitivity, data volume, connectivity cost, operation in the face of intermittent services, privacy, and security. As a result, IoT is driving the Internet toward edge computing.
We have published RFC 9556, outlining the requirements of the emerging IoT edge and its challenges. It presents a general model and major components of the IoT edge to provide a common basis for future discussions in the Thing-to-Thing Research Group (T2TRG) and other IRTF and IETF groups.
Today, many IoT services leverage cloud computing platforms because they provide virtually unlimited storage and processing power. The reliance of IoT on back-end cloud computing provides additional advantages, such as scalability and efficiency. At the time of writing, IoT systems are fairly static with respect to integrating and supporting computation. It is not that there is no computation, but that systems are often limited to static configurations (edge gateways and cloud services).
However, IoT devices generate large amounts of data at the edges of the network. To meet IoT use case requirements, data is increasingly being stored, processed, analyzed, and acted upon close to the data sources. These requirements include time sensitivity, data volume, connectivity cost, and resiliency in the presence of intermittent connectivity, privacy, and security, which cannot be addressed by centralized cloud computing. A more flexible approach is necessary to address these needs effectively. This involves distributing computing (and storage) and seamlessly integrating it into the edge-cloud continuum. We refer to this integration of edge computing and IoT as "IoT edge computing". RFC 9556 describes the related background, use cases, challenges, system models, and functional components.
Data-oriented, Decentralized, Daring: Opportunities and Research Challenges for an Information-Centric Web
Research and development in ICN has led to different communication patterns such as Sync and API implementations such as CNL. It is now time to think about how to leverage Information-Centric principles for providing better foundations for hypermedia applications in the future web. At NDNComm-2024 I talked about how ICN could possibly help, what could be fruitful future research directions, and why web3 and dweb are not the answer.
Material
Content Retrieval on the Decentralised Web
Trends and Emerging Technologies for Content Retrieval on the Decentralized Web
The control, governance, and management of the web have become increasingly centralised, resulting in security, privacy, and censorship concerns. Decentralised initiatives have emerged to address these issues, beginning with decentralised file systems. These systems have gained popularity, with major platforms serving millions of content requests daily. Complementing the file systems are decentralised search engines and name registry infrastructures, together forming the basis of a decentralised web. We have published a survey paper that analyses research trends and emerging technologies for content retrieval on the decentralised web, encompassing both academic literature and industrial projects.
Challenges
Several challenges hinder the realisation of a fully decentralised web. Achieving comparable performance to centralised systems without compromising decentralisation is a key challenge. Hybrid infrastructures, blending centralised components with verifiability mechanisms, show promise to improve decentralised initiatives. While decentralised file systems have seen more mature deployments, they still face challenges such as usability, performance, privacy, and content moderation. Integrating these systems with decentralised name-registries offers a potential for improved usability with human-readable and persistent names for content. Further research is needed to address security concerns in decentralised name-registries and enhance governance and crypto-economic incentive mechanisms.
References
Navin V. Keizer, Onur Ascigil, Michał Król, Dirk Kutscher, and George Pavlou; A Survey on Content Retrieval on the Decentralised Web; ACM Computing Surveys; March 2024; https://doi.org/10.1145/3649132
Towards a Unified Transport Protocol for In-Network Computing in Support of RPC-based Applications
The emerging term In-Network Computin (INC) [inc] in particular refers applying on-path programmable networking devices (e.g., switches and routers between clients and servers) as an accelerator or function offloader to boost throughput, reduce server load, or improve latency, typically in a well-controlled data center network environment.
Some INC implementations evolved from programmable data plane systems and align with the trend of network programmability at large. In recent year, it has been shown to support many promising applications (e.g., caching, aggregation, and agreement). For example, in distributed machine learning (DML), training nodes produce data (gradients) that needs to be aggregated or reduced -- and the result could be distributed to one or multiple consumers. As another example, the NetClone system [netclone] uses in-network forwarder to replicate RPC invocation messages and to perform more informed forwarding based on observed latencies for accelerating RPC communication.
While it is possible to achieve this kind of operation purely with end-to-end communication between worker nodes, performance can be dramatically improved by offloading both the operation processing and the data dissemination to nodes in the network. These in-network processors are often conceived as semi-transparent performance enhancing on-path elements, i.e., they are not the actual endpoints in transport protocol sessions and would intercept packets with application data and potentially generate new data that they would have to transmit.
In our Internet Draft draft-song-inc-transport-protocol-req-01.txt, we are discussing this problem and are formulating some requirements for the design of future transport protocols in this space.
References
- Collective Communication: Better Network Abstractions for AI
- Computing in the Network – Lessons Learned and New Opportunities
- [I-D.yao-tsvwg-cco-problem-statement-and-usecases] Yao, K., Shiping, X., Li, Y., Huang, H., and D. KUTSCHER, "Collective Communication Optimization: Problem Statement and Use cases", Work in Progress, Internet-Draft, draft-yao-tsvwg-cco-problem-statement-and-usecases-00, 23 October 2023, https://datatracker.ietf.org/doc/html/draft-yao-tsvwg-cco-problem-statement-and-usecases-00.
- [inc] Klenk et al., B., "An In-Network Architecture for Accelerating Shared-Memory Multiprocessor Collectives", ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), 2020, <https:dx.doi.org/10.1109/ISCA45697.2020.00085>
- [netclone] Kim, G., "NetClone: Fast, Scalable, and Dynamic Request Cloning for Microsecond-Scale RPCs", In Proceedings of the ACM SIGCOMM 2023 Conference (ACM SIGCOMM '23). Association for Computing Machinery, New York, NY, USA, 195-207, 2023 https://dl.acm.org/doi/10.1145/3603269.3604820
AINTEC-2023
It was nice to meet new and old friends at the recent Asian Internet Engineering Conference (AINTEC) in Hanoi last week.
I had the pleasure of contributing three elements to the program:
AINTEC Panel on 6G Research
I had the pleasure of moderating a on panel 6G Research Challenges at AINTEC-2023. The panelists were Serge Fdida, Abhimanyu Gosain, Jim Kurose, and George Michaelson.
Opportunities and Challenges for Future Network Systems Design?
The panel was discussing opportunities and challenges for future network systems design and tried to shed some light on what 6G might actually mean and what interesting research could and should be done.
5G Hype vs Reality
While many people are speculating about possible 6G features, it is quite instructive to review the adoption of current 5G technology. The panel discussed this from different perspectives. It was noted that quite many advanced 5G features, although specified, are not yet available, such as new core designs, low latency communication, positioning, and network slicing.
There may be different reasons for that. One reason that was mentioned the lack of demand. 5G seems to be mostly used as a reasonably fast bitpipe, i.e., as an access technology for mobile broadband. Economically, this means that it is difficult to monetize the network beyond that.
The panel discussed whether WiFi and 5G will integrate as just two "localized" link-level wireless technologies at the Internet edge, or whether 5G will actually provide a global end-to-end network, interconnected to the Internet.
Centralization and new Deployment Models
Another interesting topic is the evolution of deployment models and the changing nature of service provider and infrastructure providers. Not only are hyperscalers providing most of the "over-the-top" functionality and infrastructure today, they are also increasingly providing the cloud infrastructure and telco software functions, such as Microsoft with their "Azure for Operators" platform. The panel also discussed the issues of commercial consolidation and concentration in this regard.
Key Enablers for 6G
We discussed potential key enables for 6G, and the following topics were mentioned:
- AI/ML Native Interface
- New Spectrum Technologies: 7-24 GHz, 300GHz-1THz
- Networking as a Sensor: Shift from Radio KPI to system and service focused
- Communication-Compute-Data Centric
- Zero Trust Architecture (ZTA): Security and Trust
- Open Radio Access Networks
With respect to "Communication-Compute-Data-Centricity", we discussed whether it would be the mobile network infrastructure that would provide features in this direction, e.g., a better integration of computing and networking, or whether the network would just provide the access service, and computing etc. would continue being an application (also see my invited talk on computing in the network at AINTEC-2023). The panel expressed some preference for maintaing a separation of concerns, layering and the end-to-end principle.
Another topic that was discussed was the continuing "softwarization" and the application of Software-Defined Networking (SDN) principles. Future systems may see some more management support for applications (and application-related infrastructure), and there is certainly a trend towards more autonomous management and the use of machine learning for that.
References
- Azure for operators
- Tim Wu; The Master Switch: The Rise and Fall of Information Empires; Columbia Law School; 2010