Dirk Kutscher

Personal web page

HKUST Internet Research Workshop 2024

without comments

On March 15 2024, in the week before the IETF-119 meeting in Brisbane, Zili Meng and I organized the 1st HKUST Internet Research Workshop that brought together researchers in computer networking and systems around the globe to a live forum discussing innovative ideas at their early stages. The workshop took place at HKUST's Clear Water Bay campus in Hong Hong.

We ran the workshop like a “one day Dagstuhl seminar” and focused on discussion and ideas exchange and less on conference-style presentations. The objective was to identify topics and connect like-minded people for potential future collaboration, which worked out really well.

The agenda was:

  1. Dirk Kutscher: Networking for Distributed ML
  2. Zili Meng: Overview of the Low-Latency Video Delivery Pipeline
  3. Jianfei He: The philosophy behind computer networking
  4. Carsten Bormann: Towards a device-infrastructure continuum in IoT and OT networks
  5. Zili Meng: Network Research – Academia, Industry, or Both?

Dirk Kutscher: Networking for Distributed ML

With the ever-increasing demand for compute power from large-scale machine learning training we have started to realize that not only does Moore's Law no longer address increasing performance demand automatically, but also that the growth rate in terms of training FLOPs for transformers and other large-scale machine learning exhibits by far larger exponential factors.

This has been well illustrated by presentations in an AI data center side meeting at IETF-118, for example by Omer Shabtai who talked about Distributed Training in data centers.

WIth increasing scale, communication over networks becomes a bottleneck, and the question arises, what could be good system designs, protocols, and in-network support strategies to improve performance.

Current distributed machine learning systems typically use a technology called Collective Communication that was developed as a Message Passing Interface (MPI) abstraction for high-performance computing (HPC). Collective Communication is the combination of standardized aggregration and reduction function with communication abstractions, e.g., for "broadcasting" or "unicasting" results.

Collective Communication is implemented a few popular libraries such as OpenMPI and Nvidia's NCCL. When used in IP networks, the communication is usually mapped to iterations of peer-to-peer interactions, e.g., organizing nodes in a ring and sending data for aggregation within such rings. One potential way to achieve better performance would be to perform the aggregation "in the network", as in HPC systems, e.g., using the Scalable hierarchical aggregation protocol (SHArP). Previous work has attempted doing this with P4-based dataplane programming, however such approaches are typically limited due to the mostly stateless operation of the corresponding network elements.

In large-scale training sessions, running over shared infrastructure in multi-tenant data centers, communication needs to respond to congestion, packet loss, server overload etc., i.e., the features of typical transport protocols are needed.

I had previously discussed corresponding challenges and requirements in these Internet Drafts:

In my talk at HKIRW, I discussed ideas for corresponding transport protocols. There are interesting challenges in bringing together reliable communication, congestion control, flow control, single-destination as well multi-destination communication and in-network processing.

Zili Meng: Overview of the Low-Latency Video Delivery Pipeline

Zili talked about requirements for ultra-low latency for interactive streaming for the next-generation of immersive applications. Some application provide really stringent low-latency requirements, with a consistent service quality over many hours, and the talk suggested a better coordination between all elements of the streaming and rendering pipeline.

There was a discussion as to how achievable these requirements are in the Internet and whether applications might be re-designed in terms of providing acceptable user experience even without guaranteed high-bandwidth low-latency service, for example by employing technologies such as semantic communication, prediction, local control loops etc.

Jianfei He: The philosophy behind computer networking

In his talk, Jianfe He asked the question how the field of computer networked can be more precisely defined and how a more systematic could help with the understanding and design of future networked systems.

Specifically, he suggested considering basing design on a solid understanding of potentials and absolute constraints in a certain field, such as Shannon's theory/limit and on the notion of tradeoffs, i.e., consequences of certain design decisions, as represented by the CAP theorem in distributed systems. He mentioned two examples: 1) routing protocols and 2) transport protocols.

For routing protocols, there are well-known tradeoffs between convergence time, scaling limits, and required bandwidths. With changed network properties (bandwidth) – can we reasons about options for shifting the tradeoffs?

For transport protocols, there a goals such as reliability, congestion control etc., and tradeoff relationships between packet loss, line utilization, delay and buffer size. How would designs change if we changed the objective, e.g., to shortest flow completion times or shortest message completion time (or if we looked at collections of flows)? What if we added fairness to these objectives?

Jianfe asked the question whether it was possible to develop these tradeoffs/constraints into a more consistent theory.

Carsten Bormann: Towards a device-infrastructure continuum in IoT and OT networks

Carsten talked about requirements and available technologies for providing a secure management of IoT devices in a device-infrastructure continuum in IoT and OT networks, where scale demands high degrees of automation at run-time and only limited individual device configuration (at installation only). It is no longer possible to manually track each new "Thing" species.


Carsten mentioned technologies such as

  • RFC 8250: Manufacturer's Usage Description (MUD);
  • W3C Web of Things description model; and
  • IETF Semantic Definition Format (SDF).

In his talk, Carsten formulated the goal of "Well-Informed Networking", i.e., an approach where networks can obtain sufficient information about the existing devices, their legitimate communication requirements, and their current status (device health).

Zili Meng: Network Research – Academia, Industry, or Both?

Zili discussed the significance of consistently high numbers industry and industry-only papers at major networking conferences. Often such papers are based on operational experience that can only obtained by companies actually operating corresponding systems.

Sometimes papers seem to get accepted not necessarily on the basis of their technical merits but because they report on "large-scale deployments".

When academics get involved in such work, it is often not in a driving position, but rather through students who work in internship at corresponding companies. Naturally, such papers are not questioning the status quo and are generally not critical of the systems they discuss.

At the workshop, we discussed the changes in the networking research field over the past years, as well as the challenges of successful collaborations between academia and industry.

Written by dkutscher

April 6th, 2024 at 10:55 am