Optical circuit switching in disaggregated cloud and HPC infrastructures

In this blog summary of his white paper, Dr. Michael Enrico, Network Solutions Architect at HUBER+SUHNER Polatis, explains how network disaggregation, supported by an optically switched interconnect fabric, can make a critical contribution to new network designs being developed to support AI and Machine Learning.

The Hyperscalers in cloud computing and other providers of high-performance computing (HPC) services have to architect and scale their computing platforms to meet client demand for AI applications while controlling CAPEX and reducing power requirements. In particular, the processing power required has increased by orders of magnitude.

Instead of the building blocks of these platforms being tightly and inflexibly bundled in a fairly monolithic platform such as a standard server chassis, the process of "disaggregating" the requisite component parts or sub-systems avoids the risk of increased inefficiency and underutilisation of some of the key underlying resources, and importantly excessive power consumption, inevitable if simply "racking and stacking" more servers.

In a disaggregated architecture, these resources (CPU, memory, storage, acceleration hardware in its various guises) are flexibly combined by interconnecting them using integrated high-speed digital transceivers and a dedicated interconnect fabric based on appropriate transport media and switching technologies. These can then be combined and appropriately scaled, independently of each other, to meet the demands of the expected workloads.

The principle of disaggregation is shown in the diagram above. The required resources are bundled together in bespoke ratios to form flexibly proportioned "bare metal" hardware hosts "composed" on-the-fly, using a common pool of underlying finely grained resources. The key building blocks in this case are the lower level resource elements themselves such as CPUs, memory, storage and various kinds of accelerators (GPUs, TPUs, FPGAs).

Several levels of disaggregation can be defined, related to the granularity at which the resource blocks can be accessed and consumed.

In the most granular form of disaggregation, each resource block (e.g. a bank of DRAM, a CPU, an accelerator) has onboard hardware to facilitate the necessary high-speed, low-latency connection of its resources to an interconnect platform.

Less granular forms of resource disaggregation that are more compatible with current hardware implementations may be seen as a way to facilitate a more gradual transition towards fully disaggregated platforms.

These include:

An application in which the dynamically interconnected compute resource components are limited only to accelerator hardware. By fitting single mode optical transceivers they can be flexibly and directly interconnected with those in other hosts using a dedicated optical switching fabric that effectively acts as an overlay to a packet switching fabric that is already used to provide most of the interconnect between the hosts in the cluster.

Going beyond the interconnection of accelerator cards alone to accessing more of the resources already present in fleets of conventional servers, a dedicated PCIe interconnection card fitted with specialised SerDes processing hardware and firmware and high-density, high-speed optical transceivers acts as a high performance gateway between the PCIe-connected compute resources in that chassis and the optical interconnect fabric.

The Interconnect Fabric

An optical interconnect fabric with transparent optical circuit switching provides deterministic, circuit-switched, fixed bandwidth data paths which are well suited to interconnect hardware resource elements that would otherwise be directly and deterministically interconnected at a low level by dedicated traces on a server motherboard or via a specific bus technology such as PCI Express.

It also promises significant reductions in power consumption of the fabric itself compared to an electrical fabric, much lower latencies associated with the data paths through it, and a better ability to physically scale the fabric up and out. It also enjoys significantly better future-proofing thanks to the inherent transparency of the fabric to the formats and line rates of the serialized data traffic between the optical transceivers associated with the disaggregated resource elements.

The lowest loss optical circuit switches, such as POLATIS DirectLight™ switches, allow for fabrics to be constructed with up to four or more stages of switching whilst keeping within the optical loss budgets of typical transceivers used with disaggregated resource elements.

The Benefits of Disaggregated Computing

Hardware computing platforms can be composed on-the-fly.
Platforms can be scaled to whatever size and ratio of the available resource types is appropriate to the kinds of workloads that will be run on the hardware.
Platforms can be resized during the course of running a particular workload as the resource consumption requirements evolve.
Resources not required can be temporarily powered down, resulting in OPEX savings.

Operators can:

Select best-of-breed vendors for the various component building blocks.
Use those resources that support only the specific functions they need.
Upgrade different types and/or blocks of resource element as and when required.

For a more detailed discussion of the topic and to discover the benefits of disaggregated computing for operators, please read Michael’s latest white paper here: https://www.polatis.com/White_Paper_POLATIS_Disaggregation_EN.pdf