What an RDU is

The RDU is the processor at the heart of SambaNova's systems - the hardware our platform runs on. Rather than a fixed instruction pipeline, the chip provides a grid of Programmable Compute Units (PCUs) and Programmable Memory Units (PMUs) that the compiler configures per model: operations are laid out spatially, and tensors stream through them as a pipeline - the dataflow execution model. SambaNova describes the result as data flowing from one AI operation to the next like an assembly line, with data for the next operation fetched while the current one is still running.

The current production chip, the SN40L, pairs that compute fabric with a three-tier memory system: 520 MB of on-chip SRAM for the hottest data, 64 GB of co-packaged HBM that streams model weights, and directly attached DDR memory for prompt caching and holding a catalog of models. Each SN40L socket delivers 638 BF16 teraFLOPS from 1,040 compute units.

Why it exists: the memory wall

SambaNova's engineering team frames the problem bluntly: AI inference is a data movement problem, not a compute problem. During the decode phase of inference, the hardware must move model weights from memory for every generated token - on instruction-based architectures, that memory traffic, not arithmetic, dominates latency. The RDU's dataflow design attacks this directly: operator fusion keeps intermediate results on-chip, and in SambaNova's published SN40L paper, fusing operators into large pipelines achieved over 85% of HBM bandwidth utilization while eliminating per-kernel launch overheads.

The peer-reviewed SN40L paper (MICRO 2024) reports speedups of 2x to 13x over an unfused baseline and 3.7x over a DGX H100 system on composition-of-experts inference workloads. Because the DDR tier sits directly on the chip's memory system, switching between model checkpoints takes milliseconds - SambaNova measured roughly 60-90 ms for hot-swaps that take around 800 ms on GPU-based serving stacks.

RDUs in practice

A SambaRack combines 16 RDUs into a single air-cooled 19-inch rack drawing about 10 kW typical power - standard datacenter infrastructure, no liquid cooling. One rack runs frontier-scale models, including 671B-parameter class models, that would otherwise require multiple GPU racks. Our platform operates 8 SambaRack SN40L systems (128 RDUs) in Munich, and the speed this architecture delivers is measurable: 713 tokens per second on gpt-oss-120b, independently verifiable through our published benchmarks.

Sources

Related terms

Dataflow Architecture

The execution model where data streams through operations as a pipeline - eliminating the kernel-by-kernel round trips of GPU execution.

Prefill vs. Decode

The two phases of LLM inference - parallel prompt processing vs. token-by-token generation.

Tokens per Second

The standard unit for LLM generation speed - and why the same number can mean two different things.

Parameters

A model's learned weights - the rough measure of its size and capacity, and the direct driver of its memory, speed, and cost.

Learn how SambaNova's dataflow architecture changes the economics of inference - and why we built on it.

RDU (Reconfigurable Dataflow Unit)

What an RDU is

Why it exists: the memory wall

RDUs in practice

Sources

Related terms

Ready to Build the Future of AI in Europe?