Kernel-by-kernel vs. streaming execution

A GPU executes a neural network as a sequence of kernels: run an operation, write the intermediate result out to memory, fetch it back for the next operation, synchronize, repeat. SambaNova's engineers note that each of those boundaries adds latency, memory traffic, and energy cost - a penalty paid on every token, compounding through the autoregressive decode phase where tokens are generated one at a time.

A dataflow processor instead maps the computation onto a grid of compute and memory units as a continuous pipeline: while one operation executes, data for the next is already being fetched, and intermediate activations stay local on the chip instead of making round trips to external memory. SambaNova's published SN40L paper describes fusing pipelines of 20 or more operators into a single kernel call - where conventional GPU fusion typically combines 1 to 5 operators - amortizing kernel launch overhead and reserving memory bandwidth for what matters: streaming weights and KV cache.

Why it matters for inference specifically

LLM inference has two phases with opposite hardware appetites. Prefill (processing the prompt) is compute-heavy and parallel - work GPUs are well suited to, as SambaNova itself acknowledges. Decode (generating tokens) is memory-bandwidth-bound: each token requires streaming the model weights from memory, so execution efficiency and data movement determine speed. Dataflow execution is built for exactly this phase - which is why the industry calls the current shift toward agentic, generation-heavy workloads the decode era.

The practical consequence shows up in per-request speed at low batch sizes. GPU serving recovers decode efficiency by batching many users together, trading individual latency for aggregate throughput. A dataflow pipeline maintains high utilization without depending on large batches - delivering high single-request speed. On our SN40L-based infrastructure in Munich, that translates to 713 tokens per second on gpt-oss-120b and 428 tokens per second on MiniMax M2.7 Ultraspeed, measured per-request on production hardware.

The memory system behind it

Streaming execution needs memory designed around it. The SN40L couples its dataflow fabric to a three-tier memory system - 520 MB on-chip SRAM, 64 GB HBM per socket, and directly attached DDR - which SambaNova describes as the way to scale the AI memory wall: SRAM holds the hottest local data, HBM streams the active model's weights, and the DDR tier holds additional models and prompt caches, enabling model switching in milliseconds rather than the seconds GPU stacks need.

Sources

Related terms

RDU (Reconfigurable Dataflow Unit)

SambaNova's AI processor - purpose-built AI chips designed for dataflow execution instead of instruction-by-instruction processing.

Prefill vs. Decode

The two phases of LLM inference - parallel prompt processing vs. token-by-token generation.

Latency vs. Throughput

The fundamental serving trade-off: total system output vs. each user's speed.

Parameters

A model's learned weights - the rough measure of its size and capacity, and the direct driver of its memory, speed, and cost.

Learn how SambaNova's dataflow architecture changes the economics of inference - and why we built on it.

Dataflow Architecture

Kernel-by-kernel vs. streaming execution

Why it matters for inference specifically

The memory system behind it

Sources

Related terms

Ready to Build the Future of AI in Europe?