Glossary
Architecture

Dataflow Architecture

Dataflow architecture is a processor design where computation is laid out as a pipeline and data streams continuously through it - in contrast to GPUs, which execute models kernel-by-kernel, writing intermediate results to memory and fetching them back between every operation. For LLM inference, this eliminates much of the memory traffic that limits token generation speed.

Kernel-by-kernel vs. streaming execution

A GPU executes a neural network as a sequence of kernels: run an operation, write the intermediate result out to memory, fetch it back for the next operation, synchronize, repeat. SambaNova's engineers note that each of those boundaries adds latency, memory traffic, and energy cost - a penalty paid on every token, compounding through the autoregressive decode phase where tokens are generated one at a time.

A dataflow processor instead maps the computation onto a grid of compute and memory units as a continuous pipeline: while one operation executes, data for the next is already being fetched, and intermediate activations stay local on the chip instead of making round trips to external memory. SambaNova's published SN40L paper describes fusing pipelines of 20 or more operators into a single kernel call - where conventional GPU fusion typically combines 1 to 5 operators - amortizing kernel launch overhead and reserving memory bandwidth for what matters: streaming weights and KV cache.

Why it matters for inference specifically

LLM inference has two phases with opposite hardware appetites. Prefill (processing the prompt) is compute-heavy and parallel - work GPUs are well suited to, as SambaNova itself acknowledges. Decode (generating tokens) is memory-bandwidth-bound: each token requires streaming the model weights from memory, so execution efficiency and data movement determine speed. Dataflow execution is built for exactly this phase - which is why the industry calls the current shift toward agentic, generation-heavy workloads the decode era.

The practical consequence shows up in per-request speed at low batch sizes. GPU serving recovers decode efficiency by batching many users together, trading individual latency for aggregate throughput. A dataflow pipeline maintains high utilization without depending on large batches - delivering high single-request speed. On our SN40L-based infrastructure in Munich, that translates to 713 tokens per second on gpt-oss-120b and 428 tokens per second on MiniMax M2.7 Ultraspeed, measured per-request on production hardware.

The memory system behind it

Streaming execution needs memory designed around it. The SN40L couples its dataflow fabric to a three-tier memory system - 520 MB on-chip SRAM, 64 GB HBM per socket, and directly attached DDR - which SambaNova describes as the way to scale the AI memory wall: SRAM holds the hottest local data, HBM streams the active model's weights, and the DDR tier holds additional models and prompt caches, enabling model switching in milliseconds rather than the seconds GPU stacks need.

Sources

Related terms

Learn how SambaNova's dataflow architecture changes the economics of inference - and why we built on it.

Ready to Build the Future of AI in Europe?

Join forward-thinking organizations deploying sovereign AI with world-class performance