Performance

Don't Take Our Word for It

Real benchmark results from our EU infrastructure. Independently verified technology. Open-source tools you can run yourself.

Measured on Infercom EU Infrastructure

These numbers are from our production API with our servers in the EU. Not vendor marketing — real measurements you can reproduce.

Last measured: March 2026

EU Sovereign

MiniMax-M2.5

Time to First Token
266ms
Output Throughput
426tok/s
End-to-End Latency
2.615ms

1K input / 1K output, single request

Variance <1.7%
EU Sovereign

gpt-oss-120b

Time to First Token
69ms
Output Throughput
772tok/s
End-to-End Latency
1.362ms

1K input / 1K output, single request

Variance <0,1%
EU Sovereign

DeepSeek-V3.1

Time to First Token
149ms
Output Throughput
273tok/s
End-to-End Latency
3.813ms

1K input / 1K output, single request

Variance <0,1%

Server-side metrics (p50). Measured using our open-source benchmark tool. Your client-side results will vary based on network location and conditions.

What We Measure

Three metrics that matter for production AI inference

Time to First Token (TTFT)

How quickly the model starts responding after your request. Critical for interactive applications and chat interfaces.

Output Throughput

Tokens generated per second after the first token. Determines how fast a complete response is delivered to the user.

End-to-End Latency

Total time from request to complete response. Includes TTFT plus full generation time. The number that matters for batch workloads.

Open Source

Run Your Own Benchmark

Our benchmark tool is fully open source. Run it against our API with your free-tier API key, or clone the repository and run it locally. Same code, same methodology, your results.

Synthetic Performance

Fixed input/output token counts for controlled comparisons across models

Real Workload Simulation

Variable request rates mimicking production traffic patterns

Custom Dataset

Upload your own prompts and measure performance on your actual workload

Interactive Chat

Per-response metrics in a live chat interface — see TTFT and throughput on every reply

Performance Without the Power Bill

Up to 5x more energy efficient than GPU-based inference. Speed doesn't have to come at the planet's expense.

10 kW

Per Rack

vs. 40–50 kW+ for equivalent GPU infrastructure. Dramatic reduction in power consumption and cooling requirements.

Air Cooled

No Liquid Cooling

Standard air cooling simplifies deployment, eliminates water usage, and reduces operational complexity and cost.

Up to 5x

More Efficient

More intelligence per joule of energy consumed. Validated by Stanford Hazy Research methodology.

Bereit, die Zukunft der AI in Europa zu gestalten?

Schließen Sie sich zukunftsorientierten Unternehmen an, die Souveräne KI mit Weltklasse-Performance einsetzen