Two Ollama servers ran the same prompts and the same models. Higher numbers = faster. Every chart below shows the same answer in a different way.
How much faster is RunPod for each model? Numbers below show the multiplier (e.g. "10×" = ten times faster).
Tokens per second is how fast the model produces words. Higher is better. Bars show the median across all tests.
Same models, four different load patterns: sequential = one request at a time, concurrent = several at once, queued = a stream of requests, mixed = all models hit at the same time.
When many requests run at once, what's the total tokens/sec the server produces? Think of this as the kitchen output rate when the restaurant is busy. Only the "concurrent", "queued", and "mixed" tests stress this.
How to read this: