Ollama is building an engine for Apple and renting one for NVIDIA

Ollama’s 0.30.0 release notes promised “faster performance on NVIDIA hardware.” No number. No benchmark. Just the word faster. The MLX engine on Apple Silicon, by contrast, shipped with a clean claim you could quote: about 2x decode on supported chips. One platform gets a measurement. The other gets an adjective. That gap repeats across the entire 0.30 line, and it is the whole story.

What actually shipped

I read every release from 0.30.0 through 0.30.10 and sorted the changes by who benefits. The pattern is not subtle.

The Apple Silicon column is first-party engineering. Ollama reworked the MLX sampler for generation quality. It moved MLX embedding layers to an NVFP4 global scale for better quantization. It hardened the linear and embedding layers for inference stability, then taught the MLX runner to snapshot during prompt processing and speculative decoding. It added Metal GPU offload so multimodal models stop falling back to CPU on Macs. Every one of those is Ollama writing and tuning its own code, on a roughly two-week cadence, for one hardware family.

The NVIDIA column reads differently. The llama.cpp backend “got updated.” A gemma4:12b floating-point crash on CUDA got fixed. Windows backend cleanup got more reliable. That is the list. Notice the grammar even shifts: Apple Silicon gets sentences where Ollama is the subject doing the work, and NVIDIA gets sentences where things passively got done to it.

Apple Silicon (first-party MLX work)	NVIDIA / cross-platform (inherited or fixed)
Reworked MLX sampler (0.24)	“Faster on NVIDIA,” unquantified (0.30.0)
NVFP4 embedding scale (0.30.6)	`gemma4:12b` CUDA crash fixed (0.30.5)
Hardened MLX layers + runner snapshots (0.30.8)	Windows llama.cpp cleanup (0.30.4)
Metal multimodal offload (0.30.4)	llama.cpp backend version bumps (ongoing)

Models are the exception, and they land for everyone. Gemma 4, Nemotron-3-Ultra, the QAT weights that make a 26B-class model fit in sane VRAM: if you run a discrete card, those releases were good for you. But model availability is table stakes. It says nothing about whether the runtime underneath your card is getting faster.

Why this is structural

I am not calling this a snub. The real explanation is more durable than a snub would be, and worse for you.

Ollama owns the MLX engine. It is their code. When they want Apple Silicon to go faster, they open their own files and make it go faster. NVIDIA inference runs through llama.cpp, which Ollama consumes as a dependency. When they want your card to go faster, they wait for the llama.cpp project to ship it, bump the submodule, and pass it along. They optimize what they own and inherit the rest.

That is a rational call. MLX is young, the unified-memory design left obvious speed on the table, and chasing it produces clean 2x headlines on launch day. llama.cpp is mature, the CUDA paths are well-trodden, and squeezing another five percent out of them is somebody else’s full-time job. Of course the bespoke effort flows toward the easy wins and the marketing.

What this means if you run a discrete card

The practical version, for anyone on a 12 to 24GB discrete GPU under Windows, WSL2, or Linux: your throughput roadmap does not live in Ollama’s release notes. It lives in the llama.cpp repository.

Stop refreshing the Ollama changelog waiting for your card to get faster. Wrong repo. The CUDA kernel work and the new quant formats that move your tokens-per-second land in llama.cpp first, sometimes well before Ollama bumps to a build that includes them. Watch ggml-org/llama.cpp, and read Ollama’s notes mainly to find out which llama.cpp build you just inherited and when.

There is a second-order cost here too. The distance between what your card can do and what Ollama exposes is now a real lag, set by how often Ollama updates its submodule. Some releases that lag is days. Sometimes it sits for a while, and you are running last month’s kernels without knowing it.

Is this bad? Mostly no

I want to be straight rather than sell outrage. llama.cpp is one of the most battle-tested local inference engines in existence. It supports more hardware than any Apple-native stack could, and being downstream of it means your NVIDIA setup benefits from a contributor base Ollama could never staff internally. Sitting downstream of the best open inference project is not a punishment.

You keep the speed. What moves to Apple Silicon is the narrative and the attention. The quantified wins, the launch-day benchmarks, the “look how fast local got” energy all flow there, because that is where Ollama spends its own engineering hours. If you bought a discrete GPU for local AI and you have felt like the story moved on without you, you are reading the room right. It did. The hardware still works exactly as well as it did last month. It is just not the protagonist anymore, and nothing in the 0.30 line suggests that reverses.

So calibrate. Run NVIDIA for the raw compute and the VRAM, not because you expect Ollama to lavish first-party attention on it. Watch the repo that actually governs your performance. And treat every Apple Silicon decode benchmark as what it is: a real number, on a different architecture, that tells you nothing about your card.¹

MLX decode figures are measured Apple-Silicon-against-older-Apple-Silicon, on unified memory. They are not comparable to CUDA throughput on discrete VRAM. Different memory model, different bottleneck, different sampler. A 2x MLX number is not 2x anything on an NVIDIA card. ↩

What actually shipped

Why this is structural

What this means if you run a discrete card

Is this bad? Mostly no

Footnotes