whichllm Review: Best Local LLM for Your GPU (2026)

Tue, 09 Jun 2026 00:00:00 +0000

You’ve got a local LLM setup — Ollama, LM Studio, whatever. Now which model do you actually run?

That’s the question nobody’s really answering well. HuggingFace shows you download counts. Ollama search tells you what fits in VRAM. But “fits” and “best” are two very different things. I’ve spent way too many afternoons downloading model after model, testing them one by one, only to wonder if there’s something better I missed.

So when whichllm hit GitHub Trending at #10 with 3.5k stars, I paid attention. The pitch: a CLI tool that detects your hardware, pulls real benchmark data from LiveBench, Chatbot Arena, Aider, and the Open LLM Leaderboard, and tells you — not what can run — but what’s actually the best for your machine.

So I installed it, ran it across four GPU configurations (my actual machine, plus simulated RTX 4070 / 4090 / 5090), and here’s what I found.

What whichllm Actually Does

So what is it? whichllm is a Python CLI that does three things:

Detects your hardware — GPU model, VRAM, CPU cores, system RAM, disk space
Pulls live benchmark data — merges scores from LiveBench, Artificial Analysis, Chatbot Arena ELO, Aider, and Open LLM Leaderboard
Recommends models — ranks them by a weighted score that accounts for benchmark quality, recency (confidence decay for older models), and VRAM estimates

But the key insight: it’s evidence-ranked, not capacity-ranked. Ollama tells you “a 7B model fits in 8GB VRAM,” which is technically true but useless — Qwen3-8B and Gemma-3-12B both fit, but they have very different real-world performance. whichllm tells you which one actually scores higher on current benchmarks.

Hands-On: Running whichllm on My Machine

And installation is the fastest I’ve seen for a Python CLI this year:

uvx whichllm@latest

“That’s it. No pip install, no virtual env, no dependency hell. uvx downloads and runs it in one shot. So here’s what landed on my screen:

#	Model	Params	Quant	Published	Score
1	Qwen/Qwen3.6-27B	27.8B	Q6_K	2026-04-21	78.3
2	google/gemma-4-31B-it	32.7B	Q4_K_M	2026-03-11	73.5
3	Qwen/Qwen3-30B-A3B	30.5B	Q6_K	2025-04-27	67.6
4	google/gemma-4-26B-A4B-it	26.5B	Q6_K	2026-03-11	65.7
5	zai-org/GLM-4.7-Flash	31.2B	Q5_K_M	2026-01-19	64.7

So not exactly a powerhouse. But the tool correctly detected my hardware constraints and recommended models that’d work within them. And the #1 pick, Qwen3.6-27B in Q6_K, scored significantly ahead of the next option (+4.8 gap = high confidence).

But what also stood out — the tool flagged a speed caution for the top 3 picks, flagging low-confidence speed estimates. That’s the kind of honest signal I want from a recommendation engine, not just “here’s the biggest model.”

Simulating GPU Upgrades: RTX 4070 vs 4090 vs 5090

Now here’s where whichllm gets really useful. The --gpu flag lets you simulate any GPU before you buy it:

whichllm --gpu "RTX 4090"
whichllm --gpu "RTX 5090"

So I ran this across three hypothetical GPU setups and my current machine. Here’s the comparison table:

GPU	VRAM	Top Pick	Quant	Score	Est. tok/s
UHD Graphics 630	Shared	Qwen3.6-27B	Q6_K	78.3	~5
RTX 4070	12 GB	Qwen3-14B	Q5_K_M	75.1	~20
RTX 4090	24 GB	Qwen3.6-27B	Q5_K_M	92.4	~27
RTX 5090	32 GB	Qwen3.6-27B	Q6_K	94.3	~40

And a few things jumped out:

On the RTX 4070 (12 GB) — the top pick shifts to Qwen3-14B in Q5_K_M, scoring 75.1. That’s a solid daily driver for coding and chat. So the 14B gives better speed and smoother experience.

Now the RTX 4090 (24 GB) — that’s where things get interesting. Qwen3.6-27B in Q5_K_M scores 92.4 at ~27 tok/s. Still, the upgrade from the 4070 is 14.9 quality points and ~40% faster token generation.

As for the RTX 5090 (32 GB) — the best pick actually stays the same model (Qwen3.6-27B), but shifts to Q6_K quant for 94.3 quality and ~40 tok/s. The upgrade command validated this:

whichllm upgrade "RTX 4090" "RTX 5090"
# Verdict: worth it (≥12pt Q & ≥10 tok/s lift)

And going from 4090 to 5090 is genuinely worth it — that 32 GB VRAM lets you push higher quants and bigger context windows.

The Benchmark Engine — Why I Trust It More Than Random Reddit Recs

And Whichllm’s scoring isn’t a black box. It merges:

LiveBench — objective, contamination-avoiding benchmarks
Artificial Analysis — real-world inference speed data
Chatbot Arena ELO — human preference rankings (how actual users rate outputs)
Aider — code-editing benchmarks (LLM-as-judge)
Open LLM Leaderboard V2 — standardized evaluation suite

Still, each score is weighted and older benchmarks decay in influence. So a model that topped the leaderboard 6 months ago doesn’t get equal weight with something fresh. That time-weighting alone fixes a huge blind spot in most recommendation tools.

One thing I wish it did — it doesn’t show you the individual benchmark breakdowns per model in the default view. So you get an aggregate score. But I’d love to see “this model kills it on coding tasks but is weak on reasoning” at a glance.

Quick Chat: `whichllm run`

But the tool also has a one-shot chat command:

whichllm run "qwen 2.5 1.5b gguf"

And it downloads the model and starts a conversation right in your terminal — handy for quick tests. Still, I wouldn’t use it as a daily chat interface — Ollama is better for that. But as a “try before you commit” option, it works.

Limitations — What whichllm Doesn’t Do Well

But let me be straight about where this tool falls short.

No GPU benchmark data on its own. whichllm doesn’t benchmark your hardware. The token-per-second estimates are inferred from model size and GPU specs, not measured on your actual machine. A real benchmark run (like llama-bench) would give more accurate speed data.

Weak offline mode. Even if you’re offline, the benchmark data isn’t cached locally (yet). The fallback mode works but with reduced accuracy.

Not a model runner. It recommends models and can start a chat, but you’ll still want Ollama or LM Studio for day-to-day use. So think of it as a pre-purchase advisor and catalog browser, not a runtime.

Pair it with a memory layer like Mnemo and your model keeps context across sessions too.

Is It Worth Using?

And here’s my honest take.

Use it if: You’re shopping for a GPU and want to know what models it can actually run well. Or you have existing hardware and feel like you’re missing out on better models.

But skip it if you already know your setup and have a model you’re happy with. And I’ll be keeping it installed for the next time I’m GPU shopping.

Still, for GPU shopping, whichllm saved me hours of cross-referencing VRAM sizes against HuggingFace model cards. I’d call that a win.

Quick Comparison: whichllm vs Alternatives

Feature	whichllm	Ollama Search	HuggingFace Models
Hardware auto-detection	✅	❌	❌
Multi-benchmark scoring	✅	❌	❌
Pre-purchase GPU simulation	✅	❌	❌
Time-weighted scores	✅	❌	❌
One-click chat	✅	✅	❌
JSON output for scripting	✅	❌	❌

Final Verdict

But whichllm isn’t trying to replace Ollama or LM Studio. But it’s solving a different problem — the “what should I run” question that everyone in the local LLM space hits.

And at 3.5k GitHub stars and climbing (Trending #10 today), it’s early but actively maintained. I’ll be keeping it installed for the next time I’m GPU shopping.

If you want to dig deeper into the local AI tool ecosystem, check out my Headroom review — another tool that changes how you think about local LLM deployment.

💡 Recommended Resources

Disclosure: Some of the links below are affiliate links. If you purchase through them, I earn a small commission at no extra cost to you. All testing and opinions are my own.

Shopping for a new GPU to run local LLMs?

NVIDIA GeForce RTX 4090 (24 GB VRAM) — Top-tier consumer card for 27B+ models. Run it at Q5_K_M for ~27 tok/s:
→ RTX 4090 on Amazon (check current price)
NVIDIA GeForce RTX 5090 (32 GB VRAM) — Next-gen flagship. Higher quants, bigger context windows, ~40 tok/s on 27B models:
→ RTX 5090 on Amazon (check current price)
NVIDIA GeForce RTX 4070 (12 GB VRAM) — Solid mid-range for 7B-14B models. Practical daily driver for most users:
→ RTX 4070 on Amazon (check current price)

Already have a GPU but want cloud compute for bigger models?

Vultr Cloud GPU instances — Rent hourly GPU capacity when your local hardware isn't enough. No long-term commitment:
→ Vultr Cloud GPU (get $50-100 credit)

Last tested: June 2026. whichllm v0.5.8 on Windows via uvx. Benchmark data sourced from LiveBench, Chatbot Arena, and Open LLM Leaderboard. Scores are based on current benchmarks and may change — always verify performance for your specific hardware.

Tutorials on ToolGenix — AI Tools Discovery & Reviews