You’ve got a local LLM setup — Ollama, LM Studio, whatever. Now which model do you actually run?

That’s the question nobody’s really answering well. HuggingFace shows you download counts. Ollama search tells you what fits in VRAM. But “fits” and “best” are two very different things. I’ve spent way too many afternoons downloading model after model, testing them one by one, only to wonder if there’s something better I missed.

So when whichllm hit GitHub Trending at #10 with 3.5k stars, I paid attention. The pitch: a CLI tool that detects your hardware, pulls real benchmark data from LiveBench, Chatbot Arena, Aider, and the Open LLM Leaderboard, and tells you — not what can run — but what’s actually the best for your machine.

So I installed it, ran it across four GPU configurations (my actual machine, plus simulated RTX 4070 / 4090 / 5090), and here’s what I found.

What whichllm Actually Does

So what is it? whichllm is a Python CLI that does three things:

  1. Detects your hardware — GPU model, VRAM, CPU cores, system RAM, disk space
  2. Pulls live benchmark data — merges scores from LiveBench, Artificial Analysis, Chatbot Arena ELO, Aider, and Open LLM Leaderboard
  3. Recommends models — ranks them by a weighted score that accounts for benchmark quality, recency (confidence decay for older models), and VRAM estimates

But the key insight: it’s evidence-ranked, not capacity-ranked. Ollama tells you “a 7B model fits in 8GB VRAM,” which is technically true but useless — Qwen3-8B and Gemma-3-12B both fit, but they have very different real-world performance. whichllm tells you which one actually scores higher on current benchmarks.

Hands-On: Running whichllm on My Machine

And installation is the fastest I’ve seen for a Python CLI this year:

uvx whichllm@latest

“That’s it. No pip install, no virtual env, no dependency hell. uvx downloads and runs it in one shot. So here’s what landed on my screen:

# Model Params Quant Published Score
1 Qwen/Qwen3.6-27B 27.8B Q6_K 2026-04-21 78.3
2 google/gemma-4-31B-it 32.7B Q4_K_M 2026-03-11 73.5
3 Qwen/Qwen3-30B-A3B 30.5B Q6_K 2025-04-27 67.6
4 google/gemma-4-26B-A4B-it 26.5B Q6_K 2026-03-11 65.7
5 zai-org/GLM-4.7-Flash 31.2B Q5_K_M 2026-01-19 64.7

So not exactly a powerhouse. But the tool correctly detected my hardware constraints and recommended models that’d work within them. And the #1 pick, Qwen3.6-27B in Q6_K, scored significantly ahead of the next option (+4.8 gap = high confidence).

But what also stood out — the tool flagged a speed caution for the top 3 picks, flagging low-confidence speed estimates. That’s the kind of honest signal I want from a recommendation engine, not just “here’s the biggest model.”

Simulating GPU Upgrades: RTX 4070 vs 4090 vs 5090

Now here’s where whichllm gets really useful. The --gpu flag lets you simulate any GPU before you buy it:

whichllm --gpu "RTX 4090"
whichllm --gpu "RTX 5090"

So I ran this across three hypothetical GPU setups and my current machine. Here’s the comparison table:

GPU VRAM Top Pick Quant Score Est. tok/s
UHD Graphics 630 Shared Qwen3.6-27B Q6_K 78.3 ~5
RTX 4070 12 GB Qwen3-14B Q5_K_M 75.1 ~20
RTX 4090 24 GB Qwen3.6-27B Q5_K_M 92.4 ~27
RTX 5090 32 GB Qwen3.6-27B Q6_K 94.3 ~40

And a few things jumped out:

On the RTX 4070 (12 GB) — the top pick shifts to Qwen3-14B in Q5_K_M, scoring 75.1. That’s a solid daily driver for coding and chat. So the 14B gives better speed and smoother experience.

Now the RTX 4090 (24 GB) — that’s where things get interesting. Qwen3.6-27B in Q5_K_M scores 92.4 at ~27 tok/s. Still, the upgrade from the 4070 is 14.9 quality points and ~40% faster token generation.

As for the RTX 5090 (32 GB) — the best pick actually stays the same model (Qwen3.6-27B), but shifts to Q6_K quant for 94.3 quality and ~40 tok/s. The upgrade command validated this:

whichllm upgrade "RTX 4090" "RTX 5090"
# Verdict: worth it (≥12pt Q & ≥10 tok/s lift)

And going from 4090 to 5090 is genuinely worth it — that 32 GB VRAM lets you push higher quants and bigger context windows.

The Benchmark Engine — Why I Trust It More Than Random Reddit Recs

And Whichllm’s scoring isn’t a black box. It merges:

  • LiveBench — objective, contamination-avoiding benchmarks
  • Artificial Analysis — real-world inference speed data
  • Chatbot Arena ELO — human preference rankings (how actual users rate outputs)
  • Aider — code-editing benchmarks (LLM-as-judge)
  • Open LLM Leaderboard V2 — standardized evaluation suite

Still, each score is weighted and older benchmarks decay in influence. So a model that topped the leaderboard 6 months ago doesn’t get equal weight with something fresh. That time-weighting alone fixes a huge blind spot in most recommendation tools.

One thing I wish it did — it doesn’t show you the individual benchmark breakdowns per model in the default view. So you get an aggregate score. But I’d love to see “this model kills it on coding tasks but is weak on reasoning” at a glance.

Quick Chat: whichllm run

But the tool also has a one-shot chat command:

whichllm run "qwen 2.5 1.5b gguf"

And it downloads the model and starts a conversation right in your terminal — handy for quick tests. Still, I wouldn’t use it as a daily chat interface — Ollama is better for that. But as a “try before you commit” option, it works.

Limitations — What whichllm Doesn’t Do Well

But let me be straight about where this tool falls short.

No GPU benchmark data on its own. whichllm doesn’t benchmark your hardware. The token-per-second estimates are inferred from model size and GPU specs, not measured on your actual machine. A real benchmark run (like llama-bench) would give more accurate speed data.

Weak offline mode. Even if you’re offline, the benchmark data isn’t cached locally (yet). The fallback mode works but with reduced accuracy.

Not a model runner. It recommends models and can start a chat, but you’ll still want Ollama or LM Studio for day-to-day use. So think of it as a pre-purchase advisor and catalog browser, not a runtime.

Pair it with a memory layer like Mnemo and your model keeps context across sessions too.

Is It Worth Using?

And here’s my honest take.

Use it if: You’re shopping for a GPU and want to know what models it can actually run well. Or you have existing hardware and feel like you’re missing out on better models.

But skip it if you already know your setup and have a model you’re happy with. And I’ll be keeping it installed for the next time I’m GPU shopping.

Still, for GPU shopping, whichllm saved me hours of cross-referencing VRAM sizes against HuggingFace model cards. I’d call that a win.

Quick Comparison: whichllm vs Alternatives

Feature whichllm Ollama Search HuggingFace Models
Hardware auto-detection
Multi-benchmark scoring
Pre-purchase GPU simulation
Time-weighted scores
One-click chat
JSON output for scripting

Final Verdict

But whichllm isn’t trying to replace Ollama or LM Studio. But it’s solving a different problem — the “what should I run” question that everyone in the local LLM space hits.

And at 3.5k GitHub stars and climbing (Trending #10 today), it’s early but actively maintained. I’ll be keeping it installed for the next time I’m GPU shopping.

If you want to dig deeper into the local AI tool ecosystem, check out my Headroom review — another tool that changes how you think about local LLM deployment.


Disclosure: Some of the links below are affiliate links. If you purchase through them, I earn a small commission at no extra cost to you. All testing and opinions are my own.

Shopping for a new GPU to run local LLMs?

Already have a GPU but want cloud compute for bigger models?


Last tested: June 2026. whichllm v0.5.8 on Windows via uvx. Benchmark data sourced from LiveBench, Chatbot Arena, and Open LLM Leaderboard. Scores are based on current benchmarks and may change — always verify performance for your specific hardware.