<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>Tutorials on ToolGenix — AI Tools Discovery &amp; Reviews</title>
    <link>https://toolgenix.nxtniche.com/categories/tutorials/</link>
    <description>Recent content in Tutorials on ToolGenix — AI Tools Discovery &amp; Reviews</description>
    <generator>Hugo</generator>
    <language>en-us</language>
    <lastBuildDate>Tue, 09 Jun 2026 00:00:00 +0000</lastBuildDate>
    <atom:link href="https://toolgenix.nxtniche.com/categories/tutorials/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>whichllm Review: Best Local LLM for Your GPU (2026)</title>
      <link>https://toolgenix.nxtniche.com/posts/whichllm-review-2026/</link>
      <pubDate>Tue, 09 Jun 2026 00:00:00 +0000</pubDate>
      <guid>https://toolgenix.nxtniche.com/posts/whichllm-review-2026/</guid>
      <description>I tested whichllm across 4 GPU scenarios — here&amp;#39;s how LiveBench, Chatbot Arena, and real benchmarks find the best local LLM for your specific hardware setup.</description>
      <content:encoded><![CDATA[<p>You&rsquo;ve got a local LLM setup — Ollama, LM Studio, whatever. Now which model do you actually run?</p>
<p>That&rsquo;s the question nobody&rsquo;s really answering well. HuggingFace shows you download counts. Ollama search tells you what fits in VRAM. But &ldquo;fits&rdquo; and &ldquo;best&rdquo; are two very different things. I&rsquo;ve spent way too many afternoons downloading model after model, testing them one by one, only to wonder if there&rsquo;s something better I missed.</p>
<p>So when whichllm hit GitHub Trending at #10 with 3.5k stars, I paid attention. The pitch: a CLI tool that detects your hardware, pulls real benchmark data from LiveBench, Chatbot Arena, Aider, and the Open LLM Leaderboard, and tells you — not what <em>can</em> run — but what&rsquo;s <em>actually the best</em> for your machine.</p>
<p>So I installed it, ran it across four GPU configurations (my actual machine, plus simulated RTX 4070 / 4090 / 5090), and here&rsquo;s what I found.</p>
<h2 id="what-whichllm-actually-does">What whichllm Actually Does</h2>
<p>So what is it? whichllm is a Python CLI that does three things:</p>
<ol>
<li><strong>Detects your hardware</strong> — GPU model, VRAM, CPU cores, system RAM, disk space</li>
<li><strong>Pulls live benchmark data</strong> — merges scores from LiveBench, Artificial Analysis, Chatbot Arena ELO, Aider, and Open LLM Leaderboard</li>
<li><strong>Recommends models</strong> — ranks them by a weighted score that accounts for benchmark quality, recency (confidence decay for older models), and VRAM estimates</li>
</ol>
<p>But the key insight: it&rsquo;s evidence-ranked, not capacity-ranked. Ollama tells you &ldquo;a 7B model fits in 8GB VRAM,&rdquo; which is technically true but useless — Qwen3-8B and Gemma-3-12B both fit, but they have very different real-world performance. whichllm tells you which one actually scores higher on current benchmarks.</p>
<h2 id="hands-on-running-whichllm-on-my-machine">Hands-On: Running whichllm on My Machine</h2>
<p>And installation is the fastest I&rsquo;ve seen for a Python CLI this year:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span>uvx whichllm@latest
</span></span></code></pre></div><p>&ldquo;That&rsquo;s it. No <code>pip install</code>, no virtual env, no dependency hell. <code>uvx</code> downloads and runs it in one shot. So here&rsquo;s what landed on my screen:</p>
<table>
	<thead>
			<tr>
					<th style="text-align: center">#</th>
					<th style="text-align: left">Model</th>
					<th style="text-align: center">Params</th>
					<th style="text-align: center">Quant</th>
					<th style="text-align: center">Published</th>
					<th style="text-align: center">Score</th>
			</tr>
	</thead>
	<tbody>
			<tr>
					<td style="text-align: center">1</td>
					<td style="text-align: left">Qwen/Qwen3.6-27B</td>
					<td style="text-align: center">27.8B</td>
					<td style="text-align: center">Q6_K</td>
					<td style="text-align: center">2026-04-21</td>
					<td style="text-align: center">78.3</td>
			</tr>
			<tr>
					<td style="text-align: center">2</td>
					<td style="text-align: left">google/gemma-4-31B-it</td>
					<td style="text-align: center">32.7B</td>
					<td style="text-align: center">Q4_K_M</td>
					<td style="text-align: center">2026-03-11</td>
					<td style="text-align: center">73.5</td>
			</tr>
			<tr>
					<td style="text-align: center">3</td>
					<td style="text-align: left">Qwen/Qwen3-30B-A3B</td>
					<td style="text-align: center">30.5B</td>
					<td style="text-align: center">Q6_K</td>
					<td style="text-align: center">2025-04-27</td>
					<td style="text-align: center">67.6</td>
			</tr>
			<tr>
					<td style="text-align: center">4</td>
					<td style="text-align: left">google/gemma-4-26B-A4B-it</td>
					<td style="text-align: center">26.5B</td>
					<td style="text-align: center">Q6_K</td>
					<td style="text-align: center">2026-03-11</td>
					<td style="text-align: center">65.7</td>
			</tr>
			<tr>
					<td style="text-align: center">5</td>
					<td style="text-align: left">zai-org/GLM-4.7-Flash</td>
					<td style="text-align: center">31.2B</td>
					<td style="text-align: center">Q5_K_M</td>
					<td style="text-align: center">2026-01-19</td>
					<td style="text-align: center">64.7</td>
			</tr>
	</tbody>
</table>
<p>So not exactly a powerhouse. But the tool correctly detected my hardware constraints and recommended models that&rsquo;d work within them. And the #1 pick, Qwen3.6-27B in Q6_K, scored significantly ahead of the next option (+4.8 gap = high confidence).</p>
<p>But what also stood out — the tool flagged a speed caution for the top 3 picks, flagging low-confidence speed estimates. That&rsquo;s the kind of honest signal I want from a recommendation engine, not just &ldquo;here&rsquo;s the biggest model.&rdquo;</p>
<h2 id="simulating-gpu-upgrades-rtx-4070-vs-4090-vs-5090">Simulating GPU Upgrades: RTX 4070 vs 4090 vs 5090</h2>
<p>Now here&rsquo;s where whichllm gets really useful. The <code>--gpu</code> flag lets you simulate any GPU before you buy it:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span>whichllm --gpu <span style="color:#e6db74">&#34;RTX 4090&#34;</span>
</span></span><span style="display:flex;"><span>whichllm --gpu <span style="color:#e6db74">&#34;RTX 5090&#34;</span>
</span></span></code></pre></div><p>So I ran this across three hypothetical GPU setups and my current machine. Here&rsquo;s the comparison table:</p>
<table>
	<thead>
			<tr>
					<th style="text-align: left">GPU</th>
					<th style="text-align: center">VRAM</th>
					<th style="text-align: left">Top Pick</th>
					<th style="text-align: center">Quant</th>
					<th style="text-align: center">Score</th>
					<th style="text-align: center">Est. tok/s</th>
			</tr>
	</thead>
	<tbody>
			<tr>
					<td style="text-align: left">UHD Graphics 630</td>
					<td style="text-align: center">Shared</td>
					<td style="text-align: left">Qwen3.6-27B</td>
					<td style="text-align: center">Q6_K</td>
					<td style="text-align: center">78.3</td>
					<td style="text-align: center">~5</td>
			</tr>
			<tr>
					<td style="text-align: left">RTX 4070</td>
					<td style="text-align: center">12 GB</td>
					<td style="text-align: left">Qwen3-14B</td>
					<td style="text-align: center">Q5_K_M</td>
					<td style="text-align: center">75.1</td>
					<td style="text-align: center">~20</td>
			</tr>
			<tr>
					<td style="text-align: left">RTX 4090</td>
					<td style="text-align: center">24 GB</td>
					<td style="text-align: left">Qwen3.6-27B</td>
					<td style="text-align: center">Q5_K_M</td>
					<td style="text-align: center">92.4</td>
					<td style="text-align: center">~27</td>
			</tr>
			<tr>
					<td style="text-align: left">RTX 5090</td>
					<td style="text-align: center">32 GB</td>
					<td style="text-align: left">Qwen3.6-27B</td>
					<td style="text-align: center">Q6_K</td>
					<td style="text-align: center">94.3</td>
					<td style="text-align: center">~40</td>
			</tr>
	</tbody>
</table>
<p>And a few things jumped out:</p>
<p>On the <strong>RTX 4070 (12 GB)</strong> — the top pick shifts to Qwen3-14B in Q5_K_M, scoring 75.1. That&rsquo;s a solid daily driver for coding and chat. So the 14B gives better speed and smoother experience.</p>
<p>Now the <strong>RTX 4090 (24 GB)</strong> — that&rsquo;s where things get interesting. Qwen3.6-27B in Q5_K_M scores 92.4 at ~27 tok/s. Still, the upgrade from the 4070 is 14.9 quality points and ~40% faster token generation.</p>
<p>As for the <strong>RTX 5090 (32 GB)</strong> — the best pick actually stays the same model (Qwen3.6-27B), but shifts to Q6_K quant for 94.3 quality and ~40 tok/s. The <code>upgrade</code> command validated this:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span>whichllm upgrade <span style="color:#e6db74">&#34;RTX 4090&#34;</span> <span style="color:#e6db74">&#34;RTX 5090&#34;</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Verdict: worth it (≥12pt Q &amp; ≥10 tok/s lift)</span>
</span></span></code></pre></div><p>And going from 4090 to 5090 is genuinely worth it — that 32 GB VRAM lets you push higher quants and bigger context windows.</p>
<h2 id="the-benchmark-engine--why-i-trust-it-more-than-random-reddit-recs">The Benchmark Engine — Why I Trust It More Than Random Reddit Recs</h2>
<p>And Whichllm&rsquo;s scoring isn&rsquo;t a black box. It merges:</p>
<ul>
<li><strong>LiveBench</strong> — objective, contamination-avoiding benchmarks</li>
<li><strong>Artificial Analysis</strong> — real-world inference speed data</li>
<li><strong>Chatbot Arena ELO</strong> — human preference rankings (how actual users rate outputs)</li>
<li><strong>Aider</strong> — code-editing benchmarks (LLM-as-judge)</li>
<li><strong>Open LLM Leaderboard V2</strong> — standardized evaluation suite</li>
</ul>
<p>Still, each score is weighted and older benchmarks decay in influence. So a model that topped the leaderboard 6 months ago doesn&rsquo;t get equal weight with something fresh. That time-weighting alone fixes a huge blind spot in most recommendation tools.</p>
<p><strong>One thing I wish it did</strong> — it doesn&rsquo;t show you the individual benchmark breakdowns per model in the default view. So you get an aggregate score. But I&rsquo;d love to see &ldquo;this model kills it on coding tasks but is weak on reasoning&rdquo; at a glance.</p>
<h2 id="quick-chat-whichllm-run">Quick Chat: <code>whichllm run</code></h2>
<p>But the tool also has a one-shot chat command:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span>whichllm run <span style="color:#e6db74">&#34;qwen 2.5 1.5b gguf&#34;</span>
</span></span></code></pre></div><p>And it downloads the model and starts a conversation right in your terminal — handy for quick tests. Still, I wouldn&rsquo;t use it as a daily chat interface — Ollama is better for that. But as a &ldquo;try before you commit&rdquo; option, it works.</p>
<h2 id="limitations--what-whichllm-doesnt-do-well">Limitations — What whichllm Doesn&rsquo;t Do Well</h2>
<p>But let me be straight about where this tool falls short.</p>
<p><strong>No GPU benchmark data on its own.</strong> whichllm doesn&rsquo;t benchmark <em>your</em> hardware. The token-per-second estimates are inferred from model size and GPU specs, not measured on your actual machine. A real benchmark run (like <code>llama-bench</code>) would give more accurate speed data.</p>
<p><strong>Weak offline mode.</strong> Even if you&rsquo;re offline, the benchmark data isn&rsquo;t cached locally (yet). The fallback mode works but with reduced accuracy.</p>
<p><strong>Not a model runner.</strong> It recommends models and can start a chat, but you&rsquo;ll still want Ollama or LM Studio for day-to-day use. So think of it as a pre-purchase advisor and catalog browser, not a runtime.</p>
<p>Pair it with a memory layer like <a href="/posts/mnemo-ai-memory-layer-rust-review/">Mnemo</a> and your model keeps context across sessions too.</p>
<h2 id="is-it-worth-using">Is It Worth Using?</h2>
<p>And here&rsquo;s my honest take.</p>
<p><strong>Use it if:</strong> You&rsquo;re shopping for a GPU and want to know what models it can actually run well. Or you have existing hardware and feel like you&rsquo;re missing out on better models.</p>
<p>But skip it if you already know your setup and have a model you&rsquo;re happy with. And I&rsquo;ll be keeping it installed for the next time I&rsquo;m GPU shopping.</p>
<p>Still, for GPU shopping, whichllm saved me hours of cross-referencing VRAM sizes against HuggingFace model cards. I&rsquo;d call that a win.</p>
<h2 id="quick-comparison-whichllm-vs-alternatives">Quick Comparison: whichllm vs Alternatives</h2>
<table>
	<thead>
			<tr>
					<th style="text-align: left">Feature</th>
					<th style="text-align: center">whichllm</th>
					<th style="text-align: center">Ollama Search</th>
					<th style="text-align: center">HuggingFace Models</th>
			</tr>
	</thead>
	<tbody>
			<tr>
					<td style="text-align: left">Hardware auto-detection</td>
					<td style="text-align: center">✅</td>
					<td style="text-align: center">❌</td>
					<td style="text-align: center">❌</td>
			</tr>
			<tr>
					<td style="text-align: left">Multi-benchmark scoring</td>
					<td style="text-align: center">✅</td>
					<td style="text-align: center">❌</td>
					<td style="text-align: center">❌</td>
			</tr>
			<tr>
					<td style="text-align: left">Pre-purchase GPU simulation</td>
					<td style="text-align: center">✅</td>
					<td style="text-align: center">❌</td>
					<td style="text-align: center">❌</td>
			</tr>
			<tr>
					<td style="text-align: left">Time-weighted scores</td>
					<td style="text-align: center">✅</td>
					<td style="text-align: center">❌</td>
					<td style="text-align: center">❌</td>
			</tr>
			<tr>
					<td style="text-align: left">One-click chat</td>
					<td style="text-align: center">✅</td>
					<td style="text-align: center">✅</td>
					<td style="text-align: center">❌</td>
			</tr>
			<tr>
					<td style="text-align: left">JSON output for scripting</td>
					<td style="text-align: center">✅</td>
					<td style="text-align: center">❌</td>
					<td style="text-align: center">❌</td>
			</tr>
	</tbody>
</table>
<h2 id="final-verdict">Final Verdict</h2>
<p>But whichllm isn&rsquo;t trying to replace Ollama or LM Studio. But it&rsquo;s solving a different problem — the &ldquo;what should I run&rdquo; question that everyone in the local LLM space hits.</p>
<p>And at 3.5k GitHub stars and climbing (Trending #10 today), it&rsquo;s early but actively maintained. I&rsquo;ll be keeping it installed for the next time I&rsquo;m GPU shopping.</p>
<p>If you want to dig deeper into the local AI tool ecosystem, check out my <a href="/posts/headroom-quick-review-2026/">Headroom review</a> — another tool that changes how you think about local LLM deployment.</p>
<hr>
<h2 id="-recommended-resources">💡 Recommended Resources</h2>
<!-- BEGIN AFFILIATE LINKS (generated by ads-center for ToolGenix) -->
<p><em>Disclosure: Some of the links below are affiliate links. If you purchase through them, I earn a small commission at no extra cost to you. All testing and opinions are my own.</em></p>
<p><strong>Shopping for a new GPU to run local LLMs?</strong></p>
<ul>
<li><p><strong>NVIDIA GeForce RTX 4090 (24 GB VRAM)</strong> — Top-tier consumer card for 27B+ models. Run it at Q5_K_M for ~27 tok/s:<br>
<a href="https://toolgenix.nxtniche.com/go/amazon/B0BJFRT43X" rel="nofollow sponsored" target="_blank">→ RTX 4090 on Amazon (check current price)</a></p></li>
<li><p><strong>NVIDIA GeForce RTX 5090 (32 GB VRAM)</strong> — Next-gen flagship. Higher quants, bigger context windows, ~40 tok/s on 27B models:<br>
<a href="https://toolgenix.nxtniche.com/go/amazon/B0DT7GBNWQ" rel="nofollow sponsored" target="_blank">→ RTX 5090 on Amazon (check current price)</a></p></li>
<li><p><strong>NVIDIA GeForce RTX 4070 (12 GB VRAM)</strong> — Solid mid-range for 7B-14B models. Practical daily driver for most users:<br>
<a href="https://toolgenix.nxtniche.com/go/amazon/B0C3SPXZJ8" rel="nofollow sponsored" target="_blank">→ RTX 4070 on Amazon (check current price)</a></p></li>
</ul>
<p><strong>Already have a GPU but want cloud compute for bigger models?</strong></p>
<ul>
<li><p>Vultr Cloud GPU instances — Rent hourly GPU capacity when your local hardware isn't enough. No long-term commitment:<br>
<a href="https://toolgenix.nxtniche.com/go/vultr" rel="nofollow sponsored" target="_blank">→ Vultr Cloud GPU (get $50-100 credit)</a></p></li>
</ul>
<!-- END AFFILIATE LINKS -->
<hr>
<p><em>Last tested: June 2026. whichllm v0.5.8 on Windows via uvx. Benchmark data sourced from LiveBench, Chatbot Arena, and Open LLM Leaderboard. Scores are based on current benchmarks and may change — always verify performance for your specific hardware.</em></p>
]]></content:encoded>
    </item>
  </channel>
</rss>
