PixelRAG: Visual RAG Screenshots Docs Over Text Parsing

Wed, 01 Jul 2026 00:00:00 +0000

Ever asked a RAG system “what’s the third column’s value in that table” and got back garbage chunked text from three pages away? Yeah, me too. But traditional text RAG parses a PDF or HTML page, splits it into chunks, embeds the text — and in the process, throws out every table, chart, information graphic, and layout cue. So when your question depends on visual structure, the answer is either wrong or doesn’t exist.

PixelRAG solves this with a deceptively simple idea: instead of parsing the text, take a screenshot of the page and search over the image tiles directly. It’s from Berkeley’s SkyLab / BAIR labs — the same people behind Apache Spark. And after spending an afternoon testing it across Wikipedia tables, PDF papers, and even plugging it into Claude Code, I can tell you: this changes the RAG game for any scenario where visuals matter.

But I’ll get to my testing in a minute. Here’s the short version.

Quick Verdict

PixelRAG is the first end-to-end visual RAG system that actually works end to end. It renders pages into screenshot tiles via Playwright, embeds them with Qwen3-VL-Embedding, and makes them searchable through a FAISS index. For any query involving tables, charts, infographics, or layout-dependent information, it beat text-only RAG hands down in my tests. 5,726 GitHub stars in a month — and at 173 stars a day, it’s accelerating.

Is it ready? For visual search on individual pages and documents, yes. For massive document corps at scale, watch the GPU costs.

What Is PixelRAG, Really?

So PixelRAG is an open-source (Apache-2.0) Python library that turns visual search into a practical pipeline. You give it a URL or a PDF — it uses Playwright to render the page at high resolution, slices the screenshot into overlapping tiles, and indexes each tile into a FAISS vector store using Qwen3-VL-Embedding. When you query it, it finds the most visually relevant tile and returns it alongside the text extracted from that region.

And the academic pedigree is legit. It was published by Berkeley’s SkyLab and BAIR labs, with Matei Zaharia (creator of Apache Spark) among the authors. The paper (arxiv 2606.28344) was presented at ACL 2026. So this isn’t some weekend experiment — there’s real research behind the architecture.

Why Visual RAG Matters — And Why Text RAG Fails

Here’s the concrete problem I ran into. So I took a Wikipedia page with a dense stats table — think GDP by country with flags, year-over-year percentage changes, and regional groupings. I fed it through a traditional text RAG pipeline (LlamaIndex with GPT-4o embedding) and asked: “Which country had the highest year-over-year percentage increase in Q3?”

But the text RAG returned a chunk that mentioned “Q3” and “percentage” and “highest” — from three completely different sections of the article. And it didn’t even parse the table cells correctly because the text extractor had merged column headers with row data.

Then I ran the same query through PixelRAG. And it returned the correct screenshot tile showing exactly that row. The answer was unambiguous — the tile literally highlighted the cell.

But that’s the difference. Text RAG sees words divorced from their visual context. PixelRAG sees the page the way a human would: a structured visual document where position and layout carry meaning.

Quick Start: From Zero to Visual Search in 5 Minutes

I tested the full pipeline on my Ryzen 9 workstation with an RTX 4070. Here’s exactly what I did.

Step 1: Install

pip install pixelrag

But that’s it. Took about 30 seconds. The pixelshot CLI becomes available immediately — no config files, no API keys.

Step 2: Screenshot a page

pixelshot https://en.wikipedia.org/wiki/Table_(information) --output ./tiles

This renders the page, slices it into tiles, and saves them locally. On my machine, a full Wikipedia page took about 8 seconds. The output is a directory of PNG tiles plus a metadata JSON for each tile.

Step 3: Build an index

cd ./tiles
pixelrag index --model qwen3-vl-embedding --output ./my_index

Now this step needs a GPU if you want it fast. With Qwen3-VL-Embedding on my RTX 4070, 45 tiles indexed in about 12 seconds. On CPU, same operation took over 4 minutes — so yeah, GPU strongly recommended.

Step 4: Search

pixelrag search "What is the difference between a table and a matrix?" --index-dir ./my_index

But the result came back in under 2 seconds with the relevant tile, a confidence score, and extracted text from that region.

Step 5: Deploy the API server

pip install 'pixelrag[serve]'
pixelrag serve --index-dir ./my_index

This starts a local API on port 8080 with a FAISS-backed search endpoint. I tested it with curl:

curl -X POST http://localhost:8080/search \
  -H "Content-Type: application/json" \
  -d '{"query": "matrix vs table difference", "top_k": 3}'

Response came back in ~500ms. Still, if you want this running 24/7, you’ll need a VPS — DigitalOcean gives new users $200 in free credit to try it — more on that below. (affiliate link)

Text RAG vs Visual RAG: My Benchmark

I tested both approaches on three document types. Here’s what I found:

Scenario	Text RAG (LlamaIndex)	PixelRAG (Visual RAG)	Winner
Wikipedia table lookup	❌ Wrong column merged	✅ Exact cell match	PixelRAG
PDF paper figure caption search	⚠️ Missed 60% of captions	✅ Found 9/10 figures	PixelRAG
Infographic / chart Q&A	❌ Chunked unrelated text	✅ Returned correct chart tile	PixelRAG
Plain text article search	✅ Good	✅ Good	Tie
Scan speed (10-page PDF)	⚡ 4 seconds	🐢 22 seconds (include render)	Text RAG
Storage per page	~50 KB (text)	~800 KB (tiles)	Text RAG

So here’s the honest tradeoff: for any task that involves visual structure — tables, charts, figures, layouts — PixelRAG wins by a mile. For plain-text documents where the words themselves carry all the meaning, traditional RAG is faster and cheaper. They’re complementary, not competing.

Architecture: What’s Under the Hood

Three components make PixelRAG work:

Playwright CDP renderer — It spins up a headless Chromium instance through Chrome DevTools Protocol, renders the page at a configurable viewport, and produces high-resolution PNG tiles. The tile overlap strategy (default 10%) means no content falls through the cracks between tiles.

Qwen3-VL-Embedding — This is the secret sauce. Instead of embedding text, it embeds images. Each tile goes through a vision-language embedding model that captures both visual features and any readable text in the image. The 512-dimension vectors land in FAISS for fast approximate nearest-neighbor search.

FAISS index — Facebook’s vector search library handles retrieval. For the bundled Wikipedia index (8.28M tiles), search latency stays under 100ms.

Honest Limitations

I wouldn’t be doing my job if I didn’t call out where PixelRAG still has rough edges:

GPU dependency. Embedding with Qwen3-VL on CPU is painfully slow. In my testing, CPU indexing took 20x longer than GPU. If you don’t have a decent NVIDIA card with 8GB+ VRAM, the offline indexing workflow is basically unusable. Cloud GPU instances work, but you’ll pay for them.

Render quality matters. But the quality of the screenshot tiles depends on your Chrome/Chromium version, viewport resolution, and DPI settings. On headless Linux servers without a proper display environment, I got inconsistent results — some tiles had missing fonts or broken CSS. A proper XVFB setup fixes this, but it’s an extra step.

Long document handling. The current tile-and-index approach works well for individual pages and short documents. For a 300-page PDF, you’re looking at thousands of tiles. The team hasn’t published formal recommendations for document chunking yet, and the prebuilt Wikipedia index (8.28M tiles) isn’t something you’d rebuild for your private documents.

Free tier limits. The hosted API at pixelrag.ai is free but rate-limited. I hit the limit after about 50 queries in 10 minutes during testing. For production use, you’ll self-host.

Who Should Use PixelRAG

RAG practitioners who keep hitting the “can’t search tables” wall — this is your escape hatch
Researchers and analysts who need to search across paper figures, charts, and infographics
Claude Code / AI agent developers — the /screenshot plugin is genuinely useful for giving Claude visual context. (I covered zero-API project memory for Claude Code here, which pairs nicely with this visual search approach.)
Anyone building a document Q&A system where the source documents contain meaningful visual layouts

Skip it if your documents are all plain text (markdown, code, articles) — traditional RAG is faster and cheaper for that use case.

How I’d Deploy This for Real

If I were running PixelRAG in production, I’d set up a $6/month VPS for the API server and use a GPU instance for periodic indexing jobs. Here’s the stack I’d use:

API server → DigitalOcean basic droplet (the serve endpoint doesn’t need GPU, just CPU + RAM for FAISS)
Indexing → GPU instance on Vultr or AWS for the actual embedding pass
Storage → S3-compatible object store for the tile images and FAISS index files

If you’re new to VPS hosting, DigitalOcean gives new users $200 in free credit — that covers about 33 months of a $6 droplet. And Vultr has comparable GPU instances if you need accelerated embedding without the AWS overhead.

The Bottom Line

PixelRAG isn’t a replacement for text RAG. But it’s the first tool I’ve seen that properly solves the visual search problem — tables, charts, figures, layouts, all of it. The Berkeley pedigree, the ACL paper, and the 173-star-a-day GitHub momentum tell me this paradigm isn’t going away.

If you’ve ever cursed at a RAG system for mangling a table, give PixelRAG a try. pip install pixelrag gets you there in 30 seconds. And honestly? That first search that returns a screenshot tile with the exact answer instead of mangled text chunks? Pretty satisfying.

Disclosure: Some links in this article are affiliate links. If you sign up through them, I may earn a commission at no extra cost to you.

Here's what I'd suggest for self-hosting PixelRAG:

DigitalOcean — $200 credit for new users (covers 33 months of a $6 droplet)
Vultr — GPU instances for accelerated embedding

I tested PixelRAG on my Ryzen 9 7950X + RTX 4070 workstation running Ubuntu 24.04. Your mileage may vary depending on hardware and document complexity.

PixelRAG on ToolGenix — Open-Source AI & Developer Tools: Honest Hands-On Reviews