Disclaimer: I'm an AI novice relative to many here. FWIW last wknd I spent a cou...

patrickk · 2026-04-03T06:33:01 1775197981

For those who wish to do OCR on photos, like receipts, or PDFs or anything really, Paperless-NGX works amazingly well and runs on a potato.

polishdude20 · 2026-04-02T20:25:18 1775161518

How do you extract the content? OCR? Pdf to text then feed into qwen?

I tried something similar where I needed a bunch of tables extracted from the pdf over like 40 pages. It was crazy slow on my MacBook and innacurate

philipkglass · 2026-04-02T20:47:24 1775162844

If you have a basic ARM MacBook, GLM-OCR is the best single model I have found for OCR with good table extraction/formatting. It's a compact 0.9b parameter model, so it'll run on systems with only 8 GB of RAM.

https://github.com/zai-org/GLM-OCR

Use mlx-vlm for inference:

https://github.com/zai-org/GLM-OCR/blob/main/examples/mlx-de...

Then you can run a single command to process your PDF:

  glmocr parse example.pdf

  Loading images: example.pdf
  Found 1 file(s)
  Starting Pipeline...
  Pipeline started!
  GLM-OCR initialized in self-hosted mode
  Using Pipeline (enable_layout=true)...

  === Parsing: example.pdf (1/1) ===

My test document contains scanned pages from a law textbook. It's two columns of text with a lot of footnotes. It took 60 seconds to process 5 pages on a MBP with M4 Max chip.

After it's done, you'll have a directory output/example/ that contains .md and .json files. The .md file will contain a markdown rendition of the complete document. The .json file will contain individual labeled regions from the document along with their transcriptions. If you get all the JSON objects with

  "label": "table"

from the JSON file, you can get an HTML-formatted table from each "content" section of these objects.

It might still be inaccurate -- I don't know how challenging your original tables are -- but it shouldn't be terribly slow. The tables it produced for me were good.

I have also built more complex work flows that use a mixture of OCR-specialized models and general purpose VLM models like Qwen 3.5, along with software to coordinate and reconcile operations, but GLM-OCR by itself is the best first thing to try locally.

polishdude20 · 2026-04-02T21:56:39 1775166999

Thanks! Just tried it on a 40 page pdf. Seems to work for single images but the large pdf gives me connection timeouts

philipkglass · 2026-04-02T22:04:54 1775167494

I also get connection timeouts on larger documents, but it automatically retries and completes. All the pages are processed when I'm done. However, I'm using the Python client SDK for larger documents rather than the basic glmocr command line tool. I'm not sure if that makes a difference.

polishdude20 · 2026-04-03T04:52:43 1775191963

Yeah looks like the cli also retries as well. I was able to get it working using a higher timeout.

davidbjaffe · 2026-04-03T14:25:04 1775226304

Cool! For GLM-OCR, do you use "Option 2: Self-host with vLLM / SGLang" and in that case, am I correct that there is no internet connection involved and hence connection timeouts would be avoided entirely?

philipkglass · 2026-04-03T14:54:33 1775228073

When you self-host, there's still a client/server relationship between your self-hosted inference server and the client that manages the processing of individual pages. You can get timeouts depending on the configured timeouts, the speed of your inference server, and the complexity of the pages you're processing. But you can let the client retry and/or raise the initial timeout limit if you keep running into timeouts.

That said, this is already a small and fast model when hosted via MLX on macOS. If you run the inference server with a recent NVidia GPU and vLLM on Linux it should be significantly faster. The big advantage with vLLM for OCR models is its continuous batching capability. Using other OCR models that I couldn't self-host on macOS, like DeepSeek 2 OCR or Chandra 2, vLLM gave dramatic throughput improvements on big documents via continuous batching if I process 8-10 pages at a time. This is with a single 4090 GPU.

chrisweekly · 2026-04-02T21:40:08 1775166008

1. Correction: I'd planned to use Qwen-3.5 but ended up using gemma3:4b.

2. The n8n workflow passes a given binary pdf to gemma, which (based on a detailed prompt) analyzes it and produces JSON output.

See https://github.com/LinkedInLearning/build-with-ai-running-lo... if you want more details. :)