More

aazo11 · 2025-05-21T16:39:43 1747845583

I tested on-device LMs (Gemma, DeepSeek) across prompt cleanup, PII redaction, math, and general knowledge on my M2 Max laptop using LM Studio + DSPy.

Some observations

- Gemma-3 is the best model for on-device inference - 1B models look fine at first but break under benchmarking - 4B can handle simple rewriting and PII redaction. It also did math reasoning surprisingly well. - General knowledge Q&A does not work with a local model. This might work with a RAG pipeline or additional tools

I plan on training and fine-tuning 1B models to see if I can build high accuracy task specific models under 1GB in the future.

billconan · 2025-05-22T08:25:40 1747902340

hope to see your results of their new gemma 3n and benchmarks on multi-modal inputs.

aazo11 · 2025-04-29T16:56:55 1745945815

The trend is that the length of tasks AI can do is doubling every 7 months.

Accompanying YT video https://www.youtube.com/watch?v=evSFeqTZdqs

aazo11 · 2025-04-25T21:58:04 1745618284

This is a huge unlock for on-device inference. The download time of larger models makes local inference unusable for non-technical users.

aazo11 · 2025-04-21T20:47:05 1745268425

Thanks for calling that out. It was 32GB. I updated the post as well.

aazo11 · 2025-04-21T20:10:40 1745266240

Very interesting. I had not thought about gaming at all but that makes a lot of sense.

I also agree the goal should not be to replace ChatGPT. I think ChatGPT is way overkill for a lot of the workloads it is handling. A good solution should probably use the cloud LLM outputs to train a smaller model to deploy in the background.

aazo11 · 2025-04-21T20:04:53 1745265893

They look awesome. Will try it out.

aazo11 · 2025-04-21T20:01:23 1745265683

Exactly. Why does this not exist yet?

byyoung3 · 2025-04-21T20:33:19 1745267599

its an if statement on whether the model has downloaded or not

aazo11 · 2025-04-21T21:20:24 1745270424

A better solution would train/finetune the smaller model from the responses of the larger model and only push to the inference to the edge if the smaller model is performant and the hardware specs can handle the workload?

monoid73 · 2025-04-21T22:21:01 1745274061

yeah, that'd b nice, some kind of self-bootstrapping system where you start with a strong cloud model, then fine-tune a smaller local one over time until it’s good enough to take over. tricky part is managing quality drift and deciding when it's 'good enough' without tanking UX. edge hardware's catching up though, so feels more feasible by the day.

aazo11 · 2025-04-21T20:00:33 1745265633

By "too hard" I do not mean getting started with them to run inference on a prompt. Ollama especially makes that quite easy. But as an application developer, I feel these platforms are too hard to build around. The main issues being: getting the correct small enough task specific model and how long it takes to download these models for the end user.

thot_experiment · 2025-04-21T20:42:53 1745268173

I guess it depends on expectations, if your expectation is an CRUD app that opens in 5 seconds, then sure, it's definitely tedious. People do install things though, the companion app for DJI action cameras is 700mb (which is an abomination, but still). Modern games are > 100gb on the high side, downloading 8-16gb of tensors one time is nbd. You mentioned that there are 663 different models of dsr1-7b on huggingface, sure, but if you want that model on ollama it's just `ollama run deepseek-r1`

As a developer the amount of effort I'm likely to spend on the infra side of getting the model onto the user's computer and getting it running is now FAR FAR below the amount of time I'll spend developing the app itself or getting together a dataset to tune the model I want etc. Inference is solved enough. "getting the correct small enough model" is something that I would spend the day or two thinking about/testing when building something regardless. It's not hard to check how much VRAM someone has and get the right model, the decision tree for that will have like 4 branches. It's just so little effort compared to everything else you're going to have to do to deliver something of value to someone. Especially in the set of users that have a good reason to run locally.

aazo11 · 2025-04-21T16:42:52 1745253772

I spent a couple of weeks trying out local inference solutions for a project. Wrote up my thoughts with some performance benchmarks in a blog.

TLDR -- What these frameworks can do on off the shelf laptops is astounding. However, it is very difficult to find and deploy a task specific model and the models themselves (even with quantization) are so large the download would kill UX for most applications.

codelion · 2025-04-21T18:41:03 1745260863

There are ways to improve the performance of local LLMs with inference time techniques. You can try with optillm - https://github.com/codelion/optillm it is possible to match the performance of larger models on narrow tasks by doing more at inference.

aazo11 · on Dec 22, 2024

Yes in the future. We share the source code in both commercial and non-commercial engagements already. Drop me a line at amir [at] relta.dev if interested.

bberenberg · on Dec 22, 2024

I am building an AI Slack Moderator bot [0] as a side project. I was thinking that this could be a cool intermediate layer to allow a user to ask questions about moderation logs. However I am not ready to build this for now. Feel free to add me to an email list for people who want to know when you OSS it down the line.

[0] - https://popsia.com