Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

LM Studio seems pretty good at making local models easier to use


A less known feature of LM Studio I really like is speculative decoding: https://lmstudio.ai/blog/lmstudio-v0.3.10

Basically you let a very small model speculate on the next few tokens, and the large model then blesses/rejects those predictions. Depending on how well the small model performs, you get massive speedups that way.

The small model has to be as close to the big model as possible - I tried this with models from different vendors and it slowed generation down by x3 or so. So, you need to use a small Qwen 2.5 with a big Qwen 2.5, etc


How exactly does this give a speedup? If you have to wait for the large model to confirm the small model's predictions, wouldn't it always be slower than just running the large model?


As far as I understand it works like this:

1. Small model generates k tokens (probably k>=4 or even higher, there‘s a tradeoff to be made here, depending on the model sizes)

2. Big model processes all k tokens‘ logits (probabilities) in parallel.

3. Ideally, all tokens pass the probability threshold. That might be the case for standard phrases that the model likes to use, like „Alright, the user wants me to“. If not all tokens pass the probability threshold, then the first unsuitable token and all after are discarded.

4. Return to 1., maybe with an adjusted k.


Apparently for the bigger model checking a token is faster than generating a fresh one. So if they tiny model gets it right you get a tiny speed bump. Can’t say I fully understand it either why it’s faster to check

Needs a pretty large difference in size to result in a speedup. 0.5 vs 27b is the only ones I’ve seen a speedbump


I'm genuinely afraid its going to do telemetry one day.

I'm sure someone is watching their internet traffic, but I don't.

I take the risk now, but I ask questions about myself, relationships, conversations, etc... Stuff I don't exactly want Microsoft/ChatGPT to have.


Local inferencing is synonymous with privacy for me. There is no universe until laws get put into effect where your LLM usage online is private as it stands now. I suspect most of these companies are going to be putting in a Microsoft Clippy style assistant in soon that will act as a recommendation/ad engine very soon, and this of course requires parsing every convo you've ever had. Paid tier may remove Clippy, but boy oh boy the free tier (which most people will use) won't.

Clippy is coming back guys, and we have to be ready for it.


I‘ve configured Little Snitch to only allow it access to huggingface. I think for updates I need to reset LS to „ask for each connection“ or sthg like that.


If you want privacy, use a local models and an open-source chat interface such as OpenWeb-UI or Jan. (avoid proprietary systems such as Msty or LM Studio).

https://github.com/janhq/jan

https://github.com/open-webui/open-webui


Here is another:

https://msty.app/


Can anyone vouch for this? I (personally) don't mind that it's closed source, but I've never heard of it and can't find much about it. Website makes it look fantastic though, so I'm intrigued. But am hesitating at giving it all of my API keys..


Reddit is probably your best bet. I think we all take some risk, even with something like LMStudio which is closed source, since all these apps are basically a new genre.


they made it so easy to do specdec, that alone sold it for me

Some models have even a 0.5B draft model. The speed increase is incredible.


They look awesome. Will try it out.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: