tkp-415's comments

tkp-415 · 2026-02-20T21:29:14 1771622954

My first use case of an LLM for security research was feeding Gemini Semgrep scan results of an open source repo. It definitely was a great way to get the LLM to start looking at something, and provide a usable sink + source flow for manual review.

I assumed I was still dealing with lots of false positives from Gemini due to using the free version and not being able to have it memorize the full code base. Either way combining those two tools makes the review process a lot more enjoyable.

tkp-415 · 2026-02-20T14:54:56 1771599296

Can anyone point me in the direction of getting a model to run locally and efficiently inside something like a Docker container on a system with not so strong computing power (aka a Macbook M1 with 8gb of memory)?

Is my only option to invest in a system with more computing power? These local models look great, especially something like https://huggingface.co/AlicanKiraz0/Cybersecurity-BaronLLM_O... for assisting in penetration testing.

I've experimented with a variety of configurations on my local system, but in the end it turns into a make shift heater.

0xbadcafebee · 2026-02-20T17:42:34 1771609354

8GB is not enough to do complex reasoning, but you could do very small simple things. Models like Whisper, SmolVLM, Quen2.5-0.5B, Phi-3-mini, Granite-4.0-micro, Mistral-7B, Gemma3, Llama-3.2 all work on very little memory. Tiny models can do a lot if you tune/train them. They also need to be used differently: system prompt preloaded with information, few-shot examples, reasoning guidance, single-task purpose, strict output guidelines. See https://github.com/acon96/home-llm for an example. For each small model, check if Unsloth has a tuned version of it; it reduces your memory footprint and makes inference faster.

For your Mac, you can use Ollama, or MLX (Mac ARM specific, requires different engine and different model disk format, but is faster). Ramalama may help fix bugs or ease the process w/MLX. Use either Docker Desktop or Colima for the VM + Docker.

For today's coding & reasoning models, you need a minimum of 32GB VRAM combined (graphics + system), the more in GPU the better. Copying memory between CPU and GPU is too slow so the model needs to "live" in GPU space. If it can't fit all in GPU space, your CPU has to work hard, and you get a space heater. That Mac M1 will do 5-10 tokens/s with 8GB (and CPU on full blast), or 50 token/s with 32GB RAM (CPU idling). And now you know why there's a RAM shortage.

BoredomIsFun · 2026-02-21T16:23:48 1771691028

> Mistral-7B

Is hopelessly dated. There are much better newer models around.

mft_ · 2026-02-20T15:25:52 1771601152

There’s no way around needing a powerful-enough system to run the model. So you either choose a model that can fit on what you have —i.e. via a small model, or a quantised slightly larger model— or you access more powerful hardware, either by buying it or renting it. (IME you don’t need Docker. For an easy start just install LM Studio and have a play.)

I picked up a second-hand 64GB M1 Max MacBook Pro a while back for not too much money for such experimentation. It’s sufficiently fast at running any LLM models that it can fit in memory, but the gap between those models and Claude is considerable. However, this might be a path for you? It can also run all manner of diffusion models, but there the performance suffers (vs. an older discrete GPU) and you’re waiting sometimes many minutes for an edit or an image.

ryandrake · 2026-02-20T15:41:28 1771602088

I wasn't able to have very satisfying success until I bit the bullet and threw a GPU at the problem. Found an actually reasonably priced A4000 Ada generation 20GB GPU on eBay and never looked back. I still can't run the insanely large models, but 20GB should hold me over for a while, and I didn't have to upgrade my 10 year old Ivy Bridge vintage homelab.

sigbottle · 2026-02-20T15:29:21 1771601361

Are mac kernels optimized compared to CUDA kernels? I know that the unified GPU approach is inherently slower, but I thought a ton of optimizations were at the kernel level too (CUDA itself is a moat)

liuliu · 2026-02-20T20:36:49 1771619809

Depending on what you do. If you are doing token generations, compute-dense kernel optimization is less interesting (as, it is memory-bounded) than latency optimizations else where (data transfers, kernel invocations etc). And for these, Mac devices actually have a leg than CUDA kernels (as pretty much Metal shaders pipelines are optimized for latencies (a.k.a. games) while CUDA shaders are not (until cudagraph introduction, and of course there are other issues).

ttoinou · 2026-02-21T01:30:42 1771637442

There’s this developer called nightmedia who converts a lot of models to apple MLX. I can run Qwen3 coder next at 60 tps on my m4 max. It works

bigyabai · 2026-02-20T18:43:12 1771612992

Mac kernels are almost always compute shaders written in Metal. That's the bare-minimum of acceleration, being done in a non-portable proprietary graphics API. It's optimized in the loosest sense of the word, but extremely far from "optimal" relative to CUDA (or hell, even Vulkan Compute).

Most people will not choose Metal if they're picking between the two moats. CUDA is far-and-away the better hardware architecture, not to mention better-supported by the community.

zozbot234 · 2026-02-20T15:00:18 1771599618

The general rule of thumb is that you should feel free to quantize even as low as 2 bits average if this helps you run a model with more active parameters. Quantized models are not perfect at all, but they're preferable to the models with fewer, bigger parameters. With 8GB usable, you could run models with up to 32B active at heavy quantization.

zargon · 2026-02-21T04:58:55 1771649935

A large model (100B+, the more the better) may be acceptable at 2-bit quantization, depending on the task. But not a small model. Especially not for technical tasks. On top of that, one still needs room for OS, software and KV cache. 8GB is just not very useful for local LLMs. That said, it can still be entertaining to try out a 4-bit 8B model for the fun of it.

zozbot234 · 2026-02-21T12:18:54 1771676334

100B+ is the amount of total parameters, whereas what matters here is active - very different for sparse MoE models. You're right that there's some overhead for the OS/software stack but it's not that much. KV-cache is a good candidate for being swapped out, since it only gets a limited amount of writes per emitted token.

zargon · 2026-02-21T15:01:47 1771686107

Total parameters, not active parameters, is the property that matters for model robustness under extreme quantization.

Once you're swapping from disk, the performance will be quite unusable for most people. And for local inference, KV cache is the worst possible choice to put on disk.

ontouchstart · 2026-02-20T17:08:33 1771607313

This is the easiest set up on a Mac. You need at least 16gb on a MacBook:

https://github.com/ggml-org/llama.cpp/discussions/15396

xrd · 2026-02-20T15:00:08 1771599608

I think a better bet is to ask on reddit.

https://www.reddit.com/r/LocalLLM/

Everytime I ask the same thing here, people point me there.

yjftsjthsd-h · 2026-02-20T18:18:32 1771611512

With only 8 GB of memory, you're going to be running a really small quant, and it's going to be slow and lower quality. But yes, it should be doable. In the worst case, find a tiny gguf and run it on CPU with llamafile.

HanClinto · 2026-02-20T16:13:37 1771604017

Maybe check out Docker Model Runner -- it's built on llama.cpp (in a good way -- not like Ollama) and handles I think most of what you're looking for?

https://www.docker.com/blog/run-llms-locally/

As far as how to find good models to run locally, I found this site recently, and I liked the data it provides:

https://localclaw.io/

Hamuko · 2026-02-20T17:19:47 1771607987

I tried to run some models on my M1 Max (32 GB) Mac Studio and it was a pretty miserable experience. Slow performance and awful results.

tkp-415 · 2026-02-15T20:32:04 1771187524

This is neat. What VPS service do you use? I am trying to replace my tendency to spin up small EC2 instances just to deploy a simple web app.

djkurlander · 2026-02-15T20:43:56 1771188236

My $6.75 per year VPS was a Black Friday sale from Dedirock on https://lowendtalk.com. Some of the Black Friday sales are still being honored. The site https://cheapvpsbox.com/ has a nice search engine for cheap VPS sales.

KronisLV · 2026-02-16T07:06:31 1771225591

Note: just be sure to have some sort of backup solution because when a deal seems to be too good to be true, sometimes the company will go under.

I had that happen years ago, consequently it meant my first ever VPS disappearing.

I think the deal back then was like 15 EUR per year.

Scaleway has small instances (Stardust) btw: https://www.scaleway.com/en/pricing/virtual-instances/

They seem expensive otherwise so I’d go with Hetzner for most other stuff. Heck I’ve even used Contabo too (they don’t have the best reputation, but it worked out okay for me).

winrid · 2026-02-15T23:30:44 1771198244

I recommend a dedicated $40 hetzner or OVH box and just keep all your projects on that. They're pretty powerful. I was spending a lot on a bunch of $5 linodes until recently and you have to keep them upgraded etc...

fragmede · 2026-02-15T23:34:53 1771198493

how deep are your WebApps? Cloudflare pages and workers have a generous free tier, depending on what you're doing.

tkp-415 · 2026-01-28T22:03:32 1769637812

Curious to see how you can get Gemini fully intercepted.

I've been intercepting its HTTP requests by running it inside a docker container with:

-e HTTP_PROXY=http://127.0.0.1:8080 -e HTTPS_PROXY=http://host.docker.internal:8080 -e NO_PROXY=localhost,127.0.0.1

It was working with mitmproxy for a very brief period, then the TLS handshake started failing and it kept requesting for re-authentication when proxied.

You can get the whole auth flow and initial conversation starters using Burp Suite and its certificate, but the Gemini chat responses fail in the CLI, which I understand is due to how Burp handles HTTP2 (you can see the valid responses inside Burp Suite).

jmuncor · 2026-01-28T22:30:08 1769639408

Tried with gemini and gave more headaches than anything else, would love if you can help me adding it to sherlock... I use claude and gemini, claude mainly for coding, so wanted to set it up first. With gemini, ran into the same problem that you did...

paulirish · 2026-01-28T23:18:10 1769642290

Gemini CLI is open source. Don't need to intercept at the network when you can just add inspectGeminiApiRequest() in the source. (I suggest it because I've been maintaining a personal branch with exactly that :)

tkp-415 · 2026-01-29T16:27:29 1769704049

Ahh, that seems much simpler. Dump the request / response directly. Now I'm wondering if I can use Gemini to patch Gemini.

paulirish · 2026-01-29T21:02:32 1769720552

Yup. It does a great job in there.

tkp-415 · 2026-01-06T23:06:52 1767740812

Hashes generated using https://pypi.org/project/ImageHash/, then a hamming distance is calculated with SQL.

Unfortunately I don't currently have a repo that can be forked.

tkp-415 · 2026-01-06T23:04:56 1767740696

The confusion is understandable as the comparison is basic and uses image hashes (https://pypi.org/project/ImageHash/), which are pretty surface level and don't always provide reliable "this image is obviously very similar to that one" results.

You are correct that when you click something in the brackets, the results returned are covers similar to what you clicked.

Still have a lot of room for improvement as I go further down this image matching rabbit hole, but the comparison's current state does provide some useful results every so often.

smusamashah · 2026-01-07T07:38:48 1767771528

As suggested in another comment, CLIP should give the images that actually look similar. This is a great collection of images, and using those you will find very similar covers.

http://same.energy/search?i=yN5ou

Also https://mood.zip/