It's less than you'd think. I'm using the 35B-A3B model on an A5000, which is so...

rahimnathwani · 2026-02-28T22:06:21 1772316381

There seem to be a lot of different Q4s of this model: https://www.reddit.com/r/LocalLLaMA/s/kHUnFWZXom

I'm curious which one you're using.

suprjami · 2026-02-28T22:11:35 1772316695

Unsloth Dynamic. Don't bother with anything else.

rahimnathwani · 2026-03-01T15:34:36 1772379276

For anyone else trying to run this on a Mac with 32GB unified RAM, this is what worked for me:

First, make sure enough memory is allocated to the gpu:

  sudo sysctl -w iogpu.wired_limit_mb=24000

Then run llama.cpp but reduce RAM needs by limiting the context window and turning off vision support. (And turn off reasoning for now as it's not needed for simple queries.)

  llama-server \
    -hf unsloth/Qwen3.5-35B-A3B-GGUF:UD-Q4_K_XL \
    --jinja \
    --no-mmproj \
    --no-warmup \
    -np 1 \
    -c 8192 \
    -b 512 \
    --chat-template-kwargs '{"enable_thinking": false}'

You can also enable/disable thinking on a per-request basis:

  curl 'http://localhost:8080/v1/chat/completions' \
  --data-raw '{"messages":[{"role":"user","content":"hello"}],"stream":false,"return_progress":false,"reasoning_format":"auto","temperature":0.8,"max_tokens":-1,"dynatemp_range":0,"dynatemp_exponent":1,"top_k":40,"top_p":0.95,"min_p":0.05,"xtc_probability":0,"xtc_threshold":0.1,"typ_p":1,"repeat_last_n":64,"repeat_penalty":1,"presence_penalty":0,"frequency_penalty":0,"dry_multiplier":0,"dry_base":1.75,"dry_allowed_length":2,"dry_penalty_last_n":-1,"samplers":["penalties","dry","top_n_sigma","top_k","typ_p","top_p","min_p","xtc","temperature"],"chat_template_kwargs": { "enable_thinking": true }}'|jq .

If anyone has any better suggestions, please comment :)

suprjami · 2026-03-02T04:59:03 1772427543

Shouldn't you be using MLX because it's optimised for Apple Silicon?

Many user benchmarks report up to 30% better memory usage and up to 50% higher token generation speed:

https://reddit.com/r/LocalLLaMA/comments/1fz6z79/lm_studio_s...

As the post says, LM Studio has an MLX backend which makes it easy to use.

If you still want to stick with llama-server and GGUF, look at llama-swap which allows you to run one frontend which provides a list of models and dynamically starts a llama-server process with the right model:

https://github.com/mostlygeek/llama-swap

(actually you could run any OpenAI-compatible server process with llama-swap)

rahimnathwani · 2026-03-02T05:04:44 1772427884

I didn't know about llama-swap until yesterday. Apparently you can set it up such that it gives different 'model' choices which are the same model with different parameters. So, e.g. you can have 'thinking high', 'thinking medium' and 'no reasoning' versions of the same model, but only one copy of the model weights would be loaded into llama server's RAM.

Regarding mlx, I haven't tried it with this model. Does it work with unsloth dynamic quantization? I looked at mlx-community and found this one, but I'm not sure how it was quantized. The weights are about the same size as unsloth's 4-bit XL model: https://huggingface.co/mlx-community/Qwen3.5-35B-A3B-4bit/tr...

suprjami · 2026-03-02T08:40:51 1772440851

Yes that's right. The config is described by the developer here:

https://www.reddit.com/r/LocalLLaMA/comments/1rhohqk/comment...

And is in the sample config too:

https://github.com/mostlygeek/llama-swap/blob/main/config.ex...

iiuc MLX quants are not GGUFs for llama.cpp. They are a different file format which you use with the MLX inference server. LM Studio abstracts all that away so you can just pick an MLX quant and it does all the hard work for you. I don't have a Mac so I have not looked into this in detail.

BoredomIsFun · 2026-03-01T10:21:50 1772360510

FYI UD quants of 3.5-35BA3B are broken, use bartowski or AesSedai ones.

regularfry · 2026-03-01T14:15:42 1772374542

They've uploaded the fix. If those are still broken something bad has happened.

rahimnathwani · 2026-02-28T22:36:08 1772318168

UD-Q4_K_XL?

msuniverse2026 · 2026-02-28T22:13:41 1772316821

I've had an AMD card for the last 5 years, so I kinda just tuned out of local LLM releases because AMD seemed to abandon rocm for my card (6900xt) - Is AMD capable of anything these days?

pja · 2026-02-28T23:07:41 1772320061

> I've had an AMD card for the last 5 years, so I kinda just tuned out of local LLM releases because AMD seemed to abandon rocm for my card (6900xt) - Is AMD capable of anything these days?

Sure. Llama.cpp will happily run these kinds of LLMs using either HIP or Vulcan.

Vulkan is easier to get going using the Mesa OSS drivers under Linux, HIP might give you slightly better performance.

wirybeige · 2026-02-28T22:32:46 1772317966

The vulkan backend for llama.cpp isn't that far behind rocm for pp and tp speeds

mmis1000 · 2026-03-03T12:58:08 1772542688

I think AMD just add support of rocm to rdna2 recently? I can run torch and aisudio with it just fine.

They also finally fix all ai related stuff building on windows, so you are no longer limited to linux for these.