More

liuliu · 2026-02-20T20:36:49 1771619809

Depending on what you do. If you are doing token generations, compute-dense kernel optimization is less interesting (as, it is memory-bounded) than latency optimizations else where (data transfers, kernel invocations etc). And for these, Mac devices actually have a leg than CUDA kernels (as pretty much Metal shaders pipelines are optimized for latencies (a.k.a. games) while CUDA shaders are not (until cudagraph introduction, and of course there are other issues).

liuliu · 2026-02-20T17:10:16 1771607416

It is very simple. Storage / bandwidth is not expensive. Residential bandwidth is. If you can convince people to install a bandwidth-related software on their residential homes, you can then charge other people $5 to $10 per 1GiB bandwidth (useful for botnet mostly, get around DDOS protections and other reCAPTCHA tasks).

logicallee · 2026-02-20T19:56:38 1771617398

Thank you for your suggestion. Below is only our plans/intentions, we welcome feedback about it:

We are not going to do what you suggest. Instead, our approach is to use the RAM people aren't using at the moment for a fast edge cache close to their area.

We've tried this architecture and get very low latency and high bandwidth. People would not be contributing their resources to anything they don't know about.

liuliu · 2026-02-20T16:03:05 1771603385

> We have a local model we would like to distribute but don't have a good CDN.

That is not true. I am serving models off Cloudflare R2. It is 1 petabyte per month in egress use and I basically pay peanuts (~$200 everything included).

logicallee · 2026-02-20T16:39:26 1771605566

1 petabyte per month is 1 million downloads of a 1 GB file. We intend to scale to more than 1 million downloads per month. We have a specific scaling architecture in mind. We're qualified to say this because we've ported a billion parameter model to run in your browser - fast - on either webgpu or wasm. (You can see us doing it live at the youtube link in my comment above.) There is a lot of demand for that.

liuliu · 2026-02-20T17:16:56 1771607816

The bandwidth is free on Cloudflare R2. I paid money for storage (~10TiB storage of different models). If you only host 1GiB file there, you are only paying $0.01 per month I believe.

liuliu · 2026-02-13T19:31:03 1771011063

The solution is to make the model stronger so the malicious intents can be better distinguished (and no, it is not a guarantee, like many things in life). Sandbox is a basic, but as long as you give the model your credential, there isn't much guardrails can be done other than making the model stronger (separate guard model is the wrong path IMHO).

ramoz · 2026-02-13T20:30:52 1771014652

I think generally correct to say "hey we need stronger models" but rather ambitious to think we really solve alignment with current attention-based models and RL side-effects. Guard model gives an additional layer of protection and probably stronger posture when used as an early warning system.

liuliu · 2026-02-13T20:51:30 1771015890

Sure. If you treat "guard model" as diversification strategy, it is another layer of protection, just like diversification in compilation helps solving the root of trust issue (Reflections on Trusting Trust). I am just generally suspicious about the weak-to-strong supervision.

I think it is in general pretty futile to implement permission systems / guardrails which basically insert a human in the loop (humans need to review the work to fully understand why it needs to send that email, and at that point, why do you need a LLM to send the email again?).

ramoz · 2026-02-13T21:07:57 1771016877

fair enough

liuliu · 2026-02-10T19:58:09 1770753489

Agree. I think it is just people have their own simplified mental model how it works. However, there is no reason to believe these simplified mental models are accurate (otherwise we will be here 20-year earlier with HMM models).

The simplest way to stop people from thinking is to have a semi-plausible / "made-me-smart" incorrect mental model of how things work.

hn_acc1 · 2026-02-10T20:22:23 1770754943

Did you mean to use the word "mental"?

liuliu · 2026-02-10T17:22:24 1770744144

Note that Qwen Image 1.0 (2512) wasted ~8B weights on timestep embedding. Both Z-Image / FLUX.2 series corrected that.

liuliu · 2026-02-09T19:36:45 1770665805

I agree. It is either ads, or Anthropic way (which is: you are too poor to use our ChatBot). There is no other way to pay the > $1 trillion per year CapEx for building these chat bots.

Would there be other way? Sure, it could be government-funded, like our public school system. But it is not possible in current political climate.

Money doesn't grow on trees, and tokens cost a lot of money. There will be divide into people who can afford these tokens and people who cannot. I feel it is better to have ways to let people who cannot afford these tokens to have some ways to try it.

liuliu · 2026-02-02T22:44:05 1770072245

Why MLX doesn't just detect apple10 support (for Metal)? That excludes all the devices without NA.

liuliu · 2026-01-26T20:39:43 1769459983

Collaborative software development is a high-trust activity. It simply doesn't work in low-trust environment. This is not an issue with code review, it is an issue with maintaining a trust environment for collaboration.

liuliu · 2026-01-24T16:49:37 1769273377

Are you sure it is choked on writes not on reads and writes? SQLite default setup is inefficient in many ways (as well as it's default compilation options), and that often cause issues.

(I am just asking: are you sure WAL is on?)

conradkay · 2026-01-24T17:03:31 1769274211

I'd imagine that's it. With WAL you can probably hit >1000 writes a second