animan's comments

animan · 2026-02-12T14:19:54 1770905994

What was the point of Claude code or Gemini banning the OP? Why would they care about how IDEs use the underlying API?

bri3d · 2026-02-12T15:23:30 1770909810

When you buy a subscription plan, you’re buying use of the harness, not the underlying compute / tokens. Buying those on their own is way more expensive. This is probably because:

* Subscriptions are oversubscribed. They know how much an “average” Claude Code user actually consumes to perform common tasks and price accordingly. This is how almost all subscription products work.

* There is some speculation that there is cooperative optimization between the harness and backend (cache related etc).

* Subscriptions are subsidized to build market share; to some extent the harnesses are “loss leader” halo products which drive the sales of tokens, which are much more profitable.

sigmar · 2026-02-12T14:44:32 1770907472

He wasn't using the regular paid api (ie per token pricing). He was using the endpoints for their subscribed customers (ie paid per month and heavily subsidized).

infecto · 2026-02-12T14:26:48 1770906408

I assume he was using Gemini the same way as he was Claude when I make the following statement.

I don’t believe it’s exceptionally unique or new that companies will revoke access if you are using an unpublished API that the apps use. I don’t see anything wrong with it myself. If you want, pay for normal token use on the published APIs. There is no expectation that you can use APIs for an application, even if you are a paid user, that are not published explicitly for usage.

deaux · 2026-02-12T14:43:00 1770907380

Indeed, that's why Anthropic, OpenAI and other LLM providers are known to adhere to published APIs to gather the world's data, obeying licensing and ROBOTS.txt.

It's truly disgusting.

skybrian · 2026-02-12T14:56:02 1770908162

I was under the impression that they do obey robots.txt now? There are clearly a lot of dumb agents that don’t, but didn’t think it was the major AI labs.

deaux · 2026-02-12T14:58:20 1770908300

After 3 years of pirating and scraping the entire world by doing the above, I guess they have everything that they now need or want.

So then it's better to start obeying ROBOTS.txt as a ladder pull through a "nicely behaved" image advantage.

skybrian · 2026-02-12T15:13:14 1770909194

Obeying robots.txt (now) is still better than not obeying it, regardless of what they did before.

The alternative is to say that bugs shouldn’t be fixed because it’s a ladder pull or something. But that’s crazy. What’s the point of complaining if not to get people to fix things?

DANmode · 2026-02-12T14:50:20 1770907820

Why does Google/Facebook et al arbitrarily enforce one human per account?

It’s because they want to study you.

They want the data!

logicallee · 2026-02-12T14:55:55 1770908155

>What was the point of Claude code or Gemini banning the OP? Why would they care about how IDEs use the underlying API?

Underscores the importance of sovereign models you can run on the edge, finetune yourself, and run offline. At State of Utopia, we're working on it!

animan · 2025-06-29T00:11:50 1751155910

That snippet is trying to say that you can calculate KV for all the input tokens at once, and you don't need to loop over them since you have them all available.

Instead for decode, you need to sequentially generate each token.

animan · 2025-06-29T00:10:29 1751155829

Prefill is part of Inference. It's the first major step where you calculate all the keys and values for the input tokens.

Decode is the next major step where you start generating output tokens one at a time.

Both run on GPUs but have slightly different workloads

1. Prefill has very little I/o from VRAM to HBM and more compute 2. Decode is light on compute but have to I/o the keys and values computed in the prefill stage for every output token

dist-epoch · 2025-06-29T06:27:54 1751178474

Doesn't decode also need to stream in the whole of the model weights, thus very I/O heavy?

0xjunhao · 2025-06-30T13:26:37 1751289997

Yes, decoding is very I/O heavy. It has to stream in the whole of the model weights from HBM for every token decoded. However, that cost can be shared between the requests in the same batch. So if the system has more GPU RAM to hold larger batches, the I/O cost per request can be lowered.

animan · on July 20, 2024

One of us, one of us...