When you buy a subscription plan, you’re buying use of the harness, not the underlying compute / tokens. Buying those on their own is way more expensive. This is probably because:
* Subscriptions are oversubscribed. They know how much an “average” Claude Code user actually consumes to perform common tasks and price accordingly. This is how almost all subscription products work.
* There is some speculation that there is cooperative optimization between the harness and backend (cache related etc).
* Subscriptions are subsidized to build market share; to some extent the harnesses are “loss leader” halo products which drive the sales of tokens, which are much more profitable.
He wasn't using the regular paid api (ie per token pricing). He was using the endpoints for their subscribed customers (ie paid per month and heavily subsidized).
I assume he was using Gemini the same way as he was Claude when I make the following statement.
I don’t believe it’s exceptionally unique or new that companies will revoke access if you are using an unpublished API that the apps use. I don’t see anything wrong with it myself. If you want, pay for normal token use on the published APIs. There is no expectation that you can use APIs for an application, even if you are a paid user, that are not published explicitly for usage.
Indeed, that's why Anthropic, OpenAI and other LLM providers are known to adhere to published APIs to gather the world's data, obeying licensing and ROBOTS.txt.
I was under the impression that they do obey robots.txt now? There are clearly a lot of dumb agents that don’t, but didn’t think it was the major AI labs.
Obeying robots.txt (now) is still better than not obeying it, regardless of what they did before.
The alternative is to say that bugs shouldn’t be fixed because it’s a ladder pull or something. But that’s crazy. What’s the point of complaining if not to get people to fix things?
That snippet is trying to say that you can calculate KV for all the input tokens at once, and you don't need to loop over them since you have them all available.
Instead for decode, you need to sequentially generate each token.
Prefill is part of Inference. It's the first major step where you calculate all the keys and values for the input tokens.
Decode is the next major step where you start generating output tokens one at a time.
Both run on GPUs but have slightly different workloads
1. Prefill has very little I/o from VRAM to HBM and more compute
2. Decode is light on compute but have to I/o the keys and values computed in the prefill stage for every output token
Yes, decoding is very I/O heavy. It has to stream in the whole of the model weights from HBM for every token decoded. However, that cost can be shared between the requests in the same batch. So if the system has more GPU RAM to hold larger batches, the I/O cost per request can be lowered.