Can you share more about what adaptations you made when using smaller models?
I'm just starting my exploration of these small models for coding on my 16GB machine (yeah, puny...) and am running into issues where the solution may very well be to reduce the scope of the problem set so the smaller model can handle it.
You'd do most of the planning/cognition yourself, down to the module/method signature level, and then have it loop through the plan to "fill in the code". Need a strong testing harness to loop effectively.
It is very unlikely that general claims about a model are useful, but only very specific claims, which indicate the exact number of parameters and quantization methods that are used by the compared models.
If you perform the inference locally, there is a huge space of compromise between the inference speed and the quality of the results.
Most open weights models are available in a variety of sizes. Thus you can choose anywhere from very small models with a little more than 1B parameters to very big models with over 750B parameters.
For a given model, you can choose to evaluate it in its native number size, which is normally BF16, or in a great variety of smaller quantized number sizes, in order to fit the model in less memory or just to reduce the time for accessing the memory.
Therefore, if you choose big models without quantization, you may obtain results very close to SOTA proprietary models.
If you choose models so small and so quantized as to run in the memory of a consumer GPU, then it is normal to get results much worse than with a SOTA model that is run on datacenter hardware.
Choosing to run models that do not fit inside the GPU memory reduces the inference speed a lot, and choosing models that do not fit even inside the CPU memory reduces the inference speed even more.
Nevertheless, slow inference that produces better results may reduce the overall time for completing a project, so one should do a lot of experiments to determine an appropriate compromise.
When you use your own hardware, you do not have to worry about token cost or subscription limits, which may change the optimal strategy for using a coding assistant. Moreover, it is likely that in many cases it may be worthwhile to use multiple open-weights models for the same task, in order to choose the best solution.
For example, when comparing older open-weights models with Mythos, by using appropriate prompts all the bugs that could be found by Mythos could also be found by old models, but the difference was that Mythos found all the bugs alone, while with the free models you had to run several of them in order to find all bugs, because all models had different strengths and weaknesses.
(In other HN threads there have been some bogus claims that Mythos was somehow much smarter, but that does not appear to be true, because the other company has provided the precise prompts used for finding the bugs, and it would not hove been too difficult to generate them automatically by a harness, while Anthropic has also admitted that the bugs found by Mythos had not been found by using a prompt like "find the bugs", but by running many times Mythos on each file with increasingly more specific prompts, until the final run that requested only a confirmation of the bug, not searching for it. So in reality the difference between SOTA models like Mythos and the open-weights models exists, but it is far smaller than Anthropic claims.)
> Anthropic has also admitted that the bugs found by Mythos had not been found by using a prompt like "find the bugs", but by running many times Mythos on each file with increasingly more specific prompts, until the final run that requested only a confirmation of the bug, not searching for it.
- provide a container with running software and its source code
- prompt Mythos to prioritize source files based on the likelihood they contain vulnerabilities
- use this prioritization to prompt parallel agents to look for and verify vulnerabilities, focusing on but not limited to a single seed file
- as a final validation step, have another instance evaluate the validity and interestingness of the resulting bug reports
This amounts to at most three invocations of the model for each file, once for prioritization, once for the main vulnerability run, and once for the final check. The prompts only became more specific as a result of information the model itself produced, not any external process injecting additional information.
Before you start to feel smug about this and think you're above it.
I've read three blog posts in as many days where their authors quietly reflect that Claude is so good that it has effectively hijacked their own decision making processes when they weigh the value of starting a project.
Do they embark on it, or hand it over to Claude, even if the process is mind-numbing and you learn nothing?
> Just a few years ago, AI essentially could not program at all. In the future, a given AI instance may “program better” than any single human in history. But for now, real programmers will always win.
For how long? Do I get to feel smug about this for 10 days, 10 weeks, or 10 years? That radically changes the planned trajectory of my life.
These posts are just programmers trying to understand their new place in the hierarchy. I'm in the same place and get it, but also truisms like 'will always win' is basically just throwing a wild guess at what the future will look like. A better attitude is to attempt to catch the wave.
TFA's author is literally saying it may happen. He's using AI so he already caught the wave. He's augmenting himself with AI tools. He's not saying "AI will never surpass humans at writing programs". He writes:
" At this particular moment, human developers are especially valuable, because of the transitional period we’re living through."
You and GP are both attacking him on a strawman: it's not clear why.
We're seeing countless AI slop and the enshittification and lower uptime for services day after day.
To anyone using these tools seriously on a daily basis it's totally obvious there are, TODAY*, shortcomings.
TFA doesn't talk about tomorrow. It talks about today.
Doctorow recently spoke (remotely) at FediMTL in Montreal. He laid out the stakes as usual, but added that in Canada, we are legally not allowed to reverse engineer competitor APIs, and we gave up this right as part of some free trade negotiation.
> Considering they trained their model on open-source software, the least they could do is give it to open-source maintainers for free with no time limit.
Why? The resulting code generated by Claude is unfit for training, so any work product produced after the start of the subsidized program should be ignored.
Therefore it makes sense to charge them for the service after 6 months, no? Heh.
What do you mean it's unfit for training? It's a form of reinforcement learning; the end result has been selected based on whether it actually solved the need.
You need to be careful of the amount of reinforcement learning vs continued pretraining you do, but they already do plenty of other forms of reinforcement learning, I'm sure they have it dialed in.
I'm just starting my exploration of these small models for coding on my 16GB machine (yeah, puny...) and am running into issues where the solution may very well be to reduce the scope of the problem set so the smaller model can handle it.
reply