gpt-5-mini and gpt-5.5 (had to tweak the code a bit to make it work)
Surprisingly not as big of a difference as one would hope. It turns out that smarter models are more conservative. Smarter model / More thinking = slightly worse recall sometimes.
I think it says more about the benchmark itself perhaps. Reviews are highly opinionated. And it could be that the smarter models are actually better, just the “golden” state is very opinionated.
That must have been expensive! Thanks for running the benchmark and sharing.
I tested Coducky (my AI review macos app) on the full 50-PR Martian benchmark using qwen3.7-plus via OpenRouter as the reviewer with a lightweight pre-save precision gate with deepseek-v4-flash. The score (gpt 5.2 judge) was 43.0% precision / 35.8% recall / 39.0 F1. That puts it about inline with CodeRabbit. This cost around $7 to run the full 50 PRs.
Your post inspired me to set up a test harness for my app to continue to test model combinations. Coducky allows you to select whichever models/subscriptions you like to run reviews, but it could make sense to build a collection of model combinations that work well for this.
Qwen3.6 runs on a single GPU and beats claudes sonnet. In benchmarks and real world tests from humans. Kimi is awesome but most people won't be able to host it themselves.
A lot of people are slowly realizing the moat of 1T closed source models is gone as of the last few weeks. It's going to change the industry. April was a huge month for open models, it'll be curious to see if that continues.
This Mistral submission is another nail in the coffin.
Yes. The constant full screen color flashing made Windows 8 not just unpleasant to use, I was unable to use it since I literally got migraines after using it for too long.
Click on a pdf? The whole screen turns bright red for a second before loading. Click on a Word file, same but blue. It was hell to use for people sensitive to flashing lights.
I got special permission at work to stick with Windows 7 longer than the rest of the company for medical reasons.
I think another ace up Django's sleeve is that it has had a remarkable stable API for a long time with very few breaking changes, so almost all blogposts about Django that the LLM has gobbled up will still be mostly correct whether they are a year or a decade old.
I get remarkably good and correct LLM output for Django projects compared to what I get in project with more fast moving and frequently API breaking frameworks.
Companies have been doing that for a long time now. You're not a person, you're an ID number in the payroll database, that needs to produce more profit than you cost the company.
When you emigrate to another country, you're viewed much the same way as a non-citizen by the system.
Well, although this particular quote is from a book from 2010, Terry Pratchett had been writing about these kinds of problems for decades at that point too, so that fits. For example, in Reaper Man from 1991 shopping malls were invasive parasitical life-forms from another universe.
Even in your own country you’re viewed the same way, it’s just that your database entry has higher privileges. I mean this universally, no particular country in mind.
I relate completely. I usually solve it by sticking a post-it note on the screen so it covers my face in the call. It makes the calls feel a lot more natural and less awkward.
Probably, but I fail to see how that's relevant here.
This is not a "dataclass heavy" library in any sense, they just used dataclass in the examples to make them shorter.
Based on everything I see in the documentation, you should be able to use Pydantic models as well, or standard python objects, or anything else, as long as it has a method `def htmy(self, context: Context) -> Component`.