Hacker Newsnew | past | comments | ask | show | jobs | submit | tirpen's commentslogin

Which LLM did you use? I assume that will make a pretty big difference.


gpt-5-mini and gpt-5.5 (had to tweak the code a bit to make it work)

Surprisingly not as big of a difference as one would hope. It turns out that smarter models are more conservative. Smarter model / More thinking = slightly worse recall sometimes.

I think it says more about the benchmark itself perhaps. Reviews are highly opinionated. And it could be that the smarter models are actually better, just the “golden” state is very opinionated.


That must have been expensive! Thanks for running the benchmark and sharing.

I tested Coducky (my AI review macos app) on the full 50-PR Martian benchmark using qwen3.7-plus via OpenRouter as the reviewer with a lightweight pre-save precision gate with deepseek-v4-flash. The score (gpt 5.2 judge) was 43.0% precision / 35.8% recall / 39.0 F1. That puts it about inline with CodeRabbit. This cost around $7 to run the full 50 PRs.

Your post inspired me to set up a test harness for my app to continue to test model combinations. Coducky allows you to select whichever models/subscriptions you like to run reviews, but it could make sense to build a collection of model combinations that work well for this.


This is propaganda, not data.

If the Chinese government published a graph that said the opposite, would you consider that a serious and objective source?


If the methodology in the accompanying write-up did look credible, yes. Though I have significantly more trust in US agencies, like NIST in this case.


> China is not competing, it is distilling US models

China are cheating by using data obtained without permission to train their models in an evil commie way!

They should have done what the US did instead and trained models on data obtained without permission in a fair and freedum way!

> Where are the Chinese models that are blowing US ones out of the water?

Kimi2 blows every US model out of the water in any comparison that includes both costs and performance.


Qwen3.6 runs on a single GPU and beats claudes sonnet. In benchmarks and real world tests from humans. Kimi is awesome but most people won't be able to host it themselves.

A lot of people are slowly realizing the moat of 1T closed source models is gone as of the last few weeks. It's going to change the industry. April was a huge month for open models, it'll be curious to see if that continues.

This Mistral submission is another nail in the coffin.


i run qwen 3.6. you need to drink some settle down juice.


No way it's awesome.


> beats claudes sonnet

Based on benchmarks which don’t mean that much these days.

> models is gone as of the last few weeks.

Yes, that’s exactly what people were saying after every major release for the past year or so. It’s always a couple of weeks away


That's not nearly as big a problem in Germany as the US.

The median age of Bundestag members is 45.4

https://data.ipu.org/parliament/DE/DE-LC01/

In the US Senate It's 63.9.

https://data.ipu.org/parliament/US/US-UC01/


Yes. The constant full screen color flashing made Windows 8 not just unpleasant to use, I was unable to use it since I literally got migraines after using it for too long.

Click on a pdf? The whole screen turns bright red for a second before loading. Click on a Word file, same but blue. It was hell to use for people sensitive to flashing lights.

I got special permission at work to stick with Windows 7 longer than the rest of the company for medical reasons.


I think another ace up Django's sleeve is that it has had a remarkable stable API for a long time with very few breaking changes, so almost all blogposts about Django that the LLM has gobbled up will still be mostly correct whether they are a year or a decade old.

I get remarkably good and correct LLM output for Django projects compared to what I get in project with more fast moving and frequently API breaking frameworks.


The "one way" / "batteries included" aspect of Django may also make it easier for LLMs


"Evil begins when you begin to treat people as things."

-Terry Pratchett


"If there is such a phenomenon as absolute evil, it consists in treating another human being as a thing."

— John Brunner, "The Shockwave Rider"


Companies have been doing that for a long time now. You're not a person, you're an ID number in the payroll database, that needs to produce more profit than you cost the company.

When you emigrate to another country, you're viewed much the same way as a non-citizen by the system.


Well, although this particular quote is from a book from 2010, Terry Pratchett had been writing about these kinds of problems for decades at that point too, so that fits. For example, in Reaper Man from 1991 shopping malls were invasive parasitical life-forms from another universe.


Even in your own country you’re viewed the same way, it’s just that your database entry has higher privileges. I mean this universally, no particular country in mind.


I relate completely. I usually solve it by sticking a post-it note on the screen so it covers my face in the call. It makes the calls feel a lot more natural and less awkward.


"Running a marathon is hard work, but a car will do it for you."

Sure, but then what's the point?


Probably, but I fail to see how that's relevant here. This is not a "dataclass heavy" library in any sense, they just used dataclass in the examples to make them shorter.

Based on everything I see in the documentation, you should be able to use Pydantic models as well, or standard python objects, or anything else, as long as it has a method `def htmy(self, context: Context) -> Component`.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: