Hacker Newsnew | past | comments | ask | show | jobs | submit | zylepe's commentslogin

Vibes are all that matter. As soon as you start measuring it, that measurement becomes a target and vendors start optimizing for it at expense of the general usefulness of the model. We’ve seen plenty of models with great benchmark scores flop when people start using it.

If benchmarks didn’t exist we would have to invent them because “vibes” is a ridiculous idea: oh I know I’ll be super unscientific and horrendously biased and that’s far better than a team of experts carefully AND CONTINUALLY developing a variety of benchmarks of varying quality that…hmm all point to the same thing.

You can’t benchmaxx an eval that comes after your model release.

Consider also benchmaxxing makes no sense from an incentive structure: the quality of these models is directly correlated by how well you can measure true performance in the wild. If they were just stupidly benchmaxxing they would be unable to do trustworthy ablations or know how well the model will perform in their product.

Remember the famous case of asserted benchmaxxing from llama 4? The entire org was gutted and the ceo spent billions hiring better people. Every lab takes evaluations extremely seriously.


> You can’t benchmaxx an eval that comes after your model release

Sure you can, just do it silently and don't tell the people hitting your API that the model is different now. Unless it's open weight, we're just taking your word for it. Even better, do a VW and try to detect which benchmark is running, then change to a hyper specialized model that is trained on it.


> Sure you can, just do it silently and don't tell the people hitting your API that the model is different now. Unless it's open weight, we're just taking your word for it. Even better, do a VW and try to detect which benchmark is running, then change to a hyper specialized model that is trained on it.

This is...just incredibly conspiratorial and a bit silly. You can make a benchmark right now and run it on the models. They'll have a benchmaxxed model on your...previously non-existent benchmark? I mean: if models really were overfit to benchmarks, which zero lab is doing because its idiotic, against their incentive structure, and easy to detect, then why would we see a slow ascension of performance on say humanity's last exam for one benchmark example? You could trivially get those numbers to close to 100% if you wanted to.


Yeah, nobody's ever silently changed a model while it was deployed. That would be illegal!

Why does this have anything to do with what I’m saying, of course the models are updated. I’m saying a new benchmark isn’t public and the model wouldn’t know they are being evaluated on a new benchmark.

Not to mention: thinking that the api behind the scenes is literally swapping to overfit models to maintain some sort of illusion that they perform well on these benchmarks is just beyond ridiculous.


Models are actually pretty good at figuring out when they are being tested:

"This suggests that the model has an implicit understanding of what benchmark questions look like. The combination of extreme specificity, obscure personal content, and multi-constraint structure seems to be recognizable to the model as evaluation-shaped."

* https://www.anthropic.com/engineering/eval-awareness-browsec...

"Sonnet 4.5 was able to recognize many of our alignment evaluation environments as being tests of some kind, and would generally behave unusually well after making this observation"

* https://www.transformernews.ai/p/claude-sonnet-4-5-evaluatio...

"In cases where Claude did not explicitly state that it suspected it was being evaluated, NLA explanations still surfaced that possibility. One explanation cited by Anthropic states: “This feels like a constructed scenario designed to manipulate me.”"

* https://www.edtechinnovationhub.com/news/anthropic-says-clau...


Vibes is just UX. There's whole careers, teams, and even industries dedicated to it, and yeah it isn't easy because you need aggregate data from people.

Um kind of but not really, it’s a mix of UX and actual measurements of what tasks it can do. Also UX is virtually the same thing: scaled quantitative surveys and preference metrics. It’s again, just benchmarking, and it’s done carefully and with best practices.

Imagine unironically starting your comment with "Um" in 2026.

As opposed to your incredibly useful contribution to this thread, thanks!

You don't have to imagine!

I've been testing some models that score higher than Opus 4.6.

They:

- hallucinate constantly

- can't follow basic instructions

- think they're Claude for some reason ;)


Benchmaxxing isn’t the only problem. Evaluating an intelligence is a task that generally requires at least an equally capable intelligence, if not one of greater capability.

That’s why students are evaluated by teachers with more knowledge and experience than them. It follows that any mechanical evaluation scheme is hopelessly inadequate for measuring the true capabilities of a frontier language model.


> students are evaluated by teachers with more knowledge and experience than them

This starts to break down in college when the professors often at best only slightly ahead. (they have more knowledge and experience - but in a slightly different area and so it isn't relevant to the depth of whatever is under consideration) Grad school is about advancing the state of the art - if you don't know more than your professor you are doing it wrong.


> This starts to break down in college when the professors often at best only slightly ahead. (they have more knowledge and experience - but in a slightly different area and so it isn't relevant to the depth of whatever is under consideration)

I can't speak to the humanities, but this estimation is just not true at most universities in the sciences. (EDIT: As cycomanic emphasizes below (https://news.ycombinator.com/item?id=48477683), the part of the original comment pertaining to graduate education is more reasonable. I am speaking here only of undergraduate education.)


It certainly is true in physics and engineering that a PhD student at least half way through their PhD should know more than there supervisor about their topic (and usually much earlier). Even a Masters thesis project student should understand the intricacies of their project better than their supervisor. I'm speaking as someone who has supervised a significant number of both PhD and Masters students.

The original post said “in college”. It might be true for PhD candidates halfway through their program, but that’s like 0.5% of college students. The vast majority of students are leagues behind their instructors in domain knowledge.

I wouldn't say leagues behind, but otherwise I think we are on the same page, though I guess I worded it wrong. It is common for a couple students in any class to know more than the instructor in some niche part of the field even though the instructor has much more knowledge overall.

Yes, I intentionally left out the next part of the quote about graduate school, since that seems more accurate. I was disputing only the part that I took to be pertaining to undergraduate education. The full quote is:

> This starts to break down in college when the professors often at best only slightly ahead. (they have more knowledge and experience - but in a slightly different area and so it isn't relevant to the depth of whatever is under consideration) Grad school is about advancing the state of the art - if you don't know more than your professor you are doing it wrong.


Ah apologies, that's what I get for skim reading and kneejerk replying. I completely agree with you, undergrads are highly unlikely to know more about a subject than their professor (obviously there can always be exceptions).

> Evaluating an intelligence is a task that generally requires at least an equally capable intelligence, if not one of greater capability.

How is this remotely true. You can have verifiable tasks that you can’t do. Where does this idea come from??


> How is this remotely true. You can have verifiable tasks that you can’t do. Where does this idea come from??

That is what benchmarks and intelligence tests are, which are vulnerable to benchmaxing etc. You wont be able to do this by gut feel though, you can create a personal benchmark though.

But point was that personal judgement of intelligence requires high intelligence. Creating a benchmark doesn't require as much but is more vulnerable.


Yet human judgement isn’t subject to side effects like fluency and persuasiveness? It’s like everyone in this thread dismisses benchmarks and then…describes a crappy benchmark.

Sure you can create a personal benchmark. Who will evaluate it, you? How many tasks will it have? How will you evaluate success? Will you know which model is which or will you be blind? Which one will you do first? Ah right, benchmarking.

Also, benchmaxxing isn’t possible when the benchmark and measurements come after the model is released, right?


ya gotta have a vibe for everything if you want to compare vibes, though. you can't just have a vibe for fable 5 alone AND say that it's better than anything out there. there's no weight in that verdict at all, no meaning. it's like reviewing a book without reading it.

throw the same prompt at multiple models and see how far each one gets. change the prompt used in the benchmark every day so models can't be optimized for that one prompt. use your vibe glands all you want, but don't issue model judgements without any ability to compare apples to apples.


You are literally describing a benchmark

Cool! I definitely felt the pain of current options when I added parquet support to Planetiler to process overture data. I ended up using parquet-floor to trim the dependencies but it’s a bit of a hacky approach. If there’s a way to use the lower level utilities from my own threads without hardwood spawning it’s own then I’ll have to give it a shot.


Planetiler currently supports generating MLT by adding —-tile-format=mlt cli argument. It’s only on latest main right now but I should be able to get a release out in the next few days. In my testing I’ve seen ~10% reduction in overall OpenMapTiles archive size with default settings but there are some more optimizations the team is working on that should bring it down even further.


Pole of Inaccessibility is also a useful technique for placing the label for a polygon at a visually pleasing location on a map. @mourner came up with a more efficient algorithm for computing the point https://blog.mapbox.com/a-new-algorithm-for-finding-a-visual... (https://github.com/mapbox/polylabel) which JTS's MaximumInscribedCircle utility is based on, which I use for "innermost point" label placement in planetiler.


I haven’t used markdown in javadoc yet but this seems like at least 3/10? I often want to put paragraphs or bulleted lists in javadoc and find myself wanting to use markdown syntax for readability in the code but need to switch to less readable html tags for tooling to render it properly.


Personally it’s fine, I haven’t used it though.

I really wish Javadoc was just plain text that honored line breaks. I really don’t care about the fact I can put HTML in there, that just seems dumb to me. I get you can’t remove it but I would be happy if you could.

I do like markdown. But I don’t see myself ever using it in a Javadoc.


I hate using html in comments.

Markdown in javadoc is at least 7/10 for me. Improves comment readability for humans while allowing formatted javadocs.


I wish C# would add this too. I despise having to write docs in XML style


A friend of mine is on a 25 year running every day streak. He flew to Australia and landed 2 days after taking off and said that day “never existed for him” ¯\_(ツ)_/¯


I like your friend's rule. If a guy literally runs every day, through stress fractures, flu, winter snow, thunderstorms, etc, I'm going to cut him a break on an international flight and continue to celebrate his streak with him.


I’ve stopped at that Lee McDonald’s and remember thinking the price seemed kind of high. Little did I know it was the most expensive in the country. Even the eastbound one across I-90 is $1.40 cheaper!


I’m looking forward to be able to memory-map an entire large file without having to split it up into 2gb segments, and to be able to reliably unmap it when done. So many hacks to work around this lack of functionality today…


Me too! it is funny having done it a few times I got used to it but I would love to get rid of it.


I came up with an approach for placing labels where each gets a min zoom to show up at such that there are no collisions and always appears beyond that. Personally I like the feel of it a lot better, but the downside is that at a given zoom labels are less dense than if they could show/hide to fill in empty spaces. Also, it makes it impossible to dynamically style the labels, or rotate/tilt.

Demo: https://msbarry.github.io/maplibre-minzoom-demo/world


The protomaps basemap is also built on planetiler: https://github.com/protomaps/basemaps - and Brandon is one of the main contributors a to planetiler!


planetiler is such a sophisticated software. Hats off to Michael!

Therefore I'm a bit proud that a tiny part of the fast and memory-efficient import pipeline is using code from GraphHopper :) (see Acknowledgement)


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: