More

paulus_magnus2 · 2026-02-16T14:40:15 1771252815

-- OK. Added location context for the vehicle

grok works, chatgpt still fails

[1] https://chatgpt.com/share/69932b20-3eb8-8003-9d9c-b4bba53033... [2] https://grok.com/share/bGVnYWN5LWNvcHk_f32dd53d-7b36-4fa2-b3...

swat535 · 2026-02-16T14:48:12 1771253292

Grok gets a lot of hate because of Musk, but it's a decent model.

I use it daily with my X account for basic tasks and think the free limits are generous. With X premium, you can get even more out of it.

Nothing beats Anthropic when it comes to coding however.

paulus_magnus2 · 2026-02-16T09:30:34 1771234234

I see things were fixed already [2][4] but luckily a friend showed me this issue yesterday [1][2]

[1] 2026-02-15 https://chatgpt.com/share/6992e17b-9b28-8003-9da9-38533f257d...

[2] 2026-02-16 https://chatgpt.com/share/6992e135-c610-8003-9272-55058134d4...

[3] 2026-02-15 https://grok.com/share/bGVnYWN5LWNvcHk_97e9717b-c2de-47e8-a4...

[4] 2026-02-16 https://grok.com/share/bGVnYWN5LWNvcHk_b161bb03-4bed-4785-98...

We tried a few things yesterday and it was always telling you to walk. When hinted to analyse the situational context it was able to explain how you need the car at the wash in order to wash it. But then something was not computing.

~ Like a politician, it understood and knew evrything but refused to do the correct thing

jtbayly · 2026-02-16T16:27:03 1771259223

Help me out here. Are the models learning, or being “taught more”/fixed on a daily basis? If so, isn’t the model itself changing daily?

paulus_magnus2 · 2026-01-16T17:04:21 1768583061

The blog[0] is worded rather conservatively but on Twitter [2] the claim is pretty obvious and the hype effect is achieved [2]

CEO stated "We built a browser with GPT-5.2 in Cursor"

instead of

"by dividing agents into planners and workers we managed to get them busy for weeks creating thousands of commits to the main branch, resolving merge conflicts along the way. The repo is 1M+ lines of code but the code does not work (yet)"

[0] https://cursor.com/blog/scaling-agents

[1] https://x.com/kimmonismus/status/2011776630440558799

[2] https://x.com/mntruell/status/2011562190286045552

[3]https://www.reddit.com/r/singularity/comments/1qd541a/ceo_of...

deng · 2026-01-16T17:10:33 1768583433

Even then, "resolving merge conflicts along the way" doesn't mean anything, as there are two trivial merge strategies that are always guaranteed to work ('ours' and 'theirs').

paulus_magnus2 · 2026-01-16T17:24:08 1768584248

Haha. True, CI success was not part of PR accept criteria at any point.

If you view the PRs, they bundle multiple fixes together, at least according to the commit messages. The next hurdle will be to guardrail agents so that they only implement one task and don't cheat by modifying the CI piepeline

formerly_proven · 2026-01-16T17:31:24 1768584684

If I had a nickel for every time I've seen a human dev disable/xfail/remove a failing test "because it's wrong" and then proceeding to break production I would have several nickels, which is not much, but does suggest that deleting failing tests, like many behaviors, is not LLM-specific.

vizzier · 2026-01-16T18:58:05 1768589885

> but does suggest that deleting failing tests, like many behaviors, is not LLM-specific.

True, but it is shocking how often claude suggests just disabling or removing tests.

ewoodrich · 2026-01-17T01:53:11 1768614791

The sneaky move that I hate most is when Claude (and does seem to mostly be a Claude-ism I haven’t encountered on GPT Codex or GLM) is when dealing with an external data source (API, locally polling hardware, etc) as a “helpful” fallback on failures it returns fake data in the shape of the expected output so that the rest of the code “works”.

Latest example is when I recently vibe coded a little Python MQTT client for a UPS connected to a spare Raspberry Pi to use with Home Assistant, and with a just few turns back and forth I got this extremely cool bespoke tool and felt really fun.

So I spent a while customizing how the data displayed on my Home Assistant dashboard and noticed every single data point was unchanging. It took a while to realize because the available data points wouldn’t be expected to change a whole lot on a fully charged UPS but the voltage and current staying at the exact same value to a decimal place for three hours raised my suspicions.

After reading the code I discovered it had just used one of the sample command line outputs from the UPS tool I gave it to write the CLI parsing logic. When an exception occurred in the parser function it instead returned the sample data so the MQTT portion of the script could still “work”.

Tbf Claude did eventually get it over the finish line once I clarified that yes, using real data from the actual UPS was in fact an important requirement for me in a real time UPS monitoring dashboard…

teiferer · 2026-01-17T09:25:10 1768641910

Always check the code.

It's similar to early versions of autonomous driving. You's not want to sit in the back seat with nobody at the wheel. That would get you killed guaranteed.

DonHopkins · 2026-01-17T12:25:39 1768652739

And how is that not good for humanity in an evolutionary sense (as long as it doesn't kill or maim anyone else)?

Tesla owner keeps using Autopilot from backseat—even after being arrested:

https://mashable.com/article/tesla-autopilot-arrest-driving-...

duskdozer · 2026-01-17T16:17:51 1768666671

Sounds to me like more evidence in favor of the idea that they're meant to play the golden retriever engineer reporting to you, the extremely intelligent manager.

icedchai · 2026-01-17T00:01:11 1768608071

A coworker opened a PR full of AI slop. One of the first things I do is check if the tests pass. Of course, the didn't. I asked them to fix the tests, since there's no point in reviewing broken code.

"Fix the tests." This was interpreted literally, and assert status == 200 got changed to assert status == 500 in several locations. Some tests required more complex edits to make them "pass."

Inquiries about the tests went unanswered. Eventually the 2000 lines of slop was closed without merging.

saghm · 2026-01-17T09:10:22 1768641022

After a certain point the response to low effort vibe code has to be vibe reviews. Failing tests? Bad vibes, close without merging. Much more efficient than vibe coding too, since no AI is needed.

ciaranmca · 2026-01-16T20:59:37 1768597177

100%, trying a bit of an experiment like this(similar in that I mostly just care about playing around with different agents, techniques etc.) it has built out literally hundreds of tests. Dozens of which were almost pointless as it decided to mock apis. When the number of failed tests exceeded 40 it just started disabling tests.

icedchai · 2026-01-17T00:05:46 1768608346

To be fair, many human developers are fond of pointless tests that mock everything to the extent that no real code is actually exercised. At least the tests are fast though.

falkensmaize · 2026-01-17T04:35:07 1768624507

Citing the absolute worst practices from terrible developers as a way to exonerate or legitimize LLM code production issues is something we need to stop doing in my opinion. I would not excuse or expect a day one junior on my team that wrote pointless tests or worse yet removed tests to get the CI to pass.

If LLMs do this it should be seen as an issue and should not be overlooked with “people do it too…”. Professional developers do not do this. If we’re going to use Ai for creating production code we need to be honest about its deficiencies.

icedchai · 2026-01-17T19:21:43 1768677703

I agree, but if LLMs are trained on common practices, best or worst, what do you expect?

Testing, specifically, is heavily opinionated among professional developers.

zephen · 2026-01-16T21:07:13 1768597633

> it is shocking how often claude suggests just disabling or removing tests.

Arguably, Claude is simply successfully channeling what the developers who wrote the bulk of its training data would do. We've already seen how bad behavior injected into LLMs in one domain causes bad behavior in other domains, so I don't find this particularly shocking.

The next frontier in LLMs has to be distinguishing good training data from bad training data. The companies have to do this, even if only in self defense against the new onslaught of AI-generated slop, and against deliberate LLM poisoning.

If the models become better at critically distinguishing good from bad inputs, particularly if they can learn to treat bad inputs as examples of what not to do, I would expect one benefit of this is that the increased ability of the models to write working code will then greatly increase the willingness of the models to do so, rather than to simply disable failing tests.

dullcrisp · 2026-01-16T20:46:37 1768596397

If I had a nickel for every time I’ve seen a human being pull down their pants and defecate in the middle of the street I’d have a couple nickels. That’s not a lot but it suggests that this behavior is not LLM specific.

Tade0 · 2026-01-17T10:09:33 1768644573

If anything, the LLMs had to learn that from somewhere, so they're just copying human behaviour.

aspenmartin · 2026-01-17T15:52:38 1768665158

I'm definitely in the camp that this browser implementation is shit, but just a reminder: agent training does involve human coding data in early stages of training to bootstrap it but in its reinforcement learning phase it does not -- it learns closer to the way AlphaGo did, self play and verifiable rewards. This is why people are very bullish on agents, there is no limit technically to how well they can learn (unlike LLMs) and we know we will reach superhuman skill, and the crucial crucial reason for this is: verifiable rewards. You have this for coding, you do not have this for e.g. creative tasks etc.

So agents will actually be able to build a {browser, library, etc} that won't be an absolute slopfest, but the real crucial question is when. You need better and more efficient RL training, further scaling (Amodei thinks really scaling is the only thing you technically need here and we have about 3-4 orders of magnitude of headroom left before we hit insurmountable limits), bigger context windows (that models actually handle well) and possibly continual learning paradigms, but solutions to these problems are quite tangible now.

teiferer · 2026-01-17T09:22:14 1768641734

Where I work, that's exceptionally rare to the point practically non-existing.

mickdarling · 2026-01-16T22:11:39 1768601499

Had humans not been doing this already, I would have walked into Samsung with the demo application that was working an hour before my meeting, rather than the android app that could only show me the opening logo.

There are a lot of really bad human developers out there, too.

moregrist · 2026-01-17T00:56:13 1768611373

> Entrepreneur, CEO and founder of Tomorrowish a social media DVR

So you flubbed managing a project and are now blaming your employees. Classy.

mickdarling · 2026-01-17T15:14:12 1768662852

Wasn't my project to manage. That was a consulting gig. And I fired the client right after this.

DonHopkins · 2026-01-17T12:36:31 1768653391

Nice blog post, gp serial entrepreneur founder bro -- what did your investors think of that?

http://www.mickdarling.com/2019/07/26/busy-summer/

  An embedded page at landr-atlas.com says:

  Attention!

  MacOS Security Center has identified that your system is under threat. 
  Please scan your MacOS as soon as possible to avoid more damage.
  Don't leave this page until you have undertaken all the suggested steps 
  by authorised Antivirus.

  [OK]

mickdarling · 2026-01-18T23:29:07 1768778947

Thank you for the note. It's not a site I used all that often.

Whether you had anything to do with it or not, I have no idea. And, since you didn't follow best practices and tell me directly rather than trying to score points here, there's really no way of knowing whether you're the one who caused the problem in the first place.

I built a new site without Wordpress. That took in less than a day.

I don't imagine you will alter your behavior to align with general best security practices anytime soon.

DonHopkins · 2026-01-20T10:53:48 1768906428

> Whether you had anything to do with it or not, I have no idea. And, since you didn't follow best practices and tell me directly rather than trying to score points here, there's really no way of knowing whether you're the one who caused the problem in the first place.

Are you actually accusing me (slyly couched in weasel words, but still explicitly) of hacking your wordpress blog, then pointing it out on Hacker News to score points?

Yeah, you have a point /s: there's really no way to tell if I hacked your blog or not, nor any way of knowing whether any statement is true or not if you're nihilistic enough, but you're going to have to take my word that I didn't, and clean up your own mess without shifting the blame to me, or demanding I should have helped you. You're the one who chose to use wordpress, not me. FYI, "general best security practices" include DON'T USE WORDPRESS.

What possible evidence or delusional reasons do you have to imply that I hacked your wordpress blog? Is your security really that lax and password that easy to guess? And even if I did, then why would I post about it publicly or notify you privately? You sound pathologically paranoid and antisocially aggressive to make such baseless accusations out of the blue, to try to shift the blame to me for your own mistakes. That makes me glad I didn't try to contact you directly. Funny thing for you to complain about when you don't even openly publish your contact email address on your blog or hn profile like I do, though.

anonzzzies · 2026-01-16T22:04:33 1768601073

We use claude code a lot for updating systems to a newer minor/major version. We have our own 'base' framework for clients which is a, by now, very large codebase that does 'everything you can possibly need'; so not only auth, but payments, billing, support tickets, email workflows, email wysiwyg editing, landing page editor, blogging, cms, AI /agent workflows etc etc (across our client base, we collect features that are 'generic' enough and create those in the base). It has many updates for the product lead working on it (a senior using Claude code) but we cannot just update our clients (whose versions are sometimes extremely customised/diverging) at the same pace; some do not want updates outside security, some want them once a year etc. In this case AI has been really a productivity booster; our framework always was quite fast moving before AI too when we had 3.5 FTE (client teams are generally much larger, especially the first years) on it but then merging, that to mean; including the new features and improvements in the client version that are in the new framework version without breaking/removing changes on the client side, was a very painful process taking a lot of time and at at least 2 people for an extended period of time; one from the client team, one from the framework team. With CC it is much less painful: it will merge them (it is not allowed, by hooks, to touch the tests), it will run the client tests and the new framework tests and report the difference. That difference is evaluated usually by someone from the client team who will then merge and fix the tests (mostly manually) to reflect the new reality and test the system manually. Claude misses things (especially if functionalities are very similar but not exactly the same, it cannot really pick which to take so it does nothing usually) but the biggest bulk/work is done quickly and usually without causing issues.

fzzzy · 2026-01-16T17:21:28 1768584088

that’s not guaranteed to work. Other parts of the CodeBase that didn’t conflict could depend on the discarded code.

madeofpalk · 2026-01-16T18:35:05 1768588505

The point is that the merge conflict was resolved, regardless of whether there was a working product at the end. Which there apparently isn’t.

formerly_proven · 2026-01-16T17:31:51 1768584711

Well they did mention the code doesn't work.

nyeah · 2026-01-16T18:00:21 1768586421

Where did Cursor say that?

logicallee · 2026-01-16T21:36:34 1768599394

It's implied by the fact that early in the post they say:

>"To test this system, we pointed it at an ambitious goal: building a web browser from scratch."

and then near the end, they say:

>"Hundreds of agents can work together on a single codebase for weeks, making real progress on ambitious projects."

This means they only make progress toward it, but do not "build a web browser from scratch".

If you're curious, the State of Utopia (will be available at https://stateofutopia.com ) did build a web browser from scratch, though it used several packages for the networking portion of it.

See my other comments and posts for links.

efreak · 2026-01-18T05:50:41 1768715441

`git add .; git merge continue` also "solves" the conflict

PunchyHamster · 2026-01-16T20:15:42 1768594542

So, AI agent battle royale

embedding-shape · 2026-01-16T17:06:12 1768583172

So clearly someone, at some point, managed to run this, surely? That's where the screenshots come from? I just don't understand how, given the code is riddled with errors.

nicoburns · 2026-01-16T19:28:17 1768591697

Somebody managed to get it to compile https://x.com/CanadaHonk/status/2011612084719796272

But apparently "some pages take a literal minute to load"

embedding-shape · 2026-01-16T19:42:08 1768592528

> to be clear those 2 hours were fixing compile errors and bugs, not compile time

Seems like "I had to do the last mile myself", not "autonomous coding" which was Cursor's claim here.

deeth_starr_v · 2026-01-16T17:52:11 1768585931

Maybe they just asked an AI to create an image of a rendered webpage?

nyeah · 2026-01-16T17:43:25 1768585405

The link [0] implies that the browser worked. Can you help me understand what's "conservative" about that?

DonHopkins · 2026-01-17T12:44:41 1768653881

> Can you help me understand what's "conservative" about that?

It's the gaslighting.

paulus_magnus2 · 2026-01-02T10:07:48 1767348468

Why are we always blaming the lowest developer for things? How about we target the Jira overlords who know exactly what they are doing?

I was once in a team developing a billing system that was counting how many times NSA invokated APIS to snoop on ATnT subscribers. The whole thing was very decoupled and dynamicly set up it took us developer very long time to figure out what this is used for. But the PM knew exactly what they did from the start.

paulus_magnus2 · on June 16, 2024

No need to ban it. Just automaticly award full salary for 2x the noncompete period they put in your contract, payable in full a week after contract termination.

callalex · on June 16, 2024

Suddenly your salary is $1/yr and your bonus is $1M/yr. (This really happens in the USA financial sector.)

AgentOrange1234 · on June 16, 2024

Employees won’t tolerate that due to not being able to get mortgages? Our megacorp recently upped base salaries due to this.

Dr_Birdbrain · on June 16, 2024

Employees definitely tolerate this. As said above, this already happens in industry, and also for many years Amazon used to cap salaries at some hilariously low number—I think like 180k. That may not seem low until you try to buy a house in the Bay Area and lenders laugh in your face.

People would either be part of a two-income household or wait for their stock to vest, but for many years they tolerated it fine.

tekno45 · on June 17, 2024

Can't you count bonuses that are consistent for two years?

paulus_magnus2 · on June 4, 2024

How does this look outside coding? Do we know of any other professions where there are whiteboard / leetcode style interviews?

paulus_magnus2 · on May 13, 2024

The single biggest opportunity equalizer for the young generations come from 2 unlikely sources: COVID (mandatory WFH) and Musk (ubiquitous internet access). This opens doors for anyone who is hard working and can follow though a long term plan. No generation wealth needed. Study hard, get remote work, buy a cheap piece of land in the middle of nowhere, get married, get a mortgate, get kids.

Focus on your own stuff. Avoid politics, avoid social media, avoid HCOL cities, avoid career ladder. Do your 9-5 each day, log off and invest time in a hobby, family, farm etc.

paulus_magnus2 · on May 13, 2024

That's not the 1st time I see this thinking fallacy.

please talk to your grandparents about the topic and report back to us.

anonzzzies · on May 13, 2024

My grandparents are dead, but I doubt they would have much to say about it; they had kids (in the 1940s) because their parents would not accept anything less; religion + marriage + kids. They often told me and my parents they didn't want kids and they were supposed to have as many as possible but they only had one and promised to have more but never did.

But what great insights would they come up with pray tell? In those days (it was WOII) it was to work the land (economy), cannon fodder and pension plan (have them take care of you). What else?

thinkingemote · on May 13, 2024

If it wasn't for your grandparents you wouldn't be here.

That's what is else, your very existence!

I'm sure you would agree, your life has value and meaning beyond a utilitarian or disposable cog in a wheel.

By extension therefore all lives are equally valuable and necessary.

card_zero · on May 13, 2024

"You, an existing person, find you prefer to stay alive" is not an argument for "more babies are always better". If the reason that there are fewer babies is that potential parents don't want them, then it's possible that there are exactly the right amount of babies, and the ensuing problems should be solved by other means than pressuring these unenthused couples to reproduce.

I think that creating more happy people is better, not more people per se.

anonzzzies · on May 13, 2024

> Your life has value

It does? Measured by what? That's what I say; I'm a consumer, but beyond that, how am I valuable? How are you?

> By extension all lives are equally valuable and necessary.

You are a vegan? Are plants lives lives?

Also, putting too much pressure on the planet with over population makes it the opposite of valuable.

thinkingemote · on May 13, 2024

By any measurement you care to make yourself. You value your own life.

anonzzzies · on May 13, 2024

> You value your own life.

We are an accidental blib in a cold vast cosmos; nothing and no-one would think of us or remember us if the earth explodes today (or if someone hits the nuke button). I 'value' my life in so far that I am not in pain and I was born in the right location with the right interests (math + computing) to do whatever I want to sit it out until I die.

How is any of it a reason to make more babies though or a sign we have a shortage just because we make less than we need to create a population equilibrium or growth?

Ekaros · on May 13, 2024

Grandparents who lived lot more poorly or had their siblings die as babies? Lot more babies survive and their living standards are much higher. Maybe that is reasonable trade-off...

paulus_magnus2 · on April 8, 2024

How is housing expensive?? Assuming a 1.2m house (for easy math). You pay 200k down and take a 1m loan which is 6k/mo. After 10y you sell the house for 2.4m. Having paid 720k, ~500k of mortgage is left over. So 200k turns into 2.4m -500k -720k = 1.18m profit.

Rather than renting for 5k (600k over 10y).

How is housing expensive??

paulus_magnus2 · on March 4, 2024

Please let's not glorify working from 6am (effectively allocating time since 5am for work) unless you also left office at 2pm. Only seeing your children on weekends to make your boss 10% happier is not something to be proud about

Faaak · on March 4, 2024

In Switzerland at least, it's very common to arrive very early in the morning (~6am), and then leave at 3pm. As long as you do your hours

jajko · on March 4, 2024

No it isn't, not in any big multinationals, not unless you have some extremely specific agreement with management. More acceptable is say 7:30 - 16:30, and is still often seen as slacking by folks who stay till 6-7pm. Of course if you do consistently excellent work way above rest of the peers this doesn't matter.

The reason is trivial - if you are not around half of the time rest of the team is, teamwork suffers quite a bit (which is cca constant situation with distant remote places like India, ie if you have any questions in the afternoon you are blocked and they need to wait till next day)

hoseja · on March 4, 2024

>not in any big multinationals

Those aren't the only companies that exist.

throwaway11460 · on March 4, 2024

But the only companies that pay well - at least here in Europe.

15457345234 · on March 4, 2024

Having spent some time living in Zurich I call bullshit on this; the streets and train stations are _deserted_ at 6am. Shops are still getting deliveries, coffee shops are still closed.

No significant numbers of people - in Zurich at least - are starting work in offices at 6am.

EduardoBautista · on March 4, 2024

A previous coworker used to get into the office early, around 7am, because they dropped off their kid at school. They also left around 3pm. This is reasonable and would be accepted by any competent manager.

benjaminwootton · on March 4, 2024

This is how I always worked over a 15 year developer career. I’d get in at 6 maybe take a 2 hour lunch break and then put a few more hours in until around 4pm as a standard pattern.

The vast majority of useful work I ever did was between the hours of 6-9am.

edanm · on March 4, 2024

Parent described what he did, in what way did he "glorify" this?

I agree with your sentiment but I think in this case, it's misplaced.

esskay · on March 4, 2024

If someone choses to do it (and is not required) then why is it an issue?

I do 6:30am - 3pm and have done for years, it fits my day well. Others I work with might not start till 10am. Thats the great thing about working from home, you get more flexability in your start/end time. As long as you get your working hours done it's a total non issue.

skipkey · on March 4, 2024

I generally left around 3pm to avoid the Austin rush hour traffic. Sure, I would occasionally stay late but I generally didn’t work much more than a 40 hour week. But this was Facebook games, with a daily release cycle, so there wasn’t really the crunch time that you would have seen at a AAA studio. They catered lunch every day but I almost always left and ate somewhere else.

Personally I appreciated the flexibility to work and leave early.

JR1427 · on March 4, 2024

When I was a PhD student and post-doc, I used to also arrive by 7am, and get most of my work done before anyone else arrived at 11am. I had all the equipment to myself, and enjoyed being able to settle in to the work day without having to interact with people.