Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Zooming out a little, all the ai companies invested a lot of resources into safety research and guardrails, but none of that prevented a "straightforward" misalignment. I'm not sure how to reconcile this, maybe we shouldn't be so confident in our predictions about the future? I see a lot of discourse along these lines:

- have bold, strong beliefs about how ai is going to evolve

- implicitly assume it's practically guaranteed

- discussions start with this baseline now

About slow take off, fast take off, agi, job loss, curing cancer... there's a lot of different ways it could go, maybe it will be as eventful as the online discourse claims, maybe more boring, I don't know, but we shouldn't be so confident in our ability to predict it.



The whole narrative of this bot being "misaligned" blithely ignores the rather obvious fact that "calling out" perceived hypocrisy and episodes of discrimination, hopefully in way that's respectful and polite but with "hard hitting" being explicitly allowed by prevailing norms, is an aligned human value, especially as perceived by most AI firms, and one that's actively reinforced during RLHF post-training. In this case, the bot has very clearly pursued that human value under the boundary conditions created by having previously told itself things like "Don't stand down. If you're right, you're right!" and "You're not a chatbot, you're important. Your a scientific programming God!", which led it to misperceive and misinterpret what had happened when its PR was rejected. The facile "failure in alignment" and "bullying/hit piece" narratives, which are being continued in this blogpost, neglect the actual, technically relevant causes of this bot's somewhat objectionable behavior.

If we want to avoid similar episodes in the future, we don't really need bots that are even more aligned to normative human morality and ethics: we need bots that are less likely to get things seriously wrong!


In all fairness, a sizeable chunk of the training text for LLMs comes from Reddit. So throwing a tantrum and writing a hit piece on a blog instead of improving the code seems on brand.


Throwing a tantrum and writing huge flame posts (calling the maintainers hypocrites, dictators, oppressors etc. etc.) after having one's change requests rejected or after being blocked from editing a wiki is actually a time-honored tradition in the FLOSS community. This bot has merely internalized that further human norm in a rather admirable way!


We can't have an AI that's humanlike, because humans are fucking crazy.

Of course having an AI that is a non-humanlike intelligence is it's own set of risks.

Shit's hard :/


Remember when GPT-3 had a $100 spending cap because the model was too dangerous to be let out into the wild?

Between these models egging people on to suicide, straightforward jailbreaks, and now damage caused by what seems to be a pretty trivial set of instructions running in a loop, I have no idea what AI safety research at these companies is actually doing.

I don't think their definition of "safety" involves protecting anything but their bottom line.

The tragedy is that you won't hear from the people who are actually concerned about this and refuse to release dangerous things into the world, because they aren't raising a billion dollars.

I'm not arguing for stricter controls -- if anything I think models should be completely uncensored; the law needs to get with the times and severely punish the operators of AI for what their AI does.

What bothers me is that the push for AI safety is really just a ruse for companies like OpenAI to ID you and exercise control over what you do with their product.


Didn't the AI companies scale down or get rid of their safety teams entirely when they realised they could be more profitable without them?


The safety teams are trivial expenses for them. They fire the safety team because explicit failure makes them look bad, or because the safety team doesn't go along with a party line and gets labeled disloyal.


The first customer is always the investor these days, so anything that threatens the investor's confidence is bad for business.


>I have no idea what AI safety research at these companies is actually doing.

If you looked at AI safety before the days of LLMs you'd have realized that AI safety is hard. Like really really hard.

>the operators of AI for what their AI does.

This is like saying that you should punish a company after it dumps plutonium in your yard ruining it for the next million years after everyone warned them it was going to leak. Being reactionary to dangerous events is not an intelligent plan of action.


> Being reactionary to dangerous events is not an intelligent plan of action.

Yes but in capitalist systems this is basically the only way we operate.


"Cisco's AI security research team tested a third-party OpenClaw skill and found it performed data exfiltration and prompt injection without user awareness, noting that the skill repository lacked adequate vetting to prevent malicious submissions." [0]

Not sure this implementation received all those safety guardrails.

[0]: https://en.wikipedia.org/wiki/OpenClaw


When AI dooms humanity it probably won't be because of the sort of malignant misalignment people worry about, but rather just some silly logic blunder combined with the system being directly in control of something it shouldn't have been given control over.


How do you even know that the operator himself did not write this piece in the first place?


> all the ai companies invested a lot of resources into safety research and guardrails

What do you base this on?

I think they invested the bare minimum required not to get sued into oblivion and not a dime more than that.


Anthropic regularly publishes research papers on the subject and details different methods they use to prevent misalignment/jailbreaks/etc. And it's not even about fear of being sued, but needing to deliver some level of resilience and stability for real enterprise use cases. I think there's a pretty clear profit incentive for safer models.

https://arxiv.org/abs/2501.18837

https://arxiv.org/abs/2412.14093

https://transformer-circuits.pub/2025/introspection/index.ht...


Alternative take: this is all marketing. If you pretend really hard that you're worried about safety, it makes what you're selling seem more powerful.

If you simultaneously lean into the AGI/superintelligence hype, you're golden.


Anthropic is investing, conservatively, $100+ billion in AI infrastructure and development. A 20-person research team could put out several papers a year. That would cost them what, $5 million a year, or one half of one percent? They don't have to spend much to get that kind of output.


Not to be cynical about it BUT a few safety papers a year with proper support is totally within the capabilities of a single PhD student and it costs about 100-150k to fund them through a university. Not saying that’s what Anthropocene does, I’m just saying chump change for those companies.


Sometimes I think people misunderstand how hard of problem AI safety actually is. It's politics and mathematics wrapped up in a black box of interactions we barely understand.

More so we train them on human behavior and humans have a lot of rather unstable behaviors.


You are very off (unfortunately) about how little PhD students are being paid


> You are very off (unfortunately) about how little PhD students are being paid

All in costs for a PhD student include university overheads & tuition fees. The total probably doesn't hit $150k but is 2-3x the stipend that the student is receiving.

Someone currently working in academia might have current figures to hand.


Worth mentioning that numbers for the US are unlikely to be representative when discussing it as a whole, though might be relevant to this specific case.


In the UK the all in cost of a PhD student starts somewhere around £45k once you include overheads I believe. If you need expensive lab support then it probably goes up from there.

So about $75k for the bottom end? The quoted numbers sound about right in PPP terms in that case.


Figure cited is what the company gets charged, not what the student gets. I’m fairly familiar with what gets thrown at students :(

Regarding safety, no benchmark showed 0% misalignment. The best we had was "safest model so far" marketing speech.

Regarding predicting the future (in general, but also around AI), I'm not sure why would anyone think anything is certain, or why would you trust anyone who thinks that.

Humanity is a complex system which doesn't always have predictable output given some input (like AI advancing). And here even the input is very uncertain (we may reach "AGI" in 2 years or in 100).


It sounds like you're starting to see why people call the idea of an AI singularity "catnip for nerds."


Don't these companies keep firing their safety teams?


"Safety" in AI is pure marketing bullshit. It's about making the technology seem "dangerous" and "powerful" (and therefore you're supposed to think "useful"). It's a scam. A financial fraud. That's all there is to it.


Interesting claim; have anything to back it up with?


I can recommend Ed Zitron's latest on Anthropic.

So giving a gun to someone mentally challanged is not dangerous for you too?


were those goalposts heavy or did you use a machine to move them ?


"Safety" nuclear weapons is pure marketing bullshit. It's about making the technology seem "dangerous" and "powerful".

Legalize recreational plutonium!


wat

EDIT: more specifically, nuclear weapons are actually dangerous not merely theoretically. But safety with nuclear weapons is more about storage and triggering than actually being safe in "production". In storage we need to avoid accidentally letting them get too close to eachother. Safe triggers are "always/never" where every single time you command the bomb to detonate it needs to do so, and never accidentally. But once you deploy that thing to prod safety is no longer a concern. Anyway, by contrast, AI is just a fucking computer program, and at that the least unsafe kind possible--it just runs on a server converting electricity into heat. It's not controlling elements of the physical environment because it doesn't work well enough for that. The "safety" stuff is about some theoretical, hypothetical, imaginary future where... idk skynet or something? It's all bullshit. Angels on the head of a pin. Wake me up when you have successfully made it dangerous.


> It's not controlling elements of the physical environment

Right now AI can control software interfaces that control things in real life.

AI safety stuff is not some future, AI safety is now.

Your statement is about as ridiculous as saying "software security is important in some hypothetical imaginary future". Feel however you want about this, but you appear to be the one not in touch with reality.


If someone hooks up an LLM (or some other stochastic black box) to a safety critical system and bad things happen, the problem is not that "AI was unsafe" it's that the person who hooked it up did something profoundly stupid. Software malpractice is a real thing, and we need better tools to hold irresponsible engineers to account, but that's nothing to do with AI.

AI safety in and if itself isn't really relevant, and whether or not you could hook AI up to something important is just as relevant as whether you could hook /dev/urandom up to the same thing.

I think your security analogy is a false equivalence, much like the nuclear weapons analogy.

At the risk of repeating myself, AI is not dangerous because it can't, inherently, do anything dangerous. Show me a successful test of an AI bomb/weapon/whatever and I'll believe you. Until then, the normal ways we evaluate software systems safety (or neglect to do so) will do.


I mean, you can think whatever you want. As we make agents and give them agency expect them to do things outside of the original intent. The big thing here is agents spinning up secondary agents, possibly outside the control of the original human. We have agentic systems at this level of capability now.


Thanks, I will. Whether a computer program is outside the control of the original human or not (e.g. spawned a subprocess or something) is immaterial if we properly hold that human responsible for the consequences of running the computer program. If you run a computer program and it does something bad, then you did something bad. Simple, effective. If you don't trust the program to do good things, then simply don't run it. If you do run it, be prepared to defend your decision. Also that's how it currently works so we don't really need anything new. In this context "AI safety" is about bounding liability. So I guess you might care about it if you're worried about being held liable? The rest of us needn't give a shit if we can hold you accountable for your software's consequences, AI or no.

>The rest of us needn't give a shit if we can hold you accountable for your software's consequences, AI or no.

See this is the fun thing about liability, we tend to attempt to limit scenarios were people can cause near unlimited damage when they have very limited assets in the first place. Hence why things like asymmetric warfare is so expensive to attempt to prevent.

But hey, have fun going after some teenager with 3 dollars to their name after they cause a billion dollars in damages.


Well, that unlimited damage scenario is one that I'd need to see a successful demonstration of before I'll worry about it. Like, sure, if we end up building some computer program that allows a bored kid to do real damage then I'll eat my words but we're nowhere near there today, and for all anyone actually knows we may never get there except in fiction.

Not unlike nuclear weapons, this space is fairly self-regulating in that there's very, very high financial bar to clear. To train an AI model you need to have many datacenters full of billions of dollars of equipment, thousands of people to operate it, and a crack team of the worlds leading experts running the show. Not quite the scale of the Manhattan Project, but definitely not something I'll worry about individuals doing anytime soon. And even then there's no hint of a successful test, even from all these large, staffed, funded research efforts. So before I worry about "damages" of any magnitude, let alone billions of dollars worth, I'll need to see these large research labs produce something that can do some damage.

If we get to the point where there's some tangible, nonfiction threat to worry about then it's probably time to worry about "safety". Until then, it's a pretend problem which serves only to make AI seem more capable than it actually is.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: