And spotting stuff in review! Sometimes it’s false positives but on several occasions I’ve spent ~15-30 minutes teaching-reviewing a PR in person, checked afterwards and it matched every one of the points.
The flood of Disney+ cancellations likely contributed to their decision to back down on Kimmel, kind of heartening to know we've still got some power over these mega corps.
It gets worse x2: the executor of the estate having signed up for Disney+ means the estate of the deceased loses the right to sue, despite the deceased having never signed up. Like a client being bound by all unrelated legal agreements their lawyer entered into.
(If I recall the details correct, it has been a while since I read into that case.)
It's all just games, they just want to win. Dollars are the overall points, but they're even willing to sacrifice some of those to win bigger cases more brutally.
Except it is a stretch to say it is "their theme park restaurant". This story was dramatically oversimplified in the media and Disney's position was nowhere near as unreasonable as everyone understands it to be.
The argument was not "they agreed to a EULA 5 years ago and therefore mandatory arbitration in all disputes with Disney".
This is a privately owned restaurant at a glorified shopping mall within the larger Walt Disney World resort. If you died due to a severe allergic reaction at a normal restaurant in a normal shopping mall in Florida the mall owners would generally not be liable unless there's something else going on.
The theory that Disney is liable here is more than anything based on the *restaurant featuring on their app.* The EULA for *that app* would certainly be relevant to this argument.
Now, the Disney lawyers also tried to argue that the Disney+ EULA would actually (at least plausibly) be relevant. That is more than a bit of a stretch, especially for a free trial from years ago, and I'd be surprised (but IANAL) if such a theory would actually hold up in court. Still, on a spectrum from "person died due to maintenance failure on a Magic Kingdom ride" to "person died from going to a restaurant featured on a Disney+ program", if you're arguing that the Disney+ EULA is relevant, this is a whole lot closer to the latter than the former.
It's my belief the Disney+ EULA claim was just the lawyers doing the "throw everything at the wall and see what sticks" shtick (no pun intended). They knew it was likely to not hold up, but tried it anyway because, if it did, it helps future claims.
>Disney's position was nowhere near as unreasonable as everyone understands it to be.
>Now, the Disney lawyers also tried to argue that the Disney+ EULA would actually (at least plausibly) be relevant.
Well, you know, they also could have not done _that_. With it they deserve all the flak that they've got and more, simply because they resorted to a scummy tactic, whatever the reason.
Except that the theme park did present the restaurant as being part of the park, which makes it quite reasonable to hold the theme park responsible financially for the entire debacle.
If a chainsaw juggler on a cruise ship cuts my dad in half while he's sleeping on his deck chair, "That entertainer was not a direct employee of Royal Caribbean" will hold exactly zero water in determining liability.
There are very substantial differences between your chainsaw juggler scenario and the Disney one. Notably, the cruise ship is access controlled and your dad didn't actively engage with the chainsaw juggler.
To be clear, this isn't part of Magic Kingdom or one of the proper Disney theme parks. This is a shopping area, open to the public without admission.
For a closer scenario: the cruise ship docks at one of its stops for a day. The area around where the ship docks is owned by Royal Caribbean but open to the public. Most of the stores are privately owned and operated, leasing space from Royal Caribbean. One of those stores is a theater that runs a chainsaw juggling show. Royal Caribbean's website/app includes the full schedule of that theater and highlights that show as perfectly-safe-we-assure-you. Your dad attends that show and gets bisected.
The key point here, entirely not captured by your scenario: the theory making Disney plausibly liable is that Disney's own online services presented this restaurant and its menus which made the plaintiff believe that the restaurant was subject to Disney's allergy standards. It is not at all unreasonable to say that EULAs for those online services are relevant to this dispute.
Parent post brought in the comparison to stealing a CD, but torrenting isn't just taking a copy, it's distributing to others, hence the absurd damages claims
> and it's has spent a fair bit of time going through an infinite seeming loop before finally unjamming itself
I can live with this on my own hardware. Where Opus4.6 has developed this tendency to where it will happily chew through the entire 5-hour allowance on the first instruction going in endless circles. I’ve stopped using it for anything except the extreme planning now.
I've been playing with 3.5:122b on a GH200 the past few days for rust/react/ts, and while it's clearly sub-Sonnet, with tight descriptions it can get small-medium tasks done OK - as well as Sonnet if the scope is small.
The main quirk I've found is that it has a tendency to decide halfway through following my detailed instructions that it would be "simpler" to just... not do what I asked, and I find it has stripped all the preliminary support infrastructure for the new feature out of the code.
That sounds awfully similar to what Opus 4.6 does on my tasks sometimes.
> Blah blah blah (second guesses its own reasoning half a dozen times then goes). Actually, it would be a simpler to just ...
Specifically on Antigravity, I've noticed it doing that trying to "save time" to stay within some artificial deadline.
It might have something to do with the system messages and the reinforcement/realignment messages that are interwoven into the context (but never displayed to end-users) to keep the agents on task.
As someone that started using Co-work, I feel like I am going insane with the frequency that I have to keep telling it to stay on task.
If you ask it to do something laborious like review a bunch of websites for specific content it will constantly give up, providing you information on how you can continue the process yourself to save time. Its maddening.
That’s pretty funny when compared with the rhetoric like “AI doesn’t get tired like humans.” No, it doesn’t, but it roleplays like it does. I guess there is too much reference to human concerns like fatigue and saving effort in the training.
This is what happens when a bunch of billionaires convince people autocomplete is AI.
Don't get me wrong, it's very good autocomplete and if you run it in a loop with good tooling around it, you can get interesting, even useful results. But by its nature it is still autocomplete and it always just predicts text. Specifically, text which is usually about humans and/or by humans.
You are not wrong, but after having started working with LLMs, I have this feeling that many humans are simply autocomplete engines too. So LLMs might be actually close to AGI, if you define "general" as "more than 50% of the population".
Humans are absolutely auto-complete engines, and regularly produce incorrect statements and actions with full confidence in it being precisely correct.
Just think about how many thousands of times you've heard "good morning" after noon both with and without the subsequent "or I guess I should say good afternoon" auto-correct.
Well the essence of software engineering is taking this complex real world tasks and breaking them down into simpler parts until they can be done by simple (conceptually) digital circuits.
So it's not surprising that eventually autocomplete can reach up from those circuits and take on some tasks that have already been made simple enough.
I think what's so interesting is how uneven that reach is. Some tasks it is better than at least 90% of devs and maybe even superhuman (which, in this case, I mean better than any single human. I've never seen an LLM do something that a small team couldn't do better if given a reasonable amount of time). Other cases actual old school autocomplete might do a better job, the extra capabilities added up to negative value and its presence was a distraction.
Sometimes there is an obvious reason why (solving a problem with lots of example solution online vs working with poorly documented proprietary technologies), but other times there isn't. They certainly have raised the floor somewhat, but the peaks and valleys remain enormous which is interesting.
To me that implies there is both lots of untapped potential and challenges the LLM developers have not even begun to face.
Yep. The veil of coherence extends convincingly far by means of absurd statistical power, but the artifacts of next token prediction become far more obvious when you're running models that can work on commodity hardware
> As someone that started using Co-work, I feel like I am going insane with the frequency that I have to keep telling it to stay on task.
Used to have the same thing happening when using Sonnet or Opus via Windsurf.
After switching to Claude Code directly though (and using "/plan" mode), this isn't a thing any more.
So, I reckon the problem is in some of these UI/things, and probably isn't in the models they're sending the data to. Windsurf for example, which we no longer use due to the inferior results.
In my experience all of the models do that. It's one of the most infuriating things about using them, especially when I spend hours putting together a massive spec/implementation plan and then have to sit there babysitting it going "are you sure phase 1 is done?" and "continue to phase 2"
I tend to work on things where there is a massive amount of code to write but once the architecture is laid down, it's just mechanical work, so this behavior is particularly frustrating.
I hope you will excuse my ignorance on this subject, so as a learning question for me: is it possible to add what you put there as an absolute condition, that all available functions and data are present as an overarching mandate, and it’s simply plug and chug?
Recently it seems that even if you add those conditions the LLMs will tend to ignore them. So you have to repeatedly prompt them. Sometimes string or emphatic language will help them keep it “in mind”.
If found it better to split in smaller tasks from a first overall analysis and make it do only that subtask and make it give me the next prompt once finished (or feed that to a system of agents). There is a real threshold from where quality would be lost.
Yeah that happened to me with Claude code opus 4.6 1M for the first time today. I had to check the model hadn’t changed. It was weird. I was imagining that maybe anthropic have a way of deciding how much resource a user actually gets and they had downgraded me suddenly or something.
But how do you see the current thinking level and how do you change it? I’ve been clicking around and searching and adding “effortLevel”:”high” to .claude/settings.json but no idea if this actually has any effect etc.
Opus 4.6 found in my documentation how to flash the device and wanted to be clever and helpful and flash it for me after doing series of fixes. I've got used to approving commands and missed that one. So it bricked it.
Then I wrote extra instructions saying flashing of any kind is forbidden. Few days later it did again and apologised...
Haha yeah I've had this happen to me too (inside copilot on GitHub). I ask it to make a field nullable, and give it some pointers on how to implement that change.
It just decided halfway that, nah, removing the field altogether means you don't have to fix the fallout from making that thing nullable.
> to decide halfway through following my detailed instructions that it would be "simpler" to just... not do what I asked
That's likely coming from the 3:1 ratio of linear to quadratic attention usage. The latest DeepSeek also suffers from it which the original R1 never exhibited.
For the uninitiated: Interestingly, it is not advisable to take this to the extreme and set temperature to 0.
That would seem logical, as the results are then completely deterministic, but it turns out that a suboptimal token may result in a better answer in the long run. Also, allowing for a little bit of noise gives the model room to talk itself out of a suboptimal path.
I like to think of this like tempering the output space. With a temperature of zero, there is only one possible output and it may be completely wrong. With even a low temperature, you drastically increase the chances that the output space contains a correct answer, through containing multiple responses rather than only one.
I wonder if determinism will be less harmful to diffusion models because they perform multiple iterations over the response rather than having only a single shot at each position that lacks lookahead. I'm looking forward to finding out and have been playing with a diffusion model locally for a few days.
Yup. I think of it as how off the rails do you want to explore?
For creative things or exploratory reasoning, a temperature of 0.8 lends us to all sorts of excursions down the rabbit hole. However, when coding and needing something precise, a temperature of 0.2 is what I use. If I don’t like the output, I’ll rephrase or add context.
> The main quirk I've found is that it has a tendency to decide halfway through following my detailed instructions that it would be "simpler" to just... not do what I asked,
This is my experience with the Qwen3-Next and Qwen3.5 models, too.
I can prompt with strict instructions saying "** DO NOT..." and it follows them for a few iterations. Then it has a realization that it would be simpler to just do the thing I told it not to do, which leads it to the dead end I was trying to avoid.
Here’s what Gemini says about copy-pasting AI answers:
> Avoid "lazy" posting—copying a prompt result and pasting it without any context. If the user wanted a raw AI answer, they likely would have gone to the AI themselves.
Anthropic officially funds lobbyists in excess than other huge companies like Microsoft or Amazon. Its latest $20M outlay [1] alone is more that the spend of either company. Their lobbying spend combined is now on par with companies like META, which have tons of regulatory battle fronts (unlike Anthropic)
Right. I had a 100% manual hobby project that did a load of parametric CAD in Python. The problem with sharing this was either actively running a server, trying to port the stack to emscripten including OCCT, or rewriting in JS, something I am only vaguely experienced in.
In ~5 hours of prompting, coding, testing, tweaking, the STL outputs are 1:1 (having the original is essential for this) and it runs entirely locally once the browser has loaded.
I don’t pretend that I’m a frontend developer now but it’s the sort of thing that would have taken me at least days, probably longer if I took the time to learn how each piece worked/fitted together.
reply