I see frame drops when opening the start menu on a clean Windows 11 install on my work laptop (Intel Quad with 32GB memory from two years ago). I have seen the same on 3D Vcache Ryzens on systems from people who claimed there was not lag or sluggishness. It was there and they saw it once pointed out, the standards for Windows users have simply gotten so low that they’ve gotten used to the status quo.
On MacOS, meanwhile, Finder refuses to update any major changes done via CLI operations without a full Finder restart and the search indexing is currently broken, after prior versions of Ventura where stable functionality wise. I am however firm that Liquid Glass is a misstep and more made by the Figma crowd then actual UX experts. It is supposed to look good in static screenshots rather than have any UX advantage or purpose compared to e.g skeuomorphism.
If I may be a bit snarky, I’d advise anyone who does not see the window corner inconsistencies on current MacOS or the appealing lag on Windows 11 to seek an Ophthalmologist right away…
KDE and Gnome are the only projects that are still purely UX focused, though preferences can make one far more appealing than the other.
Title was editorialised from "BACK FROM THE GRAVE" to "Unreleased LG Rollable Smartphone Teardown" as the original title does not provide information what the video is on.
Having owned a Fold in the past, I still am firm that if we get flexible displays to be sufficiently resilient, this rollable design will win in the long term. It only requires one display, can be deployed far more seamlessly than opening via a hinge and allows for numerous display size between the max and minimum deployment, rather than a fixed number.
Had a LG Wing and V60, used them beyond the death of LGs smartphone division. Software/the skin was never their strong suit, but they did deviate from the pack in very interesting ways. Sadly, that likely contributed to their ultimate demise, though I maintain that they could have staved of their fate, had they heavily advertised being the only "normal looking" smartphone at the time that followed MIL-STD-810G1. That they barely advertised a differentiator which could have caught the attention of a large consumer base in the outdoors, hunting, etc. spaces is a missed opportunity in my eyes.
After these findings, any rational person would take a step back and consider whether they are actually using these models properly.
Maybe, even if you believe that LLM code output nowadays is both 100% perfect and always as high performant as possible (they aren't), having the lowest LOC is still the ideal cause the simplest functional implementation will always stay the best, all else being equal. Even more so considering this is a bloody Rails Blog, not a highly complex project with no existing reference point.
But Garry Tan, he isn't most people.
Instead, double down, call a teenager just doing some frankly fair, polite and professional analysis of a poor codebase names and do anything but reflect that maybe, just maybe, you might be wrong.
Mind you, this would be childish and stupid if it were him that had coded these offences. At least with handcrafted poor code, there is a sunk cost element to it. But here there is not. His emotional involvement in this code should be zero, just like the actual effort expended.
We are talking about code he has likely never even skimmed. Code that is unusably unoptimised. Code for a simple blog that contains deficiencies such as uncompressed pngs, broken accessibility, etc. which any decent hobbyist or old school automated tooling would catch without "AI" magic pretty quickly. One run of e.g. Lighthouse shows that this is unusably poor, though for that one must focus on something other than "look, I am spending thousands to get ever more unaudited output".
LLMs for coding, even agentic processes with limited intervention, are incredibly powerful and valuable. But even with me auditing every line of code I receive from a model, I have little to no emotional investment in said code and feel no issue throwing it out completely if I find any issue with it, far more so than before.
Despite all of that, rather than saying, "Yeah, this is poor, let's just get rid of it, thanks for pointing that out, egg on my face, let me just vibe code a better replacement now that I know what to look for", he became emotional and enraged, for code he never wrote.
gstack overall looks very odd for someone who does evals myself. I view this as build by someone who struggles to view these models through a lens beyond quantity=productivity which is the exact opposite of my goals. I will always tend towards less tokens of output with much higher quality. Faster, less expensive, easier to audit, what's there not to prefer?
In any case, if gstack makes LLMs struggle to create a maintainable blog (something these models with all their flaws most certainly can do), that should give major pause that maybe this isn't barking up the right tree. Maybe stop using gstack for a while and seeing that a solution in the hundreds of LOC can be just as achievable (likely better overall) might do a world of good.
Godspeed Garry, may we soon finish the DSM-VI with some new entires focused on the harm these LLMs can cause in certain people, so they may get the help they so desperately need. Alternatively, there is always starting his own FS and trying to get that into Linux kernel...
If the vast majority of CEOs in this industry are to be believed, any company that achieves "AGI" will be undefeatable, their model improvements and research findings impossible to catch up to. Why risk that being Anthropic, Moonshot or any other competitor to OpenAI by spending your money on this?
The few months/years before "Everyone dies", wouldn't OpenAI want to be the "Anyone" that "build it" and is in control during that time? Unless, of course, OpenAI does not actually believe in that being a possibility, as suspected when they were working on social media...
I admit I'm surprised by the move, from a company that reportedly just talked about how they need to focus more on fewer, more strategic products.
But I also see the potential value. This is an entertaining and highly influential podcast, a lot of top VC's and founders watch it; it definitely punches well above it's audience KPI's in strategic value. I've seen many interviews or op-eds on the platform pretty clearly shape the startup discourse on X.
I also think it should run mostly autonomously, it'll only be as much of a distraction for OpenAI execs as they want it to be.
OpenAI just raised $122 billion (including future commitments), so whatever the purchase price was (we have no diea) is not going to even be a rounding error on their financial resources or their ability to pay their datacenter bills.
No disrespect to them, but unless there is a financial incentive at stake for them (beyond SnP500 exposure), I've gotten to viewing this through the lens of sports teams, gaming consoles and religions. You pick your side, early and guided by hype and there is no way that choice can have been wrong (just like the Wii U, Dreamcast, etc. was the best).
Their viewpoint on this technology has become part of the identity for some unfortunately and any position that isn't either "AGI imminent" or "This is useless" can cause some major emotions.
Thing is, this finding being the case (along with all other LLM limits) does not mean that these models aren't impactful and shouldn't be scrutinised, nor does it mean they are useless. The truth is likely just a bit more nuanced than a narrow extreme.
Also, mental health impact, job losses for white collar workers, privacy issues, concerns of rights holders on training data collection, all the current day impacts of LLMs are easily brushed aside by someone believing that LLMs are near the "everyone dies" stage, which just so happens to be helpful if one were to run a lab. Same if you believe these are useless and will never get better, any discussion about real-life impacts is seen as trying to slowly get them to accept LLMs as a reality, when to them, they never were and never will be.
I have a friend who is a Microsoft stan who feels this way about LLMs too. He's convinced he'll become the most powerful, creative and productive genius of all time if he just manages to master the LLM workflow just right.
He's retired so I guess there's no harm in letting him try
Respectfully, toddlers cannot output useable code or have otherwise memorised results to an immense number of maths equations.
What this points at is the abstraction/emergence crux of it all. Why does an otherwise very capable LLM such as the GPT-5 series, despite having been trained on vastly more examples of frontend code of all shapes, sizes and quality levels, struggle to abstract all that training data to the point where outputting any frontend that deviates from the clearly used examples?
If LLMs, as they are now, were comparable with human learning, there'd be no scenario where a model that can provide output solving highly advanced equations can not count properly.
Similarly, a model such as GPT-5 trained on nearly all frontend code ever committed to any repo online, would have internalised more than that one template OpenAI predominantly leaned on.
These models, I think at this point there is little doubt, are impressive tools, but they still do not generalise or abstract information in the way a human mind does. Doesn't make them less impactful for industries, etc. but it makes any comparison to humans not very suitable.
> What this points at is the abstraction/emergence crux of it all. Why does
This paper has nothing to do with any questions starting with "why". It provides a metric for quantifying error on specific tasks.
> If LLMs, as they are now, were comparable with human learning
I think I missed the part where they need to be.
> struggle to abstract all that training data to the point where outputting any frontend that deviates from the clearly used examples? ... a model such as GPT-5 trained on nearly all frontend code ever committed to any repo online, would have internalised more than that one template OpenAI predominantly leaned on
There is a very big and very important difference between producing the same thing again and not being able to produce something else. When not given any reason to produce something else, humans also generate the same thing over and over. That's a problem of missing constraints, not of missing ability.
Long before AI there was this thing called Twitter Bootstrap. It dominated the web for...much longer than it should have. And that tragedy was done entirely by us meatsacks (not me personally). Where there's no goal for different output there's no reason to produce different output, and LLMs don't have their own goals because they don't have any mechanisms for desire (we hope).
[...] common trope that was proven false years ago by the existence of zero shot learning.
Ok, that's better than comparing LLMs to humans. ZSL however, has not proven anything of that sort false years ago, as it was mainly concerned with assessing whether LLMs are solely relying on precise instruction training or can generalise in a very limited degree beyond the initial tuning. That has never allowed for comparing human learning to LLM training.
Ironically, you are writing this under a paper that shows just that:
A model that cannot determine a short strings parity cannot have abstracted from the training data to arrive at the far more impressive and complicated maths challenges which it successfully solves in output. Some of the solutions we have seen in output require such innate understanding that, if there is no generalisation, far deeper than ZSL has ever shown, than this must come from training. Simple multiplication, etc. maybe, not the tasks people such as Easy Riders [0] throw at these models.
This paper shows exactly that even with ZSL, these models do only abstract in an incredibly limited manner and a lot of capabilities we see in the output are specifically trained, not generalised. Yes, generalisation in a limited capacity can happen, but no, it is not nearly close enough to yield some of the results we are seeing. I have also, neither here, nor in my initial comment, said that LLMs are only capable of outputting what their training data provides, merely that given what GPT-5 has been trained with, if there was any deeper abstraction these models gained during training, it'd be able to provide more than one frontend style.
Or to put it simpler, if the output provided can be useful for Maths at the Bachelor level and beyond and this capability is generalised as you believe, these tasks would not be a struggle for the model.
> When not given any reason to produce something else, humans also generate the same thing over and over. That's a problem of missing constraints, not of missing ability.
Ignoring the comparison with humans, yes, LLMs don't output something unless prompted specifically, of course. My point with GPT-5 was that, no matter how you prompt, you cannot get salvageable frontend code from this line of models.
OpenAI themselves tried and failed appallingly [0]. Call it "constraints", call it "reason", call it "prompting", you cannot get frontend code that deviates significantly from their card laden training data. Despite GPT-5 having been trained with more high quality frontend code examples than any human could ever read in a lifetime, that one template is over presented, because the model never generalised anything akin to an understanding of UI principles or what code yields a specific design.
These are solvable problems, mind you, but not because a model at some stage gains anything that one could call an abstract understanding of these concepts. Instead, by providing better training data or being clever in how you provide existing training data.
Gemini 3 and Claude 4 class models have a more varied training set, specifically of frontend templates yielding better results though if you do any extended testing you will see these repeat constantly because again, these models never abstract from that template collection [1].
Moonshot meanwhile with K2.5 did a major leap by tying their frontend code tightly to visual input, leveraging the added vision encoder [2]. They are likely not the only ones doing that, but the first that clearly stated it reading the system cards. Even there, the gains are limited to a selection of very specific templates.
In either case, more specific data, not abstractions by these models yield improvements.
> Twitter Bootstrap [...] entirely by us meatsacks (not me personally). Where there's no goal for different output there's no reason to produce different output, and LLMs don't have their own goals because they don't have any mechanisms for desire (we hope).
What? So because some devs relied on Bootstrap that means, what exactly? That no one asked/told them to leverage a different solution, be more creative, what?
Again ignoring the comparison to humans which just is not appropriate for this tech, we can and do prompt models for specific frontend output. We are, if you must, providing the goal. The model however cannot accomplish said goal, even OpenAI cannot get GPT-5s lineage to deviate from their one template.
If we must stick with the human comparison and if we must further limit it to Bootstrap, GPT-5 despite being specifically prompted to never use the Carousel in Bootstrap, can not output any website without including a Carousel, because the template it was trained on included one. Any human developer asked to do so would just not include a Carousel, because their abilities are abstracted beyond the one Bootstrap template they first learned with. But if we truly wanted to make this fair, it'd actually have to be a human who was trained on thousands of Bootstrap example pages, but just one template really well and never connected anything between that one and the others. Which isn't very human, but then again, that's why this comparison is not really a solid one.
[0] Subjectively not one good result, objectively even their team of experts could not get their own model to seize the telltale signs of GPT frontend slop that originated from a template they have been training with since Horizon: https://developers.openai.com/blog/designing-delightful-fron...
I have made a commitment to reduce my overly long and excessively hedged comments on here, so, if I may: What the heck. Is this a belated April fools joke?
This is not what a company on the precipice of AGI or even one that has faith in LLMs being a consistent growth driver across the industry would realistically do.
Is this a good investment financially? I don't know and seeing as I have never heard of TBPN before this post, I am not the right person to gauge that.
But any investment, be it in building your own Social Networks (Sora 2), a news show or anything else beyond model training is frankly, to me at least, a clear admission that OpenAI does not see nearly as much value in models as they have been selling investors on.
Considering the rest of the economy, that is more terrifying than any "AI will kill us" prediction.
If OpenAI believed even a tenth of what they have tried to sell investors, governments and the public on, they'd not have a penny to invest in anything akin to this, plain and simple.
I think there will be an AI correction and OpenAI will be the center of it. I have no clue what their plan is, they seem to throw everything at the wall an nothing sticks. Gonna take MicroSlop down with them. Anthropic and Google will come out the other end in great shape though.
Picking winners at this stage is hard, but maybe you are right on Alphabet and Anthropic. The former has use for data centres including those filled with ML hardware in a way MSFT and AWS may not if LLMs crash, simply by virtue of YouTube and other services which have relied on ML long before the hype started. Their buildout also started earlier and that may help them not overbuild to keep up with the hype, but who knows.
On Anthropic, hard to tell from the outside whether they are just more quite on their buildouts or whether they truly mainly rent their hardware, which could give them some flexibility if the market craters.
On MSFT, I'd rather they fix their products, at least to keep their mass of employees from being affected negatively.
I can only speculate of course, but I'd suspect a far simpler reason, driving subscriptions. Even limited use for video editing can fill up the free 5GB tier very quickly and if iClouds rather aggressive nudges to upgrade and their success amongst the people around me are anything to go by, there is a lot of potential in getting people to subscribe for a few bucks.
The scenario I envision is basically, a user just edits a few 4k files together and starts approaching the 5GB limit. Because OneDrive is also by default used for other things such as documents and even the desktop nowadays, they soon thereafter hit the 5GB limit and start being inundated with offers to upgrade to the 100GB tier. "Just 1,99 per month, nothing major, barely felt", at least that's the pitch. Maybe they acquiesce, maybe they ignore it. That's the play I'd wager. And if enough people resist, maybe in an upcoming update a full OneDrive could lead to (artificially) degraded functionality. Not in the sense of local storage being impossible, just slightly less convenient over direct OneDrive. Say what you want about the iCloud nagging (I have and will continue complaining about that as well), at least it isn't required to have free iCloud storage to locally edit ones videos with iMovie.
In any case, forcing OneDrive for a video editor (arguably one of the highest storage requirement programs most people will ever use at the moment) is anti-consumer and showcases how little any commitment [0] by them actually means. Took less than a week, which is honestly longer than I'd have suspected...
On MacOS, meanwhile, Finder refuses to update any major changes done via CLI operations without a full Finder restart and the search indexing is currently broken, after prior versions of Ventura where stable functionality wise. I am however firm that Liquid Glass is a misstep and more made by the Figma crowd then actual UX experts. It is supposed to look good in static screenshots rather than have any UX advantage or purpose compared to e.g skeuomorphism.
If I may be a bit snarky, I’d advise anyone who does not see the window corner inconsistencies on current MacOS or the appealing lag on Windows 11 to seek an Ophthalmologist right away…
KDE and Gnome are the only projects that are still purely UX focused, though preferences can make one far more appealing than the other.
reply