This study came up on my feed but I'm not seeing much chatter on it. I'm interested in some perspectives from people who are better at reading if studies are quality or not.
> Abstract
>
> AI assistants, like GitHub Copilot and Cursor, are transforming software engineering. While several studies highlight productivity improvements, their impact on maintainability requires further investigation. [Objective] This study investigates whether co-development with AI assistants affects software maintainability, specifically how easily other developers can evolve the resulting source code. [Method] We conducted a two-phase controlled experiment involving 151 participants, 95% of whom were professional developers. In Phase 1, participants added a new feature to a Java web application, with or without AI assistance. In Phase 2, a randomized controlled trial, new participants evolved these solutions without AI assistance. [Results] Phase 2 revealed no significant differences in subsequent evolution with respect to completion time or code quality. Bayesian analysis suggests that any speed or quality improvements from AI use were at most small and highly uncertain. Observational results from Phase 1 corroborate prior research: using an AI assistant yielded a 30.7% median reduction in completion time, and habitual AI users showed an estimated 55.9% speedup. [Conclusions] Overall, we did not detect systematic maintainability advantages or disadvantages when other developers evolved code co-developed with AI assistants. Within the scope of our tasks and measures, we observed no consistent warning signs of degraded code-level maintainability. Future work should examine risks such as code bloat from excessive code generation and cognitive debt as developers offload more mental effort to assistants.
> …perspectives from people
> who are better at reading
> if studies are quality or not…
I'm by no means an expert at telling what the quality of a given research paper is.
But, without even reading the whole PDF [1] of this particular paper, it's super easy to tell that the authors are all from pro-vibe coding consultancies [2][3][4]
Telerik, DevExpress, and a lot of other companies have made profitable businesses that have lasted well over a decade on that business premise. Selling solid and easy to integrate pre-made components has been a pretty good business for a while.
I wonder how they're doing too then, as we don't have public stats about them (Telerik was acquired by a public company Progress Software but they do not break down revenue by Telerik specifically). Ultimately, this business of selling components is not sound in the age of AI.
Another thing to consider, it seems JS devs use more AI for work than .NET devs for example, which might be in more old-school companies and industries. I can't verify this but there seems to be a correlation between companies who use hip new CSS and JS frameworks, and their AI usage, thus accelerating Tailwind Corp's cannibalization by AI, as most vibe coders are building web apps from what I've seen and Tailwind and React are very well represented in the training set.
> Another thing to consider, it seems JS devs use more AI for work than .NET devs for example, which might be in more old-school companies and industries.
Speaking from years of .NET work in state and federal government, the sort of dev groups that lean on Telerik or DevExpress have less leverage to build new things for themselves than you would expect, so the use of AI inside of them is predominantly for maintaining existing software. Decisions on how things get built at most public agencies still revolve around MS Access and WebForms due to a whole bunch of BS ordinances that legislators put in place; for those sorts of places a reliable vendor can absorb the blame if concerns surrounding accessibility, compliance, or security of your ancient web services crop up, while Claude and Codex put the liability back on your org.
Yeah I don't disagree that selling components is going to be hard business in the age of AI. Just mostly pointing out that it was a good business previously.
EVERY DX survey that comes out (surveying over 20k developers) says the exact same thing.
Staff engineers get the most time savings out of AI tools, and their weekly time savings is 4.4 hours for heavy AI users. That's a little more than 10% productivity, so not anywhere close to 10x.
What's more telling about the survey results is they are also consistent in their findings between heavy and light users of AI. Staff engineers who are heavy users of AI save 4.4 hours a week while staff engineers who are light users of AI save 3.3 hours a week. To put another way, the DX survey is pretty clear that the time savings between heavy and light AI users is minimal.
Yes surveys are all flawed in different ways but an N of 20k is nothing to sneeze at. Any study with data points shows that code generation is not a significant time savings and zero studies show significant time savings. All the productivity gains DX reports come from debugging and investigation/code base spelunking help.
In my experience the productivity measured in created merge requests increased massively.
More merge requests because now the same senior developers are creating more bugs, 4x comparing to 2025. Same developers, same codebase but now with Cursor!
Past survey results are hidden in some presentations I've seen, and the latest survey I have full access due to my company paying for it. So I'm not sure it's legal for me to reproduce
I think there is going to be 2-3 year lag in understanding how llms actually impact developer productivity. There are way too many balls in the air, and anyone claiming specific numbers on productivity increase is likely very very wrong.
For example citing staff engineers as an example will have a bias: they have years of traditional training and are obviously not representative of software engineers in general.
FWIW I only mentioned staff engineers because the survey found staff+ engineers reported the highest time savings. The survey itself had time savings averages for junior (3.9), Mid level (4.3), Senior (4.1) and Staff (4.4).
This is my main frustration with the push back against visibility modifiers. It's treated as an all or nothing approach, that any support for visibility modifiers locks anyone out from touching those fields.
It could just be a compiler error/warning that has to be explicitly opted into to touch those fields. This allows you to say "I know this is normally a footgun to modify these fields, and I might be violating an invariant condition, but I am know what I'm doing".
"Visibility" invariably bites you in the ass at some point. See: Rust, newtypes and the Orphan rule, for example.
As such, I'm happy to not have visibility modifiers at all.
I do absolutely agree that "std" needs a good design pass to make it consistent--groveling in ".len" fields instead of ".len()" functions is definitely a bad idea. However, the nice part about Zig is that replacement doesn't need extra compiler support. Anyone can do that pass and everyone can then import and use it.
> This allows you to say "I know this is normally a footgun to modify these fields, and I might be violating an invariant condition, but I am know what I'm doing".
Welcome to "Zig will not have warnings." That's why it's all or nothing.
It's the single thing that absolutely grinds my gears about Zig. However, it's also probably the single thing that can be relaxed at a later date and not completely change the language. Consequently, I'm willing to put up with it given the rest of the goodness I get.
Tbf, C++/C# style public/private/protected is definitely too restricted in many situations, and other more flexible approaches are pretty much still research territory.
They could do something like OCaml modules and signatures but more permissive. Module author writes the public signature and that's what the type checker runs against, but if you want, you can "downcast" the module to the actual signature, revealing implementation details.
Fwiw that's not necessarily true, because if Sonnet ends up using reasoning, then you are using more tokens than GPT-4 would have used for the same task. Same with GPT-5 since it will decide (using an LLM) if it should use the thinking model for it (and you don't have as much control over it).
Right, but if I understand you, the counterargument is dumb since the context in which we are discussing is business viability (vcs investing in businesses where the unit economics require inference cost decreases), so actual dollars out rather than some imaginary cost per token is the metric that matters.
Inference is getting so much cheaper that cursor and zed have had to raise prices.
Why do the unit economics require a decrease in inference spend per user? This is discussed at the end of the post. I think this is based on the very strange assumption that these businesses must charge $20 a month no matter how much inference their customers want to use. This is precisely what the move to usage-based pricing was about. End users want to use more inference because they like it so much, and are knocking down these companies’ doors demanding to be allowed to pay them more money to get more inference.
> so actual dollars out rather than some imaginary cost per token is the metric that matters.
Even if we take this as true, the point is that this is different than "the cost of inference isn't going down." It is going down, it's just that people want more performance, and are willing to pay for it. Spend going up is not the same as cost going up.
I don't disagree that there are a wide variety of things to talk about here, but that means it's extra important to get what you're talking about straight.
Playing word games labeling inference narrowly as the cost per token rather than the per-X $ going to your llm api provider per customer/user/use/whatever is kinda silly?
The cost of inference -- ie $ that go to your llm api provider -- has increased and certainly appears to continue to increase.
> The cost of inference -- ie $ that go to your llm api provider
This is the crux of it: when talking about "the cost of inference" for the purposes of the unit economics of the business, what's being discussed is not what they charge you. It's about their COGs.
That's not word games. It's about being clear about what's being talked about.
Talking about increased prices is something that could be talked about! But it's a different thing. For example, what you're talking about here is total spend, not about individual pricing going up or down. That's also a third thing!
You can't come to agreement unless you agree on what's being discussed.
I don't know about for people using CC on a regular basis, but according to `ccusage`, I can trivially go over $20 of API credits in a few days of hobby use. I'd presume if you are paying for a $200 plan then you know you have heavy usage and can easily exceed that.
It's important to look closely at the details of how these models actually do these things.
If you look at the details of how Google got gold at IMO, you'll see that AlphaGeometry only relies on LLMs for a very specific part of the whole system, and the LLM wasn't the core problem solving system in play.
Most of AlphaGeometry is standard algorithms at play solving geometry problems using known constraints. When the algorithmic system gets stuck, it reaches out to LLMs that were fine tuned specifically for creating new geometric constraints. So the LLM would create new geometric constraints and pass that back to the algorithmic parts to get it unstuck, and repeat.
Without more details, it's not clear if this win is also the Gpt-5 and Gemini models we use, or specially fine-tuned models that are integrated with other non-LLM and non-ML based systems to solve these.
Not being solved purely by LLM isn't a knock on it, but with the current conversations going on today with LLMs, these are heavily being marketed as "LLMs did this all by themselves", which doesn't match with a lot of the evidence I've personally seen.
>This achievement is a significant advance over last year’s breakthrough result. At IMO 2024, AlphaGeometry and AlphaProof required experts to first translate problems from natural language into domain-specific languages, such as Lean, and vice-versa for the proofs. It also took two to three days of computation. This year, our advanced Gemini model operated end-to-end in natural language, producing rigorous mathematical proofs directly from the official problem descriptions – all within the 4.5-hour competition time limit.
AlphaGeometry/AlphaProof (the one you're thinking of, where they used LLMs + lean) was last year! and they "only" got silver. IMO gold results this year were e2e NLP.
In a PR branch, my branches usually have a bunch of WIP commits, especially if I've worked on a PR across day boundaries. It's common for more complex PRs that I started down one path and then changed to another path, in which case a lot of work that went into earlier commits is no longer relevant to the picture as a whole.
Once a PR has been submitted for review, I NEVER want to change previous commits and force push, because that breaks common tooling that other team mates rely on to see what changes since their last review. When you do a force push, they now have to review the full PR because they can't be guaranteed exactly which lines changed, and your commit message for the old pr is now muddled.
Once the PR has been merged, I prefer it merged as a single squashed commit so it's reflective of the single atomic PR (because most of the intermediary commits have never actually mattered to debugging a bug caused by a PR).
And if I've already merged a commit to main, then I 100% don't want to rewrite the history of that other commit.
So personally I have never found the commit history of a PR branch useful enough that rewriting past commits was beneficial. The commit history of main is immensely useful, enough that you never want to rewrite that either.
> Once the PR has been merged, I prefer it merged as a single squashed commit so it's reflective of the single atomic PR (because most of the intermediary commits have never actually mattered to debugging a bug caused by a PR).
While working on a maintenance team, most of the projects we handled were on svn where we couldn't squash commits and it as been a huge help enough times that I've turned against blind squashing in general. For example once a bug was introduced during the end-of-work linting cleanup, and a couple times after a code review suggestion. They were in rarely-triggered edge cases (like it came up several years after the code was changed, or were only revealed after a change somewhere else exposed them), but because there was no squash happening afterwards it was easy to look at what should have been happening and quickly fix.
By all means manually squash commits together to clean stuff up, but please keep the types of work separate. Especially once a merge request is opened, changes made from comments on it should not be squashed into the original work.
I wonder by your last comment if this is just is talking past each other.
I try very hard to keep my PRs very focused on one complete unit of work at a time. So when the squash happens that single commit represents one type of change being made to the system.
So when going through history to pinpoint the cause of the big, I can still get what logical change and unit of work caused the change. I don't see the intermediary commits of that unit of work, but I have not personally gotten value out of that level of granularity (especially on team projects where each person's commit practices are different).
If I start working on one PR that starts to contain a refactor or change imthat makes sense to isolate, I'll make that it's own pr that will be squashed.
When you force push a PR, Gitlab shows the changes from the last push. So it also depends which forge you use. I could see that working less well on Github or simpler Git forges
Yeah I don't have much experience outside of GitHub for team projects, so maybe gitlab works better. For GitHub it just gives up and claims it can't give you diff since the last review
GitHub is basically worst in class here, it’s true. Some forges are slightly better, others are way better. It’s so sad because I like GitHub overall but this is a huge weakness of it.
In my experience people who at the same time, “no I don’t [care], actually”, and who squash their PRs as a matter of policy do not care and cannot be convinced to care.
Like the GP said
> > If not, it doesn’t matter.
You might as well just conclude that it’s subjective since there’s no progress to be made.
Since DeepSeek R1 is open weight, wouldn't it be better to validate the napkin math to validate how many realistic LLM full inferences can be done on a single H100 in a time period, and calculate the token cost of that?
Without having in depth knowledge of the industry, the margin difference between input and output tokens is very odd to me between your napkin math and the R1 prices. That's very important as any reasoning model explodes reasoning tokens, which means you'll encounter a lot more output tokens for fewer input tokens, and that's going to heavily cut into the high margin ("essentially free") input token cost profit.
I am so glad someone else called this out, I was reading the napkin math portions and struggling to see how the numbers really worked out and I think you hit the nail on the head. The author is assuming 'essentially free' input token cost and extrapolating in a business model that doesn't seem to connect directly to any claimed 'usefulness'. I think the bias on this is stated in the beginning of the article clearly as the author assumes 'given how useful the current models are...'. That is not a very scientific starting point and I think it leads to reasoning errors within the business model he posits here.
There were some oddities with the numbers themselves as well but I think it was all within rounding, though it would have been nice for the author to spell it out when he rounded some important numbers (~s don't tell me a whole lot).
TL;DR I totally agree, there are some napkin math issues going on here that make this pretty hard to see as a very useful stress test of cost.
I'm not the OP and I"m not saying you are wrong, but I am going to point out that the data doesn't necessarily back up significant productivity improvements with LLMs.
In this video (https://www.youtube.com/watch?v=EO3_qN_Ynsk) they present a slide by the company DX that surveyed 38,880 developers across 184 organizations, and found the surveyed developers claiming a 4 hour average time savings per developer per week. So all of these LLM workflows are only making the average developer 10% more productive in a given work week, with a bunch of developers getting less. Few developers are attaining productivity higher than that.
In this video by stanford researchers actively researching productivity using github commit data for private and public repositories (https://www.youtube.com/watch?v=tbDDYKRFjhk) they have a few very important data points in there:
1. There's zero correlation they've found between how productive respondants claim their productivity is and how it's actually measured, meaning people are poor judges of their own productivity numbers. This does refute the claims on the previous point I made but only if you assume people are wildly more productive then they claim on average.
2. They have been able to measure actual increase in rework and refactoring commits in the repositories measured as AI tools become more in use in those organizations. So even with being able to ship things faster, they are observing increase number of pull requests that need to fix those previous pushes.
3. They have measured that greenfield low complexity systems have pretty good measurements for productivity gains, but once you get more towards higher complexity systems or brownfield systems they start to measure much lower productivity gains, and even negative productivity with AI tools.
This goes hand in hand with this research paper: https://metr.org/blog/2025-07-10-early-2025-ai-experienced-o... which had experienced devs in significant long term projects lose productivity when using AI tools, but also completely thought the AI tools were making them even more productivity.
Yes, all of these studies have their flaws and nitpicks we can go over that I'm not interested in rehashing. However, there's a lot more data and studies that show AI having very marginal productivity boost compared to what people claim than vice versa. I'm legitimately interested in other studies that can show significant productivity gains in brownfield projects.
> Abstract > > AI assistants, like GitHub Copilot and Cursor, are transforming software engineering. While several studies highlight productivity improvements, their impact on maintainability requires further investigation. [Objective] This study investigates whether co-development with AI assistants affects software maintainability, specifically how easily other developers can evolve the resulting source code. [Method] We conducted a two-phase controlled experiment involving 151 participants, 95% of whom were professional developers. In Phase 1, participants added a new feature to a Java web application, with or without AI assistance. In Phase 2, a randomized controlled trial, new participants evolved these solutions without AI assistance. [Results] Phase 2 revealed no significant differences in subsequent evolution with respect to completion time or code quality. Bayesian analysis suggests that any speed or quality improvements from AI use were at most small and highly uncertain. Observational results from Phase 1 corroborate prior research: using an AI assistant yielded a 30.7% median reduction in completion time, and habitual AI users showed an estimated 55.9% speedup. [Conclusions] Overall, we did not detect systematic maintainability advantages or disadvantages when other developers evolved code co-developed with AI assistants. Within the scope of our tasks and measures, we observed no consistent warning signs of degraded code-level maintainability. Future work should examine risks such as code bloat from excessive code generation and cognitive debt as developers offload more mental effort to assistants.
reply