Generally I think this happens when people don’t monitor for errors on a regular basis. People only notice if things are actively broken for customers, and tons of small non-fatal bugs slip through and build up over time.
It is not just startups or small companies embracing agentic engineering… Stripe published blog posts about their autonomous coding agents. Amazon is blowing up production because they gave their agents access to prod. Google and Microsoft develop their own agentic engineering tools. It’s not just tech companies either, massive companies are frequently announcing their partnerships with OpenAI or Anthropic.
You can’t just pretend it’s startups doing all the agentic engineering. They’re just the ones pushing the boundaries on best practices the most aggressively.
That is why a fully automated firm would be a paradigm shift. Instead of requiring someone to be responsible and to QA things, you just let AI systems be responsible internally, and the company responsible as a whole for legal concerns.
This idea of an automated firm relies on the premise that AI will become more capable and reliable than people.
In this regard, the company cannot be created where there is not a single person tied to it, at least legally, even shell corporations have a person on the record as being responsible. So there needs to be some human that is apart of it, and in any "normal" organization if there is a person tied to the outcome of the company they presumably care about it and if the AI 99.99% of the time does good work, but still can make mistakes, a person will still be checking off on all its work. Which leads to a system of people reviewing and signing off on work, not exactly a fully autonomous firm.
RL on LLMs has changed things. LLMs are not stuck in continuation predicting territory any more.
Models build up this big knowledge base by predicting continuations. But then their RL stage gives rewards for completing problems successfully. This requires learning and generalisation to do well, and indeed RL marked a turning point in LLM performance.
A year after RL was made to work, LLMs can now operate in agent harnesses over 100s of tool calls to complete non-trivial tasks. They can recover from their own mistakes. They can write 1000s of lines of code that works. I think it’s no longer fair to categorise LLMs as just continuation-predictors.
Thanks for saying this. It never ceases to amaze me how many people still talk about LLMs like it’s 2023, completely ignoring the RLVR revolution that gave us models like Opus that can one-shot huge chunks of works-first-time code for novel use cases. Modern LLMs aren’t just trained to guess the next token, they are trained to solve tasks.
Forget 2023 - the advances in coding ability in just last 2-months are amazing. But, they are still not AGI, and it is almost certainly going to take more than just a new training regime such as RL to get there. Demis Hassabis estimates we need another 2-3 "transformer-level" discoveries to get there.
RL adds a lot of capability in the areas where it can be applied, but I don't think it really changes the fundamental nature of LLMs - they are still predicting training set continuations, but now trying to predict/select continuations that amount to reasoning steps steering the output in a direction that had been rewarded during training.
At the end of the day it's still copying, not learning.
RL seems to mostly only generalize in-domain. The RL-trained model may be able to generate a working C compiler, but the "logical reasoning" it had baked into it to achieve this still doesn't stop it from telling you to walk to the car wash, leaving your car at home.
There may still be more surprises coming from LLMs - ways to wring more capability out of them, as RL did, without fundamentally changing the approach, but I think we'll eventually need to adopt the animal intelligence approach of predicting the world rather than predicting training samples to achieve human-like, human-level intelligence (AGI).
You can’t really say it is just predicting continuations when it is learning to write proofs for Erdos problems, formalise significant math results, or perform automated AI research. Those are far beyond what you get by just being a copying and re-forming machine, a lot of these problems require sophisticated application of logic.
I don’t know if this can reach AGI, or if that term makes any sense to begin with. But to say these models have not learnt from their RL seems a bit ludicrous. What do you think training to predict when to use different continuations is other than learning?
I would say LLM’s failure cases like failing at riddles are more akin to our own optical illusions and blind spots rather than indicative of the nature of LLMs as a whole.
I think you're conflating mechanism with function/capability.
I'm not sure what I wrote that made you conclude that I thought these models are not learning anything from their RL training?! Let me say it again: they are learning to steer towards reasoning steps that during training led to rewards.
The capabilities of LLMs, both with and without RL, are a bit counter-intuitive, and I think that, at least in part, comes down to the massive size of the training sets and the even more massive number of novel combinations of learnt patterns they can therefore potentially generate...
In a way it's surprising how FEW new mathematical results they've been coaxed into generating, given that they've probably encountered a huge portion of mankind's mathematical knowledge, and can potentially recombine all of these pieces in at least somewhat arbitrary ways. You might have thought that there are results A, B and C hiding away in some obscure mathematical papers that no human has previously considered to put together before (just because of the vast number of such potential combinations), that might lead to some interesting result.
If you are unsure yourself about whether LLMs are sufficient to reach AGI (meaning full human-level intelligence), then why not listen to someone like Demis Hassabis, one of the brightest and best placed people in the field to have considered this, who says the answer is "no", and that a number of major new "transformer-level" discoveries/inventions will be needed to get there.
> they are still predicting training set continuations
But this is underselling what they do. Probably a large part of what they predict is learnt from their training set, but RL has added a layer on top that does not just come from just mimicry.
Again, I doubt this is enough for “AGI” but I think that term is not very well-defined to begin with. These models have now shown they are capable of novel reasoning, they just have to be prodded in the right way.
It’s not clear to me that there isn’t scaffolding that can use LLMs to search for novel improvements, like Katpathy’s recent autoresearch. The models, with the help of RL, seem to be getting to the point where this actually works to some extent, and I would expect this to happen in other fields in the next few years as well.
In general there's a difference between novel and discovering something new.
Pretraining has given the LLM a huge set of lego blocks that it can assemble in a huge variety of ways (although still limited by the "assembly patterns" is has learnt). If the LLM assembles some of these legos into something that wasn't directly in the training set, then we can call that "novel", even though everything needed to do it was present in the training set. I think maybe a more accurate way to think of this is that these "novel" lego assemblies are all part of the "generative closure" of the training set.
Things like generating math proofs are an example of this - the proof itself, as an assembled whole, may not be in the training set, but all the piece parts and thought patterns necessary to construct the proof were there.
I'm not much impressed with Karpathy's LLM autoresearch! I guess this sort of thing is part of the day to day activities of an AI researcher, so might be called "research" in that regard, but all he's done so far is just hyperparameter tuning and bug fixing. No doubt this can be extended to things that actually improve model capability, such as designing post-training datasets and training curriculums, but the bottleneck there (as any AI researcher will tell you) isn't the ideas - it's the compute needed to carry out the experiments. This isn't going to lead to the recursive self-improvement singularity that some are fantasizing about!
I would say these types of "autoresearch" model improvements, and pretty much anything current LLMs/agents are capable of, all fall under the category of "generative closure", which includes things like tool use that they have been trained to do.
It may well be possible to retrofit some type of curiosity onto LLMs, to support discovery and go beyond "generative closure" of things it already knows, and I expect that's the sort of thing we may see from Google DeepMind in next 5 years or so in their first "AGI" systems - hybrids of LLMs and hacks that add functionality but don't yet have the elegance of an animal cognitive architecture.
You laid out the theoretical limitations well, and I tend to agree with them.
I just get frustrated when people downplay how big of an impact filling in the gaps at the frontier of knowledge would have. 99.9% of researchers will never have an idea that adds a new spike to the knowledge frontier (rather than filling in holes), and 99.99% of research is just filling in gaps by combining existing ideas (numbers made up). In this realm, autoresearch may not be groundbreaking, but it can do the job. AlphaEvolve is similar.
If LLMs can actually get closer to something like that, it leaves human researchers a whole lot more time to focus on new ideas that could move entire fields forward. And their iteration speed can be a lot faster if AI agents can help with the implementation and testing of them.
> What do you think training to predict when to use different continuations is other than learning?
Sure, training = learning, but the problem with LLMs is that is where it stops, other than a limited amount of ephemeral in-context learning/extrapolation.
With an LLM, learning stops post-training when it is "born" and deployed, while with an animal that's when it starts! The intelligence of an animal is a direct result of it's lifelong learning, whether that's imitation learning from parents and peers (and subsequent experimentation to refine the observed skill), or the never ending process of observation/prediction/surprise/exploration/discovery which is what allows humans to be truly creative - not just behaving in ways that are endless mashups of things they have seen and read about other humans doing (cf training set), but generating truly novel behaviors (such as creating scientific theories) based on their own directed exploration of gaps in mankind's knowledge.
Application of AGI to science and new discovery is a large part of why Hassabis defines AGI as human-equivalent intelligence, and understands what is missing, while others like Sam Altman are content to define AGI as "whatever makes us lots of money".
Memory systems built on top of LLMs could provide continual learning. I do not agree that it is some fundamental limitation.
Claude Code already writes its own memory files. And people already finetune models. There is clear potential to use the former as a form of short-term memory and the latter for long-term “learning”.
The main blockers to this are that models aren’t good enough at managing their own memory, and finetuning is expensive and difficult. But both of these seem like solvable engineering problems.
Continual learning isn't a "fundamental limitation" or unsolvable problem. Animal brains are an existence proof that it's possible, but it's tough to do, and quite likely SGD is not the way to do it, so any attempt to retrofit continual learning to LLMs as they exist today is going to be a hack...
Memory and learning are two different things. Memorization is a small subset of learning. Memorizing declarative knowledge and personal/episodic history (cf. LLM context) are certainly needed, but an animal (or AI intern) also needs to be able to learn procedural skills which need to become baked into the weights that are generating behavior.
Fine tuning is also no substitute for incremental learning. You might think of it as addressing somewhat the same goal, but really fine tuning is about specializing a model for a particular use, and if you repeatedly fine tune a model for different specializations (e.g. what I learnt yesterday, vs what I learnt the day before) then you will run into the catastrophic forgetting problem.
I agree that incremental learning seems more like an engineering problem rather than a research one, or at least it should succumb to enough brain power and compute put into solving it, but we're now almost 10 years into the LLM revolution (attention paper in 2017) and it hasn't been solved yet - it's not easy.
Fundamentally, I’m more optimistic on how far current approaches can scale. I see no reason why RL could not be used to train models to use memory, and fine-tuning already works, it’s just expensive.
The continual learning we get may be a bit hamfisted, and not fit into a neat architecture, but I think we could actually see it work at scale in the next few years. Whereas new techniques like what Yann Lecun have demonstrated still live heavily in the realm of research. Cool, but not useful yet.
Fine tuning is also not so limited as you suggest. For one, we don’t need to fine tune the same model over and over, you can just start with a frontier model each time. And two, modern models are much better at generating synthetic data or environments for RL. This could definitely work, but it might require a lot of work in data collection and curation, and the ROI is not clear. But if large companies continue to allocate more and more resources to AI in the next few years, I could see this happening.
OpenAI already has a custom model service, and labs have stated they already have custom models built for the military (although how custom those models are is unclear). It doesn’t seem like a huge leap to also fine-tune models over a companies internal codebases and tooling. Especially for large companies like Google, Amazon, or Stripe that employ tens of thousands of software engineers.
Concerns about the wasting of maintainer’s time, onboarding, or copyright, are of great interest to me from a policy perspective. But I find some of the debate around the quality of AI contributions to be odd.
Quality should always be the responsibility of the person submitting changes. Whether a person used LLMs should not be a large concern if someone is acting in good-faith. If they submitted bad code, having used AI is not a valid excuse.
Policies restricting AI-use might hurt good contributors while bad contributors ignore the restrictions. That said, restrictions for non-quality reasons, like copyright concerns, might still make sense.
The core issue is that it takes a large amount of effort to even assess this, because LLM generated code looks good superficially.
It is said that static FP languages make it hard to implement something if you don't really understand what you are implementing. Dynamically typed languages makes it easier to implement something when you don't fully understand what you are implementing.
LLMs takes this to another level when it enables one to implement something with zero understanding of what they are implementing.
The people likely to submit low-effort contributions are also the people most likely to ignore policies restricting AI usage.
The people following the policies are the most likely to use AI responsibly and not submit low-effort contributions.
I’m more interested in how we might allow people to build trust so that reviewers can positively spend time on their contributions, whilst avoiding wasting reviewers time on drive-by contributors. This seems like a hard problem.
The real invariant is responsibility: if you submit a patch, you own it. You should understand it, be able to defend the design choices, and maintain it if needed
Ownership and responsibility are useless when a YouTuber tells it to their million followers that GitHub contributions are valued by companies and this is how you can create a pull request with AI in three minutes, and you get hundred low value noise PRs opened by university students from the other side of the globe. It’s Hacktoberfest on steroids.
"You committed it, you own it" can't even be enforced effectively at large companies, given employee turnover and changes in team priorities and recorgs. It's hard to see how this could be done effectively in open source projects. Once the code is in there, end users will rely on it. Other code will rely on it. If the original author goes radio silent it still can't be ripped out.
Trusted contributors using LLMs do not cause this problem though. It is the larger volume of low-effort contributions causing this problem, and those contributors are the most likely to ignore the policies.
Therefore, policies restricting AI-use on the basis of avoiding low-quality contributions are probably hurting more than they’re helping.
I'm not sure I agree. If you have a blanket "you must disclose how you use AI" policy it's socially very easy to say "can you disclose how you used AI", and then if they say Claude code wrote it, you can just ignore it, guilt-free.
Without that policy it feels rude to ask, and rude to ignore in case they didn't use AI.
I’d argue this social angle is not very nuanced or effective. Not all people who used Claude Code will be submitting low-effort patches, and bad-faith actors will just lie about their AI-use.
For example, someone might have done a lot of investigation to find the root cause of an issue, followed by getting Claude Code to implement the fix, which they then tested. That has a good chance of being a good contribution.
I think tackling this from the trust side is likely to be a better solution. One approach would be to only allow new contributors to make small patches. Once those are accepted, then allow them to make larger contributions. That would help with the real problem, which is higher volumes of low-effort contributions overwhelming maintainers.
We have been using something similar for editing Confluence pages. Download XML, edit, upload. It is very effective, much better than direct edit commands. It’s a great pattern.
You can use the Copilot CLI with the atlassian mcp to super easily edit/create confluence pages. After having the agent complete a meaningful amount of work, I have it go create a confluence page documenting what has been done. Super useful.
I'm afraid I can't easily share this, as we have embedded a lot of company-specific information in our setup, particularly for cross-linking between confluence/jira/zendesk and other systems. I can try explain it though, and then Claude Code is great at implementing these simple CLI tools and writing the skills.
We wrote CLIs for Confluence, Jira, and Zendesk, with skills to match. We use a simple OAuth flow for users to login (e.g., they would run jira login). Then confluence/jira/zendesk each have REST APIs to query pages/issues/tickets and submit changes, which is what our CLIs would use. Claude Code was exceptional at finding the documentation for these and implementing them. Only took a couple days to set these up and Claude Code is now remarkably good at loading the skills and using the CLIs. We use the skills to embed a lot of domain-specific information about projects, organisation of pages, conventions, standard workflows, etc.
Being able to embed company-specific links between services has been remarkably useful. For example, we look for specific patterns in pages like AIT-553 or zd124132 and then can provide richer cross-links to Jira or Zendesk that help agents navigate between services. This has made agents really efficient at finding information, and it makes them much more likely to actually read from multiple systems. Before we made changes like this, they would often rabbit-hole only looking at confluence pages, or only looking at jira issues, even when there was a lot of very relevant information in other systems.
My favourite is the confluence integration though, as I like to record a lot of worklog-style information in there that I would previously write down as markdown files. It's nicer to have these in Confluence as then they are accessible no matter what repo I am working in, what region I am working in, or what branch or feature I'm working on. I've been meaning to try to set something similar up for my personal projects using the new Obsidian CLI.
We have been doing something similar but it sounds like you have come further along this way of working. We (with help from Claude) have built a similar tool that you describe to interface with our task- and project management system, and use it together with the Gitlab and Github CLI tools to allow agents to read tickets, formulate a plan and create solutions and create MR/PR to the relevant repos. For most of our knowledge base we use Markdown but some of it is tied up in Confluence, that's why I have an interest in that part. And, some is even in workflows are in Google Docs which makes the OP tool interesting as well -- currently our tool output Markdown and we just "paste from markdown" into Gdocs. We might be able to revise and improve that too.
Thank you! Sounds like a fantastic setup. Are the claude code agents acting autonomously from any trigger conditions or is this all manual work with them? And how do you manage write permissions for documents amongst team members/agents, presumably multiple people have access to this system?
(Not OP, but have been looking into setting up a system for a similar use case)
This is all manual, so people ask their agent to load Jira issues, edit Confluence pages, etc. Users sign-in using their own accounts using the CLIs, so the agents inherit their own permissions. Then we have the permissions in Claude Code setup so any write commands are in Ask, so it always prompts the user if it wants to run them.
I think this is the main point where many people’s work differs. Most of my work I know roughly what needs changing and how things are structured but I jump between codebases often enough that I can’t always remember the exact classes/functions where changes are needed. But I can vaguely gesture at those specific changes that need to be made and have the AI find the places that need changing and then I can review the result.
I rarely get the luxury of working in a single codebase for a long enough period of time to get so familiar with it that I can jump to particular functions without much thought. That means AI is usually a better starting point than me fumbling around trying to find what I think exists but I don’t know where it is.
ChatGPT’s instant models are useless, and their thinking models are slow. This makes Claude more pleasant to use, despite them not being SOTA.
But ChatGPT is still SOTA in search and hard problem solving.
GPT-5.2 Pro is the model people are using to solve Erdos problems after all, not Claude or Gemini. The Thinking models are noticeably better if you have difficult problems to work on, which justifies my sub even if I use Claude for everything else. Their Codex models are also much smarter, but also less pleasant to use.
IME ChatGPT is pretty mid at search. Grok although significantly dumber, is really strong at diligently going through hundreds of search results, and is much more tuned to rely on search results instead of its internal knowledge (which depending on the case can be better or worse). It's the only situation where Grok is worth using IMO.
Gemini is really good with many topics. Vastly superior to ChatGPT for agronomy.
You should always use the best model for the job, not just stick to one.
People use UIs for git despite it working so well in the terminal... Many people I knew at uni doing computer science wouldn’t even know what tmux is. I would bet that the demand for these types of UIs is going to be a lot bigger than the demand for CLI tools like Claude Code. People already rave about cowork and the new codex UI. This falls into the same category.
reply