Setting up fuzzing used to be hard. I haven't tried yet, but my bet is having Claude Code, today, analyze a codebase and suggest where and how to fuzztest it and having it review the crashes and iterate, will produce CVEs.
It has access to more testing data than I will ever look at. Letting it pull from that knowledge graph is going to give you good results! I just built a chunk of this (type of thinking) into my (now evolving) test harness.
1. Unit testing is (somewhat) dead, long live simulation. Testing the parts only gets you so far. These tests are far more durable, independent artifacts (read, if you went from JS to rust, how much of your testing would carry over)
2. Testing has to be "stand alone". I want to be able to run it from the command line, I want the output to be wrapper so I can shove the output on a web page, or dump it into an API (for AI)
3. Messages (for failures) matter. These are not just simple what's broken, but must contain enough info for context.
4. Your "failed" tests should include logs. Do you have enough breadcrumbs for production? If not, this is a problem that will bite you later.
5. Any case should be an accumulation of state and behavior - this really matters in simulation.
If you have done all the above right and your tool can return all the data, dumping the output into the cheapest model you can find and having it "Write a prompt with recommendations on a fix (not actual code, just what should be done beyond "fix this") has been illuminating.
Ultimately I realized that how I thought about testing was wrong. Its output should be either dead simple, or have enough information that someone with zero knowledge could ramp up into a fix on their first day in the code base. My testing was never this good because the "cost of doing it that way" was always too high... this is no longer the case.
The answer is complex, worth watching the video. But mainly, they don't know where to place the line. Defenders need tools, as good as attackers. Attackers will jailbreak models, defender might not, it's the safeguard positive in that case? Carlini actively asks the audience and community for "help" in determining how to proceed basically.
When running long autonomous tasks it is quite frequent to fill the context, even several times. You are out of the loop so it just happens if Claude goes a bit in circles, or it needs to iterate over CI reds, or the task was too complex. I'm hoping a long context > small context + 2 compacts.
Yep I have an autonomous task where it has been running for 8 hours now and counting. It compacts context all the time. I’m pretty skeptical of the quality in long sessions like this so I have to run a follow on session to critically examine everything that was done. Long context will be great for this.
I get very useful code from long sessions. It’s all about having a framework of clear documentation, a clear multi-step plan including validation against docs and critical code reviews, acceptance criteria, and closed-loop debugging (it can launch/restsart the app, control it, and monitor logs)
I am heavily involved in developing those, and then routinely let opus run overnight and have either flawless or nearly flawless product in the morning.
I haven't figured out how to make use of tasks running that long yet, or maybe I just don't have a good use case for it yet. Or maybe I'm too cheap to pay for that many API calls.
My change cuts across multiple systems with many tests/static analysis/AI code reviews happening in CI. The agent keeps pushing new versions and waits for results until all of them come up clean, taking several iterations.
Right. At Opus 4.6 rates, once you're at 700k context, each tool call costs ~$1 just for cache reads alone. 100 tool calls = $100+ before you even count outputs. 'Standard pricing' is doing a lot of work here lol
All of those things are smells imo, you should be very weary of any code output from a task that causes that much thrashing to occur. In most cases it’s better to rewind or reset and adapt your prompt to avoid the looping (which usually means a more narrowly defined scope)
A person has a supervision budget. They can supervise one agent in a hands-on way or many mostly-hands-off agents. Even though theres some thrashing assistants still get farther as a team than a single micromanaged agent. At least that’s my experience.
Just curious, what kind of work are you doing where agentic workflows are consistently able to make notable progress semi-autonomously in parallel? Hearing people are doing this, supposedly productively/successfully, kind of blows my mind given my near-daily in-depth LLM usage on complex codebases spanning the full stack from backend to frontend. It's rare for me to have a conversation where the LLM (usually Opus 4.6 these days) lasts 30 minutes without losing the plot. And when it does last that long, I usually become the bottleneck in terms of having to think about design/product/engineering decisions; having more agents wouldn't be helpful even if they all functioned perfectly.
I've passed that bottleneck with a review task that produces engineering recommendations along six axis (encapsulation, decoupling, simplification, dedoupling, security, reduce documentation drift) and a ideation tasks that gives per component a new feature idea, an idea to improve an existing feature, an idea to expand a feature to be more useful. These two generate constant bulk work that I move into new chat where it's grouped by changeset and sent to sub agent for protecting the context window.
What I'm doing mostly these days is maintaining a goal.md (project direction) and spec.md (coding and process standards, global across projects). And new macro tasks development, I've one under work that is meant to automatically build png mockup and self review.
I work on 1M LOC 15 yr old repo. Like you it's across the full stack. Bugs in certain pieces of complex business logic would have catastrophic consequences for my employer. Basically I peel poorly-specific work items off my queue into its own worktree and session at high reasoning/effort and provide a well-specified prompt.
These things eat into my supervision budget:
* LLM loses the plot and I have to nudge (like you)
* Thinking hard to better specify prompts (like you)
* Reviewing all changes (I do not vibe code except for spikes or other low-risk areas)
* Manual thing I have to do (for things I have not yet automated with a agent-authored scripts)
* Meetings
* etc
So, yes, my supervision budget is a bottleneck. I can only run 5-8 agents at a time because I have only so much time in the day.
Compare that vs a single agent at high reasoning/effort: I am sitting waiting for it to think. Waiting for it to find the code area I'm talking about takes time. Compiling, running tests, fixing compile errors. A million other things.
Any time I find myself sitting and waiting, this is a signal to me to switch to a different session.
This is wild guess. I'm working with GIS and Claude has proven to be extremely savvy. I can see operators throwing hundreds of layers and coming back with a "there a possible military installation here". Same tech that is used to find unregulated pools, or measuring the density of parking lots, or Nazca lines, but much more on demand for specific purposes.
Just to clarify, I don't condone the use of AI for guessing targets, but I think that's what may be going on here.
Just curious, for what exact purpose are you using Claude? Does it analyze geographic data for you or images or something? Or does it help you writing code that does this? Or does it create visualization for you?
All of those. Applied to urban development; It analyzes GIS layers, validates and extracts data that will be used in text reports, huge time saving mechanical extraction, before it was a bunch of manual steps on each project. I used it to develop a heavy GIS application that hubs many public data providers. And It does help us create the maps/visualizations from that data, again a sort of mechanical transformation. None of these are groundbreaking, but we you stack all of them you end with big time savings.
QGIS has been a key piece of my career for the past 10 years. This year I'm launching a SaaS where QGIS is, again, the most fundamental piece. I'm only hoping everything goes right so I can contribute back to this project what It deserves. One of the big OSS stars. Thanks QGIS team.
It has ended up being a huge piece of the last 7 years of my life and I didn't really intend for that to be the case. I have a strong bias towards "use industry-standard protocols when possible", so when we started adding some significant geospatial components to the UAV system I work on, I pushed hard for us to use GeoJSON or Spatialite wherever possible (we have since also added some Parquet). From that foundation, I started doing analysis with GeoPandas, which works great when you know what you're looking for but not amazing just for data exploration. Enter QGIS: because we settled on standard open formats... I can just go "Add vector layer..." and load the entirety of a flight's geospatial data right on top of a Google Map without doing any kind of data conversion at all!
Does it have quirks? Yes. Many. QGIS is an incredibly powerful tool, and it has caused me to swear at so many different pieces of it :D. Looking forward to checking out QGIS 4 and see what they've been cooking.
What timing. I used the whole weekend building a CI agentic workflow where I can let CC run wild with skip-permissions in isolated vms while working async on a gitea repo. I leave the CC instance with a decent sized mission and it will iterate until CI is green and then create a PR for me to merge. I'm moving from talking synchronously to one Clade Code to manage a small group of collaborating Claudes.
I do it similarly and it only costs me my working time. I do some of these things outside my official working times for free, but that is because I like the topic and like to have a good deployment pipeline. But I doubt it is in any way more significant time investment than administration of Github.
In the end you need to write your deployments scripts yourself anyway, which takes the most time. Otherwise for installation, the most time consuming task is probably ssh key management of your users, if you don't have any fitting infrastructure.
I meant the CI agents, not AI agents. These are just runners that execute the stuff that needs to be done for deployment/testing or general CI. These rarely call AI agents because these tasks need to be deterministic. If so, you probably want to call a local model under your control running on your own compute power.
I've used CC with TypeScript, JavaScript and Python. Imo TypeScript gives best results. Many times CC will be alerted and act based on the TypeScript compile process, another useful layer in it's context.
I decided to really learn what is going on, started with: https://karpathy.ai/zero-to-hero.html
That give a useful background into understanding what the tool can do, what context is, and how models are post trained. Context management is an important concept.
Then I gave a shot to several tools, including copilot and gemini, but followed the general advice to use Claude Code. It's way better that the rest at the moment.
And then I dive deep into Claude Code documentation and different youtube videos, there is plenty of good content out there. There are some ways to customize and increase the determinism of the process by using the tools properly.
Overall my process is, define a broad spec, including architecture. Heavy usage of standard libraries and frameworks is very helpful, also typed languages. Create skills according to your needs, and use MCP to give CC a feedback mechanism, playwright is a must for web development.
After the environment and initial seed is in place in the form of a clear spec, it's process of iteration via conversation. My session tend to go "Lets implement X, plan it", CC offers a few route, I pick what makes most sense, or on occasions I need to explain the route I want to take. After the feature is implemented we go into a cleanup phase, we check if anything might be going out of hand, recheck security stuff, and create testing. Repeat. Pick small battles, instead of huge features. I'm doing quite a lot of hand handling at the moment, saying a lots of "no", but the process is on another level with what I was doing before, and the speed I can get features out is insane.
I have been through Karpathy's work - however, I don't find that it helps with large scale development.
Your tactics work successfully for me at smalle scale (at around 10klocs, etc) and starts to break down - especially when refactorings are involved.
Refactoring happens when I see that the LLM is stumbling over it's own decisions _and_ when I get a new idea. So the ability to refactor is a hard requirement.
Alternatively refactoring could be achieved by starting over? But I do have a hard time accepting that idea for projects > 100klocs.
It is until it's not. That's the problem. The AI gets tripped up at some point, starts frigging tests instead of actually fixing bugs, starts looping then after several hours says it's not possible. If you're lucky.
Then on average your velocity is little better than if you just did it all by hand.
The AI gets tripped phenomenon is something I've experienced, and I think it's again related to context usage. Using more agents and skills will reduce the pollution on the main context, and delay the moment where things go weird. /clear after each small mission.
As said above, CC needs heavy guidance, but even with these issues, I'm way faster.
Some MCP's do give the models superpowers. Adding playwright MCP changed my CC from mediocre frontend skills, to really really good. Also, it gives CC a way to check what it's done, and many times correct obvious errors before coming back at you. Big leap.
reply