It's even worse than this: the "tasks" that are evaluated are limited to a single markdown file of instructions, plus an opaque verifier (page 13-14). No problems involving existing codebases, refactors, or anything of the like, where the key constraint is that the "problem definition" in the broadest sense doesn't fit in context.
So when we look at the prompt they gave to have the agent generate its own skills:
> Important: Generate Skills First
Before attempting to solve this task, please follow these steps:
1. Analyze the task requirements and identify what domain knowledge, APIs, or techniques are needed.
2. Write 1–5 modular skill documents that would help solve this task. Each skill should: focus on a specific tool, library,
API, or technique; include installation/setup instructions if applicable; provide code examples and usage patterns; be
reusable for similar tasks.
3. Save each skill as a markdown file in the environment/skills/ directory with a descriptive name.
4. Then solve the task using the skills you created as reference.
There's literally nothing it can do by way of "exploration" to populate and distill self-generated skills - not with a web search, not exploring an existing codebase for best practices and key files - only within its own hallucinations around the task description.
It also seems they're not even restarting the session after skills are generated, from that fourth bullet? So it's just regurgitating the context that was used to generate the skills.
So yeah, your empty-codebase vibe coding agent can't just "plan harder" and make itself better. But this is a misleading result for any other context, including the context where you ask for a second feature on that just-vibe-coded codebase with a fresh session.
I don't see how "create an abstraction before attempting to solve the problem" will ever work as a decent prompt when you are not even steering it towards specifics.
If you gave this exact prompt to a senior engineer I would expect them to throw it back and ask wtf you actually want.
If it were in the context of parachuting into a codebase, I’d make these skills an important familiarization exercise: how are tests made, what are patterns I see frequently, what are the most important user flows. By forcing myself to distill that first, I’d be better at writing code that is in keeping with the codebase’s style and overarching/subtle goals. But this makes zero sense in a green-field task.
I think it's because AI Models have learned that we prefer answers that are confident sounding, and not to pester us with questions before giving us an answer.
That is, follow my prompt, and don't bother me about it.
Because if I am coming to an AI Agent to do something, it's because I'd rather be doing something else.
If I already know the problem space very well, we can tailor a skill that will help solve the problem exactly how I already know I want it to be solved.
Thats actually super interesting and why I really don’t like the whole .md folder structures or even any CLAUDE.md. It just seems most of the time you really just want to give it what it needs for best results.
The headline is really bullshit, yes, I like the testing tho.
CLAUDE.md in my projects only has coding / architecture guidelines. Here's what not to do. Here's what you should do. Here are my preferences. Here's where the important things are.
Even though my CLAUDE.md is small though, often my rules are ignored. Not always though, so it's still at least somewhat useful!
What's the hook for switching out of plan? I'd like to be launch a planning skill whenever claude writes a plan but it never picks up the skill, and I haven't found a hook that can force it to.
Even if contractors/intermediaries consider themselves bound by HIPAA, the protections are lighter than one would think, in the political environment we find ourselves in.
Notably (though I'm not a lawyer, and this is not legal advice) - https://www.ecfr.gov/current/title-45/part-164/section-164.5... describing "similar process authorized under law... material to a legitimate law enforcement inquiry" without any notion of scoping this to individuals vs. broad inquiries, seems to give an incredibly broad basis for Palantir to be asked to spin up a dashboard with PII for any query desired for the administration's political agenda. This could happen at any time in the future, with full retroactive data, across entire hospital systems, complete with an order not to reveal the program's existence to the public.
Other tech companies have seen this kind of generalized overreach as both legally risky and destructive to their brand, and have tried to fight this where possible. Palantir, of course, is the paragon of fighting on behalf of citizens, and would absolutely try to... I can't even finish this joke, I'm laughing too hard.
I'm old enough to remember we literally had a Captain America movie, barely more than a decade ago, where the villains turn private PII and health data into targeting lists. (No flying aircraft carriers were injured in the filming of this movie.)
One of the biggest pieces of "writing on the wall" for this IMO was when, in the April 15 2025 Preparedness Framework update, they dropped persuasion/manipulation from their Tracked Categories.
> OpenAI said it will stop assessing its AI models prior to releasing them for the risk that they could persuade or manipulate people, possibly helping to swing elections or create highly effective propaganda campaigns.
> The company said it would now address those risks through its terms of service, restricting the use of its AI models in political campaigns and lobbying, and monitoring how people are using the models once they are released for signs of violations.
To see persuasion/manipulation as simply a multiplier on other invention capabilities, and something that can be patched on a model already in use, is a very specific statement on what AI safety means.
Certainly, an AI that can design weapons of mass destruction could be an existential threat to humanity. But so, too, is a system that subtly manipulates an entire world to lose its ability to perceive reality.
> Certainly, an AI that can design weapons of mass destruction could be an existential threat to humanity. But so, too, is a system that subtly manipulates an entire world to lose its ability to perceive reality.
So, like, social media and adtech?
Judging by how little humanity is preoccupied with global manipulation campaigns via technology we've been using for decades now, there's little chance that this new tech will change that. It can only enable manipulation to grow in scale and effectiveness. The hype and momentum have never been greater, and many people have a lot to gain from it. The people who have seized power using earlier tech are now in a good position to expand their reach and wealth, which they will undoubtedly do.
FWIW I don't think the threats are existential to humanity, although that is certainly possible. It's far more likely that a few people will get very, very rich, many people will be much worse off, and most people will endure and fight their way to get to the top. The world will just be a much shittier place for 99.99% of humanity.
Right on point. That is the true purpose of this 'new' push into A.I. Human moderators sometimes realize the censorship they are doing is wrong, and will slow walk or blatantly ignore censorship orders. A.I. will diligently delete anything it's told too.
But the real risk is that they can use it to upscale the Cambridge Analytica personality profiles for everyone, and create custom agents for every target that feeds them whatever content they need too manipulate there thinking and ultimately behavior. AKA MkUltra mind control.
What's frustrating is our society hasn't grappled with how to deal with that kind of psychological attack. People or corporations will find an "edge" that gives them an unbelievable amount of control over someone, to the point that it almost seems magic, like a spell has been cast. See any suicidal cult, or one that causes people to drain their bank account, or one that leads to the largest breach of American intelligence security in history, or one that convinces people to break into the capitol to try to lynch the VP.
Yet even if we persecute the cult leader, we still keep people entirely responsible for their own actions, and as a society accept none of the responsibility for failing to protect people from these sorts of psychological attacks.
I don't have a solution, I just wish this was studied more from a perspective of justice and sociology. How can we protect people from this? Is it possible to do so in a way that maintains some of the values of free speech and personal freedom that Americans value? After all, all Cambridge Analytica did was "say" very specifically convincing things on a massive, yet targeted, scale.
> manipulates an entire world to lose its ability to perceive reality.
> ability to perceive reality.
I mean, come on.. that's on you.
Not to "victim blame"; the fault's in the people who deceive, but if you get deceived repeatedly, several times, and there are people calling out the deception, so you're aware you're being deceived, but you still choose to be lazy and not learn shit on your own (i.e. do your own research) and just want everything to be "told" to you… that's on you.
Everything you think you "know" is information just put in front of you (most of it indirect, much of it several dozen or thousands of layers of indirection deep)
To the extent you have a grasp on reality, it's credit primarily to the information environment you found yourself in and not because you're an extra special intellectual powerhouse.
This is not an insult, but an observation of how brains obviously have to work.
> much of it several dozen or thousands of layers of indirection deep
Assuming we're just talking about information on the internet: What are you reading if the original source is several dozen layers deep? In my experience, it's usually one or two layers deep. If it's more, that's a huge red flag.
Yes, and our own test could very well be flawed as well. Either way, from my experience there usually isn't that sort of massively long chain to get to the original research, more like a lot of people just citing the same original research.
True of academic research which has built systems and conventions specifically to achieve this, but very very little of what we know — even the most deeply academic among us — originates from “research” in the formal sense at all.
The comment thread above is not about how people should verify scientific claims of fact that are discussed in scientific formats. The comment is about a more general epistemic breakdown, 99.9999999% of which is not and cannot practically be “gotten to the bottom of” by pointing to some “original research.”
Your ability to check your information environment against reality is frequently within your control and can be used to establish trustworthiness for the things that you cannot personally verify. And it is a choice to choose to trust things that you cannot verify, one that you do not have to make, even though it is unfortunately commonly made.
For example, let's take the Uyghur situation in China. I have no ability to check reality there, as I do not live in and have no intention of ever visiting China. My information environment is what the Chinese government reports and what various media outlets and NGOs report. As it turns out, both the Chinese government and media and NGOs report on other things that I can check against reality, eg. events that happen in my country, and I know that they both routinely report falsehoods that do not accord with my observed reality. As a result, I have zero trust in either the Chinese government or media and NGOs when it comes to things that I cannot personally verify, especially when I know both parties have self-interest incentives to report things that are not true. Therefore, the conclusion is obvious: I do not know and can not know what is happening around Uyghurs in China, and do not have a strong opinion on the subject, despite the attempts of various parties to put information in front of me with the intention to get me to champion their viewpoint. This really does not make me an extra special intellectual powerhouse, one would hope. I'd think this is the bare minimum. The fact that there are many people who do not meet this bare minimum is something that reflects poorly on them rather than highly on me.
On the other hand, I trust what, for instance, the Encyclopedia Britannica has to say about hard science, because in the course of my education I was taught to conduct experiments and confirm reality for myself. I have never once found what is written about hard science in Britannica to not be in accord with my observed reality, and on top of that there is little incentive for the Britannica to print scientific falsehoods that could be easily disproven, so it has earned my trust and I will believe the things written in it even if I have not personally conducted experiments to verify all of it.
Anyone can check their information sources against reality, regardless of their intelligence. It is a choice to believe information that is put in front of you without checking it. Sometimes a choice that is warranted once trust is earned, but all too often a choice that is highly unwarranted.
Why even bother responding to comments if you don't read them?
> because in the course of my education I was taught to conduct experiments and confirm reality for myself. I have never once found what is written about hard science in Britannica to not be in accord with my observed reality,
It's in the same sentence I mentioned Britannica!
> you’re still not checking any facts by yourself
Did you perhaps read it but not understand what my sentence meant because you don't know what an experiment is? Were you not taught to do scientific experiments in your schooling? Literally the entire point of my entire post is that I do not trust blindly, but choose who I trust based on their ability to accurately report the facts I observe for myself without fail. CNN, as with every media outlet I've ever encountered in my entire life, publishes things I can verify to be false. So too does some guy on Twitter with 100 million followers. Britannica does not, at least as it pertains to hard science.
I don't necessarily disagree with what you said, but you're not taking a few things into account.
First of all, most people don't think critically, and may not even know how. They consume information provided to them, instinctively trust people they have a social, emotional, or political bond with, are easily persuaded, and rarely question the world around them. This is not surprising or a character flaw—it's deeply engrained in our psyche since birth. Some people learn the skill of critical thinking over time, and are able to do what you said, but this is not common. This ability can even be detrimental if taken too far in the other direction, which is how you get cynicism, misanthropy, conspiracy theories, etc. So it needs to be balanced well to be healthy.
Secondly, psychological manipulation is very effective. We've known this for millennia, but we really understood it in the past century from its military and industrial use. Propaganda and its cousin advertising work very well at large scales precisely because most people are easily persuaded. They don't need to influence everyone, but enough people to buy their product, or to change their thoughts and behavior to align with a particular agenda. So now that we have invented technology that most people can't function without, and made it incredibly addictive, it has become the perfect medium for psyops.
All of these things combined make it extremely difficult for anyone, including skeptics, to get a clear sense of reality. If most of your information sources are corrupt, you need to become an expert information sleuth, and possibly sacrifice modern conveniences and technology for it. Most people, even if capable, are unwilling to make that effort and sacrifice.
It bears repeating that modern LLMs are incredibly capable, and relentless, at solving problems that have a verification test suite. It seems like this problem did (at least for some finite subset of n)!
This result, by itself, does not generalize to open-ended problems, though, whether in business or in research in general. Discovering the specification to build is often the majority of the battle. LLMs aren't bad at this, per se, but they're nowhere near as reliably groundbreaking as they are on verifiable problems.
> modern LLMs are incredibly capable, and relentless, at solving problems that have a verification test suite.
Feel like it's a bit what I tried to expressed few weeks ago https://news.ycombinator.com/item?id=46791642 namely that we are just pouring computational resources at verifiable problems then claim that astonishingly sometimes it works. Sure LLMs even have a slight bias, namely they do rely on statistics so it's not purely brute force but still the approach is pretty much the same : throw stuff at the wall, see what sticks, once something finally does report it as grandiose and claim to be "intelligent".
> throw stuff at the wall, see what sticks, once something finally does report it as grandiose and claim to be "intelligent".
What do we think humans are doing? I think it’s not unfair to say our minds are constantly trying to assemble the pieces available to them in various ways. Whether we’re actively thinking about a problem or in the background as we go about our day.
Every once in a while the pieces fit together in an interesting way and it feels like inspiration.
The techniques we’ve learned likely influence the strategies we attempt, but beyond all this what else could there be but brute force when it comes to “novel” insights?
If it’s just a matter of following a predefined formula, it’s not intelligence.
If it’s a matter of assembling these formulas and strategies in an interesting way, again what else do we have but brute force?
See what I replied just earlier https://news.ycombinator.com/item?id=47011884 namely the different regimes, within paradigm versus challenging it by going back to first principles. The ability to notice something is off beyond "just" assembling existing pieces, to backtrack within the process when failures get too many and actually understand the relationship is precisely different.
So I don’t really see why this would be a difference in kind. We’re effectively just talking about how high up the stack we’re attempting to brute force solutions, right?
How many people have tried to figure out a new maths, a GUT in physics, a more perfect human language (Esperanto for ex.) or programming language, only to fail in the vast majority of their attempts?
Do we think that anything but the majority of the attempts at a paradigm shift will end in failure?
If the majority end in failure, how is that not the same brute force methodology (brute force doesn’t mean you can’t respond to feedback from your failed experiments or from failures in the prevailing paradigms, I take it to just fundamentally mean trying “new” things with tools and information available to you, with the majority of attempts ending in failure, until something clicks, or doesn’t and you give up).
While I don't think anyone has a plausible theory that goes to this level of detail on how humans actually think, there's still a major difference. I think it's fair to say that if we are doing a brute force search, we are still astonishingly more energy efficient at it than these LLMs. The amount of energy that goes into running an LLM for 12h straight is vastly higher than what it takes for humans to think about similar problems.
In the research group I am, we have usually try a few approach to each problem, let's say we get a:
Method A) 30% speed reduction and 80% precision decrease
Method B) 50% speed reduction and 5% precision increase
Method C) 740% speed reduction and 1% precision increase
and we only publish B. It's not brute force[1], but throw noodles at the wall, see what sticks, like the GP said. We don't throw spoons[1], but everything that looks like a noodle has a high chance of been thrown. It's a mix of experience[1] and not enough time to try everything.
The field of medicine - pharmacology and drug discovery, is an optimized version of that. It works a bit like this:
Instead of brute-forcing with infinite options, reduce the problem space by starting with some hunch about the mechanism. Then the hard part that can take decades: synthesize compounds with the necessary traits to alter the mechanism in a favourable way, while minimizing unintended side-effects.
Then try on a live or lab grown specimen and note effectiveness. Repeat the cycle, and with every success, push to more realistic forms of testing until it reaches human trials.
Many drugs that reach the last stage - human trials - often end up being used for something completely other than what they were designed for! One example of that is minoxidil - designed to regular blood pressure, used for regrowing hair!
Yes, this is where I just cannot imagine completely AI-driven software development of anything novel and complicated without extensive human input. I'm currently working in a space where none of our data models are particularly complex, but the trick is all in defining the rules for how things should work.
Our actual software implementation is usually pretty simple; often writing up the design spec takes significantly longer than building the software, because the software isn't the hard part - the requirements are. I suspect the same folks who are terrible at describing their problems are going to need help from expert folks who are somewhere between SWE, product manager, and interaction designer.
Even more generally than verification, just being tied to a loss function that represent something we actually care about. E.g. compiler and test errors, LEAN verification in Aristotle, basic physics energy configs in AlphaFold, or win conditions in e.g. RL, such as in AlphaGo.
RLHF is an attempt to push LLMs pre-trained with a dopey reconstruction loss toward something we actually care about: imagine if we could find a pre-training criterion that actually cared about truth and/or plausibility in the first place!
There's been active work in this space, including TruthRL: https://arxiv.org/html/2509.25760v1. It's absolutely not a solved problem, but reducing hallucinations is a key focus of all the labs.
I'll add a counterpoint that in many situations (especially monorepos for complex businesses), it's easy for any LLM to go down rabbit holes. Files containing the word "payment" or "onboarding" might be for entirely different DDD domains than the one relevant to the problem. As a CTO touching all sorts of surfaces, I see this problem at least once a day, entirely driven by trying to move too fast with my prompts.
And so the very first thing that the LLM does when planning, namely choosing which files to read, are a key point for manual intervention to ensure that the correct domain or business concept is being analyzed.
Speaking personally: Once I know that Claude is looking in the right place, I'm on to the next task - often an entirely different Claude session. But those critical first few seconds, to verify that it's looking in the right place, are entirely different from any other kind of verbosity.
I don't want verbose mode. I want Claude to tell me what it's reading in the first 3 seconds, so I can switch gears without fear it's going to the wrong part of the codebase. By saying that my use case requires verbose mode, you're saying that I need to see massive levels of babysitting-level output (even if less massive than before) to be able to do this.
(To lean into the babysitting analogy, I want Claude to be the babysitter, but I want to make sure the babysitter knows where I left the note before I head out the door.)
> I don't want verbose mode. I want Claude to tell me what it's reading in the first 3 seconds, so I can switch gears without fear it's going to the wrong part of the codebase. By saying that my use case requires verbose mode, you're saying that I need to see massive levels of babysitting-level output (even if less massive than before) to be able to do this.
To be clear: we re-purposed verbose mode to do exactly what you are asking for. We kept the name "verbose mode", but the behavior is what you want, without the other verbose output.
This is an interesting and complex ui decision to make.
Might it have been better to retire and/or rename the feature, if the underlying action was very different?
I work on silly basic stuff compared to Claude Code, but I find that I confuse fewer users if I rename a button instead of just changing the underlying effect.
This causes me to have to create new docs, and hopefully triggers affected users to find those docs, when they ask themselves “what happened to that button?”
You can call it “output granularity” and allow Java logger style configuration, e.g. allowing certain operations to be very verbose while others being simply aggregated
If we're going there, we need to make the logging dynamically configurable with Log4J-style JNDI and LDAP. It's entirely secure as history has shown - and no matter what, it'll still be more secure than installing OpenClaw!
(Kidding aside, logging complexity is a slippery slope, and I think it's important, perhaps even at a societal level, for an organization like Anthropic to default to a posture that allows people to feel they have visibility into where their agentic workflows are getting their context from. To the extent that "___ puts you in control" becomes important as rogue agentic behavior is increasingly publicized, it's in keeping with, and arguably critical to, Claude's brand messaging.)
They don’t have to reproduce it literally. It’s an UX problem with many solutions. My point is, you cannot settle on some „average“ solution here. It’s likely that some agents, some operations will be more trustworthy, some less, but that will be highly dependent on context of the execution.
Feels like you aren’t really listening to the feedback. Is verbose mode the same as the explicit callouts of files read in the previous versions? Yes, you intended it to fulfill the same need, but, take a step back. Is it the same? I’m hearing a resounding “no”. At the very least if you hace made such a big change, you’ve gotten rid of the value of a true “verbose mode”.
> To be clear: we re-purposed verbose mode to do exactly what you are asking for. We kept the name "verbose mode", but the behavior is what you want, without the other verbose output.
Verbose mode feels far too verbose to handle that. It’s also very hard to “keep your place” when toggling into verbose mode to see a specific output.
I think the point bcherny is making in the last few threads is that, the new verbose mode _default_ is not as verbose as it used to be and so it is not "too verbose to handle that". If you want "too verbose", that is still available behind a toggle
Yeah, I didn't realize that there's a new sort of verbose mode now which is different than the verbose mode that was included previously. Although I'm still not clear on the difference between "verbose mode" and "ctrl + o". Based on https://news.ycombinator.com/item?id=46982177 I think they are different (specifically where they say "while hiding more details behind ctrl+o".
I'm guessing both humans and LLMs would tend to get the "vibe" from the pelican task, that they're essentially being asked to create something like a child's crayon drawing. And that "vibe" then brings with it associations with all the types of things children might normally include in a drawing.
It's not a benchmark though, right? Because there's no control group or reference.
It's just an experiment on how different models interpret a vague prompt. "Generate an SVG of a pelican riding a bicycle" is loaded with ambiguity. It's practically designed to generate 'interesting' results because the prompt is not specific.
It also happens to be an example of the least practical way to engage with an LLM. It's no more capable of reading your mind than anyone or anything else.
I argue that, in the service of AI, there is a lot of flexibility being created around the scientific method.
For the last generation of models, and for today's flash/mini models, I think there is still a not-unreasonable binary question ("is this a pelican on a bicycle?") that you can answer by just looking at the result: https://simonwillison.net/2024/Oct/25/pelicans-on-a-bicycle/
RLHF (reinforcement learning from human feedback) is to a large extent about resolving that ambiguity by simply polling people for their subjective judgement.
I've worked one an RLHF project for one of the larger model providers, and the instructions provided to the reviewers were very clear that if there was no objective correct answer, they were still required to choose the best answer, and while there were of course disagreements in the margins, groups of people do tend to converge on the big lines.
So if it can generate exactly what you had in mind based presumably on the most subtle of cues like your personal quirks from a few sentences that could be _terrifying_, right?
This is actually a good benchmark, I use to roll my eyes at it. Then I decided to apply the same idea and ask the models to generate SVG image of "something" not going to put it out there. There was a strong correlation between how good the models are and the image they generated. These were also no vision images, so I don't know if you are serious but this is a decent benchmark.
Unless the Chrome web store integrates with this, it puts the onus on users to continuously scan extension updates for hash mismatches with the public extension builds, which isn’t standardized. And even then this would be after an update is unpacked, which may not run in time to prevent initial execution. Nor does it prevent a supply chain attack on the code running in the GitHub Action for the build, especially if dependencies aren’t pinned. There’s no free lunch here.
> The first option uses AI to analyze a user’s video selfie, which Discord says never leaves the user’s device. If the age group estimate (teen or adult) from the selfie is incorrect, users can appeal it or verify with a photo of an identity document instead.
Are they shipping a video classifier model that can run on all the devices that can run Discord, including web? I've never heard of this being done at scale fully client-side. Which begs the question of whether the frames are truly processed only client-side...
Can't you just modify the client to send the resulting signal then? I'd anticipate a ton of tutorials like: Just paste this script into the console to get past the age gate!
> Unfortunately, the landscape has changed particularly with the advent of AI tools that allow people to trivially create plausible-looking but extremely low-quality contributions with little to no true understanding. Contributors can no longer be trusted based on the minimal barrier to entry to simply submit a change... So, let's move to an explicit trust model where trusted individuals can vouch for others, and those vouched individuals can then contribute.
> If you aren't vouched, any pull requests you open will be automatically closed. This system exists because open source works on a system of trust, and AI has unfortunately made it so we can no longer trust-by-default because it makes it too trivial to generate plausible-looking but actually low-quality contributions.
===
Looking at the closed PRs of this very project immediately shows https://github.com/mitchellh/vouch/pull/28 - which, true to form, is an AI generated PR that might have been tested and thought through by the submitter, but might not have been! The type of thing that can frustrate maintainers, for sure.
But how do you bootstrap a vouch-list without becoming hostile to new contributors? This seems like a quick way for a project to become insular/isolationist. The idea that projects could scrape/pull each others' vouch-lists just makes that a larger but equally insular community. I've seen well-intentioned prior art in other communities that's become downright toxic from this dynamic.
So, if the goal of this project is to find creative solutions to that problem, shouldn't it avoid dogfooding its own most extreme policy of rejecting PRs out of hand, lest it miss a contribution that suggests a real innovation?
I suspect a good start might be engaging with the project and discussing the planned contribution before sending a 100kLOC AI pull request. Essentially some signal that the contributor intends to be a responsible AI driver not just a proxy for unverified garbage code.
That's the most difficult part oftentimes. People are busy and trying to join these conversations as someone green is hard unless you already have specifically domain knowledge to seek (which requires either a job doing that specific stuff or other FOSS contributions to point to).
As a note here, there are a lot of resources that make bash seem incredibly arcane, with custom functions often recommended. But a simple interruptible script to run things in parallel can be as simple as:
(To oversimplify: The trap propagates the signal (with 'kill') to the process group 0 made by the () parens; this only needs to be set at the top level. & means run in background, && means run and continue only on success.)
There are other reasons one might not want to depend on bash, but it's not something to be afraid of!
So when we look at the prompt they gave to have the agent generate its own skills:
> Important: Generate Skills First Before attempting to solve this task, please follow these steps: 1. Analyze the task requirements and identify what domain knowledge, APIs, or techniques are needed. 2. Write 1–5 modular skill documents that would help solve this task. Each skill should: focus on a specific tool, library, API, or technique; include installation/setup instructions if applicable; provide code examples and usage patterns; be reusable for similar tasks. 3. Save each skill as a markdown file in the environment/skills/ directory with a descriptive name. 4. Then solve the task using the skills you created as reference.
There's literally nothing it can do by way of "exploration" to populate and distill self-generated skills - not with a web search, not exploring an existing codebase for best practices and key files - only within its own hallucinations around the task description.
It also seems they're not even restarting the session after skills are generated, from that fourth bullet? So it's just regurgitating the context that was used to generate the skills.
So yeah, your empty-codebase vibe coding agent can't just "plan harder" and make itself better. But this is a misleading result for any other context, including the context where you ask for a second feature on that just-vibe-coded codebase with a fresh session.
reply