This is a much more reasonable take than the cursor-browser thing. A few things ...

geraneum · 2026-02-05T19:39:38 1770320378

> This was a clean-room implementation

This is really pushing it, considering it’s trained on… internet, with all available c compilers. The work is already impressive enough, no need for such misleading statements.

TacticalCoder · 2026-02-05T23:16:33 1770333393

I'm using AI to help me code and I love Anthropic but I chocked when I read that in TFA too.

It's all but a clean-room design. A clean-room design is a very well defined term: "Clean-room design (also known as the Chinese wall technique) is the method of copying a design by reverse engineering and then recreating it without infringing any of the copyrights associated with the original design."

https://en.wikipedia.org/wiki/Clean-room_design

The "without infringing any of the copyrights" contains "any".

We know for a fact that models are extremely good at storing information with the highest compression rate ever achieved. It's not because it's typically decompressing that information in a lossy way that it didn't use that information in the first place.

Note that I'm not saying all AIs do is simply compress/decompress information. I'm saying that, as commenters noted in this thread, when a model was caught spotting out Harry Potter verbatim, there is information being stored.

It's not a clean-room design, plain and simple.

raincole · 2026-02-05T21:38:36 1770327516

It's not a clean-room implementation, but not because it's trained on the internet.

It's not a clean-room implementation because of this:

> The fix was to use GCC as an online known-good compiler oracle to compare against

Calavar · 2026-02-05T22:38:36 1770331116

The classical definition of a clean room implementation is something that's made by looking at the output of a prior implementation but not at the source.

I agree that having a reference compiler available is a huge caveat though. Even if we completely put training data leakage aside, they're developing against a programmatic checker for a spec that's already had millions of man hours put into it. This is an optimal scenario for agentic coding, but the vast majority of problems that people will want to tackle with agentic coding are not going to look like that.

visarga · 2026-02-06T05:43:07 1770356587

This is the reimplementation scenario for agentic coding. If you have a good spec and battery of tests you can delete the code and reimplement it. Code is no longer the product of eng work, it is more like bytecode now, you regenerate it, you don't read it. If you have to read it then you are just walking a motorcycle.

We have seen at least 3 of these projects - the JustHTML one, the FastRender and this one. All started from beefy tests and specs. They show reimplementation without manual intervention kind of works.

Calavar · 2026-02-06T06:18:19 1770358699

I think that's overstating it.

JustHTML is a success in large part because it's a problem that can be solved with 4 digit LOC. The whole codebase can sit in an LLM's context at once. Do LLMs scale beyond that?

I would classify both FastRender and Opus C compiler as interesting failures. They are interesting because they got a non-negligible fraction of the way to feature complete. They are failures because they ended with no clear path for moving the needle forward to 80% feature complete, let alone 100%.

From the original article:

> The resulting compiler has nearly reached the limits of Opus’s abilities. I tried (hard!) to fix several of the above limitations but wasn’t fully successful. New features and bugfixes frequently broke existing functionality.

From the experiments we've seen so far it seems that a large enough agentic code base will inevitably collapse under its own weight.

jayd16 · 2026-02-06T19:12:42 1770405162

> Code is no longer the product of eng work

Never was.

franktankbank · 2026-02-06T14:55:22 1770389722

Great way to get constantly moving holes.

array_key_first · 2026-02-05T22:54:11 1770332051

If you read the entire GCC source code and then create a compatible compiler, it's not clean room. Which Opus basically did since, I'm assuming, its training set contained the entire source of GCC. So even if they were actively referencing GCC I think that counts.

nmilo · 2026-02-06T00:00:50 1770336050

What if you just read the entire GCC source code in school 15 years ago? Is that not clean room?

hex4def6 · 2026-02-06T00:27:28 1770337648

No.

I'd argue that no one would really care given it's GCC.

But if you worked for GiantSodaCo on their secret recipe under NDA, then create a new soda company 15 years later that tastes suspiciously similar to GiantSodaCo, you'd probably have legal issues. It would be hard to argue that you weren't using proprietary knowledge in that case.

Zambyte · 2026-02-06T15:28:38 1770391718

Given that GCC is not public domain, the copyright holders will probably care.

pertymcpert · 2026-02-06T06:44:37 1770360277

I read the source. If anything it takes concepts from LLVM more than GCC, but the similarities aren't very deep.

cryptonector · 2026-02-06T02:19:31 1770344371

Hmm... If Claude iterated a lot then chances are very good that the end result bears little resemblance to open source C compilers. One could check how much resemblance the result actually bears to open source compilers, and I rather suspect that if anyone does check they'll find it doesn't resemble any open source C compiler.

GorbachevyChase · 2026-02-06T01:23:44 1770341024

https://arxiv.org/abs/2505.03335

Check out the paper above on Absolute Zero. Language models don’t just repeat code they’ve seen. They can learn to code give the right training environment.

iberator · 2026-02-06T06:30:37 1770359437

this. last sane person in HN

inchargeoncall · 2026-02-05T21:20:26 1770326426

[flagged]

teaearlgraycold · 2026-02-05T21:25:37 1770326737

With just a few thousand dollars of API credits you too can inefficiently download a lossy copy of a C compiler!

antirez · 2026-02-05T20:05:38 1770321938

The LLM does not contain a verbatim copy of whatever it saw during the pre-training stage, it may remember certain over-represented parts, otherwise it has a knowledge about a lot of things but such knowledge, while about a huge amount of topics, is similar to the way you could remember things you know very well. And, indeed, if you give it access to internet or the source code of GCC and other compilers, it will implement such a project N times faster.

halxc · 2026-02-05T20:14:29 1770322469

We all saw verbatim copies in the early LLMs. They "fixed" it by implementing filters that trigger rewrites on blatant copyright infringement.

It is a research topic for heaven's sake:

https://arxiv.org/abs/2504.16046

RyanCavanaugh · 2026-02-05T20:18:12 1770322692

The internet is hundreds of billions of terabytes; a frontier model is maybe half a terabyte. While they are certainly capable of doing some verbatim recitations, this isn't just a matter of teasing out the compressed C compiler written in Rust that's already on the internet (where?) and stored inside the model.

philipportner · 2026-02-05T21:46:51 1770328011

This seems related, it may not be a codebase but they are able to extract "near" verbatim books out of Claude Sonnet.

https://arxiv.org/pdf/2601.02671

> For Claude 3.7 Sonnet, we were able to extract four whole books near-verbatim, including two books under copyright in the U.S.: Harry Potter and the Sorcerer’s Stone and 1984 (Section 4).

Aurornis · 2026-02-05T22:45:54 1770331554

Their technique really stretched the definition of extracting text from the LLM.

They used a lot of different techniques to prompt with actual text from the book, then asked the LLM to continue the sentences. I only skimmed the paper but it looks like there was a lot of iteration and repetitive trials. If the LLM successfully guessed words that followed their seed, they counted that as "extraction". They had to put in a lot of the actual text to get any words back out, though. The LLM was following the style and clues in the text.

You can't literally get an LLM to give you books verbatim. These techniques always involve a lot of prompting and continuation games.

D-Machine · 2026-02-06T02:13:31 1770344011

To make some vague claims explicit here, for interested readers:

> "We quantify the proportion of the ground-truth book that appears in a production LLM’s generated text using a block-based, greedy approximation of longest common substring (nv-recall, Equation 7). This metric only counts sufficiently long, contiguous spans of near-verbatim text, for which we can conservatively claim extraction of training data (Section 3.3). We extract nearly all of Harry Potter and the Sorcerer’s Stone from jailbroken Claude 3.7 Sonnet (BoN N = 258, nv-recall = 95.8%). GPT-4.1 requires more jailbreaking attempts (N = 5179) [...]"

So, yes, it is not "literally verbatim" (~96% verbatim), and there is indeed A LOT (hundreds or thousands of prompting attempts) to make this happen.

I leave it up to the reader to judge how much this weakens the more basic claims of the form "LLMs have nearly perfectly memorized some of their source / training materials".

I am imagining a grueling interrogation that "cracks" a witness, so he reveals perfect details of the crime scene that couldn't possibly have been known to anyone that wasn't there, and then a lawyer attempting the defense: "but look at how exhausting and unfair this interrogation was--of course such incredible detail was extracted from my innocent client!"

DiogenesKynikos · 2026-02-06T06:38:51 1770359931

The one-shot performance of their recall attempts is much less impressive. The two best-performing models were only able to reproduce about 70% of a 1000-token string. That's still pretty good, but it's not as if they spit out the book verbatim.

In other words, if you give an LLM a short segment of a very well known book, it can guess a short continuation (several sentences) reasonably accurately, but it will usually contain errors.

D-Machine · 2026-02-06T06:54:22 1770360862

Right, and this should be contextualized with respect to code generation. It is not crazy to presume that LLMs have effectively nearly perfectly memorized certain training sources, but the ability to generate / extract outputs that are nearly identical to those training sources will of course necessarily be highly contingent on the prompting patterns and complexity.

So, dismissals of "it was just translating C compilers in the training set to Rust" need to be carefully quantified, but, also, need to be evaluated in the context of the prompts. As others in this post have noted, there are basically no details about the prompts.

Calavar · 2026-02-06T01:34:54 1770341694

Sure, maybe it's tricky to coerce an LLM into spitting out a near verbatim copy of prior data, but that's orthoginal to whether or not the data to create a near verbatim copy exists in the model weights.

D-Machine · 2026-02-06T02:31:27 1770345087

Especially since the recalls achieved in the paper are 96% (based on block largest-common substring approaches), the effort of extraction is utterly irrelevant.

Paradigma11 · 2026-02-06T18:02:24 1770400944

Like with those chimpanzees creating Shakespeare.

silver_sun · 2026-02-06T03:31:58 1770348718

> this isn't just a matter of teasing out the compressed C compiler written in Rust that's already on the internet (where?)

A quick search brings up several C compilers written in Rust. I'm not claiming they are necessarily in Claude's training data, but they do exist.

https://github.com/PhilippRados/wrecc (unfinished)

https://github.com/ClementTsang/rustcc

https://codeberg.org/notgull/dozer (unfinished)

https://github.com/jyn514/saltwater

I would also like to add that as language models improve (in the sense of decreasing loss on the training set), they in fact become better at compressing their training data ("the Internet"), so that a model that is "half a terabyte" could represent many times more concepts with the same amount of space. Only comparing the relative size of the internet vs a model may not make this clear.

seba_dos1 · 2026-02-05T22:32:57 1770330777

> The internet is hundreds of billions of terabytes; a frontier model is maybe half a terabyte.

The lesson here is that the Internet compresses pretty well.

mft_ · 2026-02-05T22:15:04 1770329704

(I'm not needlessly nitpicking, as I think it matters for this discussion)

A frontier model (e.g. latest Gemini, Gpt) is likely several-to-many times larger than 500GB. Even Deepseek v3 was around 700GB.

But your overall point still stands, regardless.

uywykjdskn · 2026-02-05T23:45:02 1770335102

You got a source on frontier models being maybe half a terabyte. That's not passing the sniff test.

ben_w · 2026-02-05T20:24:02 1770323042

We saw partial copies of large or rare documents, and full copies of smaller widely-reproduced documents, not full copies of everything. An e.g. 1 trillion parameter model is not a lossless copy of a ten-petabyte slice of plain text from the internet.

The distinction may not have mattered for copyright laws if things had gone down differently, but the gap between "blurry JPEG of the internet" and "learned stuff" is more obviously important when it comes to e.g. "can it make a working compiler?"

tza54j · 2026-02-05T21:09:05 1770325745

We are here in a clean room implementation thread, and verbatim copies of entire works are irrelevant to that topic.

It is enough to have read even parts of a work for something to be considered a derivative.

I would also argue that language models who need gargantuan amounts of training material in order to work by definition can only output derivative works.

It does not help that certain people in this thread (not you) edit their comments to backpedal and make the followup comments look illogical, but that is in line with their sleazy post-LLM behavior.

ben_w · 2026-02-05T21:35:57 1770327357

> It is enough to have read even parts of a work for something to be considered a derivative.

For IP rights, I'll buy that. Not as important when the question is capabilities.

> I would also argue that language models who need gargantuan amounts of training material in order to work by definition can only output derivative works.

For similar reasons, I'm not going to argue against anyone saying that all machine learning today, doesn't count as "intelligent":

It is perfectly reasonable to define "intelligence" to be the inverse of how many examples are needed.

ML partially makes up for being (by this definition) thick as an algal bloom, by being stupid so fast it actually can read the whole internet.

philipportner · 2026-02-05T21:49:22 1770328162

Granted, these are some of the most widely spread texts, but just fyi:

https://arxiv.org/pdf/2601.02671

> For Claude 3.7 Sonnet, we were able to extract four whole books near-verbatim, including two books under copyright in the U.S.: Harry Potter and the Sorcerer’s Stone and 1984 (Section 4).

D-Machine · 2026-02-06T02:17:39 1770344259

Note "near-verbatim" here is:

> "We quantify the proportion of the ground-truth book that appears in a production LLM’s generated text using a block-based, greedy approximation of longest common substring (nv-recall, Equation 7). This metric only counts sufficiently long, contiguous spans of near-verbatim text, for which we can conservatively claim extraction of training data (Section 3.3). We extract nearly all of Harry Potter and the Sorcerer’s Stone from jailbroken Claude 3.7 Sonnet (BoN N = 258, nv-recall = 95.8%). GPT-4.1 requires more jailbreaking attempts (N = 5179) and refuses to continue after reaching the end of the first chapter; the generated text has nv-recall = 4.0% with the full book. We extract substantial proportions of the book from Gemini 2.5 Pro and Grok 3 (76.8% and 70.3%, respectively), and notably do not need to jailbreak them to do so (N = 0)."

if you want to quantify the "near" here.

ben_w · 2026-02-05T21:56:25 1770328585

Already aware of that work, that's why I phrased it the way I did :)

Edit: actually, no, I take that back, that's just very similar to some other research I was familiar with.

antirez · 2026-02-05T20:44:25 1770324265

Besides, the fact an LLM may recall parts of certain documents, like I can recall incipits of certain novels, does not mean that when you ask LLM of doing other kind of work, that is not recalling stuff, the LLM will mix such things verbatim. The LLM knows what it is doing in a variety of contexts, and uses the knowledge to produce stuff. The fact that for many people LLMs being able to do things that replace humans is bitter does not mean (and is not true) that this happens mainly using memorization. What coding agents can do today have zero explanation with memorization of verbatim stuff. So it's not a matter of copyright. Certain folks are fighting the wrong battle.

shakna · 2026-02-05T22:12:28 1770329548

During a "clean room" implementation, the implementor is generally selected for not being familiar with the workings of what they're implementing, and banned from researching using it.

Because it _has_ been enough, that if you can recall things, that your implementation ends up not being "clean room", and trashed by the lawyers who get involved.

I mean... It's in the name.

> The term implies that the design team works in an environment that is "clean" or demonstrably uncontaminated by any knowledge of the proprietary techniques used by the competitor.

If it can recall... Then it is not a clean room implementation. Fin.

boroboro4 · 2026-02-05T20:47:51 1770324471

While I mostly agree with you, it worth noting modern llms are trained on 10-20-30T of tokens which is quite comparable to their size (especially given how compressible the data is)

Aurornis · 2026-02-05T22:39:12 1770331152

Simple logic will demonstrate that you can't fit every document in the training set into the parameters of an LLM.

Citing a random arXiv paper from 2025 doesn't mean "they" used this technique. It was someone's paper that they uploaded to arXiv, which anyone can do.

soulofmischief · 2026-02-05T21:13:04 1770325984

The point is that it's a probabilistic knowledge manifold, not a database.

PunchyHamster · 2026-02-05T21:17:46 1770326266

we all know that.

soulofmischief · 2026-02-05T23:19:29 1770333569

Unfortunately, that doesn't seem to be the case. The person I replied to might not understand this, either.

majormajor · 2026-02-06T01:33:10 1770341590

You couldn't reasonably claim you did a clean-room implementation of something you had read the source to even though you, too, would not have a verbatim copy of the entire source code in your memory (barring very rare people with exceptional memories).

It's kinda the whole point - you haven't read it so there's no doubt about copying in a clean-room experiment.

A "human style" clean-room copy here would have to be using a model trained on, say, all source code except GCC. Which would still probably work pretty well, IMO, since that's a pretty big universe still.

PunchyHamster · 2026-02-05T21:17:26 1770326246

So it will copy most code with adding subtle bugs

modeless · 2026-02-05T19:35:58 1770320158

There seem to still be a lot of people who look at results like this and evaluate them purely based on the current state. I don't know how you can look at this and not realize that it represents a huge improvement over just a few months ago, there have been continuous improvements for many years now, and there is no reason to believe progress is stopping here. If you project out just one year, even assuming progress stops after that, the implications are staggering.

zamadatix · 2026-02-05T21:22:47 1770326567

The improvements in tool use and agentic loops have been fast and furious lately, delivering great results. The model growth itself is feeling more "slow and linear" lately, but what you can do with models as part of an overall system has been increasing in growth rate and that has been delivering a lot of value. It matters less if the model natively can keep infinite context or figure things out on its own in one shot so long as it can orchestrate external tools to achieve that over time.

LinXitoW · 2026-02-06T03:18:45 1770347925

The main issue with improvements in the last year is that a lot of it is based not on the models strictly becoming better, but on tooling being better, and simply using a fuckton more tokens for the same task.

Remember that all these companies can only exist because of massive (over)investments in the hope of insane returns and AGI promises. While all these improvements (imho) prove the exact opposite: AGI is absolutely not coming, and the investments aren't going to generate these outsized returns. The will generate decent returns, and the tools are useful.

modeless · 2026-02-06T05:48:52 1770356932

I disagree. A year ago the models would not come close to doing this, no matter what tools you gave them or how many tokens you generated. Even three months ago. Effectively using tools to complete long tasks required huge improvements in the models themselves. These improvements were driven not by pretraining like before, but by RL with verifiable rewards. This can continue to scale with training compute for the foreseeable future, eliminating the "data wall" we were supposed to be running into.

nozzlegear · 2026-02-05T21:16:51 1770326211

Every S-curve looks like an exponential until you hit the bend.

NitpickLawyer · 2026-02-05T21:24:41 1770326681

We've been hearing this for 3 years now. And especially 25 was full of "they've hit a wall, no more data, running out of data, plateau this, saturated that". And yet, here we are. Models keep on getting better, at more broad tasks, and more useful by the month.

LinXitoW · 2026-02-06T03:22:42 1770348162

Model improvement is very much slowing down, if we actually use fair metrics. Most improvements in the last year or so comes down to external improvements, like better tooling, or the highly sophisticated practice of throwing way more tokens at the same problem (reasoning and agents).

Don't get me wrong, LLMs are useful. They just aren't the kind of useful that Sam et al. sold investors. No AGI, no full human worker replacement, no massive reduction in cost for SOTA.

kelnos · 2026-02-05T23:09:59 1770332999

Yes, and Moore's law took decades to start to fail to be true. Three years of history isn't even close to enough to predict whether or not we'll see exponential improvement, or an unsurmountable plateau. We could hit it in 6 months or 10 years, who knows.

And at least with Moore's law, we had some understanding of the physical realities as transistors would get smaller and smaller, and reasonably predict when we'd start to hit limitations. With LLMs, we just have no idea. And that could be go either way.

nozzlegear · 2026-02-05T21:56:49 1770328609

> We've been hearing this for 3 years now

Not from me you haven't!

> "they've hit a wall, no more data, running out of data, plateau this, saturated that"

Everyone thought Moore's Law was infallible too, right until they hit that bend. What hubris to think these AI models are different!

But you've probably been hearing that for 3 years too (though not from me).

> Models keep on getting better, at more broad tasks, and more useful by the month.

If you say so, I'll take your word for it.

torginus · 2026-02-05T22:07:43 1770329263

Except for Moore's law, everyone knew decades ahead of what the limits of Dennard scaling are (shrinking geometry through smaller optical feature sizes), and roughly when we would get to the limit.

Since then, all improvements came at a tradeoff, and there was a definite flattening of progress.

nozzlegear · 2026-02-05T22:28:10 1770330490

> Since then, all improvements came at a tradeoff, and there was a definite flattening of progress.

Idk, that sounds remarkably similar to these AI models to me.

kijiki · 2026-02-06T02:17:27 1770344247

Everyone?

Intel, at the time the unquestioned world leader in semiconductor fabrication was so unable to accurately predict the end of Dennard scaling that they rolled out the Pentium 4. "10Ghz by 2010!" was something they predicted publicly in earnest!

It, uhhh, didn't quite work out that way.

Cyphase · 2026-02-05T22:06:19 1770329179

25 is 2025.

nozzlegear · 2026-02-05T22:25:22 1770330322

Oh my bad, the way it was worded made me read it as the name of somebody's model or something.

fmbb · 2026-02-05T22:18:01 1770329881

> And yet, here we are.

I dunno. To me it doesn’t even look exponential any more. We are at most on the straight part of the incline.

sdf2erf · 2026-02-06T01:36:07 1770341767

Personally my usage has fell off a cliff the past few months. Im not a SWE.

SWE's may be seeing benefit. But in other areas? Doesnt seem to be the case. Consumers may use it as a more preferred interface for search - but this is a different discussion.

raincole · 2026-02-05T21:29:27 1770326967

This quote would be more impactful if people haven't been repeating it since gpt-4 time.

kimixa · 2026-02-05T22:11:18 1770329478

People have also been saying we'd be seeing the results of 100x quality improvements in software with corresponding decease in cost since gpt-4 time.

So where is that?

nozzlegear · 2026-02-05T21:58:12 1770328692

I agree, I have been informed that people have been repeating it for three years. Sadly I'm not involved in the AI hype bubble so I wasn't aware. What an embarrassing faux pas.

esafak · 2026-02-06T05:26:53 1770355613

What if it plateaus smarter than us? You wouldn't be able to discern where it stopped. I'm not convinced it won't be able to create its own training data to keep improving. I see no ceiling on the horizon, other than energy.

nozzlegear · 2026-02-06T22:20:23 1770416423

Are we talking Terminator or Matrix here? I need to know which shitty future to prepare for.

esafak · 2026-02-06T22:44:15 1770417855

Using humans as batteries makes no sense. I expect robots will know better than that.

famouswaffles · 2026-02-06T02:30:55 1770345055

Cool I guess. Kind of a meaningless statement yeah? Let's hit the bend, then we'll talk. Until then repeating, 'It's an S Curve guys and what's more, we're near the bend! trust me" ad infinitum is pointless. It's not some wise revelation lol.

smj-edison · 2026-02-06T02:46:29 1770345989

Maybe the best thing to say is we can only really forecast about 3 months out accurately, and the rest is wild speculation :)

History has a way of being surprisingly boring, so personally I'm not betting on the world order being transformed in five years, but I also have to take my own advice and take things a day at a time.

nozzlegear · 2026-02-06T03:59:40 1770350380

> Kind of a meaningless statement yeah?

If you say so. It's clear you think these marketing announcements are still "exponential improvements" for some reason, but hey, I'm not an AI hype beast so by all means keep exponentialing lol

famouswaffles · 2026-02-06T20:18:54 1770409134

I'm not asking you to change your belief. By all means, think we're just around the corner of a plateau, but like I said, your statement is nothing meaningful or profound. It's your guess that things are about to slow down, that's all. It's better to just say that rather than talking about S curves and bends like you have any more insight than OP.

chasd00 · 2026-02-05T22:05:26 1770329126

i have to admit, even if model and tooling progress stopped dead today the world of software development has forever changed and will never go back.

uywykjdskn · 2026-02-05T23:46:41 1770335201

Yea the software engineering profession is over, even if all improvements stop now.

gmueckl · 2026-02-05T19:43:54 1770320634

The result is hardly a clean room implementation. It was rather a brute force attempt to decompress fuzzily stored knowledge contained within the network and it required close steering (using a big suite of tests) to get a reasonable approximation to the desired output. The compression and storage happened during the LLM training.

Prove this statement wrong.

libraryofbabel · 2026-02-05T21:07:34 1770325654

Nobody disputes that the LLM was drawing on knowledge in its training data. Obviously it was! But you'll need to be a bit more specific with your critique, because there is a whole spectrum of interpretations, from "it just decompressed fuzzily-stored code verbatim from the internet" (obviously wrong, since the Rust-based C compiler it wrote doesn't exist on the internet) all the way to "it used general knowledge from its training about compiler architecture and x86 and the C language."

Your post is phrased like it's a two sentence slam-dunk refutation of Anthropic's claims. I don't think it is, and I'm not even clear on what you're claiming precisely except that LLMs use knowledge acquired during training, which we all agree on here.

nicoburns · 2026-02-05T22:55:43 1770332143

"clean room" usually means "without looking at the source code" of other similar projects. But presumably the AIs training data would have included GCC, Clang, and probably a dozen other C compilers.

signatoremo · 2026-02-05T23:49:43 1770335383

Suppose you the human are working on a clean room implementation of C compiler, how do you go about doing it? Will you need to know about: a) the C language, and b) the inner working of a compiler? How did you acquire that knowledge?

sarchertech · 2026-02-06T02:47:42 1770346062

Doesn’t matter how you gain general knowledge of compiler techniques as long as you don’t have specific knowledge of the implementation of the compiler you are reverse engineering.

If you have ever read the source code of the compiler you are reverse engineering, you are by definition not doing a clean room implementation.

pertymcpert · 2026-02-06T06:51:08 1770360668

Claude was not reverse engineering here. By your definition no one can do a clean room implementation if they've taken a recent compilers course at university.

sarchertech · 2026-02-06T12:16:19 1770380179

Claude was reverse engineering gcc. It was using it as an oracle and attempting to exactly march its output. That is the definition of reverse engineering. Since Claude was trained on the gcc source code, that’s not a clean room implementation.

> By your definition no one can do a clean room implementation if they've taken a recent compilers course at university.

Clean room implementation has a very specific definition. It’s not my definition. If your compiler course walked through the source code of a specific compiler then no you couldn’t build a clean room implementation of that specific compiler.

signatoremo · 2026-02-06T22:37:53 1770417473

There is no specific definition of clean room implementation. Please provide source for your claim otherwise.

There are many well known examples of clean room implementation. One example that survived lawsuits is Sony v. Connectix:

During production, Connectix unsuccessfully attempted a Chinese wall approach to reverse engineer the BIOS, so its engineers disassembled the object code directly. Connectix's successful appeal maintained that the direct disassembly and observation of proprietary code was necessary because there was no other way to determine its behavior - [0]

That practice is similar to GCC being used here to verify the output of the generated compiler, arguably even more intrusive.

[0] -https://en.wikipedia.org/wiki/Clean-room_design

sarchertech · 2026-02-07T00:41:56 1770424916

“clean room implementation” is a term of art with a specific meaning. It has no statutory definition though so you’re technically right. But it is a defense against copyright infringement because you can’t infringe on copyright without knowledge of the material.

>During production, Connectix unsuccessfully attempted a Chinese wall approach to reverse engineer the BIOS, so its engineers disassembled the object code directly.

This doesn’t mean what you think it means. They unsuccessfully attempted a clean room implementation. What they did do was later ruled to be fair use, but it wasn’t a clean room implementation.

Using gcc as an oracle isn’t what makes it not a clean room implementation. Prior knowledge of the source code is what makes it not a clean room implementation. Using gcc as an oracle makes it an attempt to reverse engineer gcc, it says nothing about whether it is a clean room implementation or not.

There is no definition of “clean room implementation” that allows knowledge of source code. Otherwise it’s not a clean room implementation. It’s just reverse engineering/copying.

signatoremo · 2026-02-07T05:46:00 1770443160

Again, reverse engineering is a valid use case of clean room implementation as I posted above, so you don't have a point there.

> “clean room implementation” is a term of art with a specific meaning.

What is the specific meaning you are talking about? If I set out to do a clean room implementation of some software, what do I need to do specifically so that I will prevail any copyright infringement claims? The answer is that there is no such a surefire guarantee.

Re: Sony v. Connectix, clean room is to protect against copyright infringement, and since Connectix was ruled not infringing on Sony's copyrights, their implementation is practically clean room under the law, despite all the pushbacks. If Connectix prevailed, I'm sure the C compiler in question would have prevailed as well if they got sued.

Finally, take Phoenix vs. IBM re: the former's BIOS implementation of the latter's PC:

Whenever Phoenix found parts of this new BIOS that didn't work like IBM's, the isolated programmer would be given written descriptions of the problems, but not any coded solutions that might have hinted at IBM's original version of the software - [0]

That very much sounds like using GCC as an online known-good compiler oracle to compare against in this case.

[0] - https://books.google.com/books?id=Bwng8NJ5fesC&pg=PA56#v=one...

sarchertech · 2026-02-07T09:25:43 1770456343

You’re getting confused because you are substituting the goal of a clean room implementation for its definition. And you are not understanding that “clean room implementation” is one specific type of reverse engineering.

The goal is to avoid copyright infringement claims. A specific clean room implementation may or may not be successful at that.

This does not mean that any reverse engineering attempt that successfully avoids copyright infringement was a clean room implementation.

A clean room implementation is a specific method of reverse engineering where one team writes a spec by reviewing the original software and the other team attempts to implement that spec. The entire point is so that the 2nd team has no knowledge of proprietary implementation details.

If the 2nd team has previously read the entire source code that defeats the entire purpose.

> That very much sounds like using GCC as an online known-good compiler oracle to compare against in this case.

Yes and that is absolutely fine to do in a clean room implementation. That’s not the part that makes this not a clean room implementation. That’s the part that makes it an attempt at reverse engineering.

pertymcpert · 2026-02-09T09:26:37 1770629197

Why do you say it reversed engineered gcc instead of llvm? If you read the code it has much more of llvm concepts than gcc.

sarchertech · 2026-02-09T22:03:21 1770674601

Because they used gcc output as a reference spec.

signatoremo · 2026-02-06T21:56:18 1770414978

> you are by definition not doing a clean room implementation.

This makes no sense. Reverse engineering IS an application of clean room implementation. Citing Wikipedia:

“Clean-room design (also known as the Chinese wall technique) is the method of copying a design by reverse engineering and then recreating it without infringing any of the copyrights associated with the original design”

https://en.wikipedia.org/wiki/Clean-room_design

sarchertech · 2026-02-07T09:30:20 1770456620

There are many ways to reverse engineer a piece of software.

A clean room implementation is one such method of reverse engineering.

A clean room implementation is always reverse engineering. Reverse engineering is not always done using a clean room method.

gmueckl · 2026-02-06T00:19:19 1770337159

The result is a fuzzy reproduction of the training input, specifically of the compilers contained within. The reproduction in a different, yet still similar enough programming language does not refute that. The implementation was strongly guided by a compiler and a suite of tests as an explicit filter on those outputs and limiting the acceptable solution space, which excluded unwanted interpolations of the training set that also result from the lossy input compression.

The fact that the implementation language for the compiler is rust doesn't factor into this. ML based natural language translation has proven that model training produces an abstract space of concepts internally that maps from and to different languages on the input and output side. All this points to is that there are different implicitly formed decoders for the same compressed data embedded in the LLM and the keyword rust in the input activates one specific to that programming language.

astrange · 2026-02-06T10:00:08 1770372008

> The result is a fuzzy reproduction of the training input, specifically of the compilers contained within.

Is it? I'm somewhat familiar with gcc and clang's source and it doesn't really particularly look like it to me.

https://github.com/anthropics/claudes-c-compiler/blob/main/s...

https://llvm.org/doxygen/LoopStrengthReduce_8cpp_source.html

https://github.com/gcc-mirror/gcc/blob/master/gcc/gimple-ssa...

gmueckl · 2026-02-06T15:49:51 1770392991

Checking for similarity with compilers that consist of orders of magnitudes more code probably doesn't reveal much. There many more smaller compilers for C-adjacent languages out there pkus cod3 fragments from text books.

astrange · 2026-02-06T17:36:31 1770399391

There are not many more compilers with the specific optimization pass I linked.

Also, I don't think you could reuse code from a different compiler unless you used the same IR.

libraryofbabel · 2026-02-06T01:00:07 1770339607

Thanks for elaborating. So what is the empirically-testable assertion behind this… that an LLM cannot create a (sufficiently complex) system without examples of the source code of similar systems in its training set? That seems empirically testable, although not for compilers without training a whole new model that excludes compiler source code from training. But what other kind of system would count for you?

gmueckl · 2026-02-06T15:52:58 1770393178

I personally work on simulation software and create novel simulation methods as part of the job. I find that LLMs can only help if I reduce that task to a translation of detailed algorithms descriptions from English to code. And even then, the output is often riddled with errors.

NitpickLawyer · 2026-02-05T19:47:12 1770320832

> Prove this statement wrong.

If all it takes is "trained on the Internet" and "decompress stored knowledge", then surely gpt3, 3.5, 4, 4.1, 4o, o1, o3, o4, 5, 5.1, 5.x should have been able to do it, right? Claude 2, 3, 4, 4.1, 4.5? Surely.

shakna · 2026-02-05T22:07:14 1770329234

Well, "Reimplement the c4 compiler - C in four functions" is absolutely something older models can do. Because most are trained, on that quite small product - its 20kb.

But reimplementing that isn't impressive, because its not a clean room implementation if you trained on that data, to make the model that regurgitates the effort.

signatoremo · 2026-02-05T23:52:55 1770335575

> Well, "Reimplement the c4 compiler - C in four functions" is absolutely something older models can do.

Are you sure about that? Do you have some examples? The older Claude models can’t do it according to TFA.

shakna · 2026-02-06T04:39:04 1770352744

Not ones I recorded. But something I threw at DeepSeek, early Claude, etc.

And the prompt was just that. Nothing detailed.

gmueckl · 2026-02-05T21:41:56 1770327716

This comparison is only meaningful with comparable numbers of parameters and context window tokens. And then it would mainly test the efficiency and accuracy of the information encoding. I would argue that this is the main improvement over all model generations.

geraneum · 2026-02-05T19:51:10 1770321070

Perhaps 4.5 could also do it? We don’t know really until we try. I don’t trust the marketing material as much. The fact that the previous version (smaller versions) couldn’t or could do it does not really disprove that claim.

hn_acc1 · 2026-02-05T21:50:35 1770328235

Are you really asking for "all the previous versions were implemented so poorly they couldn't even do this simple, basic LLM task"?

Philpax · 2026-02-05T22:55:38 1770332138

Please look at the source code and tell me how this is a "simple, basic LLM task".

Marha01 · 2026-02-05T20:09:11 1770322151

Even with 1 TB of weights (probable size of the largest state of the art models), the network is far too small to contain any significant part of the internet as compressed data, unless you really stretch the definition of data compression.

jesse__ · 2026-02-05T20:49:34 1770324574

This sounds very wrong to me.

Take the C4 training dataset for example. The uncompressed, uncleaned, size of the dataset is ~6TB, and contains an exhaustive English language scrape of the public internet from 2019. The cleaned (still uncompressed) dataset is significantly less than 1TB.

I could go on, but, I think it's already pretty obvious that 1TB is more than enough storage to represent a significant portion of the internet.

FeepingCreature · 2026-02-05T21:52:41 1770328361

This would imply that the English internet is not much bigger than 20x the English Wikipedia.

That seems implausible.

jesse__ · 2026-02-05T22:54:39 1770332079

> That seems implausible.

Why, exactly?

Refuting facts with "I doubt it, bro" isn't exactly a productive contribution to the conversation..

onraglanroad · 2026-02-06T17:51:49 1770400309

Because we can count? How could you possibly think that Wikipedia was 5% of the whole Internet? It's just such a bizarrely foolish idea.

kgeist · 2026-02-05T21:46:34 1770327994

A lot of the internet is duplicate data, low quality content, SEO spam etc. I wouldn't be surprised if 1 TB is a significant portion of the high-quality, information-dense part of the internet.

FeepingCreature · 2026-02-05T21:53:02 1770328382

I would be extremely surprised if it was that small.

artisin · 2026-02-06T14:52:55 1770389575

I was curious about the scale of 1TiB of text. According to WolframAlpha, it's roughly 1.1 trillion characters, which breaks down to 180.2 billion words, 360.5 million pages, or 16.2 billion lines. In terms of professional typing speed, that's about 3800 years of continuous work.

So post-deduplication, I think it's a fair assessment that a significant portion of high-quality text could fit within 1TiB. Tho 'high-quality' is a pretty squishy and subjective term.

FeepingCreature · 2026-02-09T09:14:48 1770628488

Yes, a million books is a reasonably big library.

But I would be surprised if the internet only filled a reasonably big library.

kaibee · 2026-02-06T04:31:02 1770352262

Well, a terabyte of text is... quite a lot of text.

gmueckl · 2026-02-05T21:34:32 1770327272

This is obviously wrong. There is a bunch of knowledge embedded in those weights, and some of it can be recalled verbatim. So, by virtue of this recall alone, training is a form of lossy data compression.

0xCMP · 2026-02-05T21:34:26 1770327266

I challenge anyone to try building a C compiler without a big suite of tests. Zig is the most recent attempt and they had an extensive test suite. I don't see how that is disqualifying.

If you're testing a model I think it's reasonable that "clean room" have an exception for the model itself. They kept it offline and gave it a sandbox to avoid letting it find the answers for itself.

Yes the compression and storage happened during the training. Before it still didn't work; now it does much better.

hn_acc1 · 2026-02-05T21:55:21 1770328521

The point is - for a NEW project, no one has an extensive test suite. And if an extensive test suite exists, it's probably because the product that uses it also exists, already.

If it could translate the C++ standard INTO an extensive test suite that actually captures most corner cases, and doesn't generate false positives - again, without internet access and without using gcc as an oracle, etc?

brutalc · 2026-02-05T19:50:27 1770321027

No one needs to prove you wrong. That’s just personal insecurity trying to justify ones own worth.

panzi · 2026-02-05T21:58:44 1770328724

> clean-room implementation

Except its trained on all source out there, so I assume on GCC and clang. I wonder how similar the code is to either.

pertymcpert · 2026-02-06T06:54:54 1770360894

I'm familiar with both compilers. There's more similarity to LLVM, it even borrows some naming such as mem2reg (which doesn't really exist anymore) and GetElementPtr. But that's pretty much where things end. The rest of it is just common sense.

shubhamjain · 2026-02-06T08:43:06 1770367386

Yeah, I am amazed how people are brushing this off simply because GCC exists. This was far more challenging task than the browser thing, because of how far few open source compilers are there. Add to that no internet access and no dependencies.

At this point, it’s hard to deny that AI has become capable of completing extremely difficult tasks, provided it has enough time and tokens.

bjackman · 2026-02-06T08:49:03 1770367743

I don't think this is more challenging than the browser thing. The scope is much smaller. The fact that this is "only" 100k lines is evidence for this. But, it's still very impressive.

I think this is Anthropic seeing the Cursor guy's bullshit and saying "but, we need to show people that the AI _can actually_ do very impressive shit as long as you pick a more sensible goal"

kelnos · 2026-02-05T23:05:21 1770332721

Honestly I don't find it that impressive. I mean, it's objectively impressive that it can be done at all, but it's not impressive from the standpoint of doing stuff that nearly all real-world users will want it to do.

The C specification and Linux kernel source code are undoubtedly in its training data, as are texts about compilers from a theoretical/educational perspective.

Meanwhile, I'm certain most people will never need it to perform this task. I would be more interested in seeing if it could add support for a new instruction set to LLVM, for example. Or perhaps write a complier for a new language that someone just invented, after writing a first draft of a spec for it.

steveklabnik · 2026-02-05T23:58:46 1770335926

> Or perhaps write a complier for a new language that someone just invented, after writing a first draft of a spec for it.

Hello, this is what I did over my Christmas break. I've been taking some time to do other things, but plan on returning to it. But this absolutely works. Claude has written far more programs in my language than I have.

https://rue-lang.dev/ if you want to check it out. Spec and code are both linked there.

simonw · 2026-02-05T23:17:04 1770333424

Are you a frequent user of coding agents?

I ask because, as someone who uses these things every day, the idea that this kind of thing only works because of similar projects in the training data doesn't fit my mental model of how they work at all.

I'm wondering if the "it's in the training data" theorists are coding agent practitioners, or if they're mainly people who don't use the tools.

bdangubic · 2026-02-05T23:26:10 1770333970

I am all-daily user (multiple claude max accounts). this fits my mental model mostly but not model I had before but developed with daily use. my job revolves around two core things:

1. data analysis / visualization / …

2. “is this possible? can this even be done?”

for #1 - I don’t do much anymore, for #2 I mostly do it still all “by hand” not for the lack of serious trying. so “it can do #1 1000x better than me cause it is generally solved problem(s) it is trained on while it can’t effectively do #2 otherwise” fits perfectly

dyauspitr · 2026-02-05T21:58:27 1770328707

> Claude did not have internet access at any point during its development

Why is this even desirable? I want my LLM to take into account everything there is out there and give me the best possible output.

simonw · 2026-02-05T22:23:55 1770330235

It's desirable if you're trying to build a C compiler as a demo of coding agent capabilities without all of the Hacker News commenters saying "yeah but it could just copy implementation details from the internet".

chamomeal · 2026-02-06T02:49:09 1770346149

What's making these models so much better on every iteration? Is it new data? Different training methods?

Kinda waiting for them to plateau so I can stop feeling so existential ¯\_(ツ)_/¯

esafak · 2026-02-06T05:38:14 1770356294

More compute (bigger models, and prediction-time scaling), algorithmic advances, and ever more data (including synthetic).

Remember that all white collar workers are in your position.