Hacker Newsnew | past | comments | ask | show | jobs | submit | fuhsnn's commentslogin

It's one of the most commonly adopted feature among C successor languages (D, Zig, Odin, C3, Hare, Jai); given how opinionated some of them are on these topics, I think it's safe to say it's generally well regarded in PL communities.

What I always hated about defer is that you can simply forget or place it in the wrong position. Destructors or linear types (must-use-once types) are a much better solution.

In Nim too, although, the author doesn't like it much.

The fact that pointer types can't be used with this pattern without typedef still seems kinda primitive to me.

You can use pointer types by using a typedef first, but I agree this not nice (I hope we will fix this in future C). But then, I think this is a minor inconvenience for having an otherwise working span type in C.

Structural diff tools like difftastic[1] is a good middle ground and still underexplored IMO.

[1] https://github.com/Wilfred/difftastic


IntelliJ diffs are also really good, they are somewhat semi-structural I'd say. Not going as far as difftastic it seems (but I haven't use that one).

Intel's next gen will add 16 more general purpose registers. Can't wait for the benchmarks.

So every function call will need to spill even more call-clobbered registers to the stack!

Like, I get that leaf functions with truly huge computational cores are a thing that would benefit from more ISA-visible registers, but... don't we have GPUs for that now? And TPUs? NPUs? Whatever those things are called?


With an increase in available registers, every value that a compiler might newly choose to keep in a register was a value that would previously have lived in the local stack frame anyway.

It's up to the compiler to decide how many registers it needs to preserve at a call. It's also up to the compiler to decide which registers shall be the call-clobbered ones. "None" is a valid choice here, if you wish.


> It's up to the compiler to decide how many registers it needs to preserve at a call.

But the compiler is bound by the ABI, isn't it? (at least for externally visible entrance points / calls to sub routines external to the current compilation unit)


Someone decides how to define the ABI too, and that is a free choice. CPU register count doesn't constrain this particular system design question.

Most function calls are aggressively inlined by the compiler such that they are no longer "function calls". More registers will make that even more effective.

That depends on if something like LTO is possible and a function isn't declared to use one of the plethora of calling conventions. What it means is that new calling conventions will be needed and that this new platform will be able to use pass by register for higher arity functions.

LTO is orthogonal to inlining functions.

ARM Aarch64, IBM POWER and most RISC CPUs have had 32 general-purpose registers for many decades, and this has always been an advantage for them, not a disadvantage.

So there already is plenty of experience with ABIs using 32 GPRs.

Even so, unnecessary spilling could be avoided, but that would require some changes in programmer habits. For instance, one could require an ordering of the compilation of source files, e.g. to always compile the dependencies of a program module before it. Then the compiler could annotate the module interfaces with register usage and then use this annotations when compiling a program that imports the modules. This would avoid the need to define an ABI that is never optimal in most cases, by either saving too many or too few registers.


Yeah, this approach has been proposed, in an even more aggressive form, see David W. Wall, "Global Register Allocation at Link Time", 1986 [0], specifically for machines with large number of registers: the paper straight up says that it's much less of a problem if you a) have less registers, b) don't compile things separately. And this problem with large register files has been predicted long before, see wonderfully named "How to Use 1000 Registers", 1979 by Richard L. Sites (he would later go work on VAX and design Alpha ISA at DEC).

[0] https://dl.acm.org/doi/10.1145/12276.13338

[1] https://files01.core.ac.uk/download/pdf/9412584.pdf


Thanks for the links.

Register allocation at link time would work only for ISAs where all registers are equivalent or they are partitioned only in a small number of equivalence classes.

It cannot work for an ISA like x86-64, where register allocation and instruction choice are extremely interdependent (e.g. the 16 Intel-AMD general-purpose registers are partitioned into 11 equivalence classes, instead of in at most 2 to 4 equivalence classes, like in the majority of other ISAs, where only a stack pointer and perhaps a return link pointer and/or a null register may have a special behavior). With such an ISA, the compiler must know the exact register allocation before choosing instructions, otherwise the generated program can be much worse than an optimum program. Moreover, an ideal optimizing compiler for x86-64 might need to generate instructions for several alternative register allocations, then select the best allocation and generate the definitive instruction sequence.

With such a non-orthogonal ISA, like x86-64, for the best results you need to allocate registers based on the register usage of the invoked procedures, as assembly programmers typically do, instead of using a uniform sub-optimal ABI, like most compilers.


Why does having more more registers lead to spilling? I would assume (probably) incorrectly, that more registers means less spill. Are you talking about calls inside other calls which cause the outer scope arguments to be preemptively spilled so the inner scope data can be pre placed in registers?

More registers leads to less spilling not more, unless the compiler is making some really bad choices.

Any easy way to see that is that the system with more registers can always use the same register allocation as the one with fewer, ignoring the extra registers, if that's profitable (i.e. it's not forced into using extra caller-saved registers if it doesn't want to).


Yes, the argument for increasing the number of GPRs is precisely to eliminate the register spilling that in necessary in x86-64 programs whenever a program has already used all available architectural registers, a register spilling that does not happen whenever the same program is compiled for Aarch64 or IBM POWER.

So, let's take a function with 40 alive temporaries at a point where it needs to call a helper function of, say, two arguments.

On a 16 register machine with 9 call-clobbered registers and 7 call-invariant ones (one of which is the stack pointer) we put 6 temporaries into call-invariant registers (so there are 6 spills in the prologue of this big function), another 9 into the call-clobbered registers; 2 of those 9 are the helper function's arguments, but 7 other temporaries have to be spilled to survive the call. And the rest 25 temporaries live on the stack in the first place.

If we instead take a machine with 31 registers, 19 being call-clobbered and 12 call-invariant ones (one of which is a stack pointer), we can put 11 temporaries into call-invariant registers (so there are 11 spills in the prologue of this big function), and another 19 into the call-clobbered registers; 2 of those 19 are the helper function's arguments, so 17 other temporaries have to be spilled to survive the call. And the rest of 10 temporaries live on the stack in the first place.

So, there seems to be more spilling/reloading whether you count pre-emptive spills or the on-demand-at-the-call-site spills, at least to me.


You’re missing the fact that the compiler isn’t forced to fill every register in the first place. If it was less efficient to use more registers, the compiler simply wouldn’t use more registers.

The actual counter proof here would be that in either case, the temporaries have to end up on the stack at some point anyways, so you’d need to look at the total number of loads/stores in the proximity of the call site in general.


> You’re missing the fact that the compiler isn’t forced to fill every register in the first place.

Temporaries start their lives in registers (on RISCs, at least). So if you have 40 alive values, you can use the same one register to calculate them all and immediately save all 40 of them on the stack, or e.g. keep 15 of them in 15 registers, and use the 16th register to compute 25 other values and save those on the stack. But if you keep them in the call-invariant registers, those registers need to be saved at the function's prologue, and the call-clobbered registers need to be saved and restored around inner call sites. That's why academia has been playing with register windows, to get around this manual shuffling.

> The actual counter proof here would be that in either case, the temporaries have to end up on the stack at some point anyways, so you’d need to look at the total number of loads/stores in the proximity of the call site in general.

Would you be willing to work through that proof? There may very well be less total memory traffic for machine with 31 registers than with 16; but it would seem to me that there should be some sort of local optimum for the number of registers (and their clobbered/invariant assignment) for minimizing stack traffic: four registers is way too few, but 192 (there's been CPUs like that!) is way too many.


32 registers have been used in many CPUs since the mid seventies until today, i.e. for a half of century.

During all this time there has been a consensus that 32 registers are better than less registers.

A few CPUs, e.g. SPARC and Itanium have tried to use more registers than that, and they have been considered unsuccessful.

There have been some inconclusive debates about whether 64 architectural registers might be better than 32 registers, but the fact the 32 are better than 16 has been pretty much undisputed. So there are chances that 32 is close to an optimal number of GPRs.

Using 32 registers is something new only for Intel-AMD, due to their legacy, but it is something that all their competition has used successfully for many decades.

I have written many assembly language programs for ARM, POWER or x86. Whenever I had 32 registers, this made it much easier for me to avoid most memory accesses, for spilling or for other purposes. It is true that a compiler is dumber than a human programmer and it frequently has to adhere to a rigid ABI, but even so, I expect that on average even a compiler will succeed to reduce the register spilling when using 32 registers.


The game is deeper than that. Your model is probably about right for the compiler you're using. It shouldn't be - compilers can do better - but it's all a work in progress.

Small scale stuff is you don't usually spill around every call site. One of the calls is the special "return" branch, the other N can probably share some of the register shuffling overhead if you're careful with allocation.

Bigger is that the calling convention is not a constant. Leaf functions can get special cased, but so can non-leaf. Change the pattern of argument to fixed register / stack, change which registers are callee/caller saved. The entry point for calls from outside the current module needs to match the platform ABI you claimed it'll follow but nothing else does.

The inlining theme hints at this. Basic blocks _are_ functions that are likely to have a short list of known call sites, each of which can have the calling convention chosen by the backend, which is what the live in/out of blocks is about. It's not inlining that makes any difference to regalloc, it's being more willing to change the calling convention on each function once you've named it "basic block".


Why is almost no one in this comment thread is willing to face the scenario where the function call has to actually happen, and be an actual function call? The reactions are either "no-no-no-no, the call will be inlined, don't you worry your pretty head" or "well, then the compiler will just use less registers to make less spills" — which precisely agrees with my point that having more registers ain't necessarily all that useful.

> Small scale stuff is you don't usually spill around every call site.

Well duh: it's small, so even just 8 registers is likely enough for it. So again, why bother with cumbersome schemes to extend to 32 registers?

And this problem actually exists, that's why SPARC tried register windows and even crazier schemes on the software side of things had been proposed e.g. [0] — seriously, read this. And it's 30 years old, and IIUC nothing much came out of it so excuse me if I'm somewhat skeptical about "compilers can do better - but it's all a work in progress" claims. Perhaps they already do as best they can for general-purpose CPUs. Good thing we have other kinds processing units readily available nowadays.

[0] David W. Wall, "Global Register Allocation at Link Time", 1986, https://dl.acm.org/doi/10.1145/12276.13338


This argument doesn’t make sense to me. Generally speaking, having more registers does not result in more spilling, it results in less spilling. Obviously, if you have 100 registers here, there’s no spilling at all. And think through what happens in your example with a 4 register machine or a 1 register machine, all values must spill. You can demonstrate the general principle yourself by limiting the number of registers and then increasing it using the ffixed-reg flags. In CUDA you can set your register count and basically watch the number of spills go up by one every time you take away a register and go down by one every time you add a register.

> Obviously, if you have 100 registers here, there’s no spilling at all.

No, you still need to save/spill all the registers that you use: the call-invariant ones need to be saved at the beginning of the function, the call-clobbered at an inner call site. If your function is a leaf function, only then you can get away with using only call-clobbered registers and not preserving them.


Okay, I see what you’re saying. I was assuming the compiler or programmer knows the call graph, and you’re assuming it’s a function call in the middle of a potentially large call stack with no knowledge of its surroundings. Your assumption is for sure safer and more common for a compiler compiling a function that’s not a leaf and not inlined.

So I can see why it might seem at first glance like having more registers would mean more spilling for a single function. But if your requirement is that you must save/spill all registers used, then isn’t the amount of spilling purely dependent on the function’s number of simultaneous live variables, and not on the number of hardware registers at all? If your machine has fewer general purpose registers than live state footprint in your function, then the amount of function-internal spill and/or remat must go up. You have to spill your own live state in order to compute other necessary live state during the course of the function. More hardware registers means less function-internal spill, but I think under your function call assumptions, the amount of spill has to be constant.

For sure this topic makes it clear why inlining is so important and heavily used, and once you start talking about inlining, having more registers available definitely reduces spill, and this happens often in practice, right? Leaf calls and inlined call stacks and specialization are all a thing that more regs help, so I would expect perf to get better with more registers.


Thanks for actually engaging with my argument.

> assuming it’s a function call in the middle of a potentially large call stack with no knowledge of its surroundings.

Most of the decision logic/business logic lives exactly in functions like this, so while I wouldn't claim that 90% of all of the code is like that... it's probably at least 50% or so.

> then isn’t the amount of spilling purely dependent on the function’s number of simultaneous live variables

Yes, and this ties precisely back to my argument: whether or not larger number of GPRs "helps" depends on what kind of code is usually being executed. And most of the code, empirically, doesn't have all that many scalar variables alive simultaneously. And the code that does benefit from more registers (huge unrolled/interleaved computational loops with no function calls or with calls only to intrinsics/inlinable thin wrappers of intrinsics) would benefit even more from using SIMD or even better, being off-loaded to a GPU or the like.

I actually once designed a 256-register fantasy CPU but after playing with it for a while I realised that about 200 of its registers go completely unused, and that's with globals liberally pinned to registers. Which, I guess, explains why Knuth used some idiosyncratic windowing system for his MMIX.


It took me a minute, but yes I completely agree that whether more GPRs helps depends on the code & compiler, and that there’s plenty of code you can’t inline. Re: MMIX Yes! Theoretically it would help if the hardware could dynamically alias registers, and automatically handle spilling when the RF is full. I have heard such a thing physically exists and has been tried, but I don’t know which chip/arch it is (maybe AMD?) nor how well it works. I would bet that it can’t be super efficient with registers, and maybe the complexity doesn’t pay off in practice because it thwarts and undermines inlining.

I recalled there were some new instructions added that greatly help with this. Unfortunately I'm not finding any good _webpages_ that describe the operation generally to give me a good overview / refresher. Everything seems to either directly quote published PDF documents or otherwise not actually present the information in it's effective for end use form. E.G. https://www.felixcloutier.com/x86/ -- However availability is problematic for even slightly older silicon https://en.wikipedia.org/wiki/X86-64

- XSAVE / XRSTOR

- XSAVEOPT / XRSTOR

- XSAVEC / XRSTOR

- XSAVES / XRSTORS


Eh, pretty much nobody uses them (outside of OS kernels?); and mind you, RISC-V with its 32 registers has nothing similar to those, which is why 14-instruction long prologues (adjust sp, save lr and s0 through s12) and epilogues are not that uncommon there.

A good compiler will only do that if the register spilling is more efficient than using more stack varibles, so I don't really see the problem.

You can't operate on stack variables without loading them into the registers first, not on RISCs anyway. My main point is that this memory-shuffling traffic is unavoidable in non-leaf functions, so an extremely large amount of available registers doesn't really help them.

op is probably referring to the push all/pop all approach.

No, I don't. I use a common "spill definitely reused call-invariant registers at the prologue, spill call-clobbered registers that need to survive a call at precisely the call site" approach, see the sibling comment for the arithmetic.

That’s the push all/pop all approach.

Big non-trivial functions generally profit from having more registers. They don't have to saved and restored by functions that don't use them. Not every computationally intense application has just a few numerical kernels that can be pushed on to GPUs and fit the GPU uarch well. Just have a look at the difference between optimized compiler generated ia32 and amd64 code how much difference more registers can make.

You could argue that 16 (integer) registers is a sweet spot, but you failed to state a proper argument.


Most modern compilers for modern languages do an insane amount of inlining so the problem you're mentioning isn't a big issue. And, basically, GPUs and TPUs can't handle branches. CPUs can.

I looked it up. It's called APX (Advanced Performance Extensions)[1].

[1]: https://www.intel.com/content/www/us/en/developer/articles/t...


Oh my, it has three-operand instructions now. VAX vindicated...

How are they adding GPRs? Won’t that utterly break how instructions are encoded?

That would be a major headache — even if current instruction encodings were somehow preserved.

It’s not just about compilers and assemblers. Every single system implementing virtualization has a software emulation of the instruction set - easily 10k lines of very dense code/tables.


The same way AMD added 8 new GPRs, I imagine: by introducing a new instruction prefix.

Yes, two in fact. One is the same prefix (with subtle differences of course, it's x86) that was introduced for AVX512's 32 registers. The other is new and it's a two byte extension (0xd6 0x??) of the REX prefix (0x40-0x4f).

The longer prefix has extra functionality such as adding a third operand (i.e. add r8, r15, r16), blocking flags update, and accessing a few new instructions (push2, pop2, ccmp, ctest, cfcmov).


x86 is broadly extendable. APX adds a REX2 prefix to address the new registers, and also allows using the EVEX prefix in new ways. And there's new conditional instructions where the encoding wasn't really described on the summary page.

Presumably this is gated behind cpuid and/or model specific registers, so it would tend to not be exposed by virtualization software that doesn't support it. But yeah, if you decode and process instructions, it's more things to understand. That's a cost, but presumably the benefit outweighs the cost, at least in some applications.

It's the same path as any x86 extension. In the beginning only specialty software uses it, at some point libraries that have specialized code paths based on processor featurses will support it, if it works well it becomes standard on new processors, eventually most software requires it. Or it doesn't work out and it gets dropped from future processors.


I imagine a turning point will be if/when the large HPC centers will significantly benefit from this. I need to see whether DOE facilities compile their software with these new features on a regular basis. (Well, for the average scientific user, whether the jellybean Python stack will take advantage…)

Those general purpose registers will also need to grow to twice their size, once we get our first 128bit CPU architecture. I hope Intel is thinking this through.

That's a ways out. We're not even using all bits in addresses yet. Unless they want hardware pointer tagging a la CHERI there's not going to be a need to increase address sizes, but that doesn't expose the extra bits to the user.

Data registers could be bigger. There's no reason `sizeof int` has to equal `sizeof intptr_t`, many older architectures had separate address & data register sizes. SIMD registers are already a case of that in x86_64.


> There's no reason `sizeof int` has to equal `sizeof intptr_t`

Well, there is no reason `sizeof int` should be 4 on 64-bit platforms except for the historical baggage (which was so heavy for Windows, they couldn't move even long to be 64 bits). But having an int to be a wider type than intptr_t probably wouldn't hurt things (as in, most software would work as-is after simple recompilation).


You can do a lot of pointer tagging in 64 bit pointers. Do we have CPUs with true 64 bit pointers yet? Looks like the Zen 4 is up to 57 bits. IIRC the original x86_64 CPUs were 48 bit addressing and the first Intel CPUs to dabble with larger pointers were actually only 40 bit addressing.

Doubling the number of bits squares the number range that can be stored, so there's a point of diminishing returns.

* Four-bit processors can only count to 15,or from -8 to 7, so their use has been pretty limited. It is very difficult for them to do any math, and they've mostly been used for state machines.

* Eight-bit processors can count to 255, or from -128 to 127, so much more useful math can run in a single instruction, and they can directly address hundreds of bytes of RAM, which is low enough an entire program still often requires paging, but at least a routines can reasonably fit in that range. Very small embedded systems still use 8-bit processors.

* Sixteen-bit processors can count to 65,535, or from -32,768 to 32,767, allowing far more math to work in a single instruction, and a computer can have tens of kilobytes of RAM or ROM without any paging, which was small but not uncommon when sixteen-bit processors initially gain popularity.

* Thirty-two-bit processors can count to 4,294,967,295, or from -2,147,483,648 to 2,147,483,647, so it's rare to ever need multiple instructions for a single math operation, and a computer can address four gigabytes of RAM, which was far more than enough when thirty-two-bit processors initially gain popularity. The need for more bits in general-purpose computing plateaus at this point.

* Sixty-four-bit processors can count to 18,446,744,073,709,551,615, or from -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807, so only special-case calculations need multiple instructions for a single math operation, and a computer can address up to sixteen zettabytes of RAM, which is thousands of times more than current supercomputers use. There's so many bits that programs only rarely perform 64-bit operations, and 64-bit instructions are often performing single-instruction-multiple-data operations that use multiple 8-, 16-, or 32-bit numbers stored in a single register.

We're already at the point where we don't gain a lot from true 64-bit instructions, with the registers being more-so used with vector instructions that store multiple numbers in a single register, so a 128-bit processor is kind of pointless. Sure, we'll keep growing the registers specific to vector instructions, but those are already 512-bits wide on the latest processors, and we don't call them 512-bit processors.

Granted, before 64-bit consumer processors existed, no one would have conceived that simultaneously running a few chat interfaces, like Slack and Discord, while browsing a news web page, could fill up more RAM than a 32-bit processor can address, so software using zettabytes of RAM will likely happen as soon as we can manufacture it, thanks to Wirth's Law (https://en.wikipedia.org/wiki/Wirth%27s_law), but until then there's no likely path to 128-bit consumer processors.


Well argued. You do have a point for physical world data. But there are also some digital world constructs that need 128 bits:

- IPv6 addresses

- UUIDs

I don't think it is critical that we get a 128 CPU anytime soon, but it would still be nice if we could handle these values with only one CPU register.


There's a first time for everything.

Didn't see people mention this in the initial discussion, but despite not having access to internet, the agents actually had access to the source code of GNU binutils (has assembler, linker and readelf), and many C compilers (SDCC, PCC, chibicc, cproc, etc) in their working directory [1]. These are supposed to test compilation, but there's no way to prove that Claude didn't get crucial info from these projects.

I also found the compiler to have uncanny resemblance with chibicc. With 30-ish edge cases[2] yielding the same behavior, it's hard to believe there's no influence of chibicc in its algorithms.

[1] https://github.com/anthropics/claudes-c-compiler/blob/6f1b99...

[2] https://github.com/anthropics/claudes-c-compiler/issues/232


The binary would not be nearly as useful as the source code. And even if the AI read the binary and copied it, the story is still it reverse engineered and re-wrote a c compiler all on its own. Which is still pretty impressive to me and has real world use cases.

Maybe Anthropic could release the logs to show how the AI was accessing the files.


They definitely have access to the source code, look at the binutils entry [1], Claude was debugging compilation of "preprocessor #define inside #ifdef"'s in binutils's code.

[1] https://github.com/anthropics/claudes-c-compiler/blob/6f1b99...


FIl-C, the new memory-safe C/C++ compiler actually achieved that through introducing a GC, with that in mind I'd say D was kind of a misunderstood prodigy in retrospect.

There's two classes of programs - stuff written in C for historic reasons that could have been written in higher level language but rewrite is too expensive - fill c. Stuff where you need low level - Rust/C++/Zig

Na, there are only three classes: stuff where you need simplicity, fast compilation times, portability, interoperability with legacy systems, or high-performance - C. Stuff where you need perfect memory safety: Fil-C. And stuff where you need a combination: C + formal verification. Not sure where I would Rust/C++/Zig (none of those offers perfect memory safety in practice)

FillC works fine with all C code no matter how low level. There’s a small performance overhead but for almost every scenario it’s an acceptable overhead!

up to 5x is not what most people mean by small.

For most apps it’s much less than that and in most cases it’s unnoticeable. I think it would be more productive if you could point out an app that has noticeably worse performance on FillC, so that the cause for that could be looked at and perhaps even fixed, so that eventually there would be neatly zero examples like that.

The fact that inserting runtime checks and doing garbage collecting slows things down is not a bug, not something that needs to be fixed.

As for "I think it would be move productive if you ...", that's just rude. I wasn't criticizing Fil-C, just pointing out an inaccuracy.


You did not point out any inaccuracy, you literally made baseless comment about it being "up to 5 times" slower without providing an example where that's the case.

"That's just rude" - oh my, never though asking for evidence of a performance problem on HN would be "rude". And yes, if something is actually 5x slower, it's very likely a performance bug that can be fixed, as FillC's author has made it clear in several opportunities.


> you literally made baseless comment

It's sad that people lie here. It's commonly recognized that programs run 1.5-5 times slower.

> if something is actually 5x slower, it's very likely a performance bug that can be fixed, as FillC's author has made it clear in several opportunities.

Another lie from someone who doesn't even know what the program is called.


Depends on what the multiplicand is.

Not really.

Maybe the strategy from TCC itself is useful here: emit case bodies as a giant block first then jump back with cascaded if's at end. https://godbolt.org/z/TdE11jjxb

> Did you know that C has a keyword that is only a keyword in some places and that there is a function that can have two or three parameters?

What are those? Please tell!


A good idea that I had not thought about.

defined is a pre-processor 'keyword' that only is used in conditions after #if and #elif. Elsewhere it is ignored by the pre-processor. You could argue that it is not a real keyword. But lets see what happens when you define it as a define with: '#define defined 1'.

The function main can be defined with two or three parameters. The third contains the environment. You can also get the environment with 'argv + (argc + 1)'.


I did a version with explicit capture objects for fun: https://github.com/fuhsnn/c-extensions#extended-capture

Basically, with each capturing lambda, a context type _Ctxof(fn) is created that users can then declare on stack or on heap themselves.


Interesting! Can you tell me about your experience? How does it compare to the versions in N3654? (where this idea comes from) https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3654.pdf


- Non-capturing isn't that much simpler to implement for the front-end, because they still need to distinguish between a file-scope name or a shadowing local in parent function. And once you have the code to do that distinction, it's not that far away from implementing captures.

- From implementor point of view I just don't like the idea of a new category of function qualifiers, so I used "_Wideof(int(void))" over "int(void) _Wide", and "int fn({...}, void)" over "int fn(void) _Closure/_Capture(...)".

- During prototype scope of the inner function, parameters may have VM type that reference from parent's local scope, so the capturing (or not) effectively start at prototype scope, earlier than C++'s model. My syntax has capture clause right before parameters so it is somewhat manage-able in one pass; for function qualifier syntax the parser probably need to either delay semantic checking the VMs or skip ahead to find _Closure/_Capture before processing the parameters.

- Both N3654-example6 and N3694-4.4.6 implicitly create capture context on stack then provide reference to it (N3654 with &_Closure(), N3694 with inner function's name). My version is lower level in that only the type of capture context (_Ctxof(fn)) is implicitly created, users then call a helper (modeled after va_start) to initialize the context object. I believe it also reduces surprises from by-value captures, since users clearly see where they initialize a context object relative to local variable changes.

- Sans syntax, N3654's "specified argument" is pretty close to my implementation, just need to change the source of context pointer from r10 to the argument in function prologue.

- I implemented wide function pointers as generalized "function pointers with context pointer pushed in r10", so it's also possible to make wide pointers from regular functions, they'll just be called with an extra nullptr in r10. I think it's more flexible this way.

What I did was mostly "the compiler will need these anyways" so should be adaptable to whatever WG14 settled on, or if I ever want to implement C++11 lambda.


Thanks! What I do not quite understand is the need for a capture clause that lists the captured variables. As a user, I do not want to list my captures. As a code reviewer I certainly do not want value capture that create a modifiable copy with the same name, or any ambiguity on whether something is a copy or not. I just want to access my variables. So what is really the point here?

Regarding "also possible to make wide pointers from regular functions", this exactly the reason why I like the qualifer version: The usual conversion rules for qualifiers would allow this conversion implicitly, while back-conversion is not allowed (or only with cast).


This one[1] is similar in style and supports a bit more C constructs. [1] https://github.com/Lulzx/sectorc/blob/master/stage5/cc.c


I would love to see how you handle variably-modified types, the way I retrofitted them onto chibicc never felt quite right, and I've never seen another non-gcc/clang compiler fully support noplate's VM type patterns.


Maybe I fill find the time to clean it up and make it public. But did not find VM-types difficult to implement, the type just depends on some run-time value, so at the point in time where the size expression is evaluated, one stores the result in a hidden variable which the type refers to.


Looking forward to the release!

What I found troublesome were not really the caching of array count, but when and where should the side effect be represented in AST, for example this one: https://godbolt.org/z/rcT1d8WWe the puts() call is a side effect for automatic variable but completely ignored for static variable.


Yes, it is a bit of mess. We started to make this more precise in the C standard when those expression have to be evaluated, but there are still issues. And when this is involved extensions there are more issues. I think in this example, such initializer would not be allowed in ISO C. I also fixed many bugs in GCC .


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: