IMO, it's a little subtler than that. It's not the compiler's fault, it's the language's. Plenty of languages have no undefined behavior that can be written by accident. Go and safe Rust, for instance, have just about no undefined behavior at all, and are both performance-competitive with C. (Go has UB if you cause race conditions, and Rust has UB within `unsafe` blocks analogous to C's UB.)
A C compiler, meanwhile, has to aggressively take advantage of undefined behavior to get good performance, and the C specification has been keeping behavior undefined for the benefit of compilers.
You can hope that you find all such problems in C (which you might not) and "fix the program", but you can also "fix the program" by switching to a better language.
Yes, the C language specification is a bit shit, it leaves too much leeway to compilers so that C compiler for broken, niche architectures can be written.
However, I disagree with you. All this badness in the C specification did not stop people from writing reasonable C compilers for reasonable architectures for decades.
The real problem here is competition. Gcc is in a competition with clang to produce fast code which makes the gcc developers feel justified when they exploit undefined behaviours for marginal optimizations.
This is a case of following the letter of the law (in this case the C standard) while disregarding its spirit: all the undefined behaviour was so that C compilers could accomodate for odd architectures while remaining close to the metal, not so that compiler programmers could go out of their way to turn their compiler into a mine field.
More specifically, they're competing to produce the fastest code for software that follows the C specification to the letter. That's not necessarily the same as producing the fastest code that actually achieves the intended goal.
For example, if I recall correctly the popular Opus codec overflows signed integers when decoding invalid data, and so long as this can be guaranteed to produce some (possibly implementation-specific) result this is perfectly safe. However, this is technically undefined behaviour - a particularly malevolent optimising C compiler could decide to give the sender of the data arbitrary code execution, because it's allowed to do whatever it likes. This might even make the code run faster, but it'd make decoding Opus correctly and safely slower because the decoder would have to do a bunch of gratuitous overflow checks on operations it could otherwise just let overflow. Fortunately, gcc hasn't reached that level of advanced malevolence yet.
> More specifically, they're competing to produce the fastest code for software that follows the C specification to the letter. That's not necessarily the same as producing the fastest code that actually achieves the intended goal.
That's true, but "producing the fasted code that actually achieves the intended goal" is not a problem which can be solved by a compiler--the compiler can't read your mind.
Attempting to represent a reasonable approximation of reading your mind is the responsibility of the specification. You can definitely do better than C, but I doubt you could have done better 4 decades ago when C was designed.
Yes. The worst part of it is that compilers have the option to do something sensible when handling undefined behaviour. The standard even suggests doing that, describing one possible option as behaving "in a documented manner characteristic of the environment". So why doesn't gcc do that?
People are far too keen to assume that because the standard leaves it undefined, and "undefined" sounds a bit here-be-dragons, it's therefore inevitable that undefined behaviour has to be some nasty creepy ugly thing. That it renders your entire program instantly meaningless. That it's a perfect excuse for the compiler to look at your program and turn it into something completely different. But... it doesn't have to be.
I don't know why people don't treat handling of undefined behaviour as a quality of implementation issue, rather than just rolling over and letting gcc make their lives worse.
> This is a case of following the letter of the law (in this case the C standard) while disregarding its spirit: all the undefined behaviour was so that C compilers could accomodate for odd architectures while remaining close to the metal, not so that compiler programmers could go out of their way to turn their compiler into a mine field.
Computers don't have spirits; they work as you tell them to work, to the letter, and if you're remaining close to the metal, your language will indicate that fact. Optimizing undefined behaviors doesn't make C a minefield; low-level programming for different architectures just is inherently a minefield. C was a minefield before these optimizations were added.
Rust is extremely impressive because they've found so many ways to do high-level programming while maintaining low-level performance. But they can only do that because they have the benefit of the 4 decades of programming language research that have occurred since the basics of C were designed.
>Rust is extremely impressive because they've found so many ways to do high-level programming while maintaining low-level performance. But they can only do that because they have the benefit of the 4 decades of programming language research that have occurred since the basics of C were designed.
I doubt rust could be ported to a 8bit PIC microcontroller, or to a 6502 keeping reasonable performance characteristics or letting the programmer take advantage of the platform quirks. It's not just "4 decades of programming language research" it's also that it's intended to work only on "modern" processors.
Agreed. Which is why you should choose a language which was standardized by a standards committee whose goals better align with your goals.
> I doubt rust could be ported to a 8bit PIC microcontroller, or to a 6502 keeping reasonable performance characteristics or letting the programmer take advantage of the platform quirks.
I don't think that's true; I think that the current state of Rust tools is such that this is true now, but it's nothing inherent to the design of the language, and I think you'll be able to do quite a bit with Rust in the situations you describe when the tools around Rust are more mature. I can't really speak to this more because I'm not sure why you think this can't be done.
> I doubt rust could be ported to a 8bit PIC microcontroller, or to a 6502 keeping reasonable performance characteristics
I don't believe this is correct. Most of why Rust avoids UB is that it uses static types much more effectively than C does. Static types are an abstraction between the programmer and the compiler for conveying intent, that cease to exist at runtime. So the runtime processor architecture should be irrelevant.
For instance, in C, dereferencing a null pointer is UB. This allows a compiler to optimize out checks for null pointers if it "knows" that the pointer can't be null, and it "knows" that a pointer can't be null if the programmer previously dereferenced it. This is, itself, a form of communication between the programmer and the compiler, but an imperfect one. In Rust, safe pointers (references) cannot be null. A nullable pointer is represented with the Option<T> type, which has two variants, Some(T) and None. In order to extract an &Something from an Option<&Something>, a programmer has to explicitly check for these two cases. Once you have a &Something, both you and the compiler know it can't be null.
But at the output-code level, a documented compiler optimization allows Option<&Something> to be stored as just a single pointer -- since &Something cannot be null, a null-valued pointer must represent None, not Some(NULL). So the resulting code from the Rust compiler looks exactly like the resulting code from the C compiler, both in terms of memory usage and in terms of which null checks are present and which can be skipped. But the communication is much clearer, preventing miscommunications like the Linux kernel's
int flags = parameter->flags;
if (parameter == NULL)
return -EINVAL;
Here the compiler thinks that the first line is the programmer saying "Hey, parameter cannot be null". But the programmer did not actually intend that. In Rust, the type system requires that the programmer write the null check before using the value, so that miscommunication is not possible.
There are similar stories for bounds checks and for loops, branches and indirect jumps and the match statement, etc. And none of this differs whether you're writing for a Core i7 or for a VAX.
> or letting the programmer take advantage of the platform quirks.
I'm not deeply familiar with that level of embedded systems, but at least on desktop-class processors, compilers are generally better than humans at writing stupidly-optimized code that's aware of particular opcode sequences that work better, etc.
(There are a few reasons why porting Rust to an older processor would be somewhat less than trivial, but they mostly involve assumptions made in the language definition about things like size_t and uintptr_t being the same, etc. You could write a language with a strong type system but C's portability assumptions, perhaps even a fork of Rust, if there were a use case / demand for it.)
Did rust find a way to defeat the halting problem and push all the array bound checks to compile time? How well does rust deal with memory bank switching where an instruction here makes that pointer there refer to a different area of memory?
> Did rust find a way to defeat the halting problem
I don't understand why the halting problem is relevant to this conversation. Just about all practical programs don't care about the halting problem in a final sense, anyway; see e.g. the calculus of inductive constructions for a Turing-incomplete language that lets you implement just about everything you actually care about implementing.
The halting problem merely prevents a program from evaluating a nontrivial property of another program with perfect accuracy. It does not prevent a program from evaluating a nontrivial property (bounds checks, type safety, whatever) with possible outputs "Yes" or "Either no, or you're trying to trick me, so cut that out and express what you mean more straightforwardly kthx."
This realization is at the heart of all modern language design.
> How well does rust deal with memory bank switching where an instruction here makes that pointer there refer to a different area of memory?
This problem boils down to shared mutable state, so the conceptual model of Rust deals with it very well. The current Rust language spec does not actually have a useful model of this sort of memory, but it would be a straightforward fork. As I said in my comment above, if there was interest and a use case for a safe language for these processors, it could be done easily.
You don't need 4 decades of programming language research to specify that e.g. signed integer overflow either returns an implementation-defined value or the program aborting. There are languages older than C that allowed the useful forms of bit-twiddling but offered much stronger safety guarantees.
But you pay a performance cost for either of those decisions. Consider code like this:
for (int i=0; i <= N; i++) func();
Most processors have special support for looping a fixed number of times, e.g. "decrement then branch if zero." If overflow is UB, the compiler can use this support.
But if overflow returns an implementation defined value, then it is possible that N is INT_MAX and the loop will not terminate. In this case the compiler cannot use the fixed-iteration form, and must emit a more expensive instruction sequence.
A correctly predicted branch is almost free, the compiler could check for that case.
The real problem of course is that C requires the programmer to obscure their intent by messing with a counter variable. Is there really no "loop exactly N times" construct in the language?
I don't think it's the language's fault either: C is four decades old, and being bound by reverse-compatibility, they can't integrate much of the programming language research that has happened in the last four decades. I'm a less concerned with placing fault for the problem than I am with placing the responsibility for fixing the problem, and that's clearly on the writer of the program which uses undefined behavior.
I think choosing Rust might be a reasonable way to avoid the problem in the first place, so I agree with you there, but there are also reasons to choose C over Rust. Personally, I write a lot of C code because I prefer to build on a GPL stack. This is one of the reasons I'd like to see Rust added as a GCC language. Sadly I don't have the time to do it myself.
Umm wait. Undefined behavior is where the language specification is not 100% precise, and compiler implementations can differ on produced code.
Go and Rust only have single implementations. The specification for both are very brief. Are you claiming that a clean box implementation of Go and Rust would always behave identically?
Your definition of undefined behaviour is actually the definition for unspecified behaviour.
Unspecified behaviour is usually intentional ambiguity either to give wiggle room for an optimizer, or to accommodate platform variance. Writing a program with that invokes unspecified behaviour isn't normally a problem, as long as you're not relying on a specific result. Order of argument evaluation is a common example.
Relying on undefined behaviour is almost always bad, and almost always avoidable. That's where the nasal demons come from. Dereferencing null, alias-creating pointer typecasting, etc.
Undefined behavior has a very specific meaning for C. It doesn't mean "not 100% precise". It means 100% imprecise. The C standard only gives any guarantees on what your program will do if you never invoke UB. If you do, well then it can do literally whatever it wants including deleting all your files.
Literally deleting all your files at that. It's the part of the language that says your compiler is completely justified in allowing you to write that buffer overflow vulnerability that can trample executable memory, which is then used by a malicious attacker to do literally whatever they want, including deleting all your files. Or worse.
No, "undefined behavior" is a term with a specific meaning, because it is used for a very specific purpose. "Undefined behavior" means that a compiler can assume that a particular scenario will never happen in correct code, and therefore if it ever thinks it has to care about that scenario, it can in fact ignore it for the purpose of optimization.
For instance, if you have a bool in C++, the only defined behavior is for it to contain 0 or 1. If you somehow force the memory cell to contain 2 or 255 or anything else, you have triggered undefined behavior. This means the compiler never has to check for it. If it's faster for the compiler to implement, say, `if (b) x[1] = a; else x[0] = a;` as `x[b] = a`, it can do that. Even though the `b == 2` case might lead to a buffer overflow, the UB rules means that, as far as the compiler cares, the `b == 2` case cannot exist. If `x.operator[]` is a function that does a bounds check, the compiler can inline it and remove the bounds check. And so forth.
Hence, the rule that on encountering UB, the compiler may do anything. It is not so much that the compiler is intentionally doing anything, as that it is outputting code that assumes the UB can't happen. If the compiler chooses to implement an `if (b)` via a computed jump, the `b == 2` case may land on some completely ridiculous code to start executing. It is not that the compiler wants to land there to punish you, it's that the compiler's only responsibility is to make sure the `b == 0` and `b == 1` cases work.
What you're pointing to is unspecified values. This means that the implementation can return any value, but must actually return some coherent value. If a C++ function returning bool returns an "unspecified value", it is still only returning either 0 or 1. You can act with the resulting bool as if it is in fact a bool. There are no optimization gotchas to worry about. It's just that you don't know what it is.
For the particular case here, when you're inspecting a runtime error, the only useful thing to do with it is to call the Error() method from the error interface. The language guarantees you that it will work (i.e., you have defined behavior, that you have an object that indeed implements the error interface). It doesn't give you any particular guidance on exact error values or the resulting string. But that's no more "undefined behavior" than the result of reading from a file is "undefined".
The entire reason people care about undefined behavior is the fact that compilers can do arbitrarily-stupid(-seeming) things when it is present, which is solely because compilers want to optimize. In the case of unspecified values, there is nothing to optimize.
(Anyway, I am not a Go programmer at all. Maybe there is actually UB somewhere in safe Go. But what I have seen is not it.)
This makes it sound like Go and Rust have some secret sauce that enables C's performance without UB. But it is not so.
For example, consider an expression like (x*2)/2. clang and gcc will both optimize this to just x, but Go and Rust will actually perform the multiplication and division, as required by their overflow semantics. So the performance vs safety tradeoff is real.
I can't speak for Go, but your intuition regarding Rust code is incorrect. See https://play.rust-lang.org/?gist=09b464627f856e0ebdcd&versio... , click the "Release" button, then click the "LLVM IR" button to view the generated code for yourself. TL;DR: the Rust code `let x = 7; let y = (x*2)/2; return y;` gets compiled down to `ret i32 7` in LLVM.
In fact, there is "secret sauce" here. The secret sauce is that Rust treats integer overflow specially: in debug mode (the default compilation mode), integer overflow is checked and will result in a panic. In release mode, integer overflow is unchecked. It's not "undefined behavior" in the C sense, because of the fact that it always returns a value--there are no nasal demons possible here (Rust disallows UB in non-`unsafe` code entirely, because memory unsafety is a subset of nasal demons). The exact value that it returns is unspecified, and the language provides no guarantee of backward compatibility if you rely on it (it also provides explicit wrapping arithmetic types if you explicitly desire wrapping behavior). And though it's a potential correctness hazard if you don't do any testing in debug mode, it's not a memory safety hazard, even in the unchecked release mode, because Rust's other safety mechanisms prevent you from using an integer like this to cause memory errors.
A constant expression is not a good test: any compiler worth its salt (which includes LLVM) will be able to optimise a case like that. Change the function to:
pub fn test(x: i32) -> i32 {
let y = (x*2)/2;
y
}
and you'll see the difference (e.g. compare against a similar function on https://gcc.godbolt.org/ ).
> The secret sauce is that Rust treats integer overflow specially
Debug vs. release mode is irrelevant: the panicking case is more expensive than any arithmetic, and, the RFC[0] was changed before it landed:
> The operations +, -, [multiplication], can underflow and overflow. When checking is enabled this will panic. When checking is disabled this will two's complement wrap.
The compiler still assumes that signed overflow may happen in release mode, and that the result needs to be computed as per two's complement, i.e. not unspecified.
Bah, darn you for changing the RFC out from under me. :P
I don't understand the point of specifying wrapped behavior in unchecked mode rather than leaving the value unspecified. Surely we don't care about properly accommodating use cases that will panic in debug mode.
I finally understood your point and where I was unclear, mostly after reading the other comment about the for loop.
You're right that C requires UB to optimize things like (x2)/2. My argument is that Go and Rust's secret sauce is you wouldn't have to write things spiritually similar to that in the first place. (x2)/2 is a bad example, since nobody would write that on purpose, but loops are a great one. C's only interface to elements of an array is a pointer. So C has to say that accessing out-of-bounds pointers are UB, as is even having a pointer that's neither in-bounds or right at the end, because it simply has no way to distinguish "pointer" from "pointer that is in-bounds for this array". The type of an array iterator in Rust (and I think also Go) carries knowledge of what array it's iterating over, and how far it can iterate; since it's part of the type system, the compiler has access to that knowledge. So you can just say `for element in array`, and the compiler knows that you're only accessing in-bounds pointers, and generate the same code C would, without needing to define a concept of UB.
Of course if you do generate pointers in unsafe code, you are subject to UB, same as in C. Rust and Go simply give you a language where most of the time, you don't need to reach for constructs that require UB to be performant.
(There is, however, an actual bit of secret sauce, at least in Rust but I suspect Go has an analogue: Rust's ownership system allows it to do far stronger alias analysis than even C's -fstrict-aliasing, without the risk of false positives -- you cannot construct overlapping mutable pointers in safe code.)
Leaving aside the fact that I'm also comparing Rust... why not? It's a language that produces fast, static executables. I bet that a good fraction of the Debian archive (not all of it, for sure) could be reimplemented in Go without causing any problems. What "different universes" are these?
(To be fair, I haven't written any Go because my personal use cases involve things like shared libraries and C-ABI compatibility, so I'm going off what I've heard about Go, not personal experience. But out of what I've heard about Go, it's a fine language for this purpose, because the requirement here is just portability to all Debian architectures and comparable performance, and whether GC is used is an implementation detail.)
A C compiler, meanwhile, has to aggressively take advantage of undefined behavior to get good performance, and the C specification has been keeping behavior undefined for the benefit of compilers.
You can hope that you find all such problems in C (which you might not) and "fix the program", but you can also "fix the program" by switching to a better language.