> How would this look in the kernel? Wilcox had originally thought that, on a 128-bit system, an int should be 32 bits, long would be 64 bits, and both long long and pointer types would be 128 bits. But that runs afoul of deeply rooted assumptions in the kernel that long has the same size as the CPU's registers, and that long can also hold a pointer value. The conclusion is that long must be a 128-bit type.
Can anyone explain the rationale for not simply naming types after their size? In many programming languages, rather than this arcane terminology, “i16”, “i32”, “i64”, and “i128” simpy exist.
Legacy, there’s lots of dumb stuff in C. As you note, in modern languages the rule is generally to have fixed-size integers.
Though I think there are portability issues concerns, that world is mostly gone (it remains in some corners of computing e.g. dsps) but if you’re only using fixed-size integers what do you do when a platform doesn’t have that size? With a more flexible scheme, you have less issues there, however as the weirdness landscape contracts the risk of making technically incorrect assumptions (about relations between type sizes, or the actual limits and behaviour of a given type) start increasing dramatically.
Finally there’s the issue at hand here: even with fixed-size integers, “pointer” is a variable-size datum. So you still need a variable-size integer to go with it. C historically lacking that (nowadays it’s called uintptr_t), the kernel made assumptions which are incorrect.
Note that you can still get it wrong even if you try e.g. Rust believes and generally assumes that usize and pointers correspond, but that gets iffy with concepts like pointer provenance, which decouple pointer size and address space.
How many languages actually take advantage of the fact that they could get more performance out of "small" register sized integers? You tend to lose a lot of performance to arbitrary sized integers in most languages even if your inputs are small.
> but always correct seems like the better tradeoff.
Not if you are dealing with time constraints. Glitchy output isn't good, but locking up the system because some buggy code path is trying to allocate a 10 GB integer can be worse.
> How many languages actually take advantage of the fact that they could get more performance out of "small" register sized integers?
Er...all of them? I am not aware of one that uses growable integers as their default integer representation that does not use some form of tagged pointer optimisation.
> Glitchy output isn't good
Hmm...understating the problem here just a tad.
"In 2021, they ranked 12th in the updated Common Weakness Enumeration (CWE) list of the most common flaws, bugs, faults, and other errors in either hardware or software. The team behind the list ranked integer overflows just after “Missing Authentication for Critical Function”, due to the severity and prevalence of integer overflows."
I'm sure someone will come along and explain why I have no idea what I'm talking about, but so far my understanding is those names exist because of the difference in CPU word size. Typically "int" represents the natural word size for that CPU, which matches the register size as well, so 'int plus int' is as fast as addition can run by default, on a variety of CPUs. That's one reason chars and shorts are promoted to ints automatically in C.
Let's say you want to work with numbers and you want your program to run as fast as possible. If you specify the number of bits you want, like i32, then the compiler must make sure on 64bit CPUs, where the register holding this value has an extra 32bits available, that the extra bits are not garbage and cannot influence a subsequent operation (like signed right shift), so the compiler might be forced to insert an instruction to clear the upper 32bits, and you end up with 2 instructions for a single operation, meaning that your code now runs slower on that machine.
However, had you used 'int' in your code, the compiler would have chosen to represent those values with a 64bit data type on 64bit machines, and 32bit data type on 32bit machines, and your code would run optimally, regardless of the CPU. This of course means it's up to you to make sure that whatever values your program handles fit in 32bit data types, and sometimes that's difficult to guarantee.
If you decide to have your cake and eat it too by saying "fine, I'll just select i32 or i64 at compile time with a condition" and you add some alias, like "word" -> either i32 or i64, "half word" -> either i16 or i32, etc depending on the target CPU, then congrats, you've just reinvented 'int', 'short', 'long', et.al.
Personally, I'm finding it useful to use fixed integer sizes (e.g. int32_t) when writing and reading binary files, to be able to know how many bytes of data to read when loading the file, but once those values are read, I cast them to (int) so that the rest of the program can use the values optimally regardless of the CPU the program is running on.
That explains "int", but it doesn't explain short or long or long long. Rust has "usize" for the "int" case, and then fixed sizes for everything else, which works much better. If you want portable software, it's usually more important to know how many bits you have available for your calculation than it is to know how efficiently that calculation will happen.
I suppose short and long have to do with register sizes being available as half word and dword, and there are instructions that work with smaller data sizes on both x86 and ARM, but I agree that in today's world, you want to know the number of bits. On those weak 4MHz machines, squeezing a few extra cycles was typically very important.
> But a better solution might just be to switch to Rust types, where i32 is a 32-bit, signed integer, while u128 would be unsigned and 128 bits. This convention is close to what the kernel uses already internally, though a switch from "s" to "i" for signed types would be necessary. Rust has all the types we need, he said, it would be best to just switch to them.
That's pretty much the mentioned proposal of "just use rust types", which are i16/u16 to i128/u128, plus usize/isize for pointer-sized things.
The only improvement that you really need over that is to differentiate between what c calls size_t and uintptr_t: the size of the largest possible array, and the size of a pointer. On "normal" architectures they're the same, but on architectures that do pointer tagging or segmented memory a pointer might be bigger than the biggest possible array.
But you still have to deal with legacy C code, and C was dreamt up when running code written for 16 bits on a 14 bit architecture without losing speed was a consideration, so the C type's are weird.
Plus, it was a language to write systems, where "size of register on current machine" was a nice shortcut for "int", where registers could be anywhere from 8-32 bits, with 48 or 12 also a possibility.
C still (I think, C23 may have finally killed support*) supports architectures like clearpath mainframes which have a 36 bit word, 9 bit byte, 36 (IIRC) bit data pointer and a 81 bit function pointer.
The changes to the arithmetic rules mean you can't have sign-magnitude or 1s complement anymore IIRC
C99 specifies stdint.h/inttypes.h as part of the standard library for exactly this purpose. I'd expect using it would be a best practice at this point. But I'm no C expert, so maybe there's a good reason for not always using those explicitly sized types.
Can anyone explain the rationale for not simply naming types after their size? In many programming languages, rather than this arcane terminology, “i16”, “i32”, “i64”, and “i128” simpy exist.