> *How would this look in the kernel? Wilcox had originally thought that, on a 1...

masklinn · on Sept 23, 2022

Legacy, there’s lots of dumb stuff in C. As you note, in modern languages the rule is generally to have fixed-size integers.

Though I think there are portability issues concerns, that world is mostly gone (it remains in some corners of computing e.g. dsps) but if you’re only using fixed-size integers what do you do when a platform doesn’t have that size? With a more flexible scheme, you have less issues there, however as the weirdness landscape contracts the risk of making technically incorrect assumptions (about relations between type sizes, or the actual limits and behaviour of a given type) start increasing dramatically.

Finally there’s the issue at hand here: even with fixed-size integers, “pointer” is a variable-size datum. So you still need a variable-size integer to go with it. C historically lacking that (nowadays it’s called uintptr_t), the kernel made assumptions which are incorrect.

Note that you can still get it wrong even if you try e.g. Rust believes and generally assumes that usize and pointers correspond, but that gets iffy with concepts like pointer provenance, which decouple pointer size and address space.

raverbashing · on Sept 23, 2022

> Legacy, there’s lots of dumb stuff in C.

Yes, this, so much this

Who cares what an 'int' or a 'long' is. Except for things like the size of a pointer, it's better if you know exactly what you're working with.

mpweiher · on Sept 23, 2022

> in modern languages the rule is generally to have fixed-size integers.

Modern languages have unlimited size integers :-)

"Modern" as in "since at least the 80s, more likely 70s".

fluoridation · on Sept 23, 2022

Good luck using those to specify the data layout of a network packet.

josefx · on Sept 23, 2022

You are supposed to stream your video data as base64 encoded xml embedded in a json array.

mpweiher · on Sept 23, 2022

Well, for that you'd probably use a specialisation of Integer that's bounded and can thus be represented in a machine word.

fluoridation · on Sept 23, 2022

And then you'll be wasting time marshaling data between the stream and your objects because they're not PODs and so you can't just memcpy() onto them.

kuratkull · on Sept 23, 2022

Good luck seeing your performance drop off a very sharp cliff if you start using larger numbers than your CPU can fit into a single register.

mpweiher · on Sept 23, 2022

Well, in those case other languages fail.

Either silently with overflows, usually leading to security exploits, or by crashing.

So in either case you are betting that these cases are somewhere between rare and non-existent, particularly for your core/performance intensive code.

Being somewhat slower, probably in very isolated contexts (60-62 bits is quite a bit to overflow), but always correct seems like the better tradeoff.

YMMV. ¯\_(ツ)_/¯

josefx · on Sept 25, 2022

How many languages actually take advantage of the fact that they could get more performance out of "small" register sized integers? You tend to lose a lot of performance to arbitrary sized integers in most languages even if your inputs are small.

> but always correct seems like the better tradeoff.

Not if you are dealing with time constraints. Glitchy output isn't good, but locking up the system because some buggy code path is trying to allocate a 10 GB integer can be worse.

mpweiher · on Sept 25, 2022

> How many languages actually take advantage of the fact that they could get more performance out of "small" register sized integers?

Er...all of them? I am not aware of one that uses growable integers as their default integer representation that does not use some form of tagged pointer optimisation.

> Glitchy output isn't good

Hmm...understating the problem here just a tad.

"In 2021, they ranked 12th in the updated Common Weakness Enumeration (CWE) list of the most common flaws, bugs, faults, and other errors in either hardware or software. The team behind the list ranked integer overflows just after “Missing Authentication for Critical Function”, due to the severity and prevalence of integer overflows."

https://www.comparitech.com/blog/information-security/intege...

> Some buggy code path is trying to allocate a 10 GB integer can be worse

Hmmm...overstating the problem here just a tad. Not even sure how you'd manage that, given that that would be an integer with 4 billion digits.

creativemonkeys · on Sept 23, 2022

I'm sure someone will come along and explain why I have no idea what I'm talking about, but so far my understanding is those names exist because of the difference in CPU word size. Typically "int" represents the natural word size for that CPU, which matches the register size as well, so 'int plus int' is as fast as addition can run by default, on a variety of CPUs. That's one reason chars and shorts are promoted to ints automatically in C.

Let's say you want to work with numbers and you want your program to run as fast as possible. If you specify the number of bits you want, like i32, then the compiler must make sure on 64bit CPUs, where the register holding this value has an extra 32bits available, that the extra bits are not garbage and cannot influence a subsequent operation (like signed right shift), so the compiler might be forced to insert an instruction to clear the upper 32bits, and you end up with 2 instructions for a single operation, meaning that your code now runs slower on that machine.

However, had you used 'int' in your code, the compiler would have chosen to represent those values with a 64bit data type on 64bit machines, and 32bit data type on 32bit machines, and your code would run optimally, regardless of the CPU. This of course means it's up to you to make sure that whatever values your program handles fit in 32bit data types, and sometimes that's difficult to guarantee.

If you decide to have your cake and eat it too by saying "fine, I'll just select i32 or i64 at compile time with a condition" and you add some alias, like "word" -> either i32 or i64, "half word" -> either i16 or i32, etc depending on the target CPU, then congrats, you've just reinvented 'int', 'short', 'long', et.al.

Personally, I'm finding it useful to use fixed integer sizes (e.g. int32_t) when writing and reading binary files, to be able to know how many bytes of data to read when loading the file, but once those values are read, I cast them to (int) so that the rest of the program can use the values optimally regardless of the CPU the program is running on.

nicoburns · on Sept 23, 2022

That explains "int", but it doesn't explain short or long or long long. Rust has "usize" for the "int" case, and then fixed sizes for everything else, which works much better. If you want portable software, it's usually more important to know how many bits you have available for your calculation than it is to know how efficiently that calculation will happen.

creativemonkeys · on Sept 24, 2022

I suppose short and long have to do with register sizes being available as half word and dword, and there are instructions that work with smaller data sizes on both x86 and ARM, but I agree that in today's world, you want to know the number of bits. On those weak 4MHz machines, squeezing a few extra cycles was typically very important.

m0RRSIYB0Zq8MgL · on Sept 23, 2022

That is what was suggested in the next paragraph.

> But a better solution might just be to switch to Rust types, where i32 is a 32-bit, signed integer, while u128 would be unsigned and 128 bits. This convention is close to what the kernel uses already internally, though a switch from "s" to "i" for signed types would be necessary. Rust has all the types we need, he said, it would be best to just switch to them.

wongarsu · on Sept 23, 2022

That's pretty much the mentioned proposal of "just use rust types", which are i16/u16 to i128/u128, plus usize/isize for pointer-sized things.

The only improvement that you really need over that is to differentiate between what c calls size_t and uintptr_t: the size of the largest possible array, and the size of a pointer. On "normal" architectures they're the same, but on architectures that do pointer tagging or segmented memory a pointer might be bigger than the biggest possible array.

But you still have to deal with legacy C code, and C was dreamt up when running code written for 16 bits on a 14 bit architecture without losing speed was a consideration, so the C type's are weird.

thrown_22 · on Sept 23, 2022

stdint.h has been around far longer than Rust.

I've been using those since the 00s for bit banging code where I need guarantees for where each bit goes.

Nothing quite like working with a micro processor with 12bit words to make you appreciate 2^n addresses.

PaulHoule · on Sept 23, 2022

C dates back to a time when the 8 bit byte didn’t have 100% market share.

yetihehe · on Sept 23, 2022

Plus, it was a language to write systems, where "size of register on current machine" was a nice shortcut for "int", where registers could be anywhere from 8-32 bits, with 48 or 12 also a possibility.

cestith · on Sept 23, 2022

I have a couple of 12-bit machines upstairs. There were also 36-bit systems once upon a time.

masklinn · on Sept 23, 2022

Except that’s not been true in a while, and technically this assumptions was not kosher for even longer: C itself only guarantees that int is 16 bits.

mhh__ · on Sept 23, 2022

C still (I think, C23 may have finally killed support*) supports architectures like clearpath mainframes which have a 36 bit word, 9 bit byte, 36 (IIRC) bit data pointer and a 81 bit function pointer.

The changes to the arithmetic rules mean you can't have sign-magnitude or 1s complement anymore IIRC

Stamp01 · on Sept 23, 2022

C99 specifies stdint.h/inttypes.h as part of the standard library for exactly this purpose. I'd expect using it would be a best practice at this point. But I'm no C expert, so maybe there's a good reason for not always using those explicitly sized types.

pjmlp · on Sept 23, 2022

Windows has macros for that kind of stuff, and only in C99 the stdint header came to be.

So you had almost three decades with everyone coming up with their own solution.

To be fair, the other languages were hardly any better than C in this regard.

quonn · on Sept 23, 2022

I think that's because of portability. So that the common types just map to the correct size on a given system.