callan's comments

callan · on Feb 6, 2013

HLE is a pretty nifty implementation (note: I work for Intel) and an interesting use of CISC.

IA instructions can optionally contain a prefix, such as LOCK, address and operand override (use 16, 32, or 64 bits), and repeat (REP). For example, in Linux, a single "REP STOS" (repeat store string) instruction is used to zero-out the kernel's BSS region. REP can also be added to instructions that don't use it, and the CPU will simply ignore it--this is defined behavior.

For HLE, using REP in front of certain instructions on a spinlock will begin or end a single memory transaction without actually writing the spinlock value. Of course, on older CPUs, the REP will be ignored and the spinlock read/write will occur. On CPUs that support HLE: if another memory agent accesses the spinlock, the transaction will be canceled.

callan · on Oct 19, 2012

I work for Intel. This is not correct. A lot depends on your cache type. The two basic ones are uncacheable and write-back.

What you wrote is true for UC. For WB, reads can happen in any order (especially due to cache pre-fetchers). Writes always happen in program order, unless you are in other cache types such as write-combining. WC is mainly used for graphics memory-mapped pixmaps, where the order doesn't matter.

But don't let this scare you too much. From the viewpoint of a single CPU, everything is in order. It's only when you look at it from the memory bus point-of-view that things get confusing.

Unless we are dealing with a memory-mapped IO device that has read side-effects, in which case you need to carefully choose a cache type.

Dylan16807 · on Oct 19, 2012

Well yes as far as WB I meant from the view of the program. Modern CPUs do all kinds of terrifying rewriting and speculation internally.

dspeyer · on Oct 20, 2012

How do I control my cache type?

yuhong · on Oct 20, 2012

I don't think you can from a user mode program. If you are dealing with normal memory, it will be typically WB, and that is the only thing most user-mode programs will encounter.

dchichkov · on Oct 20, 2012

__builtin_prefetch

callan · on Oct 19, 2012

For those seeking more detail, Linux has a great reference on using memory barriers: http://www.kernel.org/doc/Documentation/memory-barriers.txt

hresult · on Oct 19, 2012

Great article. But the ASCII art in it!

callan · on July 5, 2012

I see this more as a "team builder"; a reminder why we are involved in our software communities. It's great to take a moment every once in a while (especially something historic like this) and give ourselves or others some recognition.

Savvy managers and organizations know when to give praise-- they key thing is not to trivialize it. This seems appropriate to me.

callan · on June 14, 2012

Disclaimer: I'm a software guy @ Intel.

I feel that a lot of sw folks have a lack of imagination when it comes to hw. A lot of very smart people here work on making an efficient, fast front-end. There's lots of research in this area. To my eyes, the implementations are stunning.

To say that a variable-length CISC instruction set can never be as fast or effecient as ARM is sort-of like saying that an interpreted language can never be as fast as C. And we all know how that goes over on HN.

Symmetry · on June 14, 2012

I'm sure the front end has had a lot of work and cleverness put into it, and I'm sure that it's able to, say, correctly guess where instruction boundaries are more than 99% of the time without knowing for sure and then remove those instructions from the queue if it turns out that the guess was wrong. But that just means that 4-wide decode is only on the same order of power consumption as execution, instead of being an order of magnitude more like if you'd done thing the naive way. I'm sure that the work they do _is_ stunning, but its work that doesn't even have to be done on more regular instruction sets.

It's not a problem with being variable length or CISC, its a problem with the lack of self-synchronization. If you, say, had 8 or 16 bit blocks were first bit of each block told you whether it was the beginning of an instruction you wouldn't have this problem at all. I'd generally say that for a modern general purpose computer variable length instructions and a large number of op-codes are a good idea, though a large number of addressing modes seems to still be a disadvantage. Its probably no accident that ARM, the "CISCiest" of the RISC processors is so popular right now.

phamilton · on June 14, 2012

But don't ARM/Marvell/Qualcomm/Nvidia have a lot of smart people working on making an efficient fast front-end?

Overcoming an obstacle is great, but I feel like companies who don't have to overcome that obstacle can focus their energy on other aspects of the product.

callan · on May 17, 2012

NASA in fact did knit memories for the Apollo guidance computer.

http://en.wikipedia.org/wiki/Core_rope_memory

"Software written by MIT programmers was woven into core rope memory by female workers in factories. Some programmers nicknamed the finished product LOL memory, for Little Old Lady memory"

callan · on April 23, 2012

The graphics in IVB are not just on the die, but are part of the internal memory ring and thus the shared L3 cache. So CPU cores are able to directly send and receive shared memory from the GPU, and the GPU is able to use the same last-level cache (LLC) as the CPU cores.

This is why you see the simple fill-rate benchmarks of IVB blow away SNB and discrete GPUs.

callan · on April 23, 2012

Also, there is a bit set after executing RDRAND that tells you if you got a 'good' random number.

Natsu · on April 23, 2012

Does anyone else feel like that's the sort of thing that will become part of a vulnerability someday?

batista · on April 24, 2012

No. It's used as a measure to NOT become part of a vulnerability someday, i.e so the system recheck and ask for another random number.

Natsu · on April 24, 2012

If it's just a flag you can ignore, I think that someone will.