HLE is a pretty nifty implementation (note: I work for Intel) and an interesting use of CISC.
IA instructions can optionally contain a prefix, such as LOCK, address and operand override (use 16, 32, or 64 bits), and repeat (REP). For example, in Linux, a single "REP STOS" (repeat store string) instruction is used to zero-out the kernel's BSS region. REP can also be added to instructions that don't use it, and the CPU will simply ignore it--this is defined behavior.
For HLE, using REP in front of certain instructions on a spinlock will begin or end a single memory transaction without actually writing the spinlock value. Of course, on older CPUs, the REP will be ignored and the spinlock read/write will occur. On CPUs that support HLE: if another memory agent accesses the spinlock, the transaction will be canceled.
I work for Intel. This is not correct. A lot depends on your cache type. The two basic ones are uncacheable and write-back.
What you wrote is true for UC. For WB, reads can happen in any order (especially due to cache pre-fetchers). Writes always happen in program order, unless you are in other cache types such as write-combining. WC is mainly used for graphics memory-mapped pixmaps, where the order doesn't matter.
But don't let this scare you too much. From the viewpoint of a single CPU, everything is in order. It's only when you look at it from the memory bus point-of-view that things get confusing.
Unless we are dealing with a memory-mapped IO device that has read side-effects, in which case you need to carefully choose a cache type.
I don't think you can from a user mode program. If you are dealing with normal memory, it will be typically WB, and that is the only thing most user-mode programs will encounter.
I see this more as a "team builder"; a reminder why we are involved in our software communities. It's great to take a moment every once in a while (especially something historic like this) and give ourselves or others some recognition.
Savvy managers and organizations know when to give praise-- they key thing is not to trivialize it. This seems appropriate to me.
I feel that a lot of sw folks have a lack of imagination when it comes to hw. A lot of very smart people here work on making an efficient, fast front-end. There's lots of research in this area. To my eyes, the implementations are stunning.
To say that a variable-length CISC instruction set can never be as fast or effecient as ARM is sort-of like saying that an interpreted language can never be as fast as C. And we all know how that goes over on HN.
I'm sure the front end has had a lot of work and cleverness put into it, and I'm sure that it's able to, say, correctly guess where instruction boundaries are more than 99% of the time without knowing for sure and then remove those instructions from the queue if it turns out that the guess was wrong. But that just means that 4-wide decode is only on the same order of power consumption as execution, instead of being an order of magnitude more like if you'd done thing the naive way. I'm sure that the work they do _is_ stunning, but its work that doesn't even have to be done on more regular instruction sets.
It's not a problem with being variable length or CISC, its a problem with the lack of self-synchronization. If you, say, had 8 or 16 bit blocks were first bit of each block told you whether it was the beginning of an instruction you wouldn't have this problem at all. I'd generally say that for a modern general purpose computer variable length instructions and a large number of op-codes are a good idea, though a large number of addressing modes seems to still be a disadvantage. Its probably no accident that ARM, the "CISCiest" of the RISC processors is so popular right now.
But don't ARM/Marvell/Qualcomm/Nvidia have a lot of smart people working on making an efficient fast front-end?
Overcoming an obstacle is great, but I feel like companies who don't have to overcome that obstacle can focus their energy on other aspects of the product.
"Software written by MIT programmers was woven into core rope memory by female workers in factories. Some programmers nicknamed the finished product LOL memory, for Little Old Lady memory"
The graphics in IVB are not just on the die, but are part of the internal memory ring and thus the shared L3 cache. So CPU cores are able to directly send and receive shared memory from the GPU, and the GPU is able to use the same last-level cache (LLC) as the CPU cores.
This is why you see the simple fill-rate benchmarks of IVB blow away SNB and discrete GPUs.
IA instructions can optionally contain a prefix, such as LOCK, address and operand override (use 16, 32, or 64 bits), and repeat (REP). For example, in Linux, a single "REP STOS" (repeat store string) instruction is used to zero-out the kernel's BSS region. REP can also be added to instructions that don't use it, and the CPU will simply ignore it--this is defined behavior.
For HLE, using REP in front of certain instructions on a spinlock will begin or end a single memory transaction without actually writing the spinlock value. Of course, on older CPUs, the REP will be ignored and the spinlock read/write will occur. On CPUs that support HLE: if another memory agent accesses the spinlock, the transaction will be canceled.