The Alpha AXP, part 9: The memory model and atomic memory operations

The Alpha AXP has a notoriously weak memory model. When a processor writes to memory, the result becomes visible to other processors eventually, but there are very few constraints beyond that.

For example, writes can become visible out of order. One processor writes a value to a location, and then writes a value to another location, and another processor can observe the second write without the first. Similarly, reads can complete out of order. One processor reads a value from a location, then reads from another location, and the result could be that the second read happens before the first.¹

Assume that memory locations x and y are both initially zero. The following sequence of operations is valid.

Processor 1	Processor 2
write 1 to `x`	read `y` yields 1
	`MB` (memory barrier)
write 1 to `y`	read `x` yields 0

The memory barrier instruction MB instructs the processor to make all previous loads and stores complete to memory before starting any new loads and stores. However, it doesn't force other processors to do anything; other processors can still complete their memory operations out of order, and that's what happened in the above example.

Similarly, the following sequence is also legal:

Processor 1	Processor 2
write 1 to `x`	read `y` yields 1
`MB` (memory barrier)
write 1 to `y`	read `x` yields 0

This is also legal because the memory barrier on processor 1 ensures that the value of x gets updated before the value of y, but it doesn't prevent processor 2 from performing the reads out of order.

In order to prevent x and y from appearing to be updated out of order, both sides need to issue memory barriers. Processor 1 needs a memory barrier to ensure that the write to x happens before the write to y, and processor 2 needs a memory barrier to ensure that the read from y happens before the read from x.

Okay, onward to atomic operations.

Performing atomic operations on memory requires the help of two new pairs of instructions:

    LDL_L   Ra, disp16(Rb)  ; load locked
    LDQ_L   Ra, disp16(Rb)

    STL_C   Ra, disp16(Rb)  ; store conditional
    STQ_C   Ra, disp16(Rb)

The load locked instruction performs a traditional read from memory, but also sets the lock_flag and memorizes the physical address in phys_locked. The processor monitors for any changes to that physical address from any processor, and if a change is detected,² the lock_flag is cleared.

The lock_flag is also cleared by a variety of other conditions, most notably when the processor returns from kernel mode back to user mode. This means that any hardware interrupt or trap (such as a page fault, or executing an emulated instruction) will clear the lock_flag. It is recommended that operating systems allow at least 40 instructions to execute between timer interrupts.

You can later do a store conditional operation which will store a value to a memory address, provided the lock_flag is still set. If so, then the source register is set to 1. If not, then the source register is set to 0 and the memory is left unmodified. Regardless of the result, the lock_flag is cleared.

A typical atomic increment looks like this:

retry:
    LDL_L   t1, (t0)        ; load locked
    ADDL    t1, #1, t1      ; increment value
    STL_C   t1, (t0)        ; store conditional
                            ; t1 = 1 if store was successful
    BEQ     t1, failed      ; jump if store failed
    ... continue execution ...

failed:
    BR      zero, retry     ; try again

In the case where the store failed, we jump forward, and then back. Recall that conditional jumps backward are predicted taken, and conditional jumps forward are predicted not taken. If we had simply jumped backward on failure, then the processor would have a branch prediction miss in the common case that there is no contention.

Note that the above sequence does not impose any memory ordering. In practice, you will see a MB before and/or after the atomic sequence in order to enforce acquire and/or release semantics.

There are a number of practical rules regarding the LDx_L and STx_C instructions. The most important ones are these:

The STx_C should be to the same address as the most recently preceding LDx_L. This isn't a problem in practice because storing back to the location of the previous load is the intended use of the instructions.³
The processor may lose track of your LDx_L if you perform any memory access other than a matching STx_C, or if you perform a branch instruction, or if you trigger a trap (such as executing an emulated instruction), or if you execute more than 20 instructions after the LDx_L.

Although each STx_C should be preceded by a matching LDx_C, it is legal to perform a LDx_L with no matching STx_C. This can happen with conditional interlocked operations, where you discover after the LDx_L that the condition is not satisfied and you abandon the interlocked operation.

The second rule says basically that the state created by the LDx_L instruction is ephemeral. After performing the LDx_L instruction, do as little work as possible to determine what value you want to store, and then store it right away. You are not allowed to take any branches, but CMOVcc is okay.

The requirement that you get around to the STx_C within 20 instructions is a consequence of the requirement on operating systems that they allow 40 instructions to execute between timer interrupts.

Next time, we'll do a little exercise based on what we've learned so far.

¹ Mind you, out-of-order reads are pretty common on all architectures. Store-to-load forwarding means that a speculated read operation to speculatively-written memory can complete before a read operation that occurred notionally earlier in the instruction stream. However, as Fabian Giesen notes, the x86 has extra logic to avoid getting caught doing so!

² The architecture permits implementations to be a little sloppy with the change detection. In particular, any modification within 128 bytes of the locked address is permitted to clear the lock_flag. This means that targets of atomic operations should be at least 128 bytes apart in order to minimize the likelihood of false positives.

³ There are complicated rules about what happens if you violate this guideline (including some parts which are left implementation-defined), but they are largely irrelevant because you should just follow the guideline already.

Comments (13)

Matthew Vincent says:

August 17, 2017 at 7:56 am

Wow, what a crazily weak memory model! Explicit optimistic locking just to perform consistent updates!
1. Fabian Giesen says:
  
  August 17, 2017 at 5:44 pm
  
  This set of primitives (LL/SC, load-link/store-conditional) for atomic operations is very common among 90s RISCs. It was very attractive then because it’s close to the minimum amount of dedicated hardware you can spend that lets programs build more useful higher-level atomic primitives (atomic loads/stores, fetch-and-add, fetch-and-or, swaps, compare-exchange etc.). That list gives you a hint to the original problem: there’s many such useful operations, everyone wants something different, and adding 6+ atomic operation primitives to a user-mode instruction set that only has around 30 instructions is not an attractive prospect, especially since next year’s OS might really want another new primitive! Today we have more or less settled on a set of useful atomic operations, but this is after 25 more years of experience with multi-processor systems. It was less clear then.
  
  The flipside is that you’re really exposing low-level, nuts-and-bolts details of your processor pipeline and memory subsystem to get there. LL/SC is a natural fit for an in-order classic RISC pipeline, but gets decidedly messier once you build an out-of-order version – what happens if speculative operations clear your lock bit? Oops. – or one with several hardware threads: can you guarantee that, with two hardware threads simultaneously trying to perform unrelated atomic operations, at least one of them will eventually make forward progress? The list of rules of exactly what you can and cannot do between a load-link and store-conditional is impressively long and technical for modern architectures.
  
  It’s similar to branch delay slots: a simple and elegant solution to a real problem if your architecture is implemented in a certain way, but a source of extra complexity when it isn’t.
  
  ARM recently added a bunch of explicit atomic instructions to their ARMv8.1A architecture, including the aforementioned fetch-and-{various operations}, swap, and compare-exchange. What seemed like a substantial amount of extra hardware in the early 90s is less of an issue in the age of CPU cores with 100+ floating-point and SIMD instructions and dozens of kilobytes of caches even at the low end, and knowing that a core is trying to perform say an atomic add enables hardware implementations that try to do so locally first but let a shared cache level perform the operation if there’s any contention. This takes less time and less power than having two cores race each other in a spin loop, and these days, HW designers are much more willing to throw extra silicon and design effort at a problem if it increases scalability and lowers power consumption.
  1. Yuhong Bao says:
    
    August 17, 2017 at 7:15 pm
    
    The fun thing is that I found out that LL/SC for SList in Windows still has to be double wide because of the depth. Of course, for example ARM64 uses 48-bit addresses allowing 16-bit for the depth.
Lance says:

August 17, 2017 at 8:45 am

While more complicated the x86, if you compare this with PPC and ARM weak models, it is neither unusual nor strange.
1. Voo says:
  
  August 19, 2017 at 12:12 am
  
  Oh no ARM and PPC are positively sane when compared to alphas memory system.
  
  The fact that loads or stores that depend on previous reads are not ordered is pretty much an Alpha only thing. Meaning that in the case of
  Load x; Load x.field
  making sure that you read the right x is not enough – you need an additional barrier before loading the field as well.
  1. Raymond Chen - MSFT says:
    
    August 19, 2017 at 8:43 am
    
    You don’t need a barrier if the same processor is performing both the write and the read. And certainly the read from x.field will use the value of x read by the previous instruction (there are no load delay slots). The weakness comes into play when sharing memory between processors. For example, if processor A writes x.field and then x, processor B can read x and then x.field, but get a stale value for x.field due to the lack of a barrier.
  2. Lance says:
    
    August 19, 2017 at 9:19 am
    
    For ARM, I was thinking of the memory order with multiple cpus. You need to add memory barriers for the same reasons.
    For PPC, I was thinking of the optimistic locking sequence, which is the same. And ARM did require an optimistic sequence for atomic operations at one time, but I suspect that is no longer required.
Nico says:

August 17, 2017 at 1:58 pm

Clearly the safest option is to halt all but one processor during any memory writes :)
DWalker07 says:

August 18, 2017 at 7:50 am

Raymond: There’s that time machine you always wanted! The Alpha AXP processor implements it!
DWalker07 says:

August 18, 2017 at 7:58 am

For atomic increment, I am familiar with how it’s done in IBM 370/390 assembler… and it’s fairly similar to what’s shown here.

BUT, I always wondered why a CPU can’t just implement an atomic increment… in hardware. Why does the programmer need to write code to test the result to see if the increment happened?

I realize that the memory address is in memory; that is, it’s inside the memory chip and not inside the CPU. Still, the requirement for an atomic increment is pretty common, so it seems that the CPU and the northbridge chipset and the microcode and the memory support chips could cooperate and implement an atomic increment without having the programmers go to all this trouble.

The same ought to be true for decrement, or decrement and test (for zero).
1. DWalker07 says:
  
  August 18, 2017 at 7:59 am
  
  But then again, I’m not a chipset designer or a CPU designer. :-)
2. Ben Voigt says:
  
  August 18, 2017 at 2:44 pm
  
  As Fabian pointed out, the reason these are left out isn’t because they are hard, but because they are inflexible (you can’t use them to build other *efficient* atomic operations) and use up a scarce resource: instruction codes.
  1. DWalker07 says:
    
    August 21, 2017 at 11:12 am
    
    Decrement and test for zero, implemented in hardware, would be great for building locks and related things. (Although lock-free code is cool too.)

Comments are closed.

Matthew Vincent says:

August 17, 2017 at 7:56 am

Wow, what a crazily weak memory model! Explicit optimistic locking just to perform consistent updates!

Fabian Giesen says:

August 17, 2017 at 5:44 pm

This set of primitives (LL/SC, load-link/store-conditional) for atomic operations is very common among 90s RISCs. It was very attractive then because it’s close to the minimum amount of dedicated hardware you can spend that lets programs build more useful higher-level atomic primitives (atomic loads/stores, fetch-and-add, fetch-and-or, swaps, compare-exchange etc.). That list gives you a hint to the original problem: there’s many such useful operations, everyone wants something different, and adding 6+ atomic operation primitives to a user-mode instruction set that only has around 30 instructions is not an attractive prospect, especially since next year’s OS might really want another new primitive! Today we have more or less settled on a set of useful atomic operations, but this is after 25 more years of experience with multi-processor systems. It was less clear then.

The flipside is that you’re really exposing low-level, nuts-and-bolts details of your processor pipeline and memory subsystem to get there. LL/SC is a natural fit for an in-order classic RISC pipeline, but gets decidedly messier once you build an out-of-order version – what happens if speculative operations clear your lock bit? Oops. – or one with several hardware threads: can you guarantee that, with two hardware threads simultaneously trying to perform unrelated atomic operations, at least one of them will eventually make forward progress? The list of rules of exactly what you can and cannot do between a load-link and store-conditional is impressively long and technical for modern architectures.

It’s similar to branch delay slots: a simple and elegant solution to a real problem if your architecture is implemented in a certain way, but a source of extra complexity when it isn’t.

ARM recently added a bunch of explicit atomic instructions to their ARMv8.1A architecture, including the aforementioned fetch-and-{various operations}, swap, and compare-exchange. What seemed like a substantial amount of extra hardware in the early 90s is less of an issue in the age of CPU cores with 100+ floating-point and SIMD instructions and dozens of kilobytes of caches even at the low end, and knowing that a core is trying to perform say an atomic add enables hardware implementations that try to do so locally first but let a shared cache level perform the operation if there’s any contention. This takes less time and less power than having two cores race each other in a spin loop, and these days, HW designers are much more willing to throw extra silicon and design effort at a problem if it increases scalability and lowers power consumption.
1. Yuhong Bao says:
  
  August 17, 2017 at 7:15 pm
  
  The fun thing is that I found out that LL/SC for SList in Windows still has to be double wide because of the depth. Of course, for example ARM64 uses 48-bit addresses allowing 16-bit for the depth.

Lance says:

August 17, 2017 at 8:45 am

While more complicated the x86, if you compare this with PPC and ARM weak models, it is neither unusual nor strange.

Voo says:

August 19, 2017 at 12:12 am

Oh no ARM and PPC are positively sane when compared to alphas memory system.

The fact that loads or stores that depend on previous reads are not ordered is pretty much an Alpha only thing. Meaning that in the case of
Load x; Load x.field
making sure that you read the right x is not enough – you need an additional barrier before loading the field as well.
1. Raymond Chen - MSFT says:
  
  August 19, 2017 at 8:43 am
  
  You don’t need a barrier if the same processor is performing both the write and the read. And certainly the read from x.field will use the value of x read by the previous instruction (there are no load delay slots). The weakness comes into play when sharing memory between processors. For example, if processor A writes x.field and then x, processor B can read x and then x.field, but get a stale value for x.field due to the lack of a barrier.
2. Lance says:
  
  August 19, 2017 at 9:19 am
  
  For ARM, I was thinking of the memory order with multiple cpus. You need to add memory barriers for the same reasons.
  For PPC, I was thinking of the optimistic locking sequence, which is the same. And ARM did require an optimistic sequence for atomic operations at one time, but I suspect that is no longer required.

Nico says:

August 17, 2017 at 1:58 pm

Clearly the safest option is to halt all but one processor during any memory writes :)

DWalker07 says:

August 18, 2017 at 7:50 am

Raymond: There’s that time machine you always wanted! The Alpha AXP processor implements it!

August 18, 2017 at 7:58 am

For atomic increment, I am familiar with how it’s done in IBM 370/390 assembler… and it’s fairly similar to what’s shown here.

BUT, I always wondered why a CPU can’t just implement an atomic increment… in hardware. Why does the programmer need to write code to test the result to see if the increment happened?

I realize that the memory address is in memory; that is, it’s inside the memory chip and not inside the CPU. Still, the requirement for an atomic increment is pretty common, so it seems that the CPU and the northbridge chipset and the microcode and the memory support chips could cooperate and implement an atomic increment without having the programmers go to all this trouble.

The same ought to be true for decrement, or decrement and test (for zero).

DWalker07 says:

August 18, 2017 at 7:59 am

But then again, I’m not a chipset designer or a CPU designer. :-)
Ben Voigt says:

August 18, 2017 at 2:44 pm

As Fabian pointed out, the reason these are left out isn’t because they are hard, but because they are inflexible (you can’t use them to build other *efficient* atomic operations) and use up a scarce resource: instruction codes.
1. DWalker07 says:
  
  August 21, 2017 at 11:12 am
  
  Decrement and test for zero, implemented in hardware, would be great for building locks and related things. (Although lock-free code is cool too.)

Date:	August 17, 2017 / year-entry #186
Tags:	code
Orig Link:	https://blogs.msdn.microsoft.com/oldnewthing/20170817-00/?p=96835
Comments:	13
Summary:	When does it happen? I have no idea!