Date: | May 6, 2004 / year-entry #175 |
Tags: | history |
Orig Link: | https://blogs.msdn.microsoft.com/oldnewthing/20040506-00/?p=39463 |
Comments: | 16 |
Summary: | If you read the fine print of the InterlockedIncrement and InterlockedDecrement functions, you'll see that on Windows NT 3.51 and earlier and on Windows 95, the return value only matches the sign of the result of the increment or decrement. Why is that? The 80386 instruction set supports interlocked increment and decrement, but the result... |
If you read the fine print of the InterlockedIncrement and InterlockedDecrement functions, you'll see that on Windows NT 3.51 and earlier and on Windows 95, the return value only matches the sign of the result of the increment or decrement. Why is that? The 80386 instruction set supports interlocked increment and decrement, but the result of the increment/decrement operation is not returned. Only the flags are updated by the operation. As a result, the only information you get back from the CPU about the result of the operation is whether it was zero, positive, or negative. (Okay, you also get some obscure information like whether there were an even or odd number of 1 bits in the result, but that's hardly useful nowadays.) Since those operating systems supported the 80386 processor, their implementations of the InterlockedIncrement and InterlockedDecrement functions were limited by the capabilities of the processor. The 80486 introduced the XADD instruction which returns the original value of the operand. With this additional information, it now becomes possible to return the result of the operation exactly. Windows NT 4 dropped support for the 80386 processor, requiring a minimum of an 80486, so it could take advantage of this instruction. Windows 98 still had to support the 80386, so it couldn't. So how did Windows 98 manage to implement an operation that was not supported by the CPU? Windows 98 detected whether you had a CPU that supported the new XADD instruction. If not, then it used an alternate mechanism which was mind-bogglingly slow: It called a driver whenever you wanted to increment or decrement a variable. The driver would then emulate the XADD instruction by disabling interrupts and performing the operation in locked memory. Since Windows 98 was a uniprocessor operating system, it didn't have to worry about a second processor changing the memory at the same time; all it needed to ensure was that the single processor didn't get interrupted while it was performing the "atomic" operation. |
Comments (16)
Comments are closed. |
Raymond, XADD link broken !
Hm, works for me. It’s not important; you can go to any old XADD page.
Raymond,
The link is broken for me as well. For anyone who cares, here’s a link that does work:
http://courses.ece.uiuc.edu/ece291/archive/fall2001/books/labmanual/inst-ref-xadd.html
Eeeew. I really did not want to know that.
Sounds exactly like what was done back in the x86/x87 days, when you weren’t sure you had an FPU. I still have memories of tracing way down into compilers, only to find "#ifdef IS_FPU" blocks where one side was one line, the other was inline ASM.
Hmmm… I replaced most of Microsoft’s Z-80 FORTRAN library, to use an AM9511 math coprocessor. Even converting between Microsoft and IEEE floating point format for each operation it was faster overall.
Does this qualify for an eeeew?:)
(This wasn’t a Microsoft project – Microsoft didn’t even know about it as far as I know.)
Geesz, calling into kernel-mode and disabling interrupts just to manipulate a LONG seems a bit extreme.
Hmm… I was just about to ask why the legacy 80386 support couldn’t just protect all InterlockedXxx() functions with a single global critical section. Then I realized that, under NT at least, critical sections are implemented using the InterlockedXxx() functions.
GRRRRR…
Consider:
initial value: L = 9
Thread A: InterlockedIncrement(&L);
Thread B: L = 3
If A executes before B, then the result is L = 3; if B executes before A then the result is L = 4. Since both operations are atomic, these are the only possible results.
But what if you used a critical section instead of a kernel trap? Then this would be possible:
A: EnterCriticalSection
A: load eax = L (loads 9)
A: eax++ (eax = 10 now)
B: L = 3
A: L = eax (L = 10, overwriting 3 [not 9])
A: LeaveCriticalSection
A: return eax (returns 10)
L now has the impossible value 10. Atomicity has been violated.
It must be painful to run Windows 98 on a 386, but I did used to make do with a 386sx running Windows 95.
As Keith Moore suggests, for the Win9x case they could have implemented InterlockedXXXXXX using a simple xchg semaphore common to all InterlockedXXXXXX functions:
InterlockedIncrement(DWORD* addend)
mov ebx, [addend]
loop:
mov eax, 1
; lock prefix makes it also work in multiproc (yeah, not very useful in 9x)
; ilocksem is a variable which guarantees serialization to Interlocked funcs
lock xchg [ilocksem], eax
cmp eax, 0
je loop
; Now it’s safe to increment addend
mov eax, [ebx]
inc eax
mov [ebx], eax
; Free the semaphore
mov [ilocksem], 0
; result already in eax, so we are ok
}
That would be so much faster than a ring switch.
The problem with this approach, is this sentence of the MSDN:
"The threads of different processes can use this mechanism if the variable is in shared memory."
This would mean that [ilocksem] needs to be in memory area shared among all processes (in case two processes use InterlockedXXXXX over a shared "addend"), and also means that a process could manually set it to 1 and starve all processes trying to use InterlockedXXXXX functions, which is bad.
Other approaches would be possible, like only share [ilocksem] among processes which already share some memory; for the rest of the cases [ilocksem] would be a per process variable.
This would mean that a process can only hang (by setting [ilocksem] to 1) another process it shares memory with (which is not that bad).
Oh, yes, and as Raymond says, serializing InterlockedXXXX functions won’t cover the case where a process accesses addend directly.
Atomicity guarantees are described here.
http://msdn.microsoft.com/library/en-us/dllproc/base/interlocked_variable_access.asp
Interesting, Raymond. I was focused on just the interlocked case.
So I must ask: Does the system make *any* atomicity guarantees when mixing interlocked non-interlocked operations?
Interesting, Raymond. I was focused on just the interlocked case.
So I must ask: Does the system make *any* atomicity guarantees when mixing interlocked non-interlocked operations?
That page doesn’t seem to be fully accurate. It discusses the atomicity of 32-bit and 64-bit read/writes, but then states: "Reads and writes to variables of other sizes are not guaranteed to be atomic on any platform." On x86, reads/writes to 8-bit and properly-aligned 16-bit variables are also atomic, at least according to the Intel manual. Quote here: http://lists.freebsd.org/pipermail/freebsd-smp/2003-September/000334.html
A particular CPU may guarantee it but Win32 doesn’t.
I took "not guaranteed to be atomic on any platform" to mean "not guaranteed to be atomic on any CPU". I guess by "any platform" they really mean "32-bit or 64-bit Windows in general".