Date: | April 15, 2011 / year-entry #93 |
Tags: | code |
Orig Link: | https://blogs.msdn.microsoft.com/oldnewthing/20110415-00/?p=10923 |
Comments: | 16 |
Summary: | The last lock-free pattern for this week isn't actually lock-free, but it does run without blocking. The pattern for what I'll call try/commit/(hand off) is more complicated than the other patterns, so I'll start off by describing it in words rather than in code, because the code tends to make things more complicated. First, you... |
The last lock-free pattern for this week isn't actually lock-free, but it does run without blocking. The pattern for what I'll call try/commit/(hand off) is more complicated than the other patterns, so I'll start off by describing it in words rather than in code, because the code tends to make things more complicated.
First, you take the state variable and chop it up into pieces.
You need some bits
to be used as a lock and as a work has been handed off flag.
And if the work that has been handed off is complicated,
you may need some more bits to remember the details of the handoff.
A common way of doing this is to use a pointer-sized state variable,
require that the objects being pointed to are suitably aligned,
and reusing the bottom bits as flags.
For example, if you require that the objects be To perform an operation, you first try to lock the state variable. If you can't because the state variable is already locked, then you record the details of the operation in the state variable and update it atomically. If you succeed in locking the state variable, then you perform the desired operation, but before you unlock the state variable, you look to see if any work has been handed off. (This hand-off work is the result of attempts to perform the operation while you held the lock.) If there is hand-off work, then you perform that work as well. Of course, while you're doing that, more hand-off work may arrive. You can't unlock the state variable until you've drained off all the pent-up hand-off work.
The code for this pattern tends to be a tangle of loops since there
is a lot off backing off and retrying.
Every atomic operation is its own loop, draining the hand-off work
is another loop,
and
any time an
I trust only about five people in the world to write code
that is this advanced, and I'm not one of them.
But just to illustrate the principle (although I will certainly
get the details wrong), here's an implementation of a synchronization-like
object which I will call a
The group wait object is just a linked list of
Actually, this type of object doesn't need to use the try/commit/hand off
model.
It can be implemented in a much more straightforward manner by
having Since the bottom two bits of the pointer must be zero due to alignment, we repurpose them as a lock bit and a signal bit. The lock bit is set when the list is locked, and the signal bit is set when a signal was requested but had to be handed off because the list was locked. // WARNING! IF YOU USE THIS CODE YOU ARE AN IDIOT - READ THE TEXT ABOVE struct NODE; NODE *Node(LONG_PTR key) { return reinterpret_cast<NODE*>(key); } enum { Locked = 1, Signalled = 2, }; struct NODE { NODE *pnNext; HANDLE hEvent; LONG_PTR Key() { return reinterpret_cast<LONG_PTR>(this); } NODE *Ptr() { return Node(Key() & ~(Locked | Signalled)); } }; #define NODE_INVALID Node(-1) class GroupWait { public: GroupWait() : m_pnRoot(NULL) { } ~GroupWait(); BOOL AddWait(HANDLE hEvent); void SignalAll(); private: NODE *m_pnRoot; };
Since I will be viewing the
For notational purposes, a m_pnRoot +--------+-+-+ | * |0|1| +---|----+-+-+ | v +--------+-+-+---------+ A | * |1|?| hEvent1 | +---|----+-+-+---------+ | v +--------+-+-+---------+ B | * |?|?| hEvent2 | +---|----+-+-+---------+ | v +--------+-+-+---------+ C | NULL |?|?| hEvent3 | +--------+-+-+---------+
represents a group wait with three registered event handles.
The
The Okay, let's start be adding a handle to the wait list: BOOL GroupWait::AddWait(HANDLE hEvent) { NODE *pnInsert = new(nothrow) NODE; if (pnInsert == NULL) return FALSE; pnInsert->hEvent = hEvent; NODE *pn; NODE *pnNew; do { pn = InterlockedReadAcquire(&m_pnRoot, NODE_INVALID); pnInsert->pnNext = pn; pnNew = Node(pnInsert->Key() | (pn->Key() & Locked)); } while (InterlockedCompareExchangeRelease(&m_pnRoot, pnNew, pn) != pn); return TRUE; }
To add a handle to the wait list, we just prepend it to the linked list,
being careful to propagate the Exercise: Is there an ABA race condition here?
The
The nasty part of the code is in void GroupWait::SignalAll() { NODE *pnCapture; NODE *pnNew; do { pnCapture = InterlockedReadAcquire(&m_pnRoot, NODE_INVALID); if (pnCapture->Key() & Locked) { pnNew = Node(pnCapture->Key() | Signaled); } else { pnNew = Node(Locked); } } while (InterlockedCompareExchangeAcquire(&m_pnRoot, pnNew, pnCapture) != pnCapture); if (pnCapture->Key() & Locked) return; ...
If the list is locked, then all we do is try to set the
If the list was locked,
then all we had to do was set the
Exercise:
What if the
Otherwise, we are the ones to lock the list.
We also detach the node list, for if another thread calls
... NODE *pnNext; NODE *pn; for (pn = pnCapture->Ptr(); pn; pn = pnNext) { SetEvent(pn->hEvent); pnNext = pn->pnNext->Ptr(); delete pn; } ...
That little fragment above is basically what you would do in a
naïve implementation that didn't worry about multithreading:
It walks the list of nodes, signals each event,
and then frees the node.
The only trick is sending each node pointer through Next comes the unlock code. First, a preparatory step: ... pnCapture = pnNew; ...
We exchanged ... for (;;) { pnNew = Node(pnCapture->Key() & ~Locked); if (InterlockedCompareExchangeRelease(&m_pnRoot, pnNew, pnCapture) == pnCapture) { return; } ... We start a new loop whose job is to drain off all the handed-off work items that built up while the list was locked. First, we see whether anything has changed since the last time we looked; if not, then we unlock and we're done. Otherwise, we proceed to pick up all the handed-off work: ... pnCapture = InterlockedReadAcquire(&m_pnRoot, NODE_INVALID); NODE *pnNew = Node(pnCapture->Key() & ~(Locked | Signaled)); NODE **ppn = &pnNew; NODE *pn; NODE *pnNext; BOOL fSignalSeen = FALSE; for (pn = pnNew; pn->Ptr(); pn = pnNext) { pnNext = pn->Ptr()->pnNext; if (fSignalSeen) { SetEvent(pn->Ptr()->hEvent); delete pn->Ptr(); } else if (pn->Key() & Signaled) { fSignalSeen = TRUE; (*ppn) = Node(Locked); // detach but retain lock SetEvent(pn->Ptr()->hEvent); delete pn->Ptr(); } else { ppn = &pn->Ptr()->pnNext; } } } // retry unlock } // end of function
To drain the handed-off work, we walk the list of nodes,
keeping track of whether we've seen an Once that's done, we go back and try to unlock again. Eventually, there will be no more hand-off work, and we can finally return. And that's it, a demonstration of the try/commit/hand off model. The basic idea is simple, but getting all the details right is what makes your head hurt. I leave this sort of thing to the kernel folks, who have the time and patience and brainpower to work it all through. An example of this pattern can be found, for example, in this talk that describes the dismantling of the dispatcher spinlock. |
Comments (16)
Comments are closed. |
Ex 1: There's no ABA because even if the previous head of the list is replaced with a different head at the same address, the pointer in the new head is still correct.
Ex 2: A second call to SignalAll should only signal events added after the first call to SignalAll. If the S bit is already set then no extra events have been added, so there's nothing extra to signal, so no signals are lost.
Maybe I can't count, and it's probably not crucial to the discussion… but aren't the bottom three bits (not two bits) zero for DWORD-aligned addresses?
Yeah, whatever. I'm too dumb for all of this so I think I'll just stick with critical sections and the occasional mutex.
I still find these articles fascinating, even if they are over my head. They at least give me some insight into how different things work. The points discussed in the articles or comments may pop up in my head if I work on something similar in the future, preventing errors in the program I'm working on.
This makes me wonder if there is a way to specify the requirements for a concurrent algorithm in some efficient format and have the computer synthesize an efficient implementation.
Thanks for this series, Raymond. Complicated topics like these are what make your blog compelling.
@tobi: I believe that is subject to the Turing incompleteness theorem.
I suppose if I were encountering this particular case I'd see if I couldn't use a lock-free queue instead.
Lock free queue's exist. The implementation is left as an exercise to the reader.
@Joshu: I have seen in microsoft research a project that was able to synthesize sequential programs when given constraints and a control flow template. Not usable in practical way but impressive. Also, you cannot argue that static analysis tools cannot exist because of Turings theorem. They do exist, just not complete.
I just can't wait to see a DailyWTF article that involves a file that starts with this comment:
// WARNING! IF YOU USE THIS CODE YOU ARE AN IDIOT – READ THE TEXT ABOVE
And of course: no text above ;) Well… at least they would know they are idiots XD
I'm trying to go over the code (yes today's is much more intensive, btw thanks for this great series).
I could not find the documentation for InterlockedReadAcquire on MSDN. Did I miss something?
Also the following comment from SList documentation makes a programmer much more humble:
"SLists are straightforward to implement and use in 32-bit code. However, it is challenging to implement them in 64-bit code because the amount of data exchangeable by the native interlocked exchange primitives is not double the address size, as it is in 32-bit code. Therefore, SLists enable porting high-end scalable algorithms to Windows."
"The articles in the series build on each other. If you jump into the middle you're going to be confused."
A bit of searching helped. I had actually read that article when it was published, but probably have forgotten about it.
If there's a call to AddWait while the list is locked – won't this cause a dangling pointer? Am I missing something other than the all-caps comment at the beginning?
That's some seriously cool code. This whole series has been fascinating, thanks!
@David Walker – try counting again ;-) 2 bits have a range of 0-3 and a DWORD has 4 bytes…
Apart from that, very interesting series. Though I, too, will simply stick to critical sections and the like as I prefer "is slower" over "occasionally doesn't work". But then I usually don't do high performance servers software or OS kernels anyway, where I can see the appeal for it.
We implemented ProcRun similarly in OS/2 1.1. We used a two-byte variable. The low byte was the "busy" byte and the high byte was the "hand-off" byte. The "hand-off" thread would perform the following sequence:
pushf
cli
mov al, 0FFh
xchg lockword.lobyte, al
or al, al
jz IAmTheWorkerNow
mov lockword.hibyte, al
popf
ret
IAmTheWorkerNow:
sti
Meanwhile, when the "worker" thread was done, it would do the following:
shr lockword, 8
jnz IHaveMoreWorkToDo
ret
The key was to let the bulk of ProcRun run with interrupts enabled. We ran on uniprocessor systems, and kernel code could be interrupted, but not preempted. Reenabling interrupts was a very expensive operation on some buggy chip steppings.
Wow, Raymond. Could you have picked a more contentious issue to blog about?
;)