Lock-free algorithms: The try/commit/(hand off) model

Date:April 15, 2011 / year-entry #93
Tags:code
Orig Link:https://blogs.msdn.microsoft.com/oldnewthing/20110415-00/?p=10923
Comments:    16
Summary:The last lock-free pattern for this week isn't actually lock-free, but it does run without blocking. The pattern for what I'll call try/commit/(hand off) is more complicated than the other patterns, so I'll start off by describing it in words rather than in code, because the code tends to make things more complicated. First, you...

The last lock-free pattern for this week isn't actually lock-free, but it does run without blocking.

The pattern for what I'll call try/commit/(hand off) is more complicated than the other patterns, so I'll start off by describing it in words rather than in code, because the code tends to make things more complicated.

First, you take the state variable and chop it up into pieces. You need some bits to be used as a lock and as a work has been handed off flag. And if the work that has been handed off is complicated, you may need some more bits to remember the details of the handoff. A common way of doing this is to use a pointer-sized state variable, require that the objects being pointed to are suitably aligned, and reusing the bottom bits as flags. For example, if you require that the objects be DWORD-aligned, then the two bottom bits will always be zero and you can reuse them as flags.

To perform an operation, you first try to lock the state variable. If you can't because the state variable is already locked, then you record the details of the operation in the state variable and update it atomically.

If you succeed in locking the state variable, then you perform the desired operation, but before you unlock the state variable, you look to see if any work has been handed off. (This hand-off work is the result of attempts to perform the operation while you held the lock.) If there is hand-off work, then you perform that work as well. Of course, while you're doing that, more hand-off work may arrive. You can't unlock the state variable until you've drained off all the pent-up hand-off work.

The code for this pattern tends to be a tangle of loops since there is a lot off backing off and retrying. Every atomic operation is its own loop, draining the hand-off work is another loop, and any time an Interlocked­Compare­Exchange fails, you have to undo the work you did and retry—another loop.

I trust only about five people in the world to write code that is this advanced, and I'm not one of them. But just to illustrate the principle (although I will certainly get the details wrong), here's an implementation of a synchronization-like object which I will call a Group­Wait for lack of any other name. It has the following operations:

  • Add­Wait: Register an event handle with the group wait.
  • Signal­All: Signals all events that are registered with the group wait. Once an event is signalled, it is automatically unregistered from the group wait. If you want the event to be signalled at the next call to Signal­All you have to re-add it.

The group wait object is just a linked list of NODEs containing the handles being waited on.

Actually, this type of object doesn't need to use the try/commit/hand off model. It can be implemented in a much more straightforward manner by having Add­Wait atomically prepend the node to a list and having Signal­All atomically steal the list. There are even prewritten functions to perform these atomic linked list operations for you. But I'm going to implemented it the complicated way for demonstration purposes. In real life, the code would be much simpler.

Since the bottom two bits of the pointer must be zero due to alignment, we repurpose them as a lock bit and a signal bit. The lock bit is set when the list is locked, and the signal bit is set when a signal was requested but had to be handed off because the list was locked.

// WARNING! IF YOU USE THIS CODE YOU ARE AN IDIOT - READ THE TEXT ABOVE

struct NODE;
NODE *Node(LONG_PTR key) { return reinterpret_cast<NODE*>(key); }

enum {
 Locked = 1,
 Signalled = 2,
};

struct NODE {
 NODE *pnNext;
 HANDLE hEvent;

 LONG_PTR Key() { return reinterpret_cast<LONG_PTR>(this); }
 NODE *Ptr() { return Node(Key() & ~(Locked | Signalled)); }
};

#define NODE_INVALID Node(-1)

class GroupWait {
public:
 GroupWait() : m_pnRoot(NULL) { }
 ~GroupWait();
 BOOL AddWait(HANDLE hEvent);
 void SignalAll();
private:
 NODE *m_pnRoot;
};

Since I will be viewing the NODE* as both a pointer and as a bunch of bits (which I call a key), I created some helper methods to save typing. Node and Key convert back and forth between node pointers and keys, and Ptr strips off the tag bits and returns a usable pointer.

For notational purposes, a NODE* will be written as the combination p|S|L where p is a pointer to the next node, S is the signalled bit, and L is the lock bit. The signalled bit is set to indicate that we need to signal all the nodes in the list starting with the next node. (Think of the S bit as being attached to the outgoing arrow.) For example, this linked list:

   m_pnRoot
  +--------+-+-+
  |   *    |0|1|
  +---|----+-+-+
      |
      v
  +--------+-+-+---------+
A |   *    |1|?| hEvent1 |
  +---|----+-+-+---------+
      |
      v
  +--------+-+-+---------+
B |   *    |?|?| hEvent2 |
  +---|----+-+-+---------+
      |
      v
  +--------+-+-+---------+
C |  NULL  |?|?| hEvent3 |
  +--------+-+-+---------+

represents a group wait with three registered event handles. The S bit is clear on the root pointer, which means that nobody has yet requested that hEvent1 be signalled. On the other hand, the S bit is set on node A, which means that all the events after node A need to be signaled, specifically, hEvent2 and hEvent3. Note that this means that it doesn't matter whether the S bit is set on nodes B or C; those events are getting set regardless because the S bit on node A already requested it. (In particular, the S bit on the last node is meaningless since there are no nodes which come after it.)

The L bit is meaningless on all pointers other than m_pnRoot.

Okay, let's start be adding a handle to the wait list:

BOOL GroupWait::AddWait(HANDLE hEvent)
{
 NODE *pnInsert = new(nothrow) NODE;
 if (pnInsert == NULL) return FALSE;
 pnInsert->hEvent = hEvent;

 NODE *pn;
 NODE *pnNew;
 do {
  pn = InterlockedReadAcquire(&m_pnRoot, NODE_INVALID);
  pnInsert->pnNext = pn;
  pnNew = Node(pnInsert->Key() | (pn->Key() & Locked));
 } while (InterlockedCompareExchangeRelease(&m_pnRoot, pnNew, pn) != pn);
 return TRUE;
}

To add a handle to the wait list, we just prepend it to the linked list, being careful to propagate the L bit into the new pointer so we don't accidentally release a lock that somebody else took. We add the node with the S bit clear on the inbound pointer since nobody has yet asked for this handle to be signalled. After setting up the node, we attempt to insert it into the head of the list, and if we can't (because somebody else beat us to it), then we restart and try again. This is a standard try/commit/try again pattern.

Exercise: Is there an ABA race condition here?

The Add­Wait method illustrates one extreme case of the try/commit/hand off model, where there is really nothing to hand off; we did it all ourselves. Of course, this does make other parts of the code trickier since they have to go back and deal with nodes that were added while the list was locked.

The nasty part of the code is in Signal­All. I'll present it in pieces.

void GroupWait::SignalAll()
{
 NODE *pnCapture;
 NODE *pnNew;
 do {
  pnCapture = InterlockedReadAcquire(&m_pnRoot, NODE_INVALID);

  if (pnCapture->Key() & Locked) {
   pnNew = Node(pnCapture->Key() | Signaled);
  } else {
   pnNew = Node(Locked);
  }
 } while (InterlockedCompareExchangeAcquire(&m_pnRoot,
                              pnNew, pnCapture) != pnCapture);

 if (pnCapture->Key() & Locked) return;

 ...

If the list is locked, then all we do is try to set the S bit on the root. If the list is not locked, then we try to lock it and simultaneously detach all the nodes by replacing the root pointer with NULL|0|1. Either way, we perform the operation with the try/commit/try again pattern until we finally get through.

If the list was locked, then all we had to do was set the S bit on the root pointer. Setting the S bit on the root pointer means that all the nodes reachable from this pointer (i.e., all nodes after the root, which is all nodes) should be signalled, which is exactly what we want. Since the list is locked, we leave the actual signalling to the code that unlocks the list. (This is the hand off part of try/commit/hand off.)

Exercise: What if the S bit is already set? Did we lose a signal?

Otherwise, we are the ones to lock the list. We also detach the node list, for if another thread calls Signal­All, we don't want that signal to affect the nodes that we're signalling. (Otherwise we might end up double-signalling the event.)

 ...
 NODE *pnNext;
 NODE *pn;
 for (pn = pnCapture->Ptr(); pn; pn = pnNext) {
  SetEvent(pn->hEvent);
  pnNext = pn->pnNext->Ptr();
  delete pn;
 }
 ...

That little fragment above is basically what you would do in a naïve implementation that didn't worry about multithreading: It walks the list of nodes, signals each event, and then frees the node. The only trick is sending each node pointer through ->Ptr() to strip off the tag bits.

Next comes the unlock code. First, a preparatory step:

 ...
 pnCapture = pnNew;
 ...

We exchanged pnNew into m_pnRoot up above, and if that's still the value of m_pnRoot, then it means that nobody tried to perform any operations while the list was locked, and we got off easy.

 ...
 for (;;) {
  pnNew = Node(pnCapture->Key() & ~Locked);
  if (InterlockedCompareExchangeRelease(&m_pnRoot,
                      pnNew, pnCapture) == pnCapture) {
   return;
  }
 ...

We start a new loop whose job is to drain off all the handed-off work items that built up while the list was locked. First, we see whether anything has changed since the last time we looked; if not, then we unlock and we're done. Otherwise, we proceed to pick up all the handed-off work:

 ...
  pnCapture = InterlockedReadAcquire(&m_pnRoot, NODE_INVALID);

  NODE *pnNew = Node(pnCapture->Key() & ~(Locked | Signaled));
  NODE **ppn = &pnNew;
  NODE *pn;
  NODE *pnNext;

  BOOL fSignalSeen = FALSE;
  for (pn = pnNew; pn->Ptr(); pn = pnNext) {
   pnNext = pn->Ptr()->pnNext;
   if (fSignalSeen) {
    SetEvent(pn->Ptr()->hEvent);
    delete pn->Ptr();
   } else if (pn->Key() & Signaled) {
    fSignalSeen = TRUE;
    (*ppn) = Node(Locked); // detach but retain lock
    SetEvent(pn->Ptr()->hEvent);
    delete pn->Ptr();
   } else {
    ppn = &pn->Ptr()->pnNext;
   }
  }
 } // retry unlock
} // end of function

To drain the handed-off work, we walk the list of nodes, keeping track of whether we've seen an S bit. If so, then we signal the event and free the node. And the first time we see an S bit, we null out the inbound pointer to detach the list from the chain so we do not double-signal the event in the future.

Once that's done, we go back and try to unlock again. Eventually, there will be no more hand-off work, and we can finally return.

And that's it, a demonstration of the try/commit/hand off model. The basic idea is simple, but getting all the details right is what makes your head hurt.

I leave this sort of thing to the kernel folks, who have the time and patience and brainpower to work it all through. An example of this pattern can be found, for example, in this talk that describes the dismantling of the dispatcher spinlock.


Comments (16)
  1. laonianren says:

    Ex 1: There's no ABA because even if the previous head of the list is replaced with a different head at the same address, the pointer in the new head is still correct.

    Ex 2: A second call to SignalAll should only signal events added after the first call to SignalAll.  If the S bit is already set then no extra events have been added, so there's nothing extra to signal, so no signals are lost.

  2. David Walker says:

    Maybe I can't count, and it's probably not crucial to the discussion… but aren't the bottom three bits (not two bits) zero for DWORD-aligned addresses?

  3. John says:

    Yeah, whatever.  I'm too dumb for all of this so I think I'll just stick with critical sections and the occasional mutex.

  4. John D'Oriocourt says:

    I still find these articles fascinating, even if they are over my head. They at least give me some insight into how different things work. The points discussed in the articles or comments may pop up in my head if I work on something similar in the future, preventing errors in the program I'm working on.

  5. tobi says:

    This makes me wonder if there is a way to specify the requirements for a concurrent algorithm in some efficient format and have the computer synthesize an efficient implementation.

  6. Jeff says:

    Thanks for this series, Raymond. Complicated topics like these are what make your blog compelling.

  7. Joshua says:

    @tobi: I believe that is subject to the Turing incompleteness theorem.

    I suppose if I were encountering this particular case I'd see if I couldn't use a lock-free queue instead.

    Lock free queue's exist. The implementation is left as an exercise to the reader.

  8. tobi says:

    @Joshu: I have seen in microsoft research a project that was able to synthesize sequential programs when given constraints and a control flow template. Not usable in practical way but impressive. Also, you cannot argue that static analysis tools cannot exist because of Turings theorem. They do exist, just not complete.

  9. pplu says:

    I just can't wait to see a DailyWTF article that involves a file that starts with this comment:

    // WARNING! IF YOU USE THIS CODE YOU ARE AN IDIOT – READ THE TEXT ABOVE

    And of course: no text above ;) Well… at least they would know they are idiots XD

  10. sukru says:

    I'm trying to go over the code (yes today's is much more intensive, btw thanks for this great series).

    I could not find the documentation for InterlockedReadAcquire on MSDN. Did I miss something?

    Also the following comment from SList documentation makes a programmer much more humble:

     "SLists are straightforward to implement and use in 32-bit code. However, it is challenging to implement them in 64-bit code because the amount of data exchangeable by the native interlocked exchange primitives is not double the address size, as it is in 32-bit code. Therefore, SLists enable porting high-end scalable algorithms to Windows."

    [The articles in the series build on each other. If you jump into the middle you're going to be confused. -Raymond]
  11. sukru says:

    "The articles in the series build on each other. If you jump into the middle you're going to be confused."

    A bit of searching helped. I had actually read that article when it was published, but probably have forgotten about it.

  12. J says:

    If there's a call to AddWait while the list is locked – won't this cause a dangling pointer?   Am I missing something other than the all-caps comment at the beginning?

  13. Andrew says:

    That's some seriously cool code. This whole series has been fascinating, thanks!

  14. Marcel says:

    @David Walker – try counting again ;-) 2 bits have a range of 0-3 and a DWORD has 4 bytes…

    Apart from that, very interesting series. Though I, too, will simply stick to critical sections and the like as I prefer "is slower" over "occasionally doesn't work". But then I usually don't do high performance servers software or OS kernels anyway, where I can see the appeal for it.

  15. Pete says:

    We implemented ProcRun similarly in OS/2 1.1.  We used a two-byte variable.  The low byte was the "busy" byte and the high byte was the "hand-off" byte.  The "hand-off" thread would perform the following sequence:

       pushf

       cli

       mov     al, 0FFh

       xchg    lockword.lobyte, al

       or      al, al

       jz      IAmTheWorkerNow

       mov     lockword.hibyte, al

       popf

       ret

    IAmTheWorkerNow:

       sti

    Meanwhile, when the "worker" thread was done, it would do the following:

       shr     lockword, 8

       jnz     IHaveMoreWorkToDo

       ret

    The key was to let the bulk of ProcRun run with interrupts enabled.  We ran on uniprocessor systems, and kernel code could be interrupted, but not preempted.  Reenabling interrupts was a very expensive operation on some buggy chip steppings.

  16. Gabe says:

    Wow, Raymond. Could you have picked a more contentious issue to blog about?

    ;)

Comments are closed.


*DISCLAIMER: I DO NOT OWN THIS CONTENT. If you are the owner and would like it removed, please contact me. The content herein is an archived reproduction of entries from Raymond Chen's "Old New Thing" Blog (most recent link is here). It may have slight formatting modifications for consistency and to improve readability.

WHY DID I DUPLICATE THIS CONTENT HERE? Let me first say this site has never had anything to sell and has never shown ads of any kind. I have nothing monetarily to gain by duplicating content here. Because I had made my own local copy of this content throughout the years, for ease of using tools like grep, I decided to put it online after I discovered some of the original content previously and publicly available, had disappeared approximately early to mid 2019. At the same time, I present the content in an easily accessible theme-agnostic way.

The information provided by Raymond's blog is, for all practical purposes, more authoritative on Windows Development than Microsoft's own MSDN documentation and should be considered supplemental reading to that documentation. The wealth of missing details provided by this blog that Microsoft could not or did not document about Windows over the years is vital enough, many would agree an online "backup" of these details is a necessary endeavor. Specifics include:

<-- Back to Old New Thing Archive Index