Using fibers to simplify enumerators, part 3: Having it both ways

Date:December 31, 2004 / year-entry #440
Tags:code
Orig Link:https://blogs.msdn.microsoft.com/oldnewthing/20041231-00/?p=36833
Comments:    28
Summary:As we discovered in the previous two entries [second], the problem with enumeration is that somebody always loses. Now we will use fibers to fight back. Before you decide to use fibers in your programs, make sure to read the dire warnings at the end of this article. My goal here is to show one...

As we discovered in the previous two entries [second], the problem with enumeration is that somebody always loses.

Now we will use fibers to fight back. Before you decide to use fibers in your programs, make sure to read the dire warnings at the end of this article. My goal here is to show one use of fibers, not to say that fibers are the answer to all your problems. Fibers can create more problems than they solve. We'll come back to all the dire warnings later.

As with most clever ideas, it has a simple kernel: Use a fiber to run both the caller and the enumerator each on their own stack.

#include <windows.h>
#include <shlwapi.h>
#include <stdio.h>
#include <strsafe.h>

enum FEFOUND {
 FEF_FILE,          // found a file
 FEF_DIR,           // found a directory
 FEF_LEAVEDIR,      // leaving a directory
 FEF_DONE,          // finished
};

enum FERESULT {
 FER_CONTINUE,      // continue enumerating
                    // (if directory: recurse into it)
 FER_SKIP,          // skip directory (do not recurse)
};

class __declspec(novtable) FiberEnumerator {
public:
 FiberEnumerator();
 ~FiberEnumerator();

 FEFOUND Next();
 void SetResult(FERESULT fer) { m_fer = fer; }
 void Skip() { SetResult(FER_SKIP); }

 virtual LPCTSTR GetCurDir() = 0;
 virtual LPCTSTR GetCurPath() = 0;
 virtual const WIN32_FIND_DATA* GetCurFindData() = 0;

protected:
 virtual void FiberProc() = 0;

 static void DECLSPEC_NORETURN WINAPI
    s_FiberProc(void* pvContext);

 FERESULT Produce(FEFOUND fef);

protected:
 void* m_hfibCaller;
 void* m_hfibEnumerator;
 FEFOUND  m_fef;
 FERESULT m_fer;
};

FiberEnumerator::FiberEnumerator()
 : m_fer(FER_CONTINUE)
{
 m_hfibEnumerator = CreateFiber(0, s_FiberProc, this);
}

FiberEnumerator::~FiberEnumerator()
{
 DeleteFiber(m_hfibEnumerator);
}

void DECLSPEC_NORETURN FiberEnumerator::
    s_FiberProc(void *pvContext)
{
 FiberEnumerator* self =
    reinterpret_cast<FiberEnumerator*>(pvContext);
 self->FiberProc();

 // Theoretically, we need only produce Done once,
 // but keep looping in case a consumer gets
 // confused and asks for the Next() item even
 // though we're Done.
 for (;;) self->Produce(FEF_DONE);
}

This helper class does the basic bookkeeping of fiber-based enumeration. At construction, it remembers the fiber that is consuming the enumeration, as well as creating a fiber that will produce the enumeration. At destruction, it cleans up the fiber. The derived class is expected to implement the FiberProc method and call Produce() every so often.

The real magic happens in the (somewhat anticlimactic) Produce() and Next() methods:

FERESULT FiberEnumerator::Produce(FEFOUND fef)
{
 m_fef = fef; // for Next() to retrieve
 m_fer = FER_CONTINUE; // default
 SwitchToFiber(m_hfibCaller);
 return m_fer;
}

FEFOUND FiberEnumerator::Next()
{
 m_hfibCaller = GetCurrentFiber();
 SwitchToFiber(m_hfibEnumerator);
 return m_fef;
}

To Produce() something, we remember the production code, pre-set the enumeration result to its default of FER_CONTINUE, and switch to the consumer fiber. When the consumer fiber comes back with an answer, we return it from Produce().

To get the next item, we remember the identity of the calling fiber, then switch to the enumerator fiber. This runs the enumerator until it decides to Produce() something, at which point we take the production code and return it.

That's all there is to it. The m_fef and m_fer members are for passing the parameters and results back and forth across the fiber boundary.

Okay, with that groundwork out of the way, writing the producer itself is rather anticlimactic.

Since we want to make things easy for the consumer, we use the interface the consumer would have designed, with some assistance from the helper class.

class DirectoryTreeEnumerator : public FiberEnumerator {
public:
 DirectoryTreeEnumerator(LPCTSTR pszDir);
 ~DirectoryTreeEnumerator();

 LPCTSTR GetCurDir() { return m_pseCur->m_szDir; }
 LPCTSTR GetCurPath() { return m_szPath; }
 const WIN32_FIND_DATA* GetCurFindData()
    { return &m_pseCur->m_wfd; }

private:
 void FiberProc();
 void Enum();

 struct StackEntry {
   StackEntry* m_pseNext;
   HANDLE m_hfind;
   WIN32_FIND_DATA m_wfd;
   TCHAR m_szDir[MAX_PATH];
 };
 bool Push(StackEntry* pse);
 void Pop();

private:
 StackEntry *m_pseCur;
 TCHAR m_szPath[MAX_PATH];
};

DirectoryTreeEnumerator::
 DirectoryTreeEnumerator(LPCTSTR pszDir)
 : m_pseCur(NULL)
{
 StringCchCopy(m_szPath, MAX_PATH, pszDir);
}

DirectoryTreeEnumerator::~DirectoryTreeEnumerator()
{
 while (m_pseCur) {
   Pop();
 }
}

bool DirectoryTreeEnumerator::
      Push(StackEntry* pse)
{
 pse->m_pseNext = m_pseCur;
 m_pseCur = pse;
 return
  SUCCEEDED(StringCchCopy(pse->m_szDir,
                 MAX_PATH, m_szPath)) &&
  PathCombine(m_szPath, pse->m_szDir, TEXT("*.*")) &&
  (pse->m_hfind = FindFirstFile(m_szPath,
       &pse->m_wfd)) != INVALID_HANDLE_VALUE;
}

void DirectoryTreeEnumerator::Pop()
{
 StackEntry* pse = m_pseCur;
 if (pse->m_hfind != INVALID_HANDLE_VALUE) {
  FindClose(pse->m_hfind);
 }
 m_pseCur = pse->m_pseNext;
}

void DirectoryTreeEnumerator::FiberProc()
{
 Enum();
}

void DirectoryTreeEnumerator::Enum()
{
 StackEntry se;
 if (Push(&se)) {
  do {
   if (lstrcmp(se.m_wfd.cFileName, TEXT(".")) != 0 &&
       lstrcmp(se.m_wfd.cFileName, TEXT("..")) != 0 &&
       PathCombine(m_szPath, se.m_szDir, se.m_wfd.cFileName)) {
    FEFOUND fef = (se.m_wfd.dwFileAttributes &
                    FILE_ATTRIBUTE_DIRECTORY) ?
                    FEF_DIR : FEF_FILE;
    if (Produce(fef) == FER_CONTINUE && fef == FEF_DIR) {
     Enum(); // recurse into the subdirectory we just produced
    }
   }
  } while (FindNextFile(se.m_hfind, &se.m_wfd));
 }
 Produce(FEF_LEAVEDIR);
 Pop();
}

As you can see, this class is a mix of the two previous classes. Like the consumer-based class, information about the item being enumerated is obtained by calling methods on the enumerator object. But like the callback-based version, the loop that generates the objects themselves is a very simple recursive function, with a call to Produce in place of a callback.

In fact, it's even simpler than the callback-based version, since we don't have to worry about the FER_STOP code. If the consumer wants to stop enumeration, the consumer simply stops calling Next().

Most of the complexity in the class is just bookkeeping to permit abandoning the enumeration prematurely.

Okay, let's take this fiber out for a spin. You can use the same TestWalk function as last time, but for added generality, change the first parameter from DirectoryTreeEnumerator* to FiberEnumerator*. (The significance of this will become apparent next time.)

A little tweak needs to be made to the main function, though.

int __cdecl main(int argc, char **argv)
{
 ConvertThreadToFiber(NULL);
 DirectoryTreeEnumerator e(TEXT("."));
 TestWalk(&e);
 ConvertFiberToThread();
 return 0;
}

Since the enumerator is going to switch between fibers, we'd better convert the thread to a fiber so it'll have something to switch back to!

Here's a schematic of what happens when you run this fiber-based enumerator:

ConvertThreadToFiber

Main fiber
construct DirectoryTreeEnumerator Enumerator fiber
CreateFiber (not running)
initialize variables  
Next(CONTINUE)  
SwitchToFiber() starts running
  FindFirstFile etc
  Produce(FILE)
Next(CONTINUE) "returns" FILE SwitchToFiber()
use the result  
Next(CONTINUE)  
SwitchToFiber() Produce(FILE) "returns" CONTINUE
  FindNextFile etc
  Produce(DIR)
Next(CONTINUE) "returns" DIR SwitchToFiber()
use the result  
Next(CONTINUE)  
SwitchToFiber() Produce(DIR) "returns" CONTINUE
and so on... until...
  Produce(DONE)
Next(CONTINUE) "returns" DONE SwitchToFiber()
cleanup  
DeleteFiber  
destruct DirectoryTreeEnumerator
ConvertFiberToThread

Observe that from each fiber's point of view, the other fiber is just a subroutine!

Coding subtlety: Why do we capture the caller's fiber each time the Next() method is called? Why not capture it when the FiberEnumerator is constructed?

Next time, we'll see how this fiber-based enumerator easily admits higher-order operations such as filtering and composition.

Dire warnings about fibers

Fibers are like dynamite. Mishandle them and your process explodes.

The first dire warning is that fibers are expensive in terms of address space, since each one gets its own stack (typically a megabyte).

And since each fiber has its own stack, it also has its own exception chain. This means that if a fiber throws an exception, only that fiber can catch it. (Same as threads.) That's a strong argument against using an STL std::stack object to maintain our state: STL is based on an exception-throwing model, but you can't catch exceptions raised by another fiber. (You also can't throw exceptions past a COM boundary, which severely limits how much you can use STL in a COM object.)

One of the big problems with fibers is that everybody has to be in cahoots. You need to decide on one person who will call the ConvertThreadToFiber function since fiber/thread conversion is not reference-counted. If two people call ConvertThreadToFiber on the same thread, the first will convert it, and so will the second! This results in two fibers for the same thread, and things can only get worse from there.

You might think, "Well, wouldn't the GetCurrentFiber function return NULL if the thread hasn't been converted to a fiber?" Try it: It returns garbage. (It's amazing how many people ask questions without taking even the slightest steps towards figuring out the answer themselves. Try writing a test program.)

But even if GetCurrentFiber told you whether or not the thread had been converted to a fiber, that still won't help. Suppose two people want to do fibrous activity on the thread. The first converts, the second notices that the thread is already a fiber (somehow) and skips the conversion. Now the first operation completes and calls the ConvertFiberToThread function. Oh great, now the second operation is stranded doing fibrous activity without a fiber!

Therefore, you can use fibers safely only if you control the thread and can get all your code to agree on who controls the fiber/thread conversion.

An important consequence of the "in cahoots" rule is that you have to make sure all the code you use on a fiber is "fiber-safe" - a level of safety even beyond thread-safety. The C runtime library keeps information in per-thread state: There's errno, all sorts of bonus bookkeeping when you create a thread, or call various functions that maintain state in per-thread data (such as strerror, _fcvt, and strtok).

In particular, C++ exception handling is managed by the runtime, and the runtime tracks this data in per-thread state (rather than per-fiber state). Therefore, if you throw a C++ exception from a fiber, strange things happen.

(Note: Things may have changed in the C runtime lately; I'm operating from information that's a few years old.)

Even if you carefully avoid the C runtime library, you still have to worry about any other libraries you use that use per-thread data. None of them will work with fibers. If you see a call to the TlsAlloc function, then there's a good chance that the library is not fiber-safe. (The fiber-safe version is the FlsAlloc function.)

Another category of things that are not fiber-safe are windows. Windows have thread affinity, not fiber affinity.


Comments (28)
  1. Thanks for this Raymond.

    I’ve always wanted to see a real use of fibers shown somewhere. Do newer versions of windows make much use of them?

  2. Memet says:

    Thanks Raymond,

    btw, did the Windows team introduce fibers for any reason? (e.g. MS SQL?) Since from previous conversations, I get the impression that only Win32 has such an explicitly defined Fiber API, and also because Fibers are a relatively obscure thing that most people probably wouldn’t ask for, let alone know about.

  3. Grant says:

    Just FYI, C# iterators (new in v2.0) allow this same feat without using fibers. Instead the compiler transforms the producers’ preferred code into the consumers’ perferred interface.

  4. Joe Game Developer says:

    Very cool example!

    As a point of information, some games use fibers. You give each object in the game world its own fiber.

    You could use threads (or even processes) instead, but then you have much more overhead because of having to lock to avoid potential race conditions.

  5. Sven G. Ali says:

    Yes, this is incredibly cool.

    When I started reading this series I was immediately reminded of my own struggles creating a game engine some time ago.

    In a perfect world each game entity would be on its own thread running its own script. The renderer would be on another thread, getting samples of the game world as necessary.

    The problem of course is that threads are expensive, locks are required, and the thread scheduler doesn’t provide enough control.

    The only way to make it work was, as Raymond says, to make life difficult for one side or the other. In my case, and I believe in most game engines, life was made easy for the renderer.

  6. Fibers are definitely designed to help applications which really understand their data sets, their I/O patterns and don’t have a big plug-in/extensibility mechanism. At some point large applications like this become CPU bound since they’re actually used by enterprises who do insanely expensive things like buy large storage arrays that let the number of concurrent I/Os scale as far as the CPU and operating system can push them.

    So then they need to start optimizing CPU usage and they can’t afford the statistical goodness of things like thread scheduling and I/O completion ports.

    I bet you can figure out which applications really take advantage of fibers and therefore which bugs around using fibers have gotten fixed. They’re enterprise applications which do a lot of I/O but are run by large enough physical systems that the real limiting factor is CPU use.

    Caveat emptor. To solve the easy enumerators problem you really want to have smaller things like closures/continuations without having to keep the entire stack around. This is hard without compiler support and even then managing the lifetime of the storage that the closure references is hard without GC. This is effectively how iterators/"yield" works in C#.

  7. RJ says:

    It seems that SwitchToFiber would invalidate all CPU predictions and stall the pipeline. If that is true, another warning for CPU bound apps would be to not call it in a tight loop.

  8. Ben Cooke says:

    I co-wrote a simple game engine some time ago which used coroutines, which fibers seem to be an implementation of, to allow objects to independently operate.

    In my case, each "object" in the game was able to register one or more coroutines which would be resumed each frame. Each time the coroutine was resumed the current tick count was passed in so that the coroutine would know how much to adjust its associated object by, depending on what it was supposed to be doing.

    This wasn’t done with Win32 Fibers, though. Indeed, Raymond’s series here was the first I’ve ever heard of them. My co-author had implemented a simple scripting system, so my coroutine implementation just played with the stack kept inside the script interpreter. The coroutines didn’t exist outside the script context.

    The nice thing about coroutines vs. threads is that it’s up to each routine when it yields, so things like locking become less important. If a coroutine is going to do something unsafe, it must just make sure it leaves everything in a sensible state before yeilding.

    In practice, most applications other than "real-time" games are better off with threads, of course.

  9. Rob Earhart says:

    One nitpick: ConvertThreadToFiber() should fail with GetLastError() == ERROR_ALREADY_FIBER if the thread’s already a fiber. So try it; if it fails for that reason, keep going, but skip the call to ConvertFiberToThread() at the end, and you can work with components which called yours on a fiber.

    In Longhorn, there’s also IsThreadAFiber(), which should always work.

    GetCurrentFiber() and GetFiberData() don’t return garbage; they just don’t return what you’re expecting (the field in the TEB is overloaded with OS2 information). If you file a bug against me, I’ll look into fixing them.

  10. Raymond Chen says:

    ERROR_ALREADY_FIBER is returned only on Windows Server 2003. For Windows 98, 2000, and XP, converting a thread to a fiber twice results in random behavior.

  11. Given all the dire warnings about fibers, has anyone actually measured how much slower the equivalent version using threads would be?

    I suspect that switching back and forth between two fiber stacks for each produced element is also not very good for performance.

    If threads are used instead then for large collections you could batch the elements by putting a producer-consumer queue between the enumerator thread and the main thread. Then the overhead of thread switching will be amortized over a large number of elements, and you will not have to worry about fiber-safety.

  12. Bug Hunter says:

    "ERROR_ALREADY_FIBER is returned only on Windows Server 2003. For Windows 98, 2000, and XP, converting a thread to a fiber twice results in random behavior."

    How did this ever get designed this way?!?

  13. Raymond Chen says:

    Because fibers are a brain-the-size-of-a-planet feature. You are expected to have already worked through all the difficulties yourself – like deciding who will do the thread-to-fiber conversion. In the same way you don’t get an error if you free memory twice; you’re expected simply not to do that in the first place.

    Remember, the ENTIRE PROGRAM must coordinate fiber management – see all the dire warnings – if you’re coordinating fibers on a program-wide basis surely you can decide who will be responsible for the thread-to-fiber conversion; that’s the least of your problems.

  14. M Knight says:

    Pavel Lebedinsky, switching between fibers is *cheap* compared to switching between threads.

    A switch between fibers involves; saving the registers state, and then loading the new register state from the new fiber. This will also switch the stack too. You are looking at a few dozen bytes which need to be moved about.

    With a thread; it involves a trip to kernel land, saving the registers state, loading the new registers state, and then a trip back to user-land.

    The timesaver is the removal of the trips between user-mode and kernel-mode todo a context switch.

  15. If memory serves, SQL Server gets something like a 5-10% boost when running in fiber mode. However I believe that it then can’t use XSPs (the DLL-based plug-in mechanism) because, in general, code can’t deal with fibers / shifting across threads. (This is why it’s not the default; it does break some code that may be running in-process with SQL.)

    Maybe they’ve done the work so that XSP invocations are remoted to run on threads that aren’t fibrous; it’s been too many years since I’ve paid deep attention to SQL server, so please research the details here (XSPs) yourself.

    Re: cost:

    I may be wrong here, but I believe that a context switch forces reloading the PCR (on x86) which is a very expensive operation for the CPU. Fiber switches switch a lot less state, don’t go into kernel mode and don’t change the PCR.

    Mike

  16. > switching between fibers is *cheap* compared to switching between threads.

    My point was that any context switching (thread or fiber) has additional costs because of poor locality of reference. These costs may dominate everything else (things like saving the registers, transition to KM and back, running the scheduler code etc). If that’s the case then using fibers doesn’t really buy you anything, except all the bugs you’re going to discover because most code has never been tested with fibers and probably is not fiber-safe.

    Now I’m not an expert on this to be able to say one way or another. But when I googled for ["context switch" "cache miss" cost] the first hit was this article:

    The Effect of Context Switches on Cache Performance

    http://www.eecs.harvard.edu/cs245/papers/Shieyuan.html

    They say that at least in some cases, cache-performance dominates the cost of a context switch. Note that this was back in 1991, and by all accounts the situation has only become worse in recent years because CPU speeds grew much faster than memory speeds.

    So it seems to me that unless you’ve actually *measured* the performance of both solutions and found fibers significantly faster, you’re better off staying with threads.

  17. Any kind of context switch — whether an explicit one made in fibers, or a true thread change, is painful in terms of cache swap. It’s still true, though, that SwitchToFiber() is cheaper than a normal thread context switch, as fiber switching can be done almost entirely in non-priviledged (usermode) code.

    And yes, it is still true today that context switches kill the cache. Those of us in the performance team have been on several crusades to reduce the number of switches needed in an "idle" system over the last few years — pushing for fewer services, fewer ISRs/DPCs, etc. In my particular case, my rallying cry has been high-definition video playback; we’ve found that having a high number of ISRs/DPCs per second will absolutely sink decoding, even if every single interrupt executes quickly, because they kill the cache. (This is in addition, of course, to the problem of ISRs/DPCs that take too long and lock up the CPU for extended periods of time, such as long NDIS chains or non-conformant ATAPI drives.)

  18. Frank says:

    Just a thought – couldn’t you do something similar with "good-old" POSIX setjmp()/longjmp()? Not sure though, I guess it lacks the ‘double stack’.

  19. Yes, a fiber is very similar to a jmp_buf. The thing is that setjmp/longjmp explicitly only are enabled for jumping up the stack, not across between stacks. There’s no defined way to allocate a new stack, get code running on it and save its context into a jmp_buf in standard C etc.

    Note also that since a jmp_buf is a struct, its size can’t really vary and depends on the version of the C runtime (regardless of platform) that you are built with. Fibers are provided by the OS on windows and therefore if someone discovers that some critical piece of state wasn’t captured/switched, it can be added.

    Unfortunately those additions do tend to break apps who were depending on that piece of context not being switched. Luckily most people have heeded the warning and aren’t using fibers except for people who think they already understand their per-fiber context pretty well.

  20. lukaszg says:

    >

    (It’s amazing how many people ask questions without taking even the slightest steps towards figuring out the answer themselves. Try writing a test program.)

    <<

    Isn’t that exactly how people come to rely on undocumented behavior? Just because a test program behaves in a certain way on your machine doesn’t mean it will on others. Answers to questions like this should be in the docs (don’t know if this one is, just making a general point). Of course, sometimes writing a test program is the only way, but it shouldn’t be the first one tried.

  21. Raymond Chen says:

    The test program is *one step* towards figuring out the answer. A test program can’t tell you that you’re right, but it can tell you that you’re wrong.

  22. A while back there was an article in MSDN magazine about wrapping up the unmaged fibers API to implement…

  23. Kannan Goundan says:

    While C# iterators cover a good chunk of what you’d normally want to do, they don’t use separate stacks and so you can’t do everything fibres can. The main restriction is that all your "yield"s have to be in the top function instance.

    IEnumerator<X> InOrder(Tree<X> t) {

    if (t != null) {

    InOrder(t.left);

    yield return t.value;

    InOrder(t.right);

    }

    }

    At the least, you’d have to put a for loop around the recursive calls and "re-yield" the values to the top. This is pretty inefficient.

    Full continuation support would fix this. Unfortunately (AFAIK), adding continuation support causes a significant performance hit (even if you don’t use them).

  24. This has been discussed fairly frequently on the Web. Chris Brumme discusses this here: http://blogs.msdn.com/cbrumme/archive/2003/04/15/51351.aspx

Comments are closed.


*DISCLAIMER: I DO NOT OWN THIS CONTENT. If you are the owner and would like it removed, please contact me. The content herein is an archived reproduction of entries from Raymond Chen's "Old New Thing" Blog (most recent link is here). It may have slight formatting modifications for consistency and to improve readability.

WHY DID I DUPLICATE THIS CONTENT HERE? Let me first say this site has never had anything to sell and has never shown ads of any kind. I have nothing monetarily to gain by duplicating content here. Because I had made my own local copy of this content throughout the years, for ease of using tools like grep, I decided to put it online after I discovered some of the original content previously and publicly available, had disappeared approximately early to mid 2019. At the same time, I present the content in an easily accessible theme-agnostic way.

The information provided by Raymond's blog is, for all practical purposes, more authoritative on Windows Development than Microsoft's own MSDN documentation and should be considered supplemental reading to that documentation. The wealth of missing details provided by this blog that Microsoft could not or did not document about Windows over the years is vital enough, many would agree an online "backup" of these details is a necessary endeavor. Specifics include:

<-- Back to Old New Thing Archive Index