Date: | December 31, 2004 / year-entry #440 |
Tags: | code |
Orig Link: | https://blogs.msdn.microsoft.com/oldnewthing/20041231-00/?p=36833 |
Comments: | 28 |
Summary: | As we discovered in the previous two entries [second], the problem with enumeration is that somebody always loses. Now we will use fibers to fight back. Before you decide to use fibers in your programs, make sure to read the dire warnings at the end of this article. My goal here is to show one... |
As we discovered in the previous two entries [second], the problem with enumeration is that somebody always loses. Now we will use fibers to fight back. Before you decide to use fibers in your programs, make sure to read the dire warnings at the end of this article. My goal here is to show one use of fibers, not to say that fibers are the answer to all your problems. Fibers can create more problems than they solve. We'll come back to all the dire warnings later. As with most clever ideas, it has a simple kernel: Use a fiber to run both the caller and the enumerator each on their own stack. #include <windows.h> #include <shlwapi.h> #include <stdio.h> #include <strsafe.h> enum FEFOUND { FEF_FILE, // found a file FEF_DIR, // found a directory FEF_LEAVEDIR, // leaving a directory FEF_DONE, // finished }; enum FERESULT { FER_CONTINUE, // continue enumerating // (if directory: recurse into it) FER_SKIP, // skip directory (do not recurse) }; class __declspec(novtable) FiberEnumerator { public: FiberEnumerator(); ~FiberEnumerator(); FEFOUND Next(); void SetResult(FERESULT fer) { m_fer = fer; } void Skip() { SetResult(FER_SKIP); } virtual LPCTSTR GetCurDir() = 0; virtual LPCTSTR GetCurPath() = 0; virtual const WIN32_FIND_DATA* GetCurFindData() = 0; protected: virtual void FiberProc() = 0; static void DECLSPEC_NORETURN WINAPI s_FiberProc(void* pvContext); FERESULT Produce(FEFOUND fef); protected: void* m_hfibCaller; void* m_hfibEnumerator; FEFOUND m_fef; FERESULT m_fer; }; FiberEnumerator::FiberEnumerator() : m_fer(FER_CONTINUE) { m_hfibEnumerator = CreateFiber(0, s_FiberProc, this); } FiberEnumerator::~FiberEnumerator() { DeleteFiber(m_hfibEnumerator); } void DECLSPEC_NORETURN FiberEnumerator:: s_FiberProc(void *pvContext) { FiberEnumerator* self = reinterpret_cast<FiberEnumerator*>(pvContext); self->FiberProc(); // Theoretically, we need only produce Done once, // but keep looping in case a consumer gets // confused and asks for the Next() item even // though we're Done. for (;;) self->Produce(FEF_DONE); }
This helper class does the basic bookkeeping
of fiber-based enumeration. At construction, it
remembers the fiber that is consuming the enumeration,
as well as creating a fiber that will produce the
enumeration. At destruction, it cleans up the fiber.
The derived class is expected to implement the
The real magic happens in the (somewhat anticlimactic)
FERESULT FiberEnumerator::Produce(FEFOUND fef) { m_fef = fef; // for Next() to retrieve m_fer = FER_CONTINUE; // default SwitchToFiber(m_hfibCaller); return m_fer; } FEFOUND FiberEnumerator::Next() { m_hfibCaller = GetCurrentFiber(); SwitchToFiber(m_hfibEnumerator); return m_fef; }
To
To get the next item, we
remember the identity of the calling fiber, then
switch to the enumerator fiber.
This runs the enumerator until it decides to
That's all there is to it. The Okay, with that groundwork out of the way, writing the producer itself is rather anticlimactic. Since we want to make things easy for the consumer, we use the interface the consumer would have designed, with some assistance from the helper class. class DirectoryTreeEnumerator : public FiberEnumerator { public: DirectoryTreeEnumerator(LPCTSTR pszDir); ~DirectoryTreeEnumerator(); LPCTSTR GetCurDir() { return m_pseCur->m_szDir; } LPCTSTR GetCurPath() { return m_szPath; } const WIN32_FIND_DATA* GetCurFindData() { return &m_pseCur->m_wfd; } private: void FiberProc(); void Enum(); struct StackEntry { StackEntry* m_pseNext; HANDLE m_hfind; WIN32_FIND_DATA m_wfd; TCHAR m_szDir[MAX_PATH]; }; bool Push(StackEntry* pse); void Pop(); private: StackEntry *m_pseCur; TCHAR m_szPath[MAX_PATH]; }; DirectoryTreeEnumerator:: DirectoryTreeEnumerator(LPCTSTR pszDir) : m_pseCur(NULL) { StringCchCopy(m_szPath, MAX_PATH, pszDir); } DirectoryTreeEnumerator::~DirectoryTreeEnumerator() { while (m_pseCur) { Pop(); } } bool DirectoryTreeEnumerator:: Push(StackEntry* pse) { pse->m_pseNext = m_pseCur; m_pseCur = pse; return SUCCEEDED(StringCchCopy(pse->m_szDir, MAX_PATH, m_szPath)) && PathCombine(m_szPath, pse->m_szDir, TEXT("*.*")) && (pse->m_hfind = FindFirstFile(m_szPath, &pse->m_wfd)) != INVALID_HANDLE_VALUE; } void DirectoryTreeEnumerator::Pop() { StackEntry* pse = m_pseCur; if (pse->m_hfind != INVALID_HANDLE_VALUE) { FindClose(pse->m_hfind); } m_pseCur = pse->m_pseNext; } void DirectoryTreeEnumerator::FiberProc() { Enum(); } void DirectoryTreeEnumerator::Enum() { StackEntry se; if (Push(&se)) { do { if (lstrcmp(se.m_wfd.cFileName, TEXT(".")) != 0 && lstrcmp(se.m_wfd.cFileName, TEXT("..")) != 0 && PathCombine(m_szPath, se.m_szDir, se.m_wfd.cFileName)) { FEFOUND fef = (se.m_wfd.dwFileAttributes & FILE_ATTRIBUTE_DIRECTORY) ? FEF_DIR : FEF_FILE; if (Produce(fef) == FER_CONTINUE && fef == FEF_DIR) { Enum(); // recurse into the subdirectory we just produced } } } while (FindNextFile(se.m_hfind, &se.m_wfd)); } Produce(FEF_LEAVEDIR); Pop(); }
As you can see, this class is a mix of the two previous
classes. Like the consumer-based class, information about
the item being enumerated is obtained by calling methods
on the enumerator object. But like the callback-based
version, the loop that generates the objects themselves is
a very simple recursive function, with a call to
In fact, it's even simpler than the callback-based version,
since we don't have to worry about the FER_STOP code.
If the consumer wants to stop enumeration, the consumer simply
stops calling Most of the complexity in the class is just bookkeeping to permit abandoning the enumeration prematurely.
Okay, let's take this fiber out for a spin. You can use
the same A little tweak needs to be made to the main function, though. int __cdecl main(int argc, char **argv) { ConvertThreadToFiber(NULL); DirectoryTreeEnumerator e(TEXT(".")); TestWalk(&e); ConvertFiberToThread(); return 0; } Since the enumerator is going to switch between fibers, we'd better convert the thread to a fiber so it'll have something to switch back to! Here's a schematic of what happens when you run this fiber-based enumerator:
Observe that from each fiber's point of view, the other fiber is just a subroutine!
Coding subtlety: Why do we capture the caller's fiber each time
the Next time, we'll see how this fiber-based enumerator easily admits higher-order operations such as filtering and composition. Dire warnings about fibers Fibers are like dynamite. Mishandle them and your process explodes. The first dire warning is that fibers are expensive in terms of address space, since each one gets its own stack (typically a megabyte). And since each fiber has its own stack, it also has its own exception chain. This means that if a fiber throws an exception, only that fiber can catch it. (Same as threads.) That's a strong argument against using an STL std::stack object to maintain our state: STL is based on an exception-throwing model, but you can't catch exceptions raised by another fiber. (You also can't throw exceptions past a COM boundary, which severely limits how much you can use STL in a COM object.)
One of the big problems with fibers is that
everybody has to be in cahoots.
You need to decide on one person who will
call
the
You might think, "Well, wouldn't
the
But even if GetCurrentFiber told you whether or not the thread
had been converted to a fiber, that still won't help.
Suppose two people want to do fibrous activity on the thread.
The first converts, the second notices that the thread
is already a fiber (somehow) and skips the conversion.
Now the first operation completes and calls
the Therefore, you can use fibers safely only if you control the thread and can get all your code to agree on who controls the fiber/thread conversion. An important consequence of the "in cahoots" rule is that you have to make sure all the code you use on a fiber is "fiber-safe" - a level of safety even beyond thread-safety. The C runtime library keeps information in per-thread state: There's errno, all sorts of bonus bookkeeping when you create a thread, or call various functions that maintain state in per-thread data (such as strerror, _fcvt, and strtok). In particular, C++ exception handling is managed by the runtime, and the runtime tracks this data in per-thread state (rather than per-fiber state). Therefore, if you throw a C++ exception from a fiber, strange things happen. (Note: Things may have changed in the C runtime lately; I'm operating from information that's a few years old.)
Even if you carefully avoid the C runtime library,
you still have to worry about any other libraries you use
that use per-thread data. None of them will work with fibers.
If you see a call to
the Another category of things that are not fiber-safe are windows. Windows have thread affinity, not fiber affinity. |
Comments (28)
Comments are closed. |
Thanks for this Raymond.
I’ve always wanted to see a real use of fibers shown somewhere. Do newer versions of windows make much use of them?
Thanks Raymond,
btw, did the Windows team introduce fibers for any reason? (e.g. MS SQL?) Since from previous conversations, I get the impression that only Win32 has such an explicitly defined Fiber API, and also because Fibers are a relatively obscure thing that most people probably wouldn’t ask for, let alone know about.
Just FYI, C# iterators (new in v2.0) allow this same feat without using fibers. Instead the compiler transforms the producers’ preferred code into the consumers’ perferred interface.
Very cool example!
As a point of information, some games use fibers. You give each object in the game world its own fiber.
You could use threads (or even processes) instead, but then you have much more overhead because of having to lock to avoid potential race conditions.
Yes, this is incredibly cool.
When I started reading this series I was immediately reminded of my own struggles creating a game engine some time ago.
In a perfect world each game entity would be on its own thread running its own script. The renderer would be on another thread, getting samples of the game world as necessary.
The problem of course is that threads are expensive, locks are required, and the thread scheduler doesn’t provide enough control.
The only way to make it work was, as Raymond says, to make life difficult for one side or the other. In my case, and I believe in most game engines, life was made easy for the renderer.
Fibers are definitely designed to help applications which really understand their data sets, their I/O patterns and don’t have a big plug-in/extensibility mechanism. At some point large applications like this become CPU bound since they’re actually used by enterprises who do insanely expensive things like buy large storage arrays that let the number of concurrent I/Os scale as far as the CPU and operating system can push them.
So then they need to start optimizing CPU usage and they can’t afford the statistical goodness of things like thread scheduling and I/O completion ports.
I bet you can figure out which applications really take advantage of fibers and therefore which bugs around using fibers have gotten fixed. They’re enterprise applications which do a lot of I/O but are run by large enough physical systems that the real limiting factor is CPU use.
Caveat emptor. To solve the easy enumerators problem you really want to have smaller things like closures/continuations without having to keep the entire stack around. This is hard without compiler support and even then managing the lifetime of the storage that the closure references is hard without GC. This is effectively how iterators/"yield" works in C#.
It seems that SwitchToFiber would invalidate all CPU predictions and stall the pipeline. If that is true, another warning for CPU bound apps would be to not call it in a tight loop.
I co-wrote a simple game engine some time ago which used coroutines, which fibers seem to be an implementation of, to allow objects to independently operate.
In my case, each "object" in the game was able to register one or more coroutines which would be resumed each frame. Each time the coroutine was resumed the current tick count was passed in so that the coroutine would know how much to adjust its associated object by, depending on what it was supposed to be doing.
This wasn’t done with Win32 Fibers, though. Indeed, Raymond’s series here was the first I’ve ever heard of them. My co-author had implemented a simple scripting system, so my coroutine implementation just played with the stack kept inside the script interpreter. The coroutines didn’t exist outside the script context.
The nice thing about coroutines vs. threads is that it’s up to each routine when it yields, so things like locking become less important. If a coroutine is going to do something unsafe, it must just make sure it leaves everything in a sensible state before yeilding.
In practice, most applications other than "real-time" games are better off with threads, of course.
One nitpick: ConvertThreadToFiber() should fail with GetLastError() == ERROR_ALREADY_FIBER if the thread’s already a fiber. So try it; if it fails for that reason, keep going, but skip the call to ConvertFiberToThread() at the end, and you can work with components which called yours on a fiber.
In Longhorn, there’s also IsThreadAFiber(), which should always work.
GetCurrentFiber() and GetFiberData() don’t return garbage; they just don’t return what you’re expecting (the field in the TEB is overloaded with OS2 information). If you file a bug against me, I’ll look into fixing them.
ERROR_ALREADY_FIBER is returned only on Windows Server 2003. For Windows 98, 2000, and XP, converting a thread to a fiber twice results in random behavior.
Given all the dire warnings about fibers, has anyone actually measured how much slower the equivalent version using threads would be?
I suspect that switching back and forth between two fiber stacks for each produced element is also not very good for performance.
If threads are used instead then for large collections you could batch the elements by putting a producer-consumer queue between the enumerator thread and the main thread. Then the overhead of thread switching will be amortized over a large number of elements, and you will not have to worry about fiber-safety.
"ERROR_ALREADY_FIBER is returned only on Windows Server 2003. For Windows 98, 2000, and XP, converting a thread to a fiber twice results in random behavior."
How did this ever get designed this way?!?
Because fibers are a brain-the-size-of-a-planet feature. You are expected to have already worked through all the difficulties yourself – like deciding who will do the thread-to-fiber conversion. In the same way you don’t get an error if you free memory twice; you’re expected simply not to do that in the first place.
Remember, the ENTIRE PROGRAM must coordinate fiber management – see all the dire warnings – if you’re coordinating fibers on a program-wide basis surely you can decide who will be responsible for the thread-to-fiber conversion; that’s the least of your problems.
Pavel Lebedinsky, switching between fibers is *cheap* compared to switching between threads.
A switch between fibers involves; saving the registers state, and then loading the new register state from the new fiber. This will also switch the stack too. You are looking at a few dozen bytes which need to be moved about.
With a thread; it involves a trip to kernel land, saving the registers state, loading the new registers state, and then a trip back to user-land.
The timesaver is the removal of the trips between user-mode and kernel-mode todo a context switch.
If memory serves, SQL Server gets something like a 5-10% boost when running in fiber mode. However I believe that it then can’t use XSPs (the DLL-based plug-in mechanism) because, in general, code can’t deal with fibers / shifting across threads. (This is why it’s not the default; it does break some code that may be running in-process with SQL.)
Maybe they’ve done the work so that XSP invocations are remoted to run on threads that aren’t fibrous; it’s been too many years since I’ve paid deep attention to SQL server, so please research the details here (XSPs) yourself.
Re: cost:
I may be wrong here, but I believe that a context switch forces reloading the PCR (on x86) which is a very expensive operation for the CPU. Fiber switches switch a lot less state, don’t go into kernel mode and don’t change the PCR.
Mike
> switching between fibers is *cheap* compared to switching between threads.
My point was that any context switching (thread or fiber) has additional costs because of poor locality of reference. These costs may dominate everything else (things like saving the registers, transition to KM and back, running the scheduler code etc). If that’s the case then using fibers doesn’t really buy you anything, except all the bugs you’re going to discover because most code has never been tested with fibers and probably is not fiber-safe.
Now I’m not an expert on this to be able to say one way or another. But when I googled for ["context switch" "cache miss" cost] the first hit was this article:
The Effect of Context Switches on Cache Performance
http://www.eecs.harvard.edu/cs245/papers/Shieyuan.html
They say that at least in some cases, cache-performance dominates the cost of a context switch. Note that this was back in 1991, and by all accounts the situation has only become worse in recent years because CPU speeds grew much faster than memory speeds.
So it seems to me that unless you’ve actually *measured* the performance of both solutions and found fibers significantly faster, you’re better off staying with threads.
Any kind of context switch — whether an explicit one made in fibers, or a true thread change, is painful in terms of cache swap. It’s still true, though, that SwitchToFiber() is cheaper than a normal thread context switch, as fiber switching can be done almost entirely in non-priviledged (usermode) code.
And yes, it is still true today that context switches kill the cache. Those of us in the performance team have been on several crusades to reduce the number of switches needed in an "idle" system over the last few years — pushing for fewer services, fewer ISRs/DPCs, etc. In my particular case, my rallying cry has been high-definition video playback; we’ve found that having a high number of ISRs/DPCs per second will absolutely sink decoding, even if every single interrupt executes quickly, because they kill the cache. (This is in addition, of course, to the problem of ISRs/DPCs that take too long and lock up the CPU for extended periods of time, such as long NDIS chains or non-conformant ATAPI drives.)
Just a thought – couldn’t you do something similar with "good-old" POSIX setjmp()/longjmp()? Not sure though, I guess it lacks the ‘double stack’.
Yes, a fiber is very similar to a jmp_buf. The thing is that setjmp/longjmp explicitly only are enabled for jumping up the stack, not across between stacks. There’s no defined way to allocate a new stack, get code running on it and save its context into a jmp_buf in standard C etc.
Note also that since a jmp_buf is a struct, its size can’t really vary and depends on the version of the C runtime (regardless of platform) that you are built with. Fibers are provided by the OS on windows and therefore if someone discovers that some critical piece of state wasn’t captured/switched, it can be added.
Unfortunately those additions do tend to break apps who were depending on that piece of context not being switched. Luckily most people have heeded the warning and aren’t using fibers except for people who think they already understand their per-fiber context pretty well.
The test program is *one step* towards figuring out the answer. A test program can’t tell you that you’re right, but it can tell you that you’re wrong.
A while back there was an article in MSDN magazine about wrapping up the unmaged fibers API to implement…
While C# iterators cover a good chunk of what you’d normally want to do, they don’t use separate stacks and so you can’t do everything fibres can. The main restriction is that all your "yield"s have to be in the top function instance.
IEnumerator<X> InOrder(Tree<X> t) {
if (t != null) {
InOrder(t.left);
yield return t.value;
InOrder(t.right);
}
}
At the least, you’d have to put a for loop around the recursive calls and "re-yield" the values to the top. This is pretty inefficient.
Full continuation support would fix this. Unfortunately (AFAIK), adding continuation support causes a significant performance hit (even if you don’t use them).
This has been discussed fairly frequently on the Web. Chris Brumme discusses this here: http://blogs.msdn.com/cbrumme/archive/2003/04/15/51351.aspx