What’s up with the mysterious inc bp in function prologues of 16-bit code?

Date:March 16, 2011 / year-entry #65
Tags:history
Orig Link:https://blogs.msdn.microsoft.com/oldnewthing/20110316-00/?p=11203
Comments:    24
Summary:A little while ago, we learned about the EBP chain. The EBP chain in 32-bit code is pretty straightforward because there is only one type of function call. But in 16-bit code there are two types of function calls, the near call and the far call. A near call pushes a 16-bit return address on...

A little while ago, we learned about the EBP chain. The EBP chain in 32-bit code is pretty straightforward because there is only one type of function call. But in 16-bit code there are two types of function calls, the near call and the far call.

A near call pushes a 16-bit return address on the stack before branching to the function entry point, which must reside in the same code segment as the caller. The function then uses a ret instruction (a near return) when it wants to return to the caller, indicating that the CPU should resume execution at the specified address within the same code segment.

By comparison, a far call pushes both the segment (or selector if in protected mode) and the offset of the return address on the stack (two 16-bit values), and the function being called is expected to use a retf instruction (a far return) to indicate that the CPU should pop two 16-bit values off the stack to determine where execution should resume.

When Windows was first introduced, it ran on an 8086 with 384KB of RAM. This posed a challenge because the 8086 processor had no memory manager, had no CPU privilege levels, and had no concept of task switching. And in order to squeeze into 384KB of RAM, Windows needed to be able to load code from disk on demand and discard it when memory pressure required it.

One of the really tricky parts of the real-mode memory manager was fixing up all the function pointers when code was loaded and unloaded. When you unloaded a function, you had to make sure that any existing code in memory that called that function didn't actually call it, because the function wasn't there. If you had a memory manager, you could mark the segment or page not present, but there is no such luxury on the 8086.

There are multiple parts to the solution, but the part that leads to the answer to the title question is the way the memory manager patched up all the stacks in the system. After all, if you discarded a function, you had to make sure that any reference to that function as a return address on somebody's stack got fixed up before the code tried to execute that retf instruction and found itself returning to a function that didn't exist.

And that's where the mysterious inc bp came from.

The first rule of stack frames in real-mode Windows is that you must have a bp-based stack frame. FPO was not permitted. (Fortunately, FPO was also not very tempting because the 16-bit instruction set made it cumbersome to access stack memory by means other than the bp register, so the easiest way to do something was also the right way.) In other words, the first rule required that every stack have a valid bp chain at all times.

The second rule of stack frames in real-mode Windows is that if you are going to return with a retf, then you must increment the bp register before you push it (and must therefore perform the corresponding decrement after you pop it). This second rule means that code which walks the bp chain can find the next function up the stack. If bp is even, then the function will use a near return, so it looks at the 16-bit value stored on the stack after the bp and doesn't change the cs register. On the other hand, if the bp is odd, then it knows to look at both the 16-bit offset and the 16-bit segment that were pushed on the stack.

Okay, so let's put it all together: When code got discarded, the kernel walked all the stacks in the system (which it could now do due to these two rules), and if it saw that a return address corresponded to a function that got discarded, it patched the return address to point to a chunk of code which called back into the memory manager to reload the function, re-patch all the return addresses so they now point to the new address where the function got loaded (probably different from where the function was when it was discarded), and then jumped back to the original code as if nothing had happened.

I continue to be amazed at how much Windows 1.0 managed to accomplish given that it had so little to work with. It even used an LRU algorithm to choose which functions to discard by implementing a software version of the "accessed bit", something that modern CPUs manage in hardware.


Comments (24)
  1. Tyler Durden says:

    The first rule of stack frames in real-mode Windows is: do not talk about stack frames in real-mode Windows.

    The second rule of stack frames in real-mode Windows is: DO NOT TALK ABOUT STACK FRAMES IN REAL-MODE WINDOWS.

  2. David Walker says:

    Ah, an 8086.  Was that the one that ran at 4.77 MHz?

  3. Joshua says:

    As far as I know, these could be omitted without issue on Win95. I tried it but well could have missed something nasty.

    [As far as I know, Windows 95 didn't run the real-mode kernel. (Mind you, debuggers would generate bad stack traces if you didn't mark your frames correctly.) -Raymond]
  4. nathan_works says:

    That is really, really fascinating.. I remember writing a very interesting memory manager for the NACHOS in college, where you didn't get quite down to this level.. Brilliant engineering..

  5. blah says:

    I'm not surprised at the accomplishment. It was not the Ballmer era.

  6. DonH says:

    Actually, blah, it was the Ballmer era.  Back in the early Windows era Steve wasn't the company president, he was the head of the Systems division.

  7. benjamin says:

    Nice. These are the kinds of articles that make me love this blog.

  8. rs says:

    I have often wondered why the sp/esp register is not default choice for referencing the stack. Supposedly the compiler has to keep track of the pushs/pops anyway, so it could just compute the resulting offset from sp/esp, no need to use bp/ebp.

  9. Matt says:

    Wow, implementing paging without a memory manager–kind of reminds me of the Boehm garbage collector, which does garbage collection in native C.

    So that hot-patching maneuver handles fixing up the running stacks, but how about fixing jumps in the rest of the code to the discarded functions?  As in, say I'm running along, and a few instructions ahead of EIP is a jump to another procedure.  If the kernel wants to discard that procedure, it has to find that jump address in my code, and redirect it to a page fault handler, so that when my process gets to it, it will call the procedure and fault the code back in.  How does it find all of the references to that function across the program, so that it can patch them all up?

    It would get even crazier if you're calculating the addresses procedurally, or storing function pointers for things like callbacks or C++ vtbl entries–I'm not sure I understand how the kernel could possibly find all of these.  Could you elaborate a bit more on how it pulled off this voodoo?

  10. silasd says:

    @rs: the 8086 instruction set doesn't let you reference the stack via SP.  Only BP, or (with segment override) BX, SI, DI, or various combinations of those four registers.  See http://www.sandpile.org/…/opc_rm16.htm for details.

    The '386 instruction set lets you use just about any 32-bit register or combo you want, and it even works in 16-bit mode.  (The high 16 bits of the 32-bit register must be zero, or you get a segmentation violation.)

  11. Daev says:

    rs: Congratulations, you've just invented FPO (Frame Pointer Optimization).

  12. Henning Makholm says:

    @David Walker: The 4.77 MHz was the original 8088-based IBM Personal Computer. The 8088 was essentially an 8086 with the external data bus narrowed to 8 bits and bus driver logic that would divide each word-sized memory access into two separately addressed byte accesses. It could run somewhat faster than 4.77 MHz, but IBM underclocked it such that it could share an oscillator with circuitry that modulated the NTSC color signal in the composite video output. Once non-IBM manufacturers started producing PC clones, all or most of them used real 8086's usually clocked at 8 MHz.

  13. Kyle says:

    Absolutely fascinating.  An interesting side effect is that if *any* of the call stacks are hosed and Windows discards a function, the kernel will trash random parts of memory.  Not a whole lot could be done about that, though.

    Regardless, the solution was quite ingenious.

  14. Bekenn says:

    Matt: I believe those offsets are stored as data in the executable image.

  15. Yuhong Bao says:

    [As far as I know, Windows 95 didn't run the real-mode kernel.]

    FYI, real mode Windows was removed in Windows 3.1.

    [I think you missed the joke. -Raymond]
  16. Yuhong Bao says:

    Even with a 16-bit app running on a 32-bit OS, iretd did not restore the high word of ESP if the SS was a 16-bit stack.

    [You have an annoying knack for stating things which are factually correct but completely irrelevant. -Raymond]
  17. Gabe says:

    I'd really like to know how they implemented an "access bit" in software! The only thing I can think of is to patch function calls to redirect to an "access bit setter" function, which would set the access bit, patch itself out, and jump to the real function. Does anybody know if I am right, or was it something more clever?

  18. Anonymous says:

    It is such a pity that we might never see the source code for these old versions of Windows (1.0, 3.0, 3.1…). I hope Microsoft management is enlightened enough to at least escrow the code so it can be published in the future, perhaps after copyright in it runs out.

    It might not be of any practical use anymore, but it is an important piece of human computing history.

  19. Worf says:

    You do realize that copyright won't run out until probably the next century, if not longer, right?

    No, it would be neat if Microsoft released the code to be studied way before copyright on it runs out – maybe even this year as say an academic study exercise on how to do frugal computing.

    Of course, it makes the assumption that Microsoft can even find the original code…

  20. Klimax says:

    I don't think find it would be trouble (Microsoft archives a lot of things), but copyright would be stil lproblem due to third  party code and possibly other things like patents… (some might by still in effect)

  21. Anonymous says:

    @Klimax: patents are not a problem, since you would not be using the source code for anything, just reading it. If patents were a problem in this case, you would have to license a patent just to read the patent document itself, which is plainly ridiculous.

  22. Klimax says:

    Patent wouldn't be for "reading", but distribution of implementation of patent. I wouldn't be suprised if system is not that broken. (Not aplicable here no sw patents…)

  23. j_random_hacker says:

    One thing I'm not quite understanding: If f() calls g() with a far call, then I take it g() will INC BP before PUSH BP; MOV BP,SP.  The pushed BP will be odd, and all is good in the world.  Now if g() calls h() with a far call, won't BP already be odd?  So that when h() calls INC BP; PUSH BP; MOV BP,SP itself, the pushed value of BP will be *even*.

    Perhaps when you said "(and must therefore perform the corresponding decrement after you pop it)" you actually meant "(and must perform the corresponding decrement immediately after *pushing* it)".  At least doing things that way seems to preserve the property that all the "images" of BP that get pushed on the stack are odd if and only if the function expects to be far-called, since BP will be its usual even self at all times, except possibly for a short window around PUSH BP instructions.

    [The "MOV BP,SP" makes BP even again. -Raymond]
  24. Joseph Koss says:

    @j_random_hacker: The value of BP that is put on the stack is the CALLERS frame point (+ 1 if a far call) Then BP is immediately changed to the current functions frame pointer (mov bp, sp)

    So at any given time after a functions prologue, the BP register is correctly the frame pointer for that function, untainted by these +1 shenanigans. If it was the callers job to save its own frame pointer, then FPO would not even be possible… so it will always be the callee's job, not the callers, to save the frame pointer state.

Comments are closed.


*DISCLAIMER: I DO NOT OWN THIS CONTENT. If you are the owner and would like it removed, please contact me. The content herein is an archived reproduction of entries from Raymond Chen's "Old New Thing" Blog (most recent link is here). It may have slight formatting modifications for consistency and to improve readability.

WHY DID I DUPLICATE THIS CONTENT HERE? Let me first say this site has never had anything to sell and has never shown ads of any kind. I have nothing monetarily to gain by duplicating content here. Because I had made my own local copy of this content throughout the years, for ease of using tools like grep, I decided to put it online after I discovered some of the original content previously and publicly available, had disappeared approximately early to mid 2019. At the same time, I present the content in an easily accessible theme-agnostic way.

The information provided by Raymond's blog is, for all practical purposes, more authoritative on Windows Development than Microsoft's own MSDN documentation and should be considered supplemental reading to that documentation. The wealth of missing details provided by this blog that Microsoft could not or did not document about Windows over the years is vital enough, many would agree an online "backup" of these details is a necessary endeavor. Specifics include:

<-- Back to Old New Thing Archive Index