Why does the x64 calling convention reserve four home spaces for parameters even for functions that take fewer than four parameters?

In the comments to Can an x64 function repurpose parameter home space as general scratch space?, many people questioned why the full four home spaces are allocated for all functions, even those that have fewer that four parameters.

MSDN gives a very brief answer: "This aids in the simplicity of supporting C unprototyped functions, and vararg C/C++ functions."

Let's dig into that sentence.

Classic C allows you to call an unprototyped function. You can just say "There's a function called f, and it returns an integer, but that's all I'm going to tell you. You'll just have to trust me on the rest."

int f();

You can call this function and pass however many parameters you like. As long as you pass enough parameters to satisfy the function, that's good enough. (Modern C has tightened the rules, but classic C didn't care.) The catch is that the number of parameters you pass could be fewer than the number of parameters the function actually accepts.

Huh?

The classic example of this is the open function. It takes three parameters, but the third parameter is required only when creating a file. If you aren't creating a file, then the third parameter can be omitted.

fd = open(filename, O_RDONLY);
fd = open(filename, O_CREAT | O_WRONLY,  0777);

Now think about how this function would be implemented. On entry, the rcx register points to the file name, the rdx register contains the flags, and the r8 register might or might not contain the creation mode.

So how do you know whether you should spill r8? If you spill it when you shouldn't have, then you corrupted the stack.

Okay, so you can work around this by spilling to the local frame instead of onto the home space, but since the compiler doesn't know whether this function is ever going to be called with fewer parameters than prototyped, it means that the compiler effectively can never use the home space in the first place, because, who knows, maybe the function goes like this:

int open(char *filename, int flags, int mode)
{
  if (is_auto_fail_mode()) return -1;
  ... rest of the code ...
}

and the caller cleverly arranged that is_auto_fail_mode() will return 1, and then called open() and intentionally passed no parameters at all. Under classic C, this is perfectly legal.

If you say that the caller needs to allocate home space only for the actual parameters, then the result is that the compiler can never rely on the existence of home space. Which kind of renders home space useless.

Variadic functions are in a similar boat. If home space was guaranteed only for actual parameters, then variadic functions would not know whether any of the inbound register parameters are spillable. They would have to spill them into the local frame, but that makes walking the variable parameter list more cumbersome because the parameters are no longer contiguous in memory. You have the four locally-spilled parameters, followed by the function exception state, the frame pointer, the return address, and then parameters beyond the fourth.

I guess you could hack around this by changing the prologue of variadic functions to go something like this:

    sub     rsp, N+32       ;
    mov     rax, [rsp+N+32] ; recover return address
    mov     [rsp+N+32], r9  ; spill parameter 4 (if it exists)
    mov     [rsp+N+24], r8  ; spill parameter 3 (if it exists)
    mov     [rsp+N+16], rdx ; spill parameter 2 (if it exists)
    mov     [rsp+N+ 8], rcx ; spill parameter 1 (if it exists)
    mov     [rsp+N   ], rax ; restore return address
    .. rest of standard prologue to build the frame and stuff ..

and the function epilogue would go

    add     rsp, N
    ret     32

to clean up the four parameters that got rewritten on the stack.

This would fall into the "pay for play" category, where the ugliness is localized only to the people who need the ugly thing. It would make stack walking harder, but stack walking is relatively infrequent, so making stack-walking harder isn't that big of a deal; as long as stack-walking isn't rendered impossible.

But wait, why is reserving the full four home spaces so distasteful in the first place? Maybe it's because people consider it wasteful to allocate memory that isn't being used.

But is it really going unused?

No, it's not going unused. The whole point of the original article was to say that the four spaces for home parameters aren't actually required to be used for home parameter spilling. A function is welcome to treat it as hey look, free memory. and in practice, that's what most of them do. They will spill rbx and rsi into those spaces rather than spilling the actual parameters. So the memory isn't going to waste.

Requiring the full four home parameters to be preallocated also saves you the trouble of having to keep realigning your stack on a 16-byte boundary for each function call. The value 32 is a multiple of 16, so whatever alignment you already performed is unaffected by subtracting another 32. On the other hand, if you had callee clean and variable-sized home space, you would have to push an extra dummy parameter for half of the functions, just so that the stack stayed aligned.

    ; about to call a function that takes only 1 parameter.
    ; need to push a dummy parameter to keep the stack 16-byte aligned.
    push    0
    push    rax
    call    f

But you can solve that problem by saying that functions are always caller-clean. That way, you just set up your register spill space once, and then you keep reusing it. It also has the nice property that local variables remain at the same offset relative to the stack pointer for the lifetime of the function.

Or maybe the objection is that it creates more work for the caller, having to allocate 32 extra bytes of memory.

Except that in practice, it's not any extra work at all. As we noted, the x64 calling convention is caller-clean, which means that the space for parameters gets reused from function call to function call. You merely allocate the space in your prologue and it's good for the whole function. And you already had to reserve space on the stack in the function prologue when you did a sub esp, N; you just need to bump it up to sub esp, N+32. This is arithmetic done at compile time, so there is no additional runtime cost. (It's not like the CPU goes faster if the subtrahend is smaller.)

The upshot of this is that trying to minimize parameter home space doesn't really save you anything. Memory isn't being saved, because the called function was going to use the extra space anyway. Execution time isn't being saved, because the stack pointer adjustment is already being done; you're just changing the amount of the adjustment, which has no incremental cost. You're creating more work for the compiler, because it now needs to keep track of the maximum number of parameters passed to any function called by this function. (Mind you, that's not a big cost anyway. Itanium relied on the compiler doing this work.) And you're creating more work for the function being called, because it has to be careful not to spill any registers which might not correspond to actual parameters (information that is not available at compile time).

So reserving home space for all four parameters (even if the caller passes fewer than four actual parameters) makes some people slightly happier (variadic functions), a lot of people significantly happier (classic C code), and adds zero additional cost. Free money.

mikeb says:

June 23, 2016 at 11:18 am

Very nice write up. I particularly like the perfect summary at the end.

Myria says:

June 23, 2016 at 11:37 am

How does va_arg work on the Windows x86-64 calling convention?

skSdnW says:

June 28, 2016 at 2:50 pm

It is probably implementation defined. I think it would be legal (but crazy) to not spill the registers to the space provided to you but since C99 has va_copy you have to save those registers anyway so why make it more complicated than you have to?

At the ABI level the only gotcha is that floats must be passed in XMM* _and_ R* by the caller.

asdf says:

June 23, 2016 at 3:56 pm

I’ve always wondered why the MIPS O32 ABI reserves stack space for its four register parameters, and I guess the reasons were the same. That was long before ANSI C, but is there really still enough prototype-less code around that it should be a concern when designing a modern ABI?

Raymond Chen - MSFT says:

June 23, 2016 at 5:00 pm

My guess is that there’s a lot of scientific code that falls into this category. Scientific code is often really old.
Evan says:

June 24, 2016 at 6:59 am

There’s probably a lot of accidental code like that too. I don’t know what MSVC allows when compiling in C mode, but at least GCC will let you call functions without them having been declared; the calls (I think) will work as long as you call them with the right types. It gives you a warning, but how many projects have tons of warnings?

Azarien says:

June 25, 2016 at 4:09 am

I don’t like the x64 calling convention because it makes writing assembly code by hand too hard. And it’s very, very hard to do it right with proper exception support.

John Doe says:

June 27, 2016 at 9:54 am

I still don’t like it. In practice, 32 bytes in every function is going to translate to garbage most of the time.

See what I did there? Anecdote, but so is your claim, or your claim’s source.

Show me the data.

Despite the lack of more testing, there’s someone agreeing with actual observations (note: a bit old, it doesn’t compare with optimized code, etc): http://www.viva64.com/en/b/0069/

So, Windows 64-bit ABI is basically optimizing for printf, or similar, to be able to spill the 4 registers.

It could be so that you’d require the caller to fill the shadow space with the actual 4 arguments for varargs (not for prototypeless), so they’d be automatically spilled (which they most probably will be at some point), but no.

Caller: here’s 32 bytes for you, my dear friend.
(read: they made me do it)
Callee: how kind of you to have thought out of my needs
(read 1: what made them think I’d really need or want those bytes, gruh!)
(read vararg func: brainz! [ zombie drooling and chewing ])
(read unprototyped function: what? where? ah, )

smf says:

June 28, 2016 at 3:59 am

Windows 64 bit ABI isn’t just optimizing for printf. They chose the best balance to cope with all situations, while only supporting one calling convention.

Address space is 4 billion times cheaper on 64 bit than 32 bit while the cost of executing instructions is roughly the same.
1. John Doe says:
  
  June 30, 2016 at 8:13 am
  
  The memory point is moot if you have a single memory-monster application.
  
  If you have several applications, you gain substantial memory by running them in 32-bit. More so with managed environments that don’t compress object pointers. Or if your applications has recursive algorithms. Or if…
  
  Well, I could go on and on, but you should start to see there are obviously more losing points about using 64-bit than 32-bit in terms of memory usage. In practice, you can count on needing at least double the memory when transitioning from 32-bit to 64-bit. Having 32 bytes on every stack, optimized or not, ends up counting further against the 64-bit memory consumption myth myth.
xcomcmdr says:

June 29, 2016 at 5:29 am

Your link says this :
“You should not worry if your 64-bit program consumes more stack memory. There is much more physical memory in 64-bit systems. The stack with the size 2 Mbytes on a 64-bit system with 8 Gbytes of memory takes fewer percent of memory than 1 Mbyte of stack in a 32-bit system with 2 Gbytes.”

So, apart from a point of view of “pureness”, why do you care so much ?
And how would you support variadic functions otherwise ?

Date:	June 23, 2016 / year-entry #131
Tags:	code
Orig Link:	https://blogs.msdn.microsoft.com/oldnewthing/20160623-00/?p=93735
Comments:	11
Summary:	Variadic functions.