Date: | December 16, 2004 / year-entry #425 |
Tags: | other |
Orig Link: | https://blogs.msdn.microsoft.com/oldnewthing/20041216-00/?p=36973 |
Comments: | 25 |
Summary: | Anybody who's done intensive optimization knows that optimization is often counter-intuitive. Things you think would be faster often aren't. Consider, for example, the exercise of obtaining the current instruction pointer. There's the naïve solution: __declspec(noinline) void *GetCurrentAddress() { return _ReturnAddress(); } ... void *currentInstruction = GetCurrentAddress(); If you look at the disassembly, you'll get something... |
Anybody who's done intensive optimization knows that optimization is often counter-intuitive. Things you think would be faster often aren't. Consider, for example, the exercise of obtaining the current instruction pointer. There's the naïve solution: __declspec(noinline) void *GetCurrentAddress() { return _ReturnAddress(); } ... void *currentInstruction = GetCurrentAddress(); If you look at the disassembly, you'll get something like this: GetCurrentAddress: mov eax, [esp] ret ... call GetCurrentAddress mov [currentInstruction], eax "Pah," you say to yourself, "look at how inefficient that is. I can reduce that to two instructions. Watch: void *currentInstruction; __asm { call L1 L1: pop currentInstruction } That's half the instruction count of your bloated girly-code." But if you sit down and race the two code sequences, you'll find that the function-call version is faster by a factor of two! How can that be? The reason is the "hidden variables" inside the processor. All modern processors contain much more state than you can see from the instruction sequence. There are TLBs, L1 and L2 caches, all sorts of stuff that you can't see. The hidden variable that is important here is the return address predictor.
The more recent Pentium (and I believe also Athlon)
processors maintain an internal
stack that is updated by each
The return address predictor stack is used when the processor
decodes a
That's why the "optimization" turns out to be slower.
Let's say that at the point of the
Here,
Now you execute the
But instead of executing a
I think you can see where this is going.
Eventually your function returns. The processor decodes your
But oh no, the value on the top of the real stack
isn't
The effects of this bad guess don't end there.
After the
Eventually your caller returns. Again, the processor consults its
return address predictor stack and speculatively executes
at
And so on. By mismatching the
Your peephole optimization has proven to be shortsighted.
Some processors expose this predictor more explictly.
The Alpha AXP, for example, has several types of control flow
instructions, all of which have the same logical effect,
but which hint to the processor how it should maintain its
internal predictor stack.
For example, the
Moral of the story: Just because something looks better doesn't mean that it necessarily is better. |
Comments (25)
Comments are closed. |
Second moral of the story: Leave optimization to Raymond Chen and his team of rocket scientists so we, developers, can focus on higher abstraction layers. Thumbs up for .NET and J2EE.
Silly Intel processors. That should be just:
lea (next,pc),a0
next:
Quite why you’d need to do this, though, I don’t know. Normally you can just use absolute addresses and let the linker and loader adjust them as necessary. It might be useful for debug tracing, but then you’d normally want to get the caller’s address, not your own, and it’s probably not performance-critical anyway.
Ben,
http://www.insecure.org/stf/smashstack.txt
Classic article, BTW. There’s a bit where they want to get the address of some data. Since the jmp and call are both pc-relative, the code is position independant, which it needs to be.
jmp call_insn
popl_insn:
popl %esi
; esi points to the data
call_insn:
call popl_insn
data:
My first thought on reading this was "why not just mov currentip, eip". The answer is "you can’t do that on the x86; eip is hidden". Then I ask myself, "self, then what about mv currentip, $?" Then I realize that $ is just a cheap trick of syntax, to explicitly make it use rel form, and probably only works on the LHS of a mov, or inside of brackets.
So, this post boils down to "x86 silly!"
This lesson applies even more to "virtual machine" worlds, because there are way more assumptions that clever machine implementors can make about likely code paths.
For example, "everyone knows" that you should never do this
for i = 1 to array.length
…
next
Because of course that recalculates collection.length every time, even if it’s unlikely to change. Instead, you should do
mylength = array.length
for i = 1 to mylength
…
next
But in the .NET world, the compiler that generates the IL and the jitter that jits the IL can be very smart when analyzing the original program, recognizing this common pattern, and emitting code that more optimally enumerates an array.
By trying to prematurely optimize the loop, there are situations in which you can actually make it worse.
T: I was confining my thinking to legitimate uses – and ones that are performance-critical!
Another issue for this particular optimisation to be slow is that you’ve done it in c (or c++) with embedded assembly … afair the compiler won’t give the optimal code around your smart segment there, hence causing the slowdown.
—
and why does your inline asm use "call" ?
Couldn’t you just do:
__asm {
iwanteip:
lea eax, iwanteip
ret
}
—
Eric is right about optimisation in JIT’d languages though … especially ones like java’s "HotSpot".
I found an interesting issue when I was working on MSLU, where I wanted to make sure that if somebody called the wrapper APIs and they were on NT that it would call right through to the operating system. My initial attempt was:
int __stdcall
AddFontResourceW(LPCWSTR lpszFilename)
{
if(IS_ON_NT())
{
__asm jmp AddFontResourceW
but it turned out that 500+ APIs like this were not only slower but also much fatter in the final DLL size than
int __stdcall
AddFontResourceW(LPCWSTR lpszFilename)
{
if(IS_ON_NT())
{
return AddFontResourceW(lpszFilename);
It made me almost retire from trying to be clever with assembly language, wondering how a jmp could be slower. Until I looked at the disassembly and noticed that it was being optimized to the jmp anyway, but the ASM caused other optimizations to not happen.
"Couldn’t you just do ‘lea eax, iwanteip’" – Sure you can do that, but the presence of "ret" suggests that you’re writing the implementation of GetReturnAddress(), which returns not the return address but rather &GetReturnAddress itself, which isn’t very useful.
If you meant it to be inlined into the caller function, then you run into the problem of inline assembly disabling a large class of optimizations…
"Hidden Variables". Very nice :-)
Are you now into Quantum mechanics, Raymond?
On x86-64, I think you can do
lea rax, [rip]
Also, RIP relative is the default addressing mode – it’s shorter than encoding 64 bit absolute addresses in the instruction, so presumably the loader won’t have to do many relocations compared to x86.
That’s actually quite an advantage, because it means that pages containing code can be shared between two processes, even if they are loaded at different addresses.
Wow, reading this reminds me of Micheal Abrash who once said – ‘Measure first, then optimize’ – The Zen of Code Optimization
Thanks a lot Raymond. I’ve written a 68k emulator which was originally targeted at the 486 and therefore was written in pure assembler with every trick in the books to save a few cycles. Many of those optimizations have been removed in due time, but the main loop still looked something like this:
Loop: PUSH ReturnAddress
Decode command
JMP Command
Commands return with RTS. Most of the time ReturnAddress pointed to the loop, but if e.g. an interrupt had to be simulated it pointed to the IRQ handling code. With your entry you got me thinking and lo and behold, the new version
Loop: Decode command
CALL command
JMP ReturnAddress
increases the overall speed on a P4 by 30%! Pretty amazing for a less than 10 bytes change…
Gerd Riesselmann » Optimization, Templates And The Rest
Re: Eric Lippert:
The problem is that the optimization isn’t just an optimization, it’s a clarification in semantics. If the compiler can’t prove that array.length won’t change in the body of the loop, it has to assume it may and cannot make the optimization you’re suggesting.
What’s more interesting is that actually, based on GrantRi’s blog, it’s safer to always do the length capture since the newer optimizers realize when it would be better to re-read it as long as the value can’t be seen to be changed on the current thread.
Really, this is why a "for each" type of construct is really best which makes the question unambiguous.
(What’s interesting is when the optimizer starts to be free to /not/ capture values when asked to, implementing the necessary logic for capturing client supplied data when cross security boundaries gets hard.)
My first exposure to Raymond’s general principle was back at Digital working on the VAX. Too much time was being spent in the heap allocator so I was trying to optimize the search of the free list when the allocation was too large to be on one of the lookaside lists. I used assembly language to make the function shorter, use fewer registers (and therefore preserve fewer registers on calls) etc. etc. and my implementation ended up being like twice as slow as what the VAX C compiler generated.
I gave up trying to outguess the Digital optimizers then. The MSFT optimizers are almost to that point now and between LTCG and a new generation of optimizers coming up, it’s even more important to specify the semantics you want and trust the optimizer.
(IMO, the next generation of problem is going to be code size optimizations due to proliferation of templates/generics…)
12/18/2004 8:31 PM Michael Grier [MSFT]
> I used assembly language to make the
> function shorter, use fewer registers (and
> therefore preserve fewer registers on calls)
> etc. etc. and my implementation ended up
> being like twice as slow as what the VAX C
> compiler generated.
Did you find out why? I’ve read that some of the emulated instructions were slower to execute using the built-in hardware emulator than using a hand-coded sequence of simpler and faster instructions. If you were coding before that discovery (if that’s possible), and if the VAX C compiler generated simpler code that just happened to be faster because of that reason, then you don’t need to worry about outguessing. Otherwise I’d be studying the generated code to see what I had done wrong.
The most second-most-recent time I had to code assembly language was log2 and exp2 functions for a device driver for x86 Linux. gcc can be told to avoid using floating point instructions but Linux kernel experts said that would be a bizarre approach. It turns out that actual floating point operations can be used in the x86 Linux kernel if preceded by some special macro (in C/gcc/Linux kernel) and followed by some other special macro (same situation). So then I could either call C’s built-in functions for log and exp and do the appropriate scaling, or write my own assembly language. But even though I had not told gcc to avoid using floating point instructions or to use an emulator or anything like that, it generated function calls. The x86 assembly language for log2 is a single instruction, though exp2 requires a ton of scaling before and after its central instruction. So I coded asm’s. I sure hope I wasn’t slower than gcc’s function calls.
But then we had to port it to MIPS. Floating point in the MIPS Linux kernel is never allowed. Fortunately we could precompute the results for a known set of values and dispense with run-time floating point entirely.
I was afraid I was going to have to repeat that with trig functions, but they turned out to be unnecessary.
(As for why that had to be done in the device driver, one famous foreign government prohibits exposing certain functionality to control by end users.)
The third-most-recent time I had to code assembly language was also for x86. That time I grabbed a couple of crutches, unnecessary but very convenient: I looked at the assembly language generated by the compiler, figured out how to cut out about three quarters of the instructions, and the result really was four times as fast. That was about 20 years ago though.
(The absolute most recent time I had to code assembly language was for a Hitachi chip. Hitachi’s C and C++ compilers don’t provide all of the necessary choices regarding which registers to push and pop in interrupt handlers, so I didn’t have any competition for speed on that one ^_^)
Sometimes it’s all about paging.
for i = 1 to array.length
…
next
is only faster for arrays. If you use an ArrayList, then it’s faster to cache the length in a stack variable. (And a for… statement’s better than foreach, because foreach creates a hidden iterator.)
Way back in the ’80s Creative Computing had a column by one of the names in game design. One of his rules was to precompute everything (especially bitmaps that’d have to be shifted) because memory was fast, and processors were slow.
Optimizing C code is a bit like designing a web page – you never know what kind of browser a web page is viewed with- You may compile your C code on a different platform. Same is true for assembly code. Much of the old 8086 code ran slower on the x286 because of "optimizations" done to the x86 that used the clock cycle info when the code was created.
The classic optimization problem in C can be demonstrated by comparing recursive code to iterative code. For example, write the factorial function using recursion, compare that to the factorial function using iteration. The recursion function only has 3 or 4 lines of code. You’d think that it was WAY faster than iterative code which is several lines longer, and also has a loop variable. What a suprise for the CS freshman!
"Optimize late"
Why is that so hard? I think sometimes that we coders are more eager to show off our "l33t" CS101 skills and hand-code some ASM than to trust that Someone Else (eg, compiler writers) Might Actually Know What They’re Doing.
Of course, once you’ve written your naive version, profiled it, and determined exactly where the bottlenecks are — that’s a completely situation. Go to town, with the blessings of all on this thread :)
Converting the file as we read it is taking a lot of time.