Date: | August 17, 2007 / year-entry #303 |
Tags: | code |
Orig Link: | https://blogs.msdn.microsoft.com/oldnewthing/20070817-00/?p=25533 |
Comments: | 31 |
Summary: | Prerequisites: Basic understanding of assembly language. When you debug through some managed code at the assembly level, you may find that there are an awful lot of nop instructions scattered throughout your method. What are they doing there; isn't the JIT smart enough to remove them? Isn't this going to slow down execution of my... |
Prerequisites: Basic understanding of assembly language. When you debug through some managed code at the assembly level, you may find that there are an awful lot of It is my understanding that¹ this Don't worry. If there's no debugger, the JIT won't generate the dummy Nitpicker's Corner ¹As with all statements of alleged fact, this statement is an interpretation of events based on observation and thought and does not establish a statement of the official position of the CLR JIT compiler team or Microsoft Corporation, and that interpretation may ultimately prove incorrect. |
Comments (31)
Comments are closed. |
My guess is it has very little to no impact on execution time. Its been a while since I studied them, but I think most processors have an out-of-order execution scheduler that intercepts all instructs and decides when and where they will be executed. It also throws away any nop it finds and even adds some if it feels the rest of the processor can’t handle a particular order of instructions. The impact would then only be on size and loading time.
As it’s not always possible to tell if you’ve turned certain optimizations on, a few extra NOPs would come in handy when generating void (or ignored result) calls with debugging information, so that the stack frame still points to the line of the call.
mccoyn — mostly true, but not entirely. There will be an impact on more than just the size of the code and loading time: the L1 code cache will have to store the NOPs, since (AFAIK anyway) L1 caches the "real" code bytes, not the microcode. Since L1 is storing the NOPs, it’ll have to push out some other useful code.
And when memory is so much slower than the processor, and getting worse, the size of your code (meaning the amount that you can get done without missing in the cache) is getting to be more important than the speed of your code.
Of course these NOPs only happen when you have a debugger attached, so it doesn’t matter anyway. But in general, anything that affects size is also going to affect speed, via the cache. Out-of-order and speculative execution or not; the cost of an actual executed NOP is tiny compared to waiting for main memory because you missed in both the L1 and L2 caches.
(An exclusive cache has a slight edge in this case, since there’s a slightly higher chance that the code is still in the L2 cache (because the contents of the L1 cache aren’t also taking up space in L2). But even waiting for the L2 cache will take more than one clock cycle, I think, and a NOP only takes one clock to execute.)
In spite of your disclaimer, I’m still going to assume that this is Microsoft’s official position. In addition, I will be filing a bug report with the CLR JIT compiler team and referencing this blog in it. Good day, sir.
Although this is generally the case, it is not strictly true. On the P4 for instance, the L1 cache (the ‘trace cache’) stored decoded instructions.
But it does not really matter either way I guess, as these NOPs are certainly not the even close to being the main performance bottleneck when running your application under a debugger.
For the x64 and ia64 2.0 CLR JIT, it also inserts NOPs to align loops (some processors execute a backwards branch ‘faster’ if it is on a 16-byte boundary), and to accomodate certain unwind semmantics (so you’ll see them after between calls and EH boundaries, like the end of a try body). Again this is not guaranteed behavior, or the official MS position, etc., it’s just an FYI for the curious.
–Grant
FYI, I believe the canonical example for nop operation usage is placing a breakpoint on a curly brace.
It must be CLR week over at The Old New Thing because it’s been non-stop posts about C# lately. Raymond’s
OK, why would you bother to allow someone to put a breakpoint on a curly brace? As far as I can tell, that would have the same effect as having a breakpoint on a blank line. You can’t put breakpoints on blank lines, can you?
Assuming you can’t put breakpoints on blank lines, can anyone figure out what advantage putting a breakpoint on a curly brace give you?
Hmmmm…..given:
if (x) {
foo();
}
would you expect a breakpoint on the "}" to fire if x was true, if x was not true, or both?
I frequently want to set a breakpoint AFTER something happens, not before. How do you do that if the "something happens" is the LAST thing that happens?
void M() {
this.x = Blah();
}
How are you going to inspect the value of this.x unless you can put a breakpoint AFTER the call to Blah()?
That’s why we let you put a breakpoint on the curly.
And that of course answers your question. The breakpoint would be hit iff x is true, because the curly "runs" after the call to foo().
"Don’t worry. If there’s no debugger, the JIT won’t generate the dummy nops."
Sure, and the fridge light really goes out when you shut the door… ;)
But seriously, Raymond, thanks for continuing to blog in the face of a-holes, and flinging a little poo at them while you’re at it.
nop is also emitted when you override a virtual method and provide no implementation.
eg:
.method public hidebysig virtual instance void
Test(class ThunderMain.Tree.Node opNode,
class ThunderMain.Tree.Node opTestNode,
class ThunderMain.Tree.Preferences opPrefs) il managed
{
// Code size 2 (0x2)
.maxstack 8
IL_0000: nop
IL_0001: ret
} // end of method FormatAlgorithm$NullAlgorithm::Test
In reference to my previous comment – I was talking about the nop at the IL level, not the x86 level I suspect you were talking about…
There are some other places where these NOPs are valuable.
Suppose you have a method call with an assignment:
x = F();
If you step IN to F(), and then step OUT, the current statement marker will be on this statement. That’s because this statement still has work to do – assign the result to x.
If you don’t assign the result:
F();
then upon stepping OUT the current statement marker will be on the next statement. That can be confusing. By adding the NOPs we get consistent behavior between the two.
(Note that this is a generalization – specific cases may vary).
You can hide a videocam inside the fridge.
Or, you can attach a debugger after the JIT has generated code for a method :)
For our Eiffel implementation we had to do the same but we learned it by trial and errors. Indeed the pdb format does not like when you set several breakpoints at the same location. So now we generate a nop for each of those breakpoints and it works.
sometimes people forget how advanced is out-of-order execution in today chips. for example this code runs at 1 clock (!!) per loop iteration on core2.
LOOP:
mov eax, [esi+ecx]
mov [ecx+edi*1], ebx
add ecx, 4
js LOOP
source: microarchitecture.pdf by agner fog (google for it)
My understanding of the way that the debugger works could easily be squiffy, I just don’t know in any official capacity how it works…. Anyhoo.
Surely any instruction that you execute relates directly to some decretely identifiable description, or line, or part thereof, that one specified to a compiler.
If you’re going to relate an instruction or more likely a group of instructions to a described line of code, then there must be a *mapping* of some sort that links instructions and described lines of code. I assume that the debugger operates on the basis of hardware traps.
I assume that the purpose of .NET is to allow platform independence. That is to say that the code one compiles is translated into a native machine code appropriate to the target platform, at run time. For the sake of this text I’ll refer to what the compiler produces as CLR, but I’m not sure about that. The runtime process of translation between CLR and and platform machine code I’ll refer to as JIT, but again I’m not certain what it should be called.
Is this article suggesting that the mapping is generated from the CLR code, rather than the platform machine code?
The only thing that makes sense to me is that the optimisation process, is unable to move traps, because if it did, the mapping would no longer be coherent. Presumably the JIT produces code that is smaller than expected at the time the CLR was generated, and consequently the nops have to be filled to make up the space, and maintain the mapping coherence.
The thing that I wonder, is what happens when compiled code is retargetted to a different instruction set, which is what I thought .NET was about. What happens, say, when the target platform requires more instructions to complete the line, than was specified in the original CLR code?
Surely the JIT must be a strict lookthrough for the debugger, for good design? It appears that the debugger is looking around the JIT. Does this mean that you have to rebuild the CLR for each platform that you wish to debug on?
Wouldn’t it be better for the debugger to sit completely behind the JIT, and have the JIT set real traps, and generate mappings into the CLR? The debugger could then operate on it’s own mapping into the CLR, and pseudo traps that it could receive from the JIT.
It really must be a bit of a boiler. If I am right, is this just a question of time, money and resources, or is it intended to work that way?
If you have one of Raymond’s favourites, an object with a nontrivial destructor going out of scope, then the destructor often appears to execute on the close curly of that scope, thus enabling you to breakpoint on it.
St. Thomas.
Sure, you could put a camera in your fridge… but then you’d only be sure that the light is off *when there is a camera*.
Obviously the solution is to shut yourself into the fridge and observe it directly.
Schroedinger etc.
Maybe there are only NOPs because you observe the code, and there are none when you are not looking for them ;)
(Eric, you being in the fridge would change the experiment, thus invalidating the results ;)
Uhh… IIRC:
K7s (Athlons) and derivatives will kill NOPs at the decoding stage at a max of 3/cycle.
Intel marchs before Core2 process NOPs and they must use one of the Integer/Logic/Float execution ports (P1 or P2), so they "pollute the internal buffers" and have a max throughput of 2/cycle.
Core2 has 3 I/L/F ports so I’d expect it to eat 3 NOPs per cycle.
@Hum:
"I assume that the purpose of .NET is to allow platform independence."
That’s your assumption. I’m pretty sure it’s never been stated as an assumption or goal of the platform. Rather, it is a side effect (the .NET platform can serve as a programming interface that, if implemented on other targets, can provide platform independence).
I would say some of the stated goals include a common type system and easier cross-language compatibility. Also included are a managed runtime intended to improve performance in memory management (and maybe as a side effect limit memory based bugs like buffer overruns). It could be theorized that eventually the JIT could, when it knows more about the hardware then the coder, could produce more specialized and targeted (and performant) code than the native compiler.
Interesting response.
By nature, I’m a bit holistic in my outlook. It’s probably why I’m not eminent or successful – you need to be *really* good if you want to be eminent successful *and* holistic. Digression over.
I’m sure though there are more than a few that would agree with my outlook. Certainly my first google hit agreed anyhow.
There are some potentially good things about .NET, but my holistic outlook is telling me that the number one improvement for memory management would be hardware based, specifically content adressable memory for, say, free list lookups on the heap. I’d wager that alone, such a hardware feature would have a massive effect on overall performance.
Clearly that’s not a Microsoft problem. On the other hand, they’ll (you’ll) already be aware of such things. How hard do you keep hitting a problem that can’t be dramatically improved with the only tools you have? Maybe it’s a needle to crack a coconut? Maybe ther is another way?
I wasn’t exepecting an answer to my original post, but I’m glad you responded. At least I’m not alone in the universe.
What if it’s fluorescent light?
> Sure, you could put a camera in your fridge… but then you’d only be sure that the light is off *when there is a camera*.
Any Microsoft interviewee knows the answer to this one. Shut the fridge for a while, then open it and check whether the light bulb is cold.
Fluorescent lights still get warm, just not as much. Even LED bulbs warm up, but you might need to get a thermometer to detect it.
Do I get the job?
Sure you do — you can come repair my fridge’s light any time you’d like ;)