Date: | July 21, 2006 / year-entry #243 |
Tags: | other |
Orig Link: | https://blogs.msdn.microsoft.com/oldnewthing/20060721-06/?p=30433 |
Comments: | 14 |
Summary: | An import library resolves symbols for imported functions, but it isn't consulted until the link phase. Let's consider a naive implementation where the compiler is blissfully unaware of the existence of imported functions. In the 16-bit world, this caused no difficulty at all. The compiler generated a far call instruction and left an external record... |
An import library resolves symbols for imported functions, but it isn't consulted until the link phase. Let's consider a naive implementation where the compiler is blissfully unaware of the existence of imported functions. In the 16-bit world, this caused no difficulty at all. The compiler generated a far call instruction and left an external record in the object file indicating that the address of the function should be filled in by the linker. At that time, the linker realizes that the external symbol corresponds to an imported function, so it takes all the call targets, threads them together, and creates an import record in the module's import table. At load time, those call entries are fixed up and everybody is happy. Let's look at how a naive 32-bit compiler would deal with the same situation. The compiler would generate a normal call instruction, leaving the linker to resolve the external. The linker then sees that the external is really an imported function, and, uh-oh, the direct call needs to be converted to an indirect call. But the linker can't rewrite the code generated by the compiler. What's a linker to do? The solution is to insert another level of indirection. (Warning: The information below is not literally true, but it's "true enough". We'll dig into the finer details later in this series.) For each exported function in an import library, two external symbols are generated. The first is for the entry in the imported functions table, which takes the name That's what the second symbol is for. The second symbol is the longed-for When the module is loaded, then, the import is resolved to a function pointer and stored in Note that with a naive compiler, if your code tries to take the address of an imported function, it gets the address of the Next time, we'll look at the |
Comments (14)
Comments are closed. |
Well, no one else has commented so I’ll start. Thanks for this article. And for all the others so far in the DLL series. I thought I knew a lot about DLLs but found that I didn’t know as much as I thought. I was especially interested in this article because I’ve been attempting to extend our compiler that we use for our embedded language in our line of battery-powered data loggers. Up to this point, we’ve limited ourselves to allowing customers just one source file and we’ve embedded the "library" of functions they can call in the device itself. I was thinking that we should only load the functions the customer needs for the program and also allow customers to create their own libraries of functions in mutilple files. While there are lots of books and classes on compilers, I’m having trouble finding good sources for writing a linker. Ours would be very primitive and you’ve helped me see a direction to take. Keep up the great blogging, Raymond. And, as I write this, it looks like Floyd Landis has wrapped up the Tour – almost.
I don’t understand. Does that mean that direct JMP or CALL instruction DOES NOT cause queue reload? Where is the next instruction loaded from then?
Yes. The processor knows the target of the jump long before it gets there, so it automatically preloads the instructions at the target of the jump.
It’s the same way branch prediction works, except that it knows for sure that the branch will occur.
When PE was designed, this way was the best way to implement it. __declspec(dllimport) makes the compiler do the indirection where possible, and the thunks handle where it isn’t possible.
The problem is that times have changed. An indirect call is extremely slow on modern processors, because it causes a full instruction queue reload. (Keep that in mind next time you are deciding whether you need virtual functions.)
The way that things should have been done is to use the existing trampoline stubs, except make ntdll’s Ldr* stuff modify the jmp’s themselves instead of simply an import table. Then the code becomes "call near thunk_SendMessageW". At thunk_SendMessageW is "jmp near SendMessageW". The PE loader would modify the bytes after the E9 to point to the correct address. For security, Ldr* would mark this region as PAGE_EXECUTE_READ after it’s done modifying.
This is highly specific to x86-32, because x86-64 and PowerPC can’t do a direct jump to anywhere in the address space. I don’t know IA64 so I have no idea with that one.
It’s too late to have this at the ntdll level, but such a system could be implemented with a combination of compiler, linker, and crt0 code. Or ntdll’s Ldr* could have a new option for that kind of import table, and crt0 could do it itself if ntdll didn’t support it.
Indirect jump doesn’t flush instruction pipe. Where did you hear that nonsense? It will initiate write-back fifo access (or cache access) to bring target address, but this have nothing to do with instruction queue.
Actuall, the queue will get stalled in case of cache miss, since main memory access time is very slow compared to core clock speed.
Yosi is right, on all x86es since at least the pentium 1 indirect branches like the ones discussed here are predicted to go to the same address as last time. Pentium M and newer CPUs (Core) have a more sophisticated mechanism for indirect calls (in order to handle virtual calls that go to different addresses different time), but that would go unused here.
And as a side note, since there seems to be some confusion here too, jmp bcc call etc are all branch predicted, so even a jmp is a "prediction" even though it always jumps.
It would be pretty silly of processor-designers to NOT optimize their CPUs for some of the most common cases – virtual functions and DLL-calls.
Optimization (at this level, at least) is a two-way street — software designers optmize their code for the CPU and hardware designers optimize the CPU for whatever code executes on them.
Dean: But "virtual functions" and "DLL calls" are *not* something the processor designers can optimize for. Those are several levels of abstraction above the processor. The code to actually do them is generated by the C++ compiler and library loader, respectively; the choices that Microsoft’s C++ compiler and library loader made are not the only possible choices.
(Nitpicky? Well, yes; why do you ask? :-P)
BryanK: “But “virtual functions” and “DLL calls” are *not* something the processor designers can optimize for.”
er, Why not? In software, we optimise first, by profiling, to get an idea of where the software might benefit from optimization.
I’m not too sure that hardware engineers aren’t under similar economic constraints to spend their time efficiently.
I am probably stupid or something, but considering instructions like
jmp/call $addr
jmp/call $[addr]
Do you think CPU even bother to *predict* such jump?
Is call [$addr] *significantly* slower than call $addr, when *both* addr var itself and destination it points to are outside cached memory?
Yes, it will be slower due to extra read will take place, but I doubt it is like 10 times slower or something
I think it is all about what "significantly" means in this case…
Is dll-calls always slower than ordinary calls inside a binary?
透过汇编另眼看世界之DLL导出函数调用