Date: | June 8, 2016 / year-entry #120 |
Tags: | code |
Orig Link: | https://blogs.msdn.microsoft.com/oldnewthing/20160608-00/?p=93615 |
Comments: | 16 |
Summary: | Seeing the proximate cause. |
Today we're going to look at an application compatiblity problem. Actually, today, we'll just look at the crash that is the reason why we have an application compatibility problem. We'll then spend the next few days trying to figure out what went wrong. Today's installment will be rather long because I want to go through the entire process of making the initial diagnosis, so that we won't have to spend each day re-establishing context. I also want to point out that even though this is presented as a straightforward narrative, the actual analysis involved a good number of dead ends and sitting around thinking, "Great, what am I supposed to do now?" We start with the actual crash. eax=00000001 ebx=31410000 ecx=00000000 edx=40000000 esi=001345a8 edi=1fee0a9c eip=314259d3 esp=0013443c ebp=0013444c iopl=0 nv up ei pl nz na pe nc cs=001b ss=0023 ds=0023 es=0023 fs=003b gs=0000 efl=00010206 contoso!ContosoInitialize+0x4d73: 314259d3 mov eax,dword ptr [ecx+eax*4] ds:0023:00000004=????????
(Obviously, the symbols are wrong.
Which makes sense because we don't have symbols for
What we have here is an array index operation into an array, but the array is a null pointer. Let's see where that null pointer came from. contoso!ContosoInitialize+0x4d40: 314259a0 push ebp 314259a1 mov ebp, esp 314259a3 sub esp, 10h 314259a6 mov dword ptr [ebp-10h], ecx 314259a9 mov eax, dword ptr [ebp+8] 314259ac mov dword ptr [ebp-8], eax 314259af lea ecx, [ebp-0Ch] 314259b2 push ecx 314259b3 lea edx, [ebp-4] 314259b6 push edx 314259b7 mov eax, dword ptr [ebp-8] 314259ba push eax 314259bb call contoso!ContosoInitialize+0x4db0 (31425a10) 314259c0 add esp, 0Ch 314259c3 mov edx, 1 314259c8 mov ecx, dword ptr [ebp-0Ch] 314259cb shl edx, cl 314259cd mov eax, dword ptr [ebp-4] 314259d0 mov ecx, dword ptr [ebp-10h] 314259d3 mov eax, dword ptr [ecx+eax*4]
Ah, it came from the Here's the stack trace, or at least part of it. ChildEBP RetAddr 0013444c 31426201 contoso!ContosoInitialize+0x4d73 00134478 31427401 contoso!ContosoInitialize+0x55a1 00134484 3142b030 contoso!ContosoInitialize+0x67a1 00134494 3142afdd contoso!ContosoInitialize+0xa3d0 001344c4 314279e0 contoso!ContosoInitialize+0xa37d 001344dc 31427e86 contoso!ContosoInitialize+0x6d80 001344f8 31427c2f contoso!ContosoInitialize+0x7226 00134504 31427bd3 contoso!ContosoInitialize+0x6fcf 00134530 314332b8 contoso!ContosoInitialize+0x6f73 00134554 31432278 contoso!ContosoInitialize+0x12658 00134578 3142dd7a contoso!ContosoInitialize+0x11618 001345a8 3142e326 contoso!ContosoInitialize+0xd11a 00134668 3142e4b5 contoso!ContosoInitialize+0xd6c6 0013468c 3142e5b6 contoso!ContosoInitialize+0xd855 001346b0 31424766 contoso!ContosoInitialize+0xd956 001346bc 31421c0d contoso!ContosoInitialize+0x3b06 001346e0 314c8cba contoso!ContosoInitialize+0xfad 00134790 314b5e08 contoso!ContosoInitialize+0xa805a 001347b0 77c60c6e contoso!ContosoInitialize+0x951a8 001347d0 77c12f17 ntdll!RtlAddAuditAccessAceEx+0x5e 00134820 77c09331 ntdll!RtlActivateActivationContextUnsafeFast+0xd7 001348a0 77c091b4 ntdll!RtlLoadString+0x3c1 001348c4 77c07d68 ntdll!RtlLoadString+0x244 001348e0 77c0bd21 ntdll!LdrGetDllHandleByMapping+0xe8 00134928 77c0b941 ntdll!RtlAcquirePebLock+0x391 00134a74 77c0b615 ntdll!LdrLoadDll+0x3a1 00134af8 751a7f94 ntdll!LdrLoadDll+0x75 00134b3c 751aad06 KERNELBASE!LoadLibraryExW+0x124 00134b64 018aa533 KERNELBASE!LoadLibraryExA+0x26 00134c18 018aa410 setup+0x3a533
At this point, you might choose to shudder in horror,
because you can see that this DLL is running a ton of
code during Once you overcome your initial shock, you can pick off the caller (highlighted in the stack trace as the return address) and take a look at the enclosing function. contoso!ContosoInitialize+0x5570: 314261d0 push ebp 314261d1 mov ebp, esp 314261d3 push 0FFFFFFFFh 314261d5 push offset contoso!ContosoInitialize+0xa5798 (314c63f8) 314261da mov eax, dword ptr fs:[00000000h] 314261e0 push eax 314261e1 mov dword ptr fs:[0], esp 314261e8 sub esp, 14h 314261eb mov dword ptr [ebp-20h], ecx // "this" 314261ee mov eax, dword ptr [ebp+8] 314261f1 push eax 314261f2 mov ecx, dword ptr [ebp-20h] // "this" 314261f5 call contoso!ContosoInitialize+0x5620 (31426280) 314261fa mov ecx eax 314261fc call contoso!ContosoInitialize+0x4d40 (314259a0)
The null pointer came from the call to mystery member function contoso!ContosoInitialize+0x5620: 31426280 push ebp 31426281 mov ebp, esp 31426283 push 0FFFFFFFFh 31426285 push offset contoso!ContosoInitialize+0xa57c3 (314c6423) 3142628a mov eax, dword ptr fs:[00000000h] 31426290 push eax 31426291 mov dword ptr fs:[0], esp 31426298 sub esp, 24h 3142629b mov dword ptr [ebp-2Ch], ecx // "this" 3142629e mov eax, dword ptr [ebp-2Ch] // "this" 314262a1 mov ecx, dword ptr [eax+400h] // read member at offset 0x400 314262a7 cmp ecx, dword ptr [contoso!ContosoInitialize+0xe5d3c (3150699c)] 314262ad jne contoso!ContosoInitialize+0x56c0 (31426320)
Was this jump taken? Let's find out.
First, extract the 0:000> dd 00134478-20 L1 00134458 1fee50f8 0:000> dd 1fee50f8+400 L1 1fee54f8 00000041 0:000> dd 3150699c L1 3150699c 00000000 The values do not match, so let's assume the jump was taken. 31426320 mov dword ptr [ebp-10h], 0 31426327 lea eax, [ebp-10h] // bonus parameter 3142632a push eax 3142632b mov ecx, dword ptr [ebp-2Ch] // "this" 3142632e mov edx, dword ptr [ecx+400h] // the value is 0x41 31426334 push edx 31426335 call contoso!ContosoInitialize+0x50b0 (31425d10)
Let's follow the money into mystery
free function
contoso!ContosoInitialize+0x50b0: 31425d10 push ebp 31425d11 mov ebp, esp 31425d13 push 0FFFFFFFFh 31425d15 push offset contoso!ContosoInitialize+0xa5738 (314c6398) 31425d1a mov eax, dword ptr fs:[00000000h] 31425d20 push eax 31425d21 mov dword ptr fs:[0], esp 31425d28 sub esp, 10h 31425d2b call contoso!ContosoInitialize+0x952f3 (314b5f53) 31425d30 push eax 31425d31 lea ecx, [ebp-14h] 31425d34 call contoso!ContosoInitialize+0xfe0 (31421c40) 31425d39 mov dword ptr [ebp-4], 0 31425d40 mov eax, dword ptr [ebp+8] 31425d43 push eax 31425d44 call dword ptr [contoso!ContosoInitialize+0xa8590 (314c91f0)] // TlsGetValue -> goes into the variable 31425d4a mov ecx, dword ptr [ebp+0Ch] 31425d4d mov dword ptr [ecx], eax 31425d4f mov edx, dword ptr [ebp+0Ch] 31425d52 cmp dword ptr [edx], 0 31425d55 jne contoso!ContosoInitialize+0x5125 (31425d85) 31425d57 call dword ptr [contoso!ContosoInitialize+0xa840c (314c906c)] // GetLastError 31425d5d push eax 31425d5e lea ecx, [ebp-14h] 31425d61 call contoso!ContosoInitialize+0x5150 (31425db0) 31425d66 test eax, eax 31425d68 je contoso!ContosoInitialize+0x5125 (31425d85) 31425d6a mov dword ptr [ebp-18h], 0FFFFFFFFh 31425d71 mov dword ptr [ebp-4], 0FFFFFFFFh 31425d78 lea ecx, [ebp-14h] 31425d7b call contoso!ContosoInitialize+0xff0 (31421c50) 31425d80 mov eax, dword ptr [ebp-18h] 31425d83 jmp contoso!ContosoInitialize+0x513e (31425d9e) 31425d85 mov dword ptr [ebp-1Ch], 0 31425d8c mov dword ptr [ebp-4], 0FFFFFFFFh 31425d93 lea ecx, [ebp-14h] 31425d96 call contoso!ContosoInitialize+0xff0 (31421c50) 31425d9b mov eax, dword ptr [ebp-1Ch] 31425d9e mov ecx, dword ptr [ebp-0Ch] 31425da1 mov dword ptr fs:[0], ecx 31425da8 mov esp, ebp 31425daa pop ebp 31425dab ret
There are two exit paths for this function; one returns 31425d39 mov dword ptr [ebp-4], 0 31425d40 mov eax, dword ptr [ebp+8] // Parameter 0 = 0x00000041 31425d43 push eax 31425d44 call dword ptr [contoso!ContosoInitialize+0xa8590 (314c91f0)] // TlsGetValue 31425d4a mov ecx, dword ptr [ebp+0Ch] // Parameter 1 (the bonus parameter) 31425d4d mov dword ptr [ecx], eax // Save the TLS value into parameter 1 31425d4f mov edx, dword ptr [ebp+0Ch] // Parameter 1 31425d52 cmp dword ptr [edx], 0 // Was the TLS value zero? 31425d55 jne contoso!ContosoInitialize+0x5125 (31425d85)
I was able to figure out that the called function was
0:000> ln poi 314c91f0 kernel32!TlsGetValue: 7603a1f0 mov edi,edi
The branch is taken if the value in the TLS slot is nonzero.
So is it?
Let's manually re-execute the call to
KERNELBASE!TlsGetValue:
751b6ed0 mov edi, edi
751b6ed2 push ebp
751b6ed3 mov ebp, esp
751b6ed5 mov ecx, dword ptr fs:[18h] // get the TEB
751b6edc mov eax, dword ptr [ebp+8] // tlsIndex (0x00000041)
751b6edf mov dword ptr [ecx+34h], 0 // SetLastError(NOERROR);
751b6ee6 cmp eax, 40h
751b6ee9 jae KERNELBASE!TlsGetValue+0x26 (751b6ef6) // Jump taken
...
751b6ef6 cmp eax,440h
751b6efb jae KERNELBASE!TlsGetValue+0x42 (751b6f12) // Jump not taken
751b6efd mov ecx,dword ptr [ecx+0F94h]
751b6f03 test ecx,ecx
751b6f05 je KERNELBASE!TlsGetValue+0x4c (751b6f1c)
0:000> dd @$teb+f94 L1
7ffdff94 05f5bf30
The value is nonzero, so the jump is not taken.
751b6f07 mov eax,dword ptr [ecx+eax*4-100h]
751b6f0e pop ebp
751b6f0f ret 4
0:000> dd 05f5bf30+41*4-100 L1
05f5bf34 00000000
The TLS value is zero.
Okay, let's go back to the function that called
31425d52 cmp dword ptr [edx], 0 // Was the TLS value zero? 31425d55 jne contoso!ContosoInitialize+0x5125 (31425d85) // Jump not taken 31425d57 call dword ptr [contoso!ContosoInitialize+0xa840c (314c906c)] // GetLastError 31425d5d push eax // NOERROR 31425d5e lea ecx,[ebp-14h] 31425d61 call contoso!ContosoInitialize+0x5150 (31425db0) 31425d66 test eax,eax 31425d68 je contoso!ContosoInitialize+0x5125 (31425d85)
We don't pay attention to contoso!ContosoInitialize+0x5150: 31425db0 push ebp 31425db1 mov ebp,esp 31425db3 push ecx 31425db4 mov dword ptr [ebp-4],ecx 31425db7 mov eax,dword ptr [ebp-4] 31425dba mov ecx,dword ptr [ebp+8] 31425dbd mov dword ptr [eax+4],ecx 31425dc0 mov eax, dword ptr [ebp+8] 31425dc3 mov esp, ebp 31425dc5 pop ebp 31425dc6 ret 4
The function returns its input parameter.
(It does other stuff, but we aren't interested in that part yet.)
Therefore, since we passed 31425d66 test eax,eax 31425d68 je contoso!ContosoInitialize+0x5125 (31425d85) // jump taken
And we saw from the analysis earlier that once you hit the code path
at
Now we can unwind back to the caller and see what happens when
function 31426335 call contoso!ContosoInitialize+0x50b0 (31425d10) // Returns zero 3142633a add esp, 8 3142633d cmp eax, 0FFFFFFFFh 31426340 jne contoso!ContosoInitialize+0x56e9 (31426349) // jump not taken 31426342 xor eax, eax // return zero 31426344 jmp contoso!ContosoInitialize+0x5783 (314263e3) 314263e3 mov ecx, dword ptr [ebp-0Ch] 314263e6 mov dword ptr fs:[0], ecx 314263ed mov esp, ebp 314263ef pop ebp 314263f0 ret Okay, so now we know where the null pointer came from: The null pointer was stored in the TLS slot, and this function returns whatever is in the TLS slot. That is a superficial analysis of why the program crashed. It doesn't really tell us what happened in the operating system that induced the crash; we'll have to dig in deeper. The story continues next time.
|
Comments (16)
Comments are closed. |
Fhtagn, Raymond, what’s up with your blog‽ It’s covered in canvas!
Good news: the arrows drawn on the canvas align with the text correctly. Which is no small feat.
Bad news: the canvas is opaque with respect to clicking (incl. clicking on the hyperlinks) and text selection (incl. [selection reading][1]). And it blocks the horizontal scroll bar on the widest listing (the one about TlsGetValue -> g… the rest did not fit my half-screen window).
[1]: https://blogs.msdn.microsoft.com/oldnewthing/20150528-00/?p=45014
It worked fine in staging, but when published, the canvas vanished! I hacked it by raising the canvas, but yeah, it messes with clicks. I’ll see what I can do, but I don’t have time right now. Probably some bad interaction with the site’s style sheet.
Raymond, you used to say “I don’t control the blog software”. Now you have been sucked into maintaining something approximating the blog software. :-)
Nope; he’s just maintaining his own HTML.
You could use the css rule pointer-events: none;. According to http://caniuse.com/#search=pointer-events this should work in all major desktop browsers.
Yay! Thanks for the tip.
That canvas overlay is broken for font sizes that are not 100% by the way, and kills text selection.
Apart from that, very interesting read. I’m looking forward to next episode.
Small nit: The crash isn’t the reason we have an appcompat problem; the crash is the notification that we have the problem! The appcompat problem is the cause of the crash.
But, still, very interesting post.
I’m curious: How does one obtain the rights to do something like this? I thought the standard Microsoft EULA banned reverse engineering.
One, I suppose it depends on how far you take the definition of “reverse engineering”. I would consider creating source code from binary code to be reverse engineering, but not debugging executable code.
Two, he mentions dump files a couple times. I’m guessing this is a crash dump from one of the “send crash info to Microsoft” notifications that you can enroll your programs into. I don’t know the agreement for enrollment like that, but I would imagine something in it states Microsoft can examine your crash dumps to improve Windows compatibility.
One, I wonder[0] what Microsoft’s own position is on other people debugging their executable code with respect to the anti-reverse-engineering provisions in their EULAs.
Two, the person sending the crash dump is likely *not* the author of the program, but a user of it, and therefore won’t have the right to authorise Microsoft to reverse-engineer a 3rd-party program.
[0] ’tis but an idle wonder – am not trying to put OGH on the spot.
Lots of folks in the Microsoft support organization (and in other Microsoft groups) do post-mortem crash-dump debugging. For the most part, it is done on customer code, using the standard support agreements. Crash-dump debugging is a well known art at Microsoft, and everything is above board legally (as far as I know – the closest to lawyering I ever get is watching one on TV).
“Two, the person sending the crash dump is likely *not* the author of the program, but a user of it, and therefore won’t have the right to authorise Microsoft to reverse-engineer a 3rd-party program.”
Except the “send crash data to Microsoft” option only appears if the software vendor has enrolled in Microsoft’s crash labs(can’t remember what the actual service is called) and already granted permission to MS to gather the data.
At least, I’m fairly certain that’s the case…that’s why it doesn’t appear on all applications when they crash.
Debugging and reverse-engineering are only tangentially related. You might need a debugger to do reverse-engineering, but they have nothing to do with each other beyond that.
If “debugging” was “reverse-engineering,” then editing configuration files by hand would be as well.
Yea, it should be removed. Fun quote from Volume Licensing Product Terms:
“Licensed User may reverse engineer, decompile or disassemble the Advanced Threat Analytics software, or otherwise attempt to derive the source code for the Advanced Threat Analytics Server software solely to the extent required to debug changes to any libraries licensed under the GNU Lesser General Public License that may be included with and linked to by the software.”
I would greatly appreciate it if you didn’t bring up legal issues in the comments. I don’t want my blog shut down. (Do you post questions like this whenever Mark Russinovich debugs a program?)