What does it mean when I get an access violation at a very low address when entering a critical section?

Date:September 22, 2017 / year-entry #213
Tags:code
Orig Link:https://blogs.msdn.microsoft.com/oldnewthing/20170922-00/?p=97065
Comments:    6
Summary:It probably means that you're entering a critical section that is not initialized.

Warning: This article talks about implementation details which can change at any time. The information provided is for debugging and diagnostic purposes only.

A customer found that their server program occasionally crashes in the internal function Rtlp­Wait­On­Critical­Section trying to dereference the address 0x00000014.

7789dde3 ff4014          inc     dword ptr [eax+14h]

The dereference was due to a null pointer in the EAX register. This was particularly difficult to debug because the problem usually didn't surface until the program had been running continuously for a week or more.

The customer chased the null pointer backwards and found that it came from the Debug­Info field of the RTL_CRITICAL_SECTION structure.

typedef struct _RTL_CRITICAL_SECTION
{
                                             // value in memory:
     PRTL_CRITICAL_SECTION_DEBUG DebugInfo;  // 0x00000000
     LONG LockCount;                         // 0xFFFFFFFC
     LONG RecursionCount;                    // 0x00000000
     PVOID OwningThread;                     // 0x00000000
     PVOID LockSemaphore;                    // 0x00005CDC
     ULONG SpinCount;                        // 0x00000000
} RTL_CRITICAL_SECTION, *PRTL_CRITICAL_SECTION;

The customer confirmed that, yes, the Debug­Info of the critical section they were trying to enter was indeed null.

Although the customer didn't do it in their application (at least not knowingly), they did try a test application which passed the CRITICAL_SECTION_NO_DEBUG_INFO flag to the Initialize­Critical­Section­Ex function, in the hopes of inducing a null pointer for the Debug­Info, but it didn't work. When initialized in that way, the Debug­Info was set to 0xFFFFFFFF.

Is it possible that this is a critical section that was initialized with the traditional Initialize­Critical­Section function, but the attempt to allocate the debug info failed, so the kernel left it null?

No, that's not why the the Debug­Info is null. If a critical section has no debug info (either explicitly requested as such with the CRITICAL_SECTION_NO_DEBUG_INFO flag, or because the system couldn't allocate any debug info), then the Debug­Info is set to the special value 0xFFFFFFFF. The Debug­Info for a valid initialized critical section is never null.

So what does it mean when the Debug­Info is null? The most likely reason is that you are using an uninitialized critical section. Either you never initialized it, or you deleted an initialized critical section (which resets it back to the uninitialized state).

Other evidence that you have an uninitialized critical section is that the critical section is locked, yet has no owner. Furthermore, the spin count is zero, which occurs only on uniprocessor systems. I suspect the server they are running the program on has more than one core. (Heck, my phone has more than one core.)

Bonus reading: Displaying a critical section in the debugger.

Related: I hope you werent using those undocumented critical section fields.


Comments (6)
  1. Antonio Rodríguez says:

    Well, without knowing the internals of a critical section and just by reading the article’s title (not even the subtitle), my psychic powers told me it was due to a null pointer, possibly because they were accessing an uninitialized critical section, or maybe because of memory corruption. Anyway, debugging is necessary to find the final cause. But if it only happens after running for a week, good luck with that. Another of those fun bugs where you end up reading code in hope of finding something that could cause the observed problem (and you fix some unrelated bugs in the way).

  2. Martin Bonner says:

    The physical hardware they were running on was *vanishingly* unlikely to be uniprocessor – but the VM they were running in might have been.

    I suspect a *lot* of servers these days are running inside VMs – and if you don’t need multi-threaded throughput, a uniprocessor machine eliminates quite a lot of potential problems.

    1. skSdnW says:

      That depends on the meaning of “uniprocessor systems”. Its been a while since we had special uniprocessor kernel builds. A modern system might support hot swappable CPUs so the core count is not fixed at boot.

      1. Joshua says:

        I discovered a long time ago the uniprocessor kernel has a deadlock bug the multiprocessor bug doesn’t. If you try to read a handle from one thread and close it from another it will deadlock until the read finishes (which could be never). The multiprocessor kernel won’t deadlock on this.

        1. Antonio Rodríguez says:

          Basically, uniprocessor kernels make a good number of optimizations based on the fact that there are only a processor, and thus, you can make some assumptions, like “read-modify-store instructions are always atomic”, or “there can’t be race conditions without context switches”. This allows you to save some instructions, and in some cases, even avoid switching to/from kernel mode. Which improves performance. But, as with every optimization, you risk introducing new bugs.

          Anyway, the Windows kernel should be the most tested piece of software of history. Are you sure it really was a bug, and not a side effect of some documented behavior?

          1. Joshua says:

            If you can get a thread into an uninterruptible sleep it’s a security bug. I don’t remember trying terminate process to try to unwedge it but that thread was stuck in kernel.

Comments are closed.


*DISCLAIMER: I DO NOT OWN THIS CONTENT. If you are the owner and would like it removed, please contact me. The content herein is an archived reproduction of entries from Raymond Chen's "Old New Thing" Blog (most recent link is here). It may have slight formatting modifications for consistency and to improve readability.

WHY DID I DUPLICATE THIS CONTENT HERE? Let me first say this site has never had anything to sell and has never shown ads of any kind. I have nothing monetarily to gain by duplicating content here. Because I had made my own local copy of this content throughout the years, for ease of using tools like grep, I decided to put it online after I discovered some of the original content previously and publicly available, had disappeared approximately early to mid 2019. At the same time, I present the content in an easily accessible theme-agnostic way.

The information provided by Raymond's blog is, for all practical purposes, more authoritative on Windows Development than Microsoft's own MSDN documentation and should be considered supplemental reading to that documentation. The wealth of missing details provided by this blog that Microsoft could not or did not document about Windows over the years is vital enough, many would agree an online "backup" of these details is a necessary endeavor. Specifics include:

<-- Back to Old New Thing Archive Index