Friday, March 2, 2012

Debugging Heap Failures

Sometimes heap failures can be mysterious, but they don't have to be.  For instance, today I got this kd (kernel debugger) break.

////////////////////////////////////////////////////////////////////////////
Output of !analyze -v



*******************************************************************************
*                                                                             *
*                        Exception Analysis                                   *
*                                                                             *
*******************************************************************************


Loading symbols for 680c0000          component.dll ->    component .dll
Loading symbols for 75fb0000     KERNEL32.DLL ->   KERNEL32.DLL
Force unload of C:\Windows\SYSTEM32\user32.dll
Loading symbols for 76680000       user32.dll ->   user32.dll
ModLoad: 76680000 767a1000   C:\Windows\SYSTEM32\user32.dll
Force unload of C:\Windows\system32\ole32.dll
Loading symbols for 76560000        ole32.dll ->   ole32.dll
ModLoad: 76560000 76672000   C:\Windows\system32\ole32.dll
Debugger Dbgportaldb Connection::Open failed 80040e4d
Database Dbgportaldb not connected


FAULTING_IP:
ntdll!RtlReportCriticalFailure+33
001b:77d912d6 cc              int     3


EXCEPTION_RECORD:  ffffffff -- (.exr 0xffffffffffffffff)
ExceptionAddress: 77d912d6 (ntdll!RtlReportCriticalFailure+0x00000033)
   ExceptionCode: 80000003 (Break instruction exception)
  ExceptionFlags: 00000000
NumberParameters: 3
   Parameter[0]: 00000000
   Parameter[1]: 83b70d40
   Parameter[2]: 0000fffd


ERROR_CODE: (NTSTATUS) 0x80000003 - {EXCEPTION}  Breakpoint  A breakpoint has been reached.


EXCEPTION_CODE: (HRESULT) 0x80000003 (2147483651) - One or more arguments are invalid


EXCEPTION_PARAMETER1:  00000000


EXCEPTION_PARAMETER2:  83b70d40


EXCEPTION_PARAMETER3:  0000fffd


NTGLOBALFLAG:  0


APPLICATION_VERIFIER_FLAGS:  0


APP:  ntkrpamp.exe


LAST_CONTROL_TRANSFER:  from 77d924a1 to 77d912d6


FAULTING_THREAD:  ffffffff


BUGCHECK_STR:  APPLICATION_FAULT_ACTIONABLE_HEAP_CORRUPTION_heap_failure_lfh_bitmap_mismatch


PRIMARY_PROBLEM_CLASS:  ACTIONABLE_HEAP_CORRUPTION_heap_failure_lfh_bitmap_mismatch


DEFAULT_BUCKET_ID:  ACTIONABLE_HEAP_CORRUPTION_heap_failure_lfh_bitmap_mismatch


STACK_TEXT:
0225f520 77d924a1 c0000374 77dc0130 0225f564 ntdll!RtlReportCriticalFailure+0x33
0225f530 77d9168d 00000002 934d01f7 00000016 ntdll!RtlpReportHeapFailure+0x21
0225f564 77d6434b 0000000e 00ce0000 00dd22d0 ntdll!RtlpLogHeapFailure+0xa2
0225f604 77cc3561 00000050 00fa6728 00000000 ntdll!RtlpLowFragHeapAllocFromContext+0x2d4
0225f68c 680ca40f 00ce0000 00000000 00000050 ntdll!RtlAllocateHeap+0x105
0225f6d4 680c981a 00fa6728 00d4b774 00000000  component !DoMoreWork+0x37
0225f768 680c8f84 00000001 00d4b774 00000000  component !DoWork+0xce
0225f918 680d0169 00d27c48 02483530 02483590  component !Query+0x12b
0225f92c 77cf5935 00d4d170 00000000 02483530  component !WorkDispatchThreadProc+0x99
0225fa70 77cd9139 0225fad4 02483590 934d08cb ntdll!TppWorkpExecuteCallback+0x338
0225fc58 75fb2a32 00cef980 0225fca4 77d0cdfe ntdll!TppWorkerThread+0x6da
0225fc64 77d0cdfe 00cef980 934d0837 00000000 KERNEL32!BaseThreadInitThunk+0xe
0225fca4 77d0cdaa ffffffff 77d88566 00000000 ntdll!__RtlUserThreadStart+0x4a
0225fcb4 00000000 77d0d633 00cef980 00000000 ntdll!_RtlUserThreadStart+0x1c


...



1: kd> .frame 2
02 0225f564 77d6434b ntdll!RtlpLogHeapFailure+0xa2 [d:\5858\minkernel\ntos\rtl\heaplog.c @ 672]
1: kd> dv
0225f56c              FailureType = heap_failure_lfh_bitmap_mismatch (0n14)
0225f570              HeapAddress = 0x00ce0000
0225f574                  Address = 0x00dd22d0
0225f578                   Param1 = 0x00000000
0225f57c                   Param2 = 0x00000000
0225f580                   Param3 = 0x00000000



////////////////////////////////////////////////////////////////////////////////


My code in question is uninteresting:



    pData = (PDATA)HeapAlloc(GetProcessHeap(), 0, sizeof(DATA) * m_cData);

    if (!pData) {
        hr = E_OUTOFMEMORY;
        goto Exit;
    }

//////////////////////////////////////////////////////////////////////


So what is a HEAP_FAILURE_LFH_BITMAP_MISMATCH?  I wasn't sure, so I had to find out.  First off LFH refers to the Low-Fragmentation Heap which became the default in Windows Vista; you can read more about here.  Basically the LFH uses bitmap masks to track whether blocks are free or busy.  This information is also available in each LFH block's metadata.  This failure indicates that the busy status between these two do not agree and is therefore corrupted.  This still doesn't solve the mystery.  Luckily there is a debugger extension help us find more clues, !heap.

/////////////////////////////////////////////////////////////

1: kd> !heap -triage ce0000
**************************************************************
*                                                            *
*                  HEAP ERROR DETECTED                       *
*                                                            *
**************************************************************


Details:


Heap address:  00ce0000
Error address: 00dd22d0
Error type:    HEAP_FAILURE_LFH_BITMAP_MISMATCH
Details:       The LFH detected a mismatch between an individual
               block's metadata and its corresponding subsegment's
               metadata.
Follow-up:     Enable pageheap.
Error type: Unrecognized failure.
Follow-up:  This may be a bug in the extension. Send a
            remote or dump to ______.




Stack trace:
                77d6434b: ntdll!RtlpLowFragHeapAllocFromContext+0x000002d4
                77cc3561: ntdll!RtlAllocateHeap+0x00000105
                680ca40f: component!DoMoreWork+0x00000037
                680c981a: component!DoWork+0x000000ce
                680c8f84: component!DoWork+0x0000012b
                680d0169: Component!Query+0x00000099
                77cf5935: ntdll!TppWorkpExecuteCallback+0x00000338
                77cd9139: ntdll!TppWorkerThread+0x000006da
                75fb2a32: KERNEL32!BaseThreadInitThunk+0x0000000e
                77d0cdfe: ntdll!__RtlUserThreadStart+0x0000004a
                77d0cdaa: ntdll!_RtlUserThreadStart+0x0000001c


**********************************************************
** !heap: Searching for the heap and segment that
**        contain the specified address. To search
**        for the entry that contains this address,
**        use !heap -x 00ce0000.
**********************************************************


** !heap: Analyzing heap at 00ce0000...


** !heap: The following LFH allocations are missing a flag in their
          'unused bytes' field that identifies them as LFH allocations.
          This is usually caused by entry corruption in the client
          application.
** !heap: To view the state of the invalid blocks:
          !heap -i
          !heap -i


Heap address        Entry address       Unused bytes
----------------------------------------------------------------------------
ce0000              dd1b98              49
ce0000              dd22d0              49
ce0000              dd2590              49
ce0000              dd1b98              49
ce0000              dd22d0              49
ce0000              dd2590              49
ce0000              dd1b98              49
ce0000              dd22d0              49
ce0000              dd2590              49
ce0000              dd1b98              49
ce0000              dd22d0              49
ce0000              dd2590              49




** !heap: If these failures are easily reproducible, they can
          be detected as they occur by enabling pageheap for
          this scenario.


1: kd>
///////////////////////////////////////////////////////////////////////////////////////

Ok, still no luck.  I will try enabling pageheap if this issue is reproducible.

This is how you enable page heap verification.

//////// update /////////

The corruption of the heap is indicative of a buffer overrun.  I think I have been able to pinpoint the source.  I have a RTL linked list that tracks the state of sub-operations within a larger RPC client driven operation.  It turned out there were two latent conditions where this list was not correctly locked: the first one was when a sub-operation failed to initialize and I would remove it out of the list without locking it, and the second was when sub-operations in other threads would send a state update (which causes the lock to be taken) at the same moment when the list was getting torn down.  Obviously removing elements from the list while some other thread is actively traversing it can cause the other thread to be executing on bad Flinks.  Fixing these issues should make this corruption go away.  This was a very seldom repro on x86 and AMD64, but apparently common on ARM.

RTL linked lists

critical sections


No comments:

Post a Comment