Thursday, March 29, 2012

Targeting Only Specific Moduels with Appverifier



I have always been a proponent of using Appverifier whenever you are debugging your code with Windbg, but lets say you keep hitting verifier breaks in other modules that you aren't currently trying to debug.  Aside from being a good practice to fix all verifier breaks, and it is good to help other teams debug their components, sometimes you just want those breaks to go away to you can focus on your code.  After enabling Appverifier, you can do just that in the debugger.

Lets say you only care about foo.dll.  This is how you can have verifier only enabled on that module.

0:005> !avrf -skp all
Verifier package version >= 3.00
Exclusion ranges and suspend period have been reset.
0:005> !avrf -trg foo

It is that simple.

From Windbg help on !avrf.
"
-trg [ Start End | dll Module | all ]
Specifies a target range. Start is the beginning address of the target range. End is the ending address of the target range. Module specifies the name (including the .exe or .dll extension, but not including the path) of a module to be targeted. If you enter -trg all, all target ranges are reset. If you enter -trg with no additional parameters, the current target ranges are displayed.
-skp [ Start End | dll Module | all | Time ]
Specifies an exclusion range. Start is the beginning address of the exclusion range. End is the ending address of the exclusion range. Module specifies the name of a module to be targeted or excluded. Module specifies the name (including the .exe or .dll extension, but not including the path) of a module to be excluded. If you enter -skp all, all target ranges or exclusion ranges are reset. If you enter aTime value, all faults are suppressed for Time milliseconds after execution resumes.
"

Tuesday, March 20, 2012

Windows Kernel Function Prefixes


When kernel debugging, you will see a lot of windows kernel internal function names with two letter prefixes (assuming you have symbols).  Knowing what the prefixes mean can help you figure out what is going on.  I will give you a quick rundown of some of the basics.

Common Prefixes:
Cc
Cache manager
Cm
Configuration manager
Ex
Executive support routines
FsRtl
File system driver run time lib
Hal
Hardware abstraction layer
Io
IO manager
Ke
Kernel
Lpc
Local procedure call
Lsa
Local security authority
Mm
Memory manager
Nt
System services
Ob
Object manager
Po
Power manager
Pp
PnP manager
Ps
Process support
Rtl
Runtime lib
Se
Security
Wmi
Windows management instrumentation
Zw
Kernel version of Nt functions

Within these prefixes, there are variations to denote internal (second letter changed to an 'i') or private functions (an extra p is tacked on the end of the prefix.  For instance, an internal PnP function would have the Pi prefix instead of Pp.

Monday, March 19, 2012

Windows System Events + Windbg Debugging

Events are great ways to synchronize threads, processes, and even UM and KM code.  Creating and waiting for events is almost always the best (most efficient) way to synchronize.  Spin waiting is generally bad (ie do {/*empty*/} while (!bFlag);), and spinning with a sleep is worse because it will surely yield the CPU loosing at least an order of magnitude of more cycles.

What is an event?  The simple answer is it is a kernel object.  At the fundamental level, it is a data structure in the kernel.  In fact, because an event is the simplest kernel object, its structure is the header for all other kernel objects.

Basics:

This is how you create an event.  Crating an event, if successful, returns a handle to the corresponding kernel object.  This is how you close an event.  You close the handle to it like you would for any other kernel object.  To signal the event, you use SetEvent, and to reset the event, you use ResetEvent if it is set to be manually reset.  Nothing too complicated here.  To wait for events, you simply use the normal wait functions like WaitForSingleObject.  Obviously after you create an event, you must close the handle when you are done with it.  You are also required to close the handle for each time you duplicate one.  Most of this stuff applies to all kernel objects because like I mentioned before, the heeders of their structures are the same (they are the same as events).


Debugging:


Many common handle bugs are easily found by just doing a code review.  For instance, for each event create, make sure there is a close handle.  Make sure you don't close the handle before you are done using it.  Etc.  If a code review doesn't do it for you, application verifier can auto detect many handle related bugs.  Use it, love it.


Windbg: next regular debugging can help a lot to validate your expectations.  Next the debugger extension !handle is your friend.  I can show detailed information about your handles.  Some of the information about handle are only available in KM.

0:000> !handle
Handle 4
  Type          Section
Handle 8
  Type          Event
Handle c
  Type          Event
Handle 10
  Type          Event
Handle 14
  Type          Directory
Handle 5c
  Type          File
6 Handles
Type            Count
Event           3
Section         1
File            1
Directory       1

0:002> !handle 160 7
Handle 160
  Type          Event
  Attributes    0
  GrantedAccess 0x1f0003:
         Delete,ReadControl,WriteDac,WriteOwner,Synch
         QueryState,ModifyState
  HandleCount   2
  PointerCount  65
  Name          


Wednesday, March 7, 2012

How to Setup a KD (Kernel Deugger) in Windows With 1394 or Over the Network

Lets say you are starting to write drivers and need some kernel mode (km) debugging, or lets say you've decided that user mode (UM) debugging using windbg on the host is for sissies.  In this post I will show you how to setup a KD.

Assumptions:
You will need two machines: the TARGET machine that you want to debug, and the HOST machine that you will be doing the actual debugging.

First thing you need to do is install windbg on both the target and host.  You can find the installer here.

Pick a You KD Method:
Next decide what kind of debugging you want to do.  The options are:
NET (i.e. debugging over a TCP/IP network just using NICs) (supported on Win8+),
1394 (supported on WinXP+),
COM (serial) (supported since the dawn of KD), or
USB (2.0 supported on Vista+, 3.0 in Win8+)

Generally the port you use is decided for you based on what OS you need to debug, and what hardware your machines have.  I will make it simple, use 1394 (aka firewire) if you can, or if the machines aren't close, net.

If you two machines are next to each other favor 1394.  If you are going to be kernel debugging often and don't have 1394 in your machines, buy some cards.  1394 is simple and fast.

If your target isn't close to your debugger machine, use net, short for network, debugging, but note it is a Win8+ feature at the moment.  Net debugging is also great for getting someone else to remote debug something.  Also in Win8, over 90+% of the mainstream NICs are supported for net debugging; most Intel, Broadcom, and Realtek NICs are supported.

COM is slow, but works assuming your machines have serial ports.

USB might be a choice if your USB controllers support kernel debugging.  In my experience, they rarely do.  This is especially true when the machine doesn't have 1394 and you can't net debug.  You are kind of screwed at this point.  The joke is even funnier when you do find a port that does support KD, but it is internally wired to the built in webcam, or doesn't have an external port.

Setting Up a 1394 KD

TARGET

  1. open a command prompt
  2. bcdedit -debug on
  3. bcdedit -dbgsettings 1394 channel:1
    - you will have to pass bus params if you have more than one 1394 controler)
    - channel can be 1-62
  4. reboot


HOST

  1. plug in 1394 cable into target and host
  2. open a command prompt
  3. kd -k 1394:channel=1

    windbg work instead of kd as well
Setting Up a NET KD

TARGET
  1. open a command prompt
  2. bcdedit -dbgsettings net hostip:192.168.1.11 port:50000
    - for hostip, put your machine's IP instead of  192.168.1.11
    - you can pick whatever TCP port you want as long it is between 49151 and 65536.
  3. It will output something like:
    "Key=
    aaaaaaaaaaaaa.vvvvvvvvvvvvv.yyyyyyyyyyyyy.xxxxxxxxxxxxx"
    Save that string in a text file to a thumb drive or network share, you will need it again on the host
  4. bcdedit -debug on
  5. reboot
HOST
  1. open a command prompt
  2. windbg -k net:port= 50000,key=aaaaaaaaaaaaa.vvvvvvvvvvvvv.yyyyyyyyyyyyy.xxxxxxxxxxxxx

    you can use kd instead of windbg if you want



Friday, March 2, 2012

Debugging Heap Failures

Sometimes heap failures can be mysterious, but they don't have to be.  For instance, today I got this kd (kernel debugger) break.

////////////////////////////////////////////////////////////////////////////
Output of !analyze -v



*******************************************************************************
*                                                                             *
*                        Exception Analysis                                   *
*                                                                             *
*******************************************************************************


Loading symbols for 680c0000          component.dll ->    component .dll
Loading symbols for 75fb0000     KERNEL32.DLL ->   KERNEL32.DLL
Force unload of C:\Windows\SYSTEM32\user32.dll
Loading symbols for 76680000       user32.dll ->   user32.dll
ModLoad: 76680000 767a1000   C:\Windows\SYSTEM32\user32.dll
Force unload of C:\Windows\system32\ole32.dll
Loading symbols for 76560000        ole32.dll ->   ole32.dll
ModLoad: 76560000 76672000   C:\Windows\system32\ole32.dll
Debugger Dbgportaldb Connection::Open failed 80040e4d
Database Dbgportaldb not connected


FAULTING_IP:
ntdll!RtlReportCriticalFailure+33
001b:77d912d6 cc              int     3


EXCEPTION_RECORD:  ffffffff -- (.exr 0xffffffffffffffff)
ExceptionAddress: 77d912d6 (ntdll!RtlReportCriticalFailure+0x00000033)
   ExceptionCode: 80000003 (Break instruction exception)
  ExceptionFlags: 00000000
NumberParameters: 3
   Parameter[0]: 00000000
   Parameter[1]: 83b70d40
   Parameter[2]: 0000fffd


ERROR_CODE: (NTSTATUS) 0x80000003 - {EXCEPTION}  Breakpoint  A breakpoint has been reached.


EXCEPTION_CODE: (HRESULT) 0x80000003 (2147483651) - One or more arguments are invalid


EXCEPTION_PARAMETER1:  00000000


EXCEPTION_PARAMETER2:  83b70d40


EXCEPTION_PARAMETER3:  0000fffd


NTGLOBALFLAG:  0


APPLICATION_VERIFIER_FLAGS:  0


APP:  ntkrpamp.exe


LAST_CONTROL_TRANSFER:  from 77d924a1 to 77d912d6


FAULTING_THREAD:  ffffffff


BUGCHECK_STR:  APPLICATION_FAULT_ACTIONABLE_HEAP_CORRUPTION_heap_failure_lfh_bitmap_mismatch


PRIMARY_PROBLEM_CLASS:  ACTIONABLE_HEAP_CORRUPTION_heap_failure_lfh_bitmap_mismatch


DEFAULT_BUCKET_ID:  ACTIONABLE_HEAP_CORRUPTION_heap_failure_lfh_bitmap_mismatch


STACK_TEXT:
0225f520 77d924a1 c0000374 77dc0130 0225f564 ntdll!RtlReportCriticalFailure+0x33
0225f530 77d9168d 00000002 934d01f7 00000016 ntdll!RtlpReportHeapFailure+0x21
0225f564 77d6434b 0000000e 00ce0000 00dd22d0 ntdll!RtlpLogHeapFailure+0xa2
0225f604 77cc3561 00000050 00fa6728 00000000 ntdll!RtlpLowFragHeapAllocFromContext+0x2d4
0225f68c 680ca40f 00ce0000 00000000 00000050 ntdll!RtlAllocateHeap+0x105
0225f6d4 680c981a 00fa6728 00d4b774 00000000  component !DoMoreWork+0x37
0225f768 680c8f84 00000001 00d4b774 00000000  component !DoWork+0xce
0225f918 680d0169 00d27c48 02483530 02483590  component !Query+0x12b
0225f92c 77cf5935 00d4d170 00000000 02483530  component !WorkDispatchThreadProc+0x99
0225fa70 77cd9139 0225fad4 02483590 934d08cb ntdll!TppWorkpExecuteCallback+0x338
0225fc58 75fb2a32 00cef980 0225fca4 77d0cdfe ntdll!TppWorkerThread+0x6da
0225fc64 77d0cdfe 00cef980 934d0837 00000000 KERNEL32!BaseThreadInitThunk+0xe
0225fca4 77d0cdaa ffffffff 77d88566 00000000 ntdll!__RtlUserThreadStart+0x4a
0225fcb4 00000000 77d0d633 00cef980 00000000 ntdll!_RtlUserThreadStart+0x1c


...



1: kd> .frame 2
02 0225f564 77d6434b ntdll!RtlpLogHeapFailure+0xa2 [d:\5858\minkernel\ntos\rtl\heaplog.c @ 672]
1: kd> dv
0225f56c              FailureType = heap_failure_lfh_bitmap_mismatch (0n14)
0225f570              HeapAddress = 0x00ce0000
0225f574                  Address = 0x00dd22d0
0225f578                   Param1 = 0x00000000
0225f57c                   Param2 = 0x00000000
0225f580                   Param3 = 0x00000000



////////////////////////////////////////////////////////////////////////////////


My code in question is uninteresting:



    pData = (PDATA)HeapAlloc(GetProcessHeap(), 0, sizeof(DATA) * m_cData);

    if (!pData) {
        hr = E_OUTOFMEMORY;
        goto Exit;
    }

//////////////////////////////////////////////////////////////////////


So what is a HEAP_FAILURE_LFH_BITMAP_MISMATCH?  I wasn't sure, so I had to find out.  First off LFH refers to the Low-Fragmentation Heap which became the default in Windows Vista; you can read more about here.  Basically the LFH uses bitmap masks to track whether blocks are free or busy.  This information is also available in each LFH block's metadata.  This failure indicates that the busy status between these two do not agree and is therefore corrupted.  This still doesn't solve the mystery.  Luckily there is a debugger extension help us find more clues, !heap.

/////////////////////////////////////////////////////////////

1: kd> !heap -triage ce0000
**************************************************************
*                                                            *
*                  HEAP ERROR DETECTED                       *
*                                                            *
**************************************************************


Details:


Heap address:  00ce0000
Error address: 00dd22d0
Error type:    HEAP_FAILURE_LFH_BITMAP_MISMATCH
Details:       The LFH detected a mismatch between an individual
               block's metadata and its corresponding subsegment's
               metadata.
Follow-up:     Enable pageheap.
Error type: Unrecognized failure.
Follow-up:  This may be a bug in the extension. Send a
            remote or dump to ______.




Stack trace:
                77d6434b: ntdll!RtlpLowFragHeapAllocFromContext+0x000002d4
                77cc3561: ntdll!RtlAllocateHeap+0x00000105
                680ca40f: component!DoMoreWork+0x00000037
                680c981a: component!DoWork+0x000000ce
                680c8f84: component!DoWork+0x0000012b
                680d0169: Component!Query+0x00000099
                77cf5935: ntdll!TppWorkpExecuteCallback+0x00000338
                77cd9139: ntdll!TppWorkerThread+0x000006da
                75fb2a32: KERNEL32!BaseThreadInitThunk+0x0000000e
                77d0cdfe: ntdll!__RtlUserThreadStart+0x0000004a
                77d0cdaa: ntdll!_RtlUserThreadStart+0x0000001c


**********************************************************
** !heap: Searching for the heap and segment that
**        contain the specified address. To search
**        for the entry that contains this address,
**        use !heap -x 00ce0000.
**********************************************************


** !heap: Analyzing heap at 00ce0000...


** !heap: The following LFH allocations are missing a flag in their
          'unused bytes' field that identifies them as LFH allocations.
          This is usually caused by entry corruption in the client
          application.
** !heap: To view the state of the invalid blocks:
          !heap -i
          !heap -i


Heap address        Entry address       Unused bytes
----------------------------------------------------------------------------
ce0000              dd1b98              49
ce0000              dd22d0              49
ce0000              dd2590              49
ce0000              dd1b98              49
ce0000              dd22d0              49
ce0000              dd2590              49
ce0000              dd1b98              49
ce0000              dd22d0              49
ce0000              dd2590              49
ce0000              dd1b98              49
ce0000              dd22d0              49
ce0000              dd2590              49




** !heap: If these failures are easily reproducible, they can
          be detected as they occur by enabling pageheap for
          this scenario.


1: kd>
///////////////////////////////////////////////////////////////////////////////////////

Ok, still no luck.  I will try enabling pageheap if this issue is reproducible.

This is how you enable page heap verification.

//////// update /////////

The corruption of the heap is indicative of a buffer overrun.  I think I have been able to pinpoint the source.  I have a RTL linked list that tracks the state of sub-operations within a larger RPC client driven operation.  It turned out there were two latent conditions where this list was not correctly locked: the first one was when a sub-operation failed to initialize and I would remove it out of the list without locking it, and the second was when sub-operations in other threads would send a state update (which causes the lock to be taken) at the same moment when the list was getting torn down.  Obviously removing elements from the list while some other thread is actively traversing it can cause the other thread to be executing on bad Flinks.  Fixing these issues should make this corruption go away.  This was a very seldom repro on x86 and AMD64, but apparently common on ARM.

RTL linked lists

critical sections