Sam's Code: March 2012

Thursday, March 29, 2012

Targeting Only Specific Moduels with Appverifier

I have always been a proponent of using Appverifier whenever you are debugging your code with Windbg, but lets say you keep hitting verifier breaks in other modules that you aren't currently trying to debug. Aside from being a good practice to fix all verifier breaks, and it is good to help other teams debug their components, sometimes you just want those breaks to go away to you can focus on your code. After enabling Appverifier, you can do just that in the debugger.

Lets say you only care about foo.dll. This is how you can have verifier only enabled on that module.

0:005> !avrf -skp all
Verifier package version >= 3.00
Exclusion ranges and suspend period have been reset.
0:005> !avrf -trg foo
It is that simple.

From Windbg help on !avrf.

-trg [ Start End | dll Module | all ]

Specifies a target range. Start is the beginning address of the target range. End is the ending address of the target range. Module specifies the name (including the .exe or .dll extension, but not including the path) of a module to be targeted. If you enter -trg all, all target ranges are reset. If you enter -trg with no additional parameters, the current target ranges are displayed.

-skp [ Start End | dll Module | all | Time ]

Specifies an exclusion range. Start is the beginning address of the exclusion range. End is the ending address of the exclusion range. Module specifies the name of a module to be targeted or excluded. Module specifies the name (including the .exe or .dll extension, but not including the path) of a module to be excluded. If you enter -skp all, all target ranges or exclusion ranges are reset. If you enter aTime value, all faults are suppressed for Time milliseconds after execution resumes.

Tuesday, March 20, 2012

Windows Kernel Function Prefixes

When kernel debugging, you will see a lot of windows kernel internal function names with two letter prefixes (assuming you have symbols). Knowing what the prefixes mean can help you figure out what is going on. I will give you a quick rundown of some of the basics.

Common Prefixes:

Cc	Cache manager
Cm	Configuration manager
Ex	Executive support routines
FsRtl	File system driver run time lib
Hal	Hardware abstraction layer
Io	IO manager
Ke	Kernel
Lpc	Local procedure call
Lsa	Local security authority
Mm	Memory manager
Nt	System services
Ob	Object manager
Po	Power manager
Pp	PnP manager
Ps	Process support
Rtl	Runtime lib
Se	Security
Wmi	Windows management instrumentation
Zw	Kernel version of Nt functions

Within these prefixes, there are variations to denote internal (second letter changed to an 'i') or private functions (an extra p is tacked on the end of the prefix. For instance, an internal PnP function would have the Pi prefix instead of Pp.

Monday, March 19, 2012

Windows System Events + Windbg Debugging

Events are great ways to synchronize threads, processes, and even UM and KM code. Creating and waiting for events is almost always the best (most efficient) way to synchronize. Spin waiting is generally bad (ie do {/*empty*/} while (!bFlag);), and spinning with a sleep is worse because it will surely yield the CPU loosing at least an order of magnitude of more cycles.

What is an event? The simple answer is it is a kernel object. At the fundamental level, it is a data structure in the kernel. In fact, because an event is the simplest kernel object, its structure is the header for all other kernel objects.

Basics:

This is how you create an event. Crating an event, if successful, returns a handle to the corresponding kernel object. This is how you close an event. You close the handle to it like you would for any other kernel object. To signal the event, you use SetEvent, and to reset the event, you use ResetEvent if it is set to be manually reset. Nothing too complicated here. To wait for events, you simply use the normal wait functions like WaitForSingleObject. Obviously after you create an event, you must close the handle when you are done with it. You are also required to close the handle for each time you duplicate one. Most of this stuff applies to all kernel objects because like I mentioned before, the heeders of their structures are the same (they are the same as events).

Debugging:

Many common handle bugs are easily found by just doing a code review. For instance, for each event create, make sure there is a close handle. Make sure you don't close the handle before you are done using it. Etc. If a code review doesn't do it for you, application verifier can auto detect many handle related bugs. Use it, love it.

Windbg: next regular debugging can help a lot to validate your expectations. Next the debugger extension !handle is your friend. I can show detailed information about your handles. Some of the information about handle are only available in KM.

0:000> !handle

Handle 4
Type          Section
Handle 8
Type          Event
Handle c
Type          Event
Handle 10
Type          Event
Handle 14
Type          Directory
Handle 5c
Type          File
6 Handles
Type            Count
Event           3
Section         1
File            1
Directory       1

0:002> !handle 160 7

Handle 160
Type         Event
Attributes   0
GrantedAccess 0x1f0003:
         Delete,ReadControl,WriteDac,WriteOwner,Synch
         QueryState,ModifyState
HandleCount 2
PointerCount 65
Name

Wednesday, March 7, 2012

How to Setup a KD (Kernel Deugger) in Windows With 1394 or Over the Network

Lets say you are starting to write drivers and need some kernel mode (km) debugging, or lets say you've decided that user mode (UM) debugging using windbg on the host is for sissies. In this post I will show you how to setup a KD.

Assumptions:
You will need two machines: the TARGET machine that you want to debug, and the HOST machine that you will be doing the actual debugging.

First thing you need to do is install windbg on both the target and host. You can find the installer here.

Pick a You KD Method:
Next decide what kind of debugging you want to do. The options are:
NET (i.e. debugging over a TCP/IP network just using NICs) (supported on Win8+),
1394 (supported on WinXP+),
COM (serial) (supported since the dawn of KD), or
USB (2.0 supported on Vista+, 3.0 in Win8+)

Generally the port you use is decided for you based on what OS you need to debug, and what hardware your machines have. I will make it simple, use 1394 (aka firewire) if you can, or if the machines aren't close, net.

If you two machines are next to each other favor 1394. If you are going to be kernel debugging often and don't have 1394 in your machines, buy some cards. 1394 is simple and fast.

If your target isn't close to your debugger machine, use net, short for network, debugging, but note it is a Win8+ feature at the moment. Net debugging is also great for getting someone else to remote debug something. Also in Win8, over 90+% of the mainstream NICs are supported for net debugging; most Intel, Broadcom, and Realtek NICs are supported.

COM is slow, but works assuming your machines have serial ports.

USB might be a choice if your USB controllers support kernel debugging. In my experience, they rarely do. This is especially true when the machine doesn't have 1394 and you can't net debug. You are kind of screwed at this point. The joke is even funnier when you do find a port that does support KD, but it is internally wired to the built in webcam, or doesn't have an external port.

Setting Up a 1394 KD

TARGET

open a command prompt
bcdedit -debug on
bcdedit -dbgsettings 1394 channel:1
- you will have to pass bus params if you have more than one 1394 controler)
- channel can be 1-62
reboot

HOST

plug in 1394 cable into target and host
open a command prompt
kd -k 1394:channel=1

windbg work instead of kd as well

Setting Up a NET KD

TARGET

open a command prompt
bcdedit -dbgsettings net hostip:192.168.1.11 port:50000
- for hostip, put your machine's IP instead of 192.168.1.11
- you can pick whatever TCP port you want as long it is between 49151 and 65536.
It will output something like:
"Key=aaaaaaaaaaaaa.vvvvvvvvvvvvv.yyyyyyyyyyyyy.xxxxxxxxxxxxx"
Save that string in a text file to a thumb drive or network share, you will need it again on the host
bcdedit -debug on
reboot

HOST

open a command prompt
windbg -k net:port= 50000,key=aaaaaaaaaaaaa.vvvvvvvvvvvvv.yyyyyyyyyyyyy.xxxxxxxxxxxxx

you can use kd instead of windbg if you want

Friday, March 2, 2012

Debugging Heap Failures

Sometimes heap failures can be mysterious, but they don't have to be. For instance, today I got this kd (kernel debugger) break.

////////////////////////////////////////////////////////////////////////////
Output of !analyze -v

*******************************************************************************
* *
* Exception Analysis *
* *
*******************************************************************************

Loading symbols for 680c0000 component.dll -> component .dll
Loading symbols for 75fb0000 KERNEL32.DLL -> KERNEL32.DLL
Force unload of C:\Windows\SYSTEM32\user32.dll
Loading symbols for 76680000 user32.dll -> user32.dll
ModLoad: 76680000 767a1000 C:\Windows\SYSTEM32\user32.dll
Force unload of C:\Windows\system32\ole32.dll
Loading symbols for 76560000 ole32.dll -> ole32.dll
ModLoad: 76560000 76672000 C:\Windows\system32\ole32.dll
Debugger Dbgportaldb Connection::Open failed 80040e4d
Database Dbgportaldb not connected

FAULTING_IP:
ntdll!RtlReportCriticalFailure+33
001b:77d912d6 cc int 3

EXCEPTION_RECORD: ffffffff -- (.exr 0xffffffffffffffff)
ExceptionAddress: 77d912d6 (ntdll!RtlReportCriticalFailure+0x00000033)
ExceptionCode: 80000003 (Break instruction exception)
ExceptionFlags: 00000000
NumberParameters: 3
Parameter[0]: 00000000
Parameter[1]: 83b70d40
Parameter[2]: 0000fffd

ERROR_CODE: (NTSTATUS) 0x80000003 - {EXCEPTION} Breakpoint A breakpoint has been reached.

EXCEPTION_CODE: (HRESULT) 0x80000003 (2147483651) - One or more arguments are invalid

EXCEPTION_PARAMETER1: 00000000

EXCEPTION_PARAMETER2: 83b70d40

EXCEPTION_PARAMETER3: 0000fffd

NTGLOBALFLAG: 0

APPLICATION_VERIFIER_FLAGS: 0

APP: ntkrpamp.exe

LAST_CONTROL_TRANSFER: from 77d924a1 to 77d912d6

FAULTING_THREAD: ffffffff

BUGCHECK_STR: APPLICATION_FAULT_ACTIONABLE_HEAP_CORRUPTION_heap_failure_lfh_bitmap_mismatch

PRIMARY_PROBLEM_CLASS: ACTIONABLE_HEAP_CORRUPTION_heap_failure_lfh_bitmap_mismatch

DEFAULT_BUCKET_ID: ACTIONABLE_HEAP_CORRUPTION_heap_failure_lfh_bitmap_mismatch

STACK_TEXT:
0225f520 77d924a1 c0000374 77dc0130 0225f564 ntdll!RtlReportCriticalFailure+0x33
0225f530 77d9168d 00000002 934d01f7 00000016 ntdll!RtlpReportHeapFailure+0x21
0225f564 77d6434b 0000000e 00ce0000 00dd22d0 ntdll!RtlpLogHeapFailure+0xa2
0225f604 77cc3561 00000050 00fa6728 00000000 ntdll!RtlpLowFragHeapAllocFromContext+0x2d4
0225f68c 680ca40f 00ce0000 00000000 00000050 ntdll!RtlAllocateHeap+0x105
0225f6d4 680c981a 00fa6728 00d4b774 00000000 component !DoMoreWork+0x37
0225f768 680c8f84 00000001 00d4b774 00000000 component !DoWork+0xce
0225f918 680d0169 00d27c48 02483530 02483590 component !Query+0x12b
0225f92c 77cf5935 00d4d170 00000000 02483530 component !WorkDispatchThreadProc+0x99
0225fa70 77cd9139 0225fad4 02483590 934d08cb ntdll!TppWorkpExecuteCallback+0x338
0225fc58 75fb2a32 00cef980 0225fca4 77d0cdfe ntdll!TppWorkerThread+0x6da
0225fc64 77d0cdfe 00cef980 934d0837 00000000 KERNEL32!BaseThreadInitThunk+0xe
0225fca4 77d0cdaa ffffffff 77d88566 00000000 ntdll!__RtlUserThreadStart+0x4a
0225fcb4 00000000 77d0d633 00cef980 00000000 ntdll!_RtlUserThreadStart+0x1c

...

1: kd> .frame 2
02 0225f564 77d6434b ntdll!RtlpLogHeapFailure+0xa2 [d:\5858\minkernel\ntos\rtl\heaplog.c @ 672]
1: kd> dv
0225f56c FailureType = heap_failure_lfh_bitmap_mismatch (0n14)
0225f570 HeapAddress = 0x00ce0000
0225f574 Address = 0x00dd22d0
0225f578 Param1 = 0x00000000
0225f57c Param2 = 0x00000000
0225f580 Param3 = 0x00000000

////////////////////////////////////////////////////////////////////////////////

My code in question is uninteresting:

pData = (PDATA)HeapAlloc(GetProcessHeap(), 0, sizeof(DATA) * m_cData);

if (!pData) {

hr = E_OUTOFMEMORY;

goto Exit;

}

//////////////////////////////////////////////////////////////////////

So what is a HEAP_FAILURE_LFH_BITMAP_MISMATCH? I wasn't sure, so I had to find out. First off LFH refers to the Low-Fragmentation Heap which became the default in Windows Vista; you can read more about here. Basically the LFH uses bitmap masks to track whether blocks are free or busy. This information is also available in each LFH block's metadata. This failure indicates that the busy status between these two do not agree and is therefore corrupted. This still doesn't solve the mystery. Luckily there is a debugger extension help us find more clues, !heap.

/////////////////////////////////////////////////////////////

1: kd> !heap -triage ce0000
**************************************************************
* *
* HEAP ERROR DETECTED *
* *
**************************************************************

Details:

Heap address: 00ce0000
Error address: 00dd22d0
Error type: HEAP_FAILURE_LFH_BITMAP_MISMATCH
Details: The LFH detected a mismatch between an individual
block's metadata and its corresponding subsegment's
metadata.
Follow-up: Enable pageheap.
Error type: Unrecognized failure.
Follow-up: This may be a bug in the extension. Send a
remote or dump to ______.

Stack trace:
77d6434b: ntdll!RtlpLowFragHeapAllocFromContext+0x000002d4
77cc3561: ntdll!RtlAllocateHeap+0x00000105
680ca40f: component!DoMoreWork+0x00000037
680c981a: component!DoWork+0x000000ce
680c8f84: component!DoWork+0x0000012b
680d0169: Component!Query+0x00000099
77cf5935: ntdll!TppWorkpExecuteCallback+0x00000338
77cd9139: ntdll!TppWorkerThread+0x000006da
75fb2a32: KERNEL32!BaseThreadInitThunk+0x0000000e
77d0cdfe: ntdll!__RtlUserThreadStart+0x0000004a
77d0cdaa: ntdll!_RtlUserThreadStart+0x0000001c

**********************************************************
** !heap: Searching for the heap and segment that
** contain the specified address. To search
** for the entry that contains this address,
** use !heap -x 00ce0000.
**********************************************************

** !heap: Analyzing heap at 00ce0000...

** !heap: The following LFH allocations are missing a flag in their
'unused bytes' field that identifies them as LFH allocations.
This is usually caused by entry corruption in the client
application.
** !heap: To view the state of the invalid blocks:
!heap -i
!heap -i

Heap address Entry address Unused bytes
----------------------------------------------------------------------------
ce0000 dd1b98 49
ce0000 dd22d0 49
ce0000 dd2590 49
ce0000 dd1b98 49
ce0000 dd22d0 49
ce0000 dd2590 49
ce0000 dd1b98 49
ce0000 dd22d0 49
ce0000 dd2590 49
ce0000 dd1b98 49
ce0000 dd22d0 49
ce0000 dd2590 49

** !heap: If these failures are easily reproducible, they can
be detected as they occur by enabling pageheap for
this scenario.

1: kd>

///////////////////////////////////////////////////////////////////////////////////////

Ok, still no luck. I will try enabling pageheap if this issue is reproducible.

This is how you enable page heap verification.

//////// update /////////

The corruption of the heap is indicative of a buffer overrun. I think I have been able to pinpoint the source. I have a RTL linked list that tracks the state of sub-operations within a larger RPC client driven operation. It turned out there were two latent conditions where this list was not correctly locked: the first one was when a sub-operation failed to initialize and I would remove it out of the list without locking it, and the second was when sub-operations in other threads would send a state update (which causes the lock to be taken) at the same moment when the list was getting torn down. Obviously removing elements from the list while some other thread is actively traversing it can cause the other thread to be executing on bad Flinks. Fixing these issues should make this corruption go away. This was a very seldom repro on x86 and AMD64, but apparently common on ARM.

RTL linked lists

critical sections

Sam's Code