[delay load]
Windows has the ability to delay load DLLs which can come in handy in a number of scenarios. By default, the Windows loader will load all of a DLL's or EXE's dependencies right away. It basically walks down the binary's unmapped function table and maps in all of the dependencies; then, it recursively maps in all of the dependencies' dependencies until all possible function calls are mapped into memory.
The first most obvious use case for delay loading is to speed up the point in which your binary can start running. For example, some of the DLLs could be large and might need to be loaded from disk. If your binary doesn't need them to start, waiting for them to load is adding a lot of unnecessary delay. The loader actually does a lot of smart things to mitigate long load times, but still, its good to delay load dependencies as much as possible.
Another useful less-obvious reason is to handle missing components, and break dependencies.
Handling missing components: your code references components but runs in different environments where they may or may not exist, by default, the loader tries to load them upfront. In environments where the dependencies don't exist, that fails and your wmain or DllMain will never be called. With delay load, your code can more gracefully handle missing features.
For example, I once wrote a service that runs all versions of Windows. One of the things it does is sort of acts like a user mode bus creating devnode tress. I needed to remove devices and clean up the tree at some points. The typical complete way to do this is DIF_REMOVE with SetupAPI because it evokes the full Desktop device installation. The only problem is SetupAPI is not in every Windows SKU. ConfigMgr32, CM_*, APIs are, but don't handle full Desktop device installation. My solution was to delay load setupapi.dll. My code has two implementations for cleaning devnode trees, one using SetupAPI calls, and one using CM API calls. At runtime, I see if SetupAPI is available, here is some public documentation. If it is, then I am on a Desktop-type SKU, and I use it. If it is not, like on a OneCore based SKU, I use the CM version. My service runs either way.
How does delay load work? The loader still actually needs to resolve and load every referenced function when a binary is loaded into memory. For a component to be able to delay loaded, it implements a delay load handler which implements the binary's exports. Core OS components do this, and they are linked into kernel32 so they still will exist even when the actual implementation DLL is not in a OS image. They typically return E_NOTIMPL or similar. Kernel32 is always loaded, and the so the delay load handler can depend on it as a fallback implementation. Later, when a binary actually uses one of these functions, it triggers the delay load handler to actually try to load the dll and map in the real function pointer.
[API sets]
API sets work in a similar way, and allow the OS to handle versioning of DLLs and their exports. In some cases, API sets work in a similar way to a dload handler, and might just return E_NOTIMPL. In other cases, OS components can use them to implement handling to changes of the OS' ABI behavior. For example, API might have fix that corrects some bug, call it foo-1-1, but for compatibility reasons, some callers depend on the buggy behavior, yes it happens often, call it foo-1-0. The current version of foo.dll can implement handling both clients. An API set, allows the OS to detect what the client is expecting, and, map in the correct behavior.
[comptonization and layering]
A lot of good work has gone it to componentizing Windows over the past 2 decades, but there used to be a time, say over 15 years ago, where some Windows OS components would assume anything on a Desktop-type SKU was fair game to use, so for example, a low level security API in Adavpi32 might have loaded the DLL from the shell (UI) because it had a useful function. As a developer, it doesn't make sense that using low-level Advapi32 API would also cause a whole mess of seeming unrelated DLLs to get loaded just because a shell DLL happened to be used which then depended a bunch of other things completely unrelated to the security API you were trying to use. Overtime this was componentized and split apart and the OS does a good job, um...a much better job, now not having circular dependencies or having low level components calling into, or using higher level components. It is ok for the shell to call into ntdll but not ntdll to call into the shell. Things Advapi32 used from higher levels, like the shell, had to be split out, and brought down to the lower level. That allows low level Advapi32 APIs to use those old shell features without calling up to a higher OS layer. There are still high level things that Adavapi32 does do that might only work on a full Desktop SKU, but for low level OS stuff, the APIs can work on a low level SKU without a shell or other high-level components. This is made possible from all of the hard refactoring and componentization work, but also through the other OS building blocks I mentioned like API sets and dload, etc. On a desktop SKU, the full Advapi32 will get loaded, and you can use all of its APIs. On lower level SKUs, you can still use the low-level Advapi32 APIs, but using higher level APIs won't work. If your component never actually uses the higher level features, then it is fine. Depending on how it is implemented, it might fail in different ways at runtime, but generally it can be handled like in my example with SetupAPI and keep proper layering.
[umbrella libs]
These days, it is easy to do the right thing with umbrella libs. In the old days, instead of linking the full Advapi32.lib, you'd have to link the lower level version of Advapi32 API set, eg. foo-onecore-1.lib instead of foo.lib., to not get loader errors on onecore SKUs as linking the full version will try to load the full Desktop version of the component. Now, you can simply link onecore.lib and all API sets that work on OneCore are linked in. If you then get linker errors compiling, the thing you are calling is not part of onecore and would cause loader error later if you run it on a onecore SKU.
[investigating loader failures]
Getting back to why I am writing this, investigating loader failures. What the loader does is complicated. That is why loader snaps exist. This is how I investigate these kind of issues.
1. Enable loader snaps. You can do that for your binary with: gflags.exe -i foo.dll -sls on the command line, or in the UM debugger windbg, or even KD with !gflags -sls
2. Run the scenario, watch for the loader error. Note: there will be a verbose amount of output so maybe try to scope having it on to be as short as possible, or break in as soon as it hits, etc.
3. Analyze the error.
It will point to a function and dll it is trying to resolve and load.
03a0:0704 @ 11616265 - LdrpLoadDllInternal - RETURN: Status: 0x00000000
03a0:0704 @ 11616265 - LdrpLoadDllInternal - RETURN: 0
03a0:0704 @ 11616265 - LdrLoadDll - RETURN: Status: 0x00000000
03a0:0704 @ 11616265 - LdrLoadDll - RETURN: 0
03a0:0704 @ 11616265 - LdrpGetProcedureAddress - INFO: Locating procedure "AppPolicyGetProcessTerminationMethod" by name
03a0:0704 @ 11616265 - LdrGetDllHandleEx - ENTER: DLL name: mscoree.dll
03a0:0704 @ 11616265 - LdrGetDllHandleEx - ENTER: mscoree.dll
03a0:0704 @ 11616265 - LdrpFindLoadedDllInternal - RETURN: Status: 0xc0000135
03a0:0704 @ 11616265 - LdrpFindLoadedDllInternal - RETURN: c0000135
03a0:0704 @ 11616265 - LdrGetDllHandleEx - RETURN: Status: 0xc0000135
03a0:0704 @ 11616265 - LdrGetDllHandleEx - RETURN: c0000135
4. Resolve the missing dependency. You now know what it is missing.
If it is supposed to be there, like, if it were a DLL that you own, or is part of our package, etc. maybe you forgot to install, or copy it to the PC. Or, maybe it is not in the loader's search path. The typical search path are obvious places like c:\windows, .\system32\, or .\, next to the binary trying to load it.
If it is not supposed to be there, why is your code using it?
This might be easy to answer, like you linked the full Advapi32.dll API, or are trying to use one of the desktop only APIs on a OneCore SKU. If that is the case, then youi'll have to do something else, like how I used CM_* API versions instead of SetupAPI versions for OneCore.
If it not something you think your code should be using, again, you could be using the wrong API set. You might not be leveraging the delay load correctly. The new umbrella libs make it easy to fix the "wrong API set" issue.
If you think it should be avoided by using delay load, make sure your linker is delay loading the correct API set. You search by procedure name from the API sets, and then set that to delay load. For example, if "FooFunction" from "foo.dll" is not really used but causes a loader failure as soon as your module loads, it is not being delay loaded. Find what API set it belongs to, and delay load it. FooFunction might be part of foo-1.dll API set. Set that to delay load. Then you won't get a loader failure right away while your module is loading, but it would still fail later if you happen to use it even indirectly.