Concealed Code Execution, Part 2: Code Injection.

In the broader sense, code injection is a process that allows performing operations from the context of another component, typically without interfering with its functionality. Such techniques are interesting from the security perspective because they allow attackers to blend-in additional payloads within existing logic, making it harder to distinguish between them. In this write-up, we analyze the fundamental mechanisms used during code injection and try to understand the underlying decision-making that might accompany the development of future techniques.

The ability to inject code into running processes can be beneficial or essential, depending on the goals in mind. While a discussion of the reasons to perform code injection is out of the scope of the project, here are a few possible answers to this question:

Debugging.
Installing user-mode hooks.
Subscribing to synchronous in-process callbacks such as thread creation and DLL load notifications.
Bypassing per-process restrictions for specific operations. The operating system or security software can enforce constraints on calling various functions, limiting their use to particular processes. Firewall rules are one of the many examples of such restrictions.

Depending on the complexity of the task and other requirements, we might choose one of the two primary formats for hosting our code:

Using shellcode provides a relatively stealthy way to execute code without leaving traces on the filesystem or even triggering detection. It can come in handy for installing simple hooks or issuing single function calls but doesn’t scale well.
Relying on Dynamic Link Libraries (DLLs) guarantees a simpler workflow for performing complex operations, including those that require bringing additional runtime dependencies. Injecting DLLs, however, usually leaves significantly more traces; we address this topic in the detection write-up.

Where Does The Code Come From?

To begin with, the processor needs a correct sequence of machine instructions available in the target address space before it can execute them. Before the era of Data Execution Prevention, processors were not distinguishing code from data, so as long as the bytes in memory had a valid interpretation as machine instructions, they were free to run. On modern systems, however, the memory protection model defines which regions are readable, writable, and executable, preventing attackers from redirecting execution into the data areas that might include untrusted input.

Thus, we have two options: either leverage existing code from the program itself or system libraries or allocate a new executable region. The second approach generally provides more flexibility, though, is significantly easier to detect.

Additionally, it can be challenging (if possible at all) to find a complete piece of code that achieves a specific non-trivial goal. There is, however, a family of advanced techniques that allow building an algorithm of any complexity given just a few simple sequences of instructions called gadgets. The most well-known are Return Oriented Programming (ROP) and Jump Oriented Programming (JOP). Notably, modern versions of Windows running on compatible hardware effectively prevent ROP and significantly complicate JOP, so we will not cover gadget-based techniques in this project.

Reusing Existing Code

Almost all programs depend at a minimum on several system libraries that expose Win32 API functions. To optimize memory usage, Windows utilizes a mechanism called KnownDLLs that includes a few dozen of the most frequently used built-in libraries and ensures that their memory is always shared. While their exact location is still randomized and does not persist between reboots, these libraries are mapped at the same base address across all processes. This fact drastically simplifies injection because when it comes to KnownDLLs, we don’t need to resolve any remote addresses. But keep in mind that before using their code, we still need to verify that the target has the required DLL loaded.

It is also possible to use execution primitives from other modules readily available in the target’s address space (including the primary executable itself). Unfortunately, they find limited use in general-purpose solutions because of the specificity and complications related to ASLR.

Another pitfall that might complicate reusing code is the calling convention and the number of expected parameters. As we will see below, there are several ways to gain execution that differ in the number of parameters they support. Unfortunately, they don’t come even close to covering all possibilities. Hence, some circumstances might require writing a custom shellcode into the target merely to call an existing function.

Loading Shellcode

Writing executable code into another process isn’t inherently problematic; you can find the overview of available options in the Managing Remote Memory section below. Preparing this code, however, is a substantially more complex task. Most of the time, it heavily depends on the concepts that are out of the scope of the paper. So, instead of talking about assembly and shellcode, we will focus on higher-level languages like C and, therefore, DLLs.

LoadLibrary - The Hybrid Approach

Probably the most widely targeted function during code injection is LoadLibrary. This function is implemented in kernelbase.dll and has an alias forwarder in kernel32.dll - two baseline libraries for all Win32 (both GUI and Console) applications. As a result, it is available in all Windows subsystem processes. Also, conveniently (as we will see below), it requires a single pointer parameter - the path to the file. The reason for the majority of injection techniques to target this function is simple: DLL loading is a local operation, i.e., there is no API for loading libraries remotely. Hence, most of the time, the sole purpose of a specific technique is to force another process into calling this function with the attacker-controlled parameter. There is, of course, a notable exception that allows avoiding this call altogether; we are going to examine this approach at the end of our discussion.

Hence, we can group all DLL injection techniques based on the primary API call:

LoadLibrary/LoadLibraryEx-based techniques.
LdrLoadDll-based techniques. LdrLoadDll is a Native API function implemented in ntdll.dll that takes on the heavy lifting of module loading and powers LoadLibrary. Using LdrLoadDll in place of LoadLibrary can be inconvenient due to its prototype but allows injecting DLLs into native-subsystem processes and might help avoid basic runtime detection.
Manual mapping techniques. These rely on an alternative path of, essentially, re-implementing LoadLibrary using memory manipulation routines. Writing a manual mapper is considerably trickier than other types of injectors because of the many nuances it requires taking into account when deploying the image into the target’s address space.

Gaining Execution

Remote Threads

The simplest and most reliable way to start executing code in a different process is to create a remote thread. This operation requires PROCESS_CREATE_THREAD access (part of GENERIC_WRITE) to the target and is achieved via CreateRemoteThread, RtlCreateUserThread, or NtCreateThreadEx. The last function provides more control over the flags, which might come in handy for avoiding detection. For example, it can instruct the system to skip attaching to DLLs or hide the new thread from the debugger.

The payload is expected to follow the stdcall convention, take a single pointer-size parameter, and return a 32-bit value. The result value becomes the exit status and can be inspected afterward. Additionally, it is easy to synchronize with the code running on a new thread because the object becomes signaled upon termination. Some variations for the prototypes are naturally possible when targeting specific processor architectures, such as having no parameters or returning a 64-bit value on x64 systems due to binary compatibility. Remarkably conveniently, LoadLibrary closely follows this prototype, which allows us to invoke it directly without any stub code. As we will see below, the most basic DLL injection techniques rely on this exact approach.

The APC Queue

An alternative solution is to run the payload on the existing threads. There are many ways to hijack execution and achieve our goals, but probably the most convenient is Asynchronous Procedure Calls (APCs). APCs are a facility provided out-of-the-box by the OS. Essentially, it includes a per-thread queue of callbacks to execute when the thread is not doing any other work. Queueing new tasks is possible between process boundaries and requires THREAD_SET_CONTEXT access (which is, again, part of the GENERIC_WRITE mask).

One notable difference APCs have with new threads is a more flexible prototype for the callback function. It might sound surprising if you already have experience with the Win32 API wrapper (QueueUserApc) since it uses a prototype that is effectively equivalent to the one we’ve seen before. Nevertheless, the underlying syscall (NtQueueApcThreadEx) works slightly differently. The callback it uses follows the stdcall convention, has no return value, and takes three pointer-size parameters. Yet again, functions with up to three parameters and a return value should work due to binary compatibility on x64. This gives APCs an unparalleled advantage because it supports a significantly broader range of payloads. As for the disadvantages, the code that posts tasks to the queue cannot wait for its completion and determine whether the invocation was successful or not.

Additionally, there remains a question of when the payload will execute. Historically, until the recent versions of the OS, Windows exposed a single type of user-mode APCs that requires an explicit acknowledgment from the program, indicating that it’s safe to run the pending calls from the queue. This approach is advantageous for preventing deadlocks as it helps to ensure that APCs do not execute while the thread is holding locks on synchronization primitives.

There are two ways for a thread to execute normal APCs: either sleep or wait in an alertable state or call NtTestAlert. Therefore, most of the existing implementations for APC-based injection take an indefinite amount of time before the payload executes. They work best against UI applications because those tend to wait for window messages in an alertable state.

Windows 10 RS5, however, added a new type of user-mode APCs called special APCs. Overall, they work similarly to the standard counterpart we just discussed, except they execute immediately (i.e., on the closest subsequent transition from the kernel to the user mode, before any application code). That makes special APCs incredibly useful for code injection. But keep in mind that the payload might get executed at an inconvenient time (such as when the thread is holding some locks), so, preferably, it should be as simple as possible. Technically, LoadLibrary doesn’t satisfy this criterion, but, generally, it works well. The proof-of-concept code provided with the repository makes use of special APCs whenever possible.

Hijacking Execution

Here are a few alternative techniques that we will not cover in detail:

Modifying thread contexts. The system maintains a snapshot of all registers that define the current state of every thread on the system. Two functions - GetThreadContext and SetThreadContext (and their native counterparts) provide access to this information and allow arbitrarily changing it. Hijacking execution via modifying contexts requires the THREAD_SET_CONTEXT access and a specially designed stub code (written in assembly) to be available in the target’s address space. This stub prepares the stack, invokes the payload, and then restores the context to its original state.
Patching existing code or jump tables. There are a lot of varieties for this technique that target both specific applications and commonly used system functions from user-mode libraries. This diversity boils down to either changing pointers (such as patching the IAT table) or modifying machine instructions in memory. We will cover memory management API in the next section from the perspective of delivery, but keep in mind that the same API can be used for gaining code execution via patching.
Installing the instrumentation callback. This per-process callback acts somewhat like a persistent special APC in the sense that it gets invoked on kernel-to-user mode transitions, but this time, repeatedly. This feature makes a powerful tool for gaining execution before any application-defined code has a chance to run. As with context-based injection, using the instrumentation callback requires writing a highly architecture-specific stub in assembly.

Delivering Data & Code

Managing Remote Memory

Processes on Windows use private virtual address space. Consequently, the code that runs in their context can directly access and modify only its own memory. This mechanism provides isolation that is essential for the stability of the OS and its components but might become inconvenient during inter-process operations. The functions we want to invoke remotely might require some context information, like the filename string for LoadLibrary, which should be accessible in the target’s address space. Additionally, writing machine instructions requires keeping in mind various types of memory protection (i.e., the distinction between readable, writable, and executable pages).

To address the issues that arise from similar tasks (usually performed by debuggers), Windows provides various functions that allow manipulating an address space of a different process. These functions are, of course, subject to access checks but, provided sufficient permissions, can do effectively everything a process can do to itself.

Function	Type of Memory	Required Access	Description
ReadProcessMemory / NtReadVirtualMemory	Any	`PROCESS_VM_READ`	Reads bytes from an address space of a process.
WriteProcessMemory / NtWriteVirtualMemory	Any	`PROCESS_VM_WRITE`	Writes bytes into an address space of a process.
VirtualProtectEx / NtProtectVirtualMemory	Any	`PROCESS_VM_OPERATION`	Adjusts memory protection of an existing region.
VirtualQueryEx / NtQueryVirtualMemory	Any	`PROCESS_QUERY_LIMITED_INFORMATION` or `PROCESS_QUERY_INFORMATION`	Retrieves various information about the entire address space or its specific region. Note that the Win32 function exposes only a small portion of the available properties.
VirtualAllocEx / NtAllocateVirtualMemory	Private	`PROCESS_VM_OPERATION`	Allocates or reserves private memory.
VirtualFreeEx / NtFreeVirtualMemory	Private	`PROCESS_VM_OPERATION`	Releases previously allocated private memory.
MapViewOfFile2 / NtMapViewOfSection	Shared	`PROCESS_VM_OPERATION`	Maps a shared memory region (a section object in Native API terms) into an address space of a process. The memory can be backed by a paging file, a regular file, or an executable image.
UnmapViewOfFile2 / NtUnmapViewOfSection	Shared	`PROCESS_VM_OPERATION`	Unmaps a shared memory region.

Windows supports two types of memory that use a slightly different set of functions for managing but are otherwise interchangeable: private and shared (also called mapped). Mapped memory can work in several modes, including projecting files from the disk for reading/writing and images for executions. We will cover the difference when talking about manually mapping DLLs later. Thus, when allocating general-purpose memory remotely, we effectively have two similar options that require slightly different access to the target. The example projects in the repository demonstrate both approaches, allowing us to test how they hold against detection. Occasionally, it might also be possible to reuse and modify existing memory regions (such as PEB) for a more concealed allocation. Although, this approach is less generic because it depends on the existing memory layout that we need to inspect first. The most notable examples revolve around hiding new code in RWX regions created by Just-In-Time compilation.

You won’t find many injection techniques that rely on shared memory because mapping it remotely was not exposed in Win32 API until Windows 10 RS2; although, it existed in Native API from Windows 2000.

Alternative Means of Delivery

Additionally, you can find various code injection techniques that use other means of delivering data into the target process, not covered here:

Using the shared desktop heap. UI applications running on the same desktop share a memory region that the graphical subsystem uses to store the window-related state. Despite being mapped as read-only, various functions like SetWindowLongPtr can effectively write custom data into it. Provided we know the offset where our data landed (that we can obtain by scanning the region locally) and the address at which the buffer is mapped in the target process, we can compute a pointer that is valid remotely. Techniques like PowerLoader use this approach combined with ROP, but it can also serve as a means for regular data delivery.
Using non-memory storage, such as the Atom table. This table stores strings matched with 16-bit identifiers and managed via UI-related functions like RegisterClassEx and RegisterClipboardFormat. AtomBombing technique uses this structure as a medium for delivering the payload; combined with APCs and ROP, it provides access to code injection.

“Backdoor”-like Injections

So far, we distinguished injectors based on how they gain execution and deliver the code or data. There is also a set of techniques that do not fall into these categories because they, effectively, don’t use most of the facilities we’ve seen so far. The mechanisms we discuss in this section were introduced by Microsoft and implemented accordingly, i.e., by extending the functionality of system libraries. When specific conditions are met, they will automatically load and invoke user-controlled DLLs on the target’s behalf. There are similarities of this behavior with backdoors, hence, the name.

The most widely used technique in this category is SetWindowsHookEx. This API adds support for installing a wide range of window message processing-relates hooks on both global and a per-thread scale. While some of them allow receiving notifications about specific events (such as key presses) on the calling thread, a few must execute within the context of the target process and, thus, require a DLL to perform the actual work. Whenever a window in the scope of the hook receives a specific window message, the system automatically forces its owning thread to load our DLL. Despite the simplicity, there are two moderately significant limitations to this technique. Firstly, the function works only against applications that include GUI threads, preferably actively pumping messages. Secondly and more importantly, it is not suitable for typical DLLs. To work, the library either needs to contain a specifically designed callback function or perform all tasks in DllMain and then always fail the loading. Because of these limitations and the detailed documentation provided by Microsoft, we are not providing a demo project for this technique in the repository.

If injecting a DLL on process startup is also a viable option, here are a few facilities that can help:

AppInit_DLLs. AppInit_DLLs is a registry key under HKLM that defines a list of libraries that user32.dll - a core library for any UI-related functionality - will automatically load during its initialization. Because of the system-wide scope, this technique is more suitable for persistence rather than targeted injection. Besides that, AppInit_DLLs were deprecated and are now disabled by default. Even when enabled, they might be configured to require digital signing.
AppCompat shims. This infrastructure provides an extensible mechanism for applying compatibility fixes for legacy applications. Shims target specific executable files that they recognize based on various properties. Custom shim providers can bring DLL dependencies and get them loaded in the early stages of process initialization. Microsoft EMET used this approach to load its library and apply additional mitigation policies to programs running on older versions of Windows. Keep in mind that installing new shims requires creating and registering a *.SDB (shim database) file and generally requires administrative permissions.
Custom Application Verifier providers. These libraries act as plugins for Application Verifier - a built-in OS component and a framework for testing the application’s stability and compatibility. They also get loaded early during process initialization but, unfortunately, require administrative permissions to register.

Manual DLL Mapping

The final approach we will discuss here is truly marvelous because it goes an extra mile to extend the functionality of DLL loading and significantly improves its resilience to detection. Welcome manual mappers - techniques that re-implement LoadLibrary from scratch.

There are two primary reasons to bother with this approach. Firstly, LoadLibrary is a common denominator for all other DLL injectors. Consequently, it is heavily monitored by security software. The underlying LdrLoadDll is in a better position from the attacker’s perspective, but just slightly. Secondly, these functions expect a filename as their input, limiting their use only to filesystem content. As we will see below, manual mapping doesn’t have this limitation and supports loading DLLs from memory.

How Does DLL Loading Work?

Module loader is implemented in user mode in ntdll.dll - the Native subsystem library that is always loaded into all processes. Here is the list of the most essential steps it takes to load a DLL:

Open the file via NtOpenFile for the Read & Execute access.
Create an image section (memory mapping) object via NtCreateSection with the SEC_IMAGE attribute from the file. This step makes the system read the file and validate its structure according to the PE specification. In some cases, the kernel might also apply image relocations (discussed below) at this step.
Map the image section object via NtMapViewOfSection. The new memory region will provide a copy-on-write view on the DLL but with a slightly different (so-called image) layout. Each PE section will get the specified memory protection (R/RW/RX/etc.) set automatically.
If necessary, apply relocations in user mode. PE files (both .exe and .dll) are rarely entirely position-independent; instead, they have a preferred image base address (specified in the headers). Whenever the file gets mapped at a different location (either because of mandatory ASLR or already occupied memory), the loader must apply some patches to the memory content.
Resolve imports and recursively load dependencies. Executable files define dependencies via the use of the import table; it becomes the responsibility of the loader to locate necessary functions in the export directory of referenced libraries and bind them together.
Invoke the library’s TLS initialization callbacks and the entry point upon successful loading.

There are, of course, a lot of other implementation details that we won’t cover, but these are less relevant to our goals. They include, for example, support for KnownDlls and the structure of the lists where the loader maintains the database of modules.

Can We Do The Same?

Most importantly, because the module loader is implemented in user mode and, therefore, follows the same rules of interaction with the kernel, it is possible to closely reproduce its behavior. Ultimately, we can even include a few improvements. As long as we deploy the image following the same layout and memory protection and then resolve its imports, relatively simple code should work correctly. Of course, the more complex the DLL gets, the more precisely we need to reproduce the loader’s behavior.

There are three approaches we can take for reproducing the memory layout of a loaded DLL:

Allocate private MEM_COMMIT memory and then fill it in remotely via cross-process writing. Finally, set the necessary memory protection afterward.
Map shared SEC_COMMIT memory. The primary benefit of this approach is that we can map a local view of this region first, deploy the image there (write it, apply relocations, resolve imports, etc.) and only then map a ready-to-run remote view.
Map a shared SEC_IMAGE view of the image. This approach is what LoadLibrary does under the hood. It automatically fixes PE section layout and protection and often even relocations. Unfortunately, it requires a file on the disk. Additionally, it notifies kernel-mode drivers about image loading, making it less suitable for stealthy injectors.

Probably the most widely known implementation of manual mapping is called reflective loading. This technique either relies on injecting a shellcode that allocates and deploys the image from an in-memory buffer and then invokes its entry point or requires a specially crafted DLL. All implementations of reflective loading we’ve seen are based on the first option from our list (private MEM_COMMIT memory) and perform most tasks inside the target process. These design choices simplify some tasks such as resolving imports but might lead to drawing suspicion of the security software and lead to detection.

Since the primary topic of this discussion is the inter-process injection, the demo project for manual mapping included in the repository performs as many steps as possible remotely. It also relies on the second option (shared SEC_COMMIT memory) because of its benefits for image deployment. Comparing the implementation to reflective loading, we can highlight that there are merely a few steps that touch the target process:

Enumerating loaded modules while resolving imports. This action repeatedly queries the target’s address space but doesn’t require reading any memory.
Mapping a region with the prepared content and layout using the RWX/WCX protection.
Optionally applying more restrictive memory protection to specific pages.
Creating a new thread and queueing an APC to execute the module’s DllMain.

Conclusion

The categorization we introduced should cover the reasonable majority of DLL (and, partially, shellcode) injection techniques and their variations, probably even the future ones. Of course, the landscape constantly changes as it has been for years, thanks to the work of many dedicated security researchers. Detection methods also competitively evolve over time (as you can read in the corresponding writeup), so keeping the arsenal of tricks in your sleeve up-to-date is essential. Still, we believe that the material provided here can provide valuable insight on the topic, both for beginners and professionals.