Ready to Run Delay Load

Oct 14, 2021

In this series of posts, I will deep dive into ReadyToRun to understand its mechanisms. In my previous post, I briefly explained what is ReadyToRun and what problems do we need to solve. Now we will look into the exact mechanisms.

The example

We will start with this very simple example:

using System;

namespace ReadyToDebug
{
    class Program
    {
        static void Main(string[] args)
        {
            for (int i = 0; i < 3; i++)
            {
                Console.WriteLine("Hello World!");
            }
        }
    }
}

To be precise, I put it in a project targetting .NET 6.0, and compiled using the ready to run option as follow:

dotnet publish -r win-x64 /p:PublishReadyToRun=true

Of course, the program simply prints “Hello World” 3 times.

The tools

We will use ILSpy.ReadyToRun to disassemble the compiled code, and we will use the Time Travel Debugging in WinDBG capability to make debugging easier.

Disassembling the code

Using ILSpy.ReadyToRun, here is the method disassembled.

; void ReadyToDebug.Program.Main(string[])
; Prolog
00000000000017C0 57                   push      rdi
00000000000017C1 56                   push      rsi
00000000000017C2 4883EC28             sub       rsp,28h
; IL_0001
00000000000017C6 33F6                 xor       esi,esi
; IL_0016
00000000000017C8 488B3DD9080000       mov       rdi,[rel 20A8h]
; IL_0006
00000000000017CF 488B0F               mov       rcx,[rdi];
00000000000017D2 FF15C8080000         call      qword [rel 20A0h] ; void [System.Console]System.Console::WriteLine(string)
; IL_0012
00000000000017D8 FFC6                 inc       esi; rsi = Local 0
; IL_0016
00000000000017DA 83FE03               cmp       esi,3
00000000000017DD 0F9CC0               setl      al
00000000000017E0 0FB6C0               movzx     eax,al
; IL_001b
00000000000017E3 85C0                 test      eax,eax
00000000000017E5 75E8                 jne       short 0000`0000`0000`17CFh
; IL_001e
; Epilog
00000000000017E7 4883C428             add       rsp,28h
00000000000017EB 5E                   pop       rsi;
00000000000017EC 5F                   pop       rdi
00000000000017ED C3                   ret

First of all, the address column looks weird. We seldom have instruction pointer pointing to such a small number. They are NOT actual addresses, they are relative virtual addresses. The idea is that they are a relative offset from the base address of the image. Later, we will talk about how to do this simple translation when we dive into debugging.

Second, note that ILSpy.ReadyToRun nicely annotated that the call on 00000000000017D2 is going to Console.WriteLine. Note that the call is in indirect call, therefore the address of the callee is stored on the memory cell 20A0h. Note that it has to be indirect, because by the time we compile the application, we have no idea where will the code for Console.WriteLine lives. We call such a location a redirection cell. Basically, a memory location where we can write to it to redirect execution.

Given that even the compiler don’t know where the function is, how ILSpy.ReadyToRun knows it will lead to Console.WriteLine. This is the real ReadyToRun magic.

Without further ado, let’s dive into the debugging.

Recording the trace

First, we run this command:

c:\debuggers\amd64\ttd\TTTracer.exe c:\ReadyToDebug\bin\Debug\net6.0\win-x64\publish\ReadyToDebug.exe

This will record the execution of the program and allow us to move back-and-forth to understand the execution. Then, we can start inspecting the trace as follow:

c:\debuggers\amd64\windbg -z ReadyToDebug01.run

This will show the normal debugging session, and we will start debugging it.

Analyzing the trace

We would like to know what happened on the indirect calls. So naturally we would like to set a breakpoint to it. Right now, ReadyToDebug.dll is not even loaded in the process yet. So the first thing to do is to run until it is loaded. We can do that using this command

0:000> sxe ld ReadyToDebug
0:000> g

This will run until ReadyToDebug.dll is loaded, and when it does, the debugger shows:

...
ModLoad: 000001e2`f7b50000 000001e2`f7b54000   c:\Garbage\ReadyToDebug\bin\Debug\net6.0\win-x64\publish\ReadyToDebug.dll

The first address, 000001e2f7b50000, is the base address of the module. It may or may not be the same on other machines (and even executions on the same machine). But once we have this address, then we have this very simple formula:

virtual address = base address + relative virtual address

The virtual refers to virtual memory (which is used in practically any OS today). So if we ignore the word virtual, this is basically saying the address is the base address plus the relative address. Therefore, or call site is simply 000001e2f7b50000 + 17d2 = 000001e2f7b517d2. We will set a breakpoint there.

0:000> bp 000001e2f7b517d2
0:000> g
...
Breakpoint 0 hit
Time Travel Position: 171F7:9 [Unindexed] Index
...
0:000> g
...
Breakpoint 0 hit
Time Travel Position: 1A995:16 [Unindexed] Index
...
0:000> g
Breakpoint 0 hit
Time Travel Position: 1A9A1:16 [Unindexed] Index
...
0:000> g
TTD: End of trace reached.
...

As we can see, the breakpoint hit 3 times, and then the process terminates. Of course, that matches our expectation that Console.WriteLine is called 3 times.

The first call

Now we will go back in time to see what happened in the first call.

0:000> !tt 171F7:9
Setting position: 171F7:9
0:000> dq 000001e2`f7b520a0 L1
000001e2`f7b520a0  000001e2`f7b51804
0:000> u 000001e2`f7b51804
000001e2`f7b51804 33c0            xor     eax,eax
000001e2`f7b51806 6a03            push    3
000001e2`f7b51808 ff3572080000    push    qword ptr [ReadyToDebug!COM+_Entry_Point <PERF> (ReadyToDebug+0x2080) (000001e2`f7b52080)]
000001e2`f7b5180e ff2574080000    jmp     qword ptr [ReadyToDebug!COM+_Entry_Point <PERF> (ReadyToDebug+0x2088) (000001e2`f7b52088)]

Remember we said the address to the actual call site is stored in 20a0? We inspected the call site and found this code. This code is weird, we do not setup a frame and just push and jump. This pattern is often called a trunk or a trampoline. Roughly speaking, this is not a function, but a decorator of the function. If we think about the decorator pattern, you can use a decorator to execute something before of after a method. This is exactly what we are trying to do here.

The key thing to notice here is that we set eax to 0 and pushed two more values onto the stack. Overall, the stack remain unchanged. So the argument to the target function and the return address are still there!

Next, we look at the last jump.

0:000> dq 000001e2`f7b52088 L1
000001e2`f7b52088  00007ffd`39f349a0
0:000> uf 00007ffd`39f349a0
coreclr!DelayLoad_MethodCall:
00007ffd`39f349a0 4155            push    r13
00007ffd`39f349a2 4154            push    r12
00007ffd`39f349a4 55              push    rbp
00007ffd`39f349a5 53              push    rbx
00007ffd`39f349a6 56              push    rsi
00007ffd`39f349a7 57              push    rdi
00007ffd`39f349a8 4883ec68        sub     rsp,68h
00007ffd`39f349ac 48898c24b0000000 mov     qword ptr [rsp+0B0h],rcx
00007ffd`39f349b4 48899424b8000000 mov     qword ptr [rsp+0B8h],rdx
00007ffd`39f349bc 4c898424c0000000 mov     qword ptr [rsp+0C0h],r8
00007ffd`39f349c4 4c898c24c8000000 mov     qword ptr [rsp+0C8h],r9
00007ffd`39f349cc 660f7f442420    movdqa  xmmword ptr [rsp+20h],xmm0
00007ffd`39f349d2 660f7f4c2430    movdqa  xmmword ptr [rsp+30h],xmm1
00007ffd`39f349d8 660f7f542440    movdqa  xmmword ptr [rsp+40h],xmm2
00007ffd`39f349de 660f7f5c2450    movdqa  xmmword ptr [rsp+50h],xmm3
00007ffd`39f349e4 4c8b8c2498000000 mov     r9,qword ptr [rsp+98h]
00007ffd`39f349ec 4c89b42498000000 mov     qword ptr [rsp+98h],r14
00007ffd`39f349f4 4c8b8424a0000000 mov     r8,qword ptr [rsp+0A0h]
00007ffd`39f349fc 4c89bc24a0000000 mov     qword ptr [rsp+0A0h],r15
00007ffd`39f34a04 488d4c2468      lea     rcx,[rsp+68h]
00007ffd`39f34a09 488bd0          mov     rdx,rax
00007ffd`39f34a0c e85f4decff      call    coreclr!ExternalMethodFixupWorker (00007ffd`39df9770)
00007ffd`39f34a11 660f6f442420    movdqa  xmm0,xmmword ptr [rsp+20h]
00007ffd`39f34a17 660f6f4c2430    movdqa  xmm1,xmmword ptr [rsp+30h]
00007ffd`39f34a1d 660f6f542440    movdqa  xmm2,xmmword ptr [rsp+40h]
00007ffd`39f34a23 660f6f5c2450    movdqa  xmm3,xmmword ptr [rsp+50h]
00007ffd`39f34a29 488b8c24b0000000 mov     rcx,qword ptr [rsp+0B0h]
00007ffd`39f34a31 488b9424b8000000 mov     rdx,qword ptr [rsp+0B8h]
00007ffd`39f34a39 4c8b8424c0000000 mov     r8,qword ptr [rsp+0C0h]
00007ffd`39f34a41 4c8b8c24c8000000 mov     r9,qword ptr [rsp+0C8h]
00007ffd`39f34a49 4883c468        add     rsp,68h
00007ffd`39f34a4d 5f              pop     rdi
00007ffd`39f34a4e 5e              pop     rsi
00007ffd`39f34a4f 5b              pop     rbx
00007ffd`39f34a50 5d              pop     rbp
00007ffd`39f34a51 415c            pop     r12
00007ffd`39f34a53 415d            pop     r13
00007ffd`39f34a55 415e            pop     r14
00007ffd`39f34a57 415f            pop     r15
00007ffd`39f34a59 e93cffffff      jmp     coreclr!ExternalMethodFixupPatchLabel (00007ffd`39f3499a)  Branch

First of all, note that this is a function in coreclr.dll. That means by the time our application is compiled, we have no idea where is the function. That’s why it must be done through an indirect call.

This functions looks complicated, but all it really does is pushing stuff to the stack, call the ExternalMethodFixupWorker with some arguments, and then pop them back all, and jump somewhere else.

Note that at the end we popped r14 and r15 which is not pushed. This is to make sure we balance out the two additional pushes before we reach DelayLoad_MethodCall. By the time we jump, the stack will be exactly as it was at the callsite.

Looking at just the stack, this is simply another trunk. The key is what ExternalMethodFixupWorker to prepare the ExternalMethodFixupPatchLabel.

Note that the last jump is NOT an indirect jump, all it does is simply jumping to where rax points to:

0:000> u 00007ffd`39f3499a
coreclr!ExternalMethodFixupPatchLabel:
00007ffd`39f3499a 48ffe0          jmp     rax

rax is of course the return value of ExternalMethodFixupWorker.

ExternalMethodFixupWorker

The function ExternalMethodFixupWorker is implemented in C++, so we can simply read the source code. It is a long function so we will only outline what it does. The code uses the address of the indirection cell to find a signature. Using the signature, it asks the rest of the runtime to prepare for that function and return the address of that code.

Why a signature? A signature is basically the name of the method in binary form. When we ask the runtime for the code, we need to give it an identifier of the method, and the signature is such an identifier.

We also notice that there is a way to find the signature from the indirection cell address. That is how ILSpy.ReadyToRun figure out where the call will lead to.

Further downstream, we will simply run the Console.WriteLine implementation, so we will skip that.

The third call

Now you might wonder, why do I call the function 3 times to show you how a method call is done? The reason is that calling ExternalMethodFixupWorker is expensive, and we don’t want want to call it repeatedly after we already know where it should be. Let’s inspect the 3rd call now.

0:000> !tt 1A9A1:16
0:000> dq 000001e2`f7b520a0 L1
000001e2`f7b520a0  00007ffc`da3a2a68
0:000> u 00007ffc`da3a2a68
00007ffc`da3a2a68 e9effaffff      jmp     00007ffc`da3a255c

The code changed! This is why we called it multiple times. At the third call, we no longer do the DelayLoad_MethodCall work and instead just jump directly to the implementation. Something must have changed the code!

0:000> !ttpw 000001e2`f7b520a0
Time Travel Position: 1A995:217 [Unindexed] Index
0:000> k
 # Child-SP          RetAddr               Call Site
00 (Inline Function) --------`--------     coreclr!PatchNonVirtualExternalMethod+0x9c [D:\a\_work\1\s\src\coreclr\vm\prestub.cpp @ 2484] 
01 000000b9`cd77e300 00007ffd`39f34a11     coreclr!ExternalMethodFixupWorker+0xaf3 [D:\a\_work\1\s\src\coreclr\vm\prestub.cpp @ 2793] 
...

Now it is clear that ExternalMethodFixupWorker takes care of changing the indirection cell so that we don’t do the heavy work to find the method again.

The timestamp of this change is 1A995:217, which is a bit later than the second call at 1A995:16. Therefore we are actually doing DelayLoad_MethodCall twice and the patch happen at the second call to ExternalMethodFixupWorker for the same redirection cell. This is why I need to run it 3 times instead of just twice.

The exact reason why we need to call it twice instead of just once is due to ThePreStub and tiered compilation. We will skip that for now.

This conclude our journey to understand how the ready to run code calls an external method.