Debug Info Debugging



Recently, I have found an anomaly in the debug info generated by the ready to run compiler. In this post, I will talk about how I found it and how I debugged it.

Discovery

In a routine update of the ILCompiler.Reflection.ReadyToRun.dll for ILSpy, I turned on the STRESS mode I introduced in a recent commit to test whether or not the new version can disassemble all the methods in System.Private.CoreLib. To my surprise, it failed with this exception.

An item with the same key has already been added.

mscorlib.dll!System.ThrowHelper.ThrowArgumentException()
mscorlib.dll!System.Collections.Generic.Dictionary<...>.Insert(...)	
ILSpy.ReadyToRun.Plugin.dll!ICSharpCode.ILSpy.ReadyToRun.ReadyToRunDisassembler.DebugInfoHelper.Update(...) 
ILSpy.ReadyToRun.Plugin.dll!ICSharpCode.ILSpy.ReadyToRun.ReadyToRunDisassembler.DecorateDebugInfo(...) 
ILSpy.ReadyToRun.Plugin.dll!ICSharpCode.ILSpy.ReadyToRun.ReadyToRunDisassembler.Disassemble(...) 
ILSpy.ReadyToRun.Plugin.dll!ICSharpCode.ILSpy.ReadyToRun.ReadyToRunLanguage.DecompileMethod(...)

This looks like an implementation bug in DebugInfoHelper.Update. Why is it a compiler problem? It makes sense to start with understanding how DebugInfoHelper works.

DebugInfoHelper

The debug info helper is a data structure used to decorate the disassembly with the mapping between register or register relative memory access to the managed ‘variables’ such as parameters or local variables. The data structures stored in the ready to run image are in the NativeVarInfo format.

The key inconvenience here is the fact that the info is valid only between certain code offsets. For each instruction, we would like to check if there is a matching NativeVarInfo, and because of the fact that these objects are valid only within a bound, the naive solution has to check each of them.

To remedy that, I built DebugInfoHelper. The idea is that we can maintain the hash table we wanted. First, we split each NativeVarInfo into two NativeVarInfoRecord. One of them corresponds to the beginning of the code offset range and the other corresponds to the end of the code offset range. Then we sort them by the code offsets, with the caveat that the end record would precede the start record. Then we can process these records as the code offset goes and maintain the hash table we wanted. An experiment shows that this optimization speed up the lookup significantly.

As a side-effect of this change, if two NativeVarInfo objects happen to describe the same location twice in overlapping code offset ranges, the starting of the later record would find an existing record and therefore fail to add it into the hash table, exactly the error we are seeing.

Therefore, the real bug is that we have a single location being described twice in an overlapping code offset range. R2RDump confirms my hypothesis:

void System.Runtime.CompilerServices.NullableAttribute..ctor(byte[])
...
Debug Info
    Bounds:
    Native Offset: 0x0, Prolog, Source Types: StackEmpty
    Native Offset: 0xA, IL Offset: 0x000d, Source Types: StackEmpty
    Native Offset: 0xA, Epilog, Source Types: StackEmpty

    Variable Locations:
    Variable Number: 0
    Start Offset: 0x0
    End Offset: 0x1
    Loc Type: VLT_REG
    Register: RCX

    Variable Number: 1
    Start Offset: 0x0
    End Offset: 0x1
    Loc Type: VLT_REG
    Register: RDX

    Variable Number: 1
    Start Offset: 0x0
    End Offset: 0x4
    Loc Type: VLT_REG
    Register: RDX

These will translate to these operations:

graph TD; A[Add RCX to param0 at offset 0] B[Add RDX to param1 at offset 0] C[Add RDX to param1 at offset 0] D[Remove RCX to param0 at offset 1] E[Remove RDX to param1 at offset 1] F[Remove RDX to param1 at offset 4] A --> B B --> C C --> D D --> E E --> F style A width:400px style B width:400px style C width:400px style D width:400px style E width:400px style F width:400px

And therefore it will fail to insert into the hash table at the third step.

Filtering method to compile

Next, I would like to investigate the compilation to figure out why did we generated these overlapping code offset ranges. To do so, I would like to reproduce the compilation and reduce it to make debugging easier. By capturing the build log, I figure the command to compile System.Private.CoreLib.dll is as follow:

C:\dev\runtime\dotnet.cmd C:\dev\runtime\artifacts\bin\coreclr\windows.x64.Debug\crossgen2\crossgen2.dll -o:C:\dev\runtime\artifacts\bin\coreclr\windows.x64.Release\System.Private.CoreLib.dll -r:C:\dev\runtime\artifacts\bin\coreclr\windows.x64.Release\IL\*.dll --targetarch:x64 -O C:\dev\runtime\artifacts\bin\coreclr\windows.x64.Release\IL\System.Private.CoreLib.dll --pdb --pdb-path:C:\dev\runtime\artifacts\bin\coreclr\windows.x64.Release\PDB

The interesting thing is that the compilation process loads clrjit_win_x64_x64.DLL. This module is the workhorse for compilation. Notice that it also load clrjit.dll. That module is used to support the execution of crossgen2 but not for compiling the target binary. The entry point for compilation is clrjit_win_x64_x64!CILJit::compileMethod. By setting a breakpoint on this method we figure out the following call stack.

clrjit_win_x64_x64!CILJit::compileMethod(...)
jitinterface_x64!JitCompileMethod(...)
System_Runtime_InteropServices_RuntimeInformation!ILStubClass.IL_STUB_PInvoke(...)
ILCompiler_ReadyToRun!Internal.JitInterface.CorInfoImpl.CompileMethodInternal(...)
ILCompiler_ReadyToRun!Internal.JitInterface.CorInfoImpl.CompileMethod(...)
...

The last method above is interesting, it includes a filter to detect whether or not to compile code, which allows us to optimize our debugging by compiling just the offending method.

        public void CompileMethod(MethodWithGCInfo methodCodeNodeNeedingCode)
        {
            bool codeGotPublished = false;
            _methodCodeNode = methodCodeNodeNeedingCode;

            try
            {
                if (!ShouldSkipCompilation(MethodBeingCompiled) && !MethodSignatureIsUnstable(MethodBeingCompiled.Signature, out var _))
                {
                    MethodIL methodIL = _compilation.GetMethodIL(MethodBeingCompiled);
                    if (methodIL != null)
                    {
                        CompileMethodInternal(methodCodeNodeNeedingCode, methodIL);
                        codeGotPublished = true;
                    }
                }
            }
            finally
            {
                if (!codeGotPublished)
                {
                    PublishEmptyCode();
                }
                CompileMethodCleanup();
            }
        }

Also, MethodBeingCompiled has a nice ToString() method implementation that gives a human-readable string, which means I can hack the code to compile just the method I wanted right there. After I hacked the code to compile just that method, I checked the R2RDump, it is only that method, and the debug info still has the overlapping code offset ranges.

JITDump

The next step is to enable JIT dump. It produces a lot of information. It could be enabled by an environment variable.

The key information in the JIT dump is this:

*************** In genSetScopeInfo()
VarLocInfo count is 3
; Variable debug info: 3 live range(s), 2 var(s) for method System.Runtime.CompilerServices.NullableAttribute:.ctor(System.Byte[]):this
  0(   UNKNOWN) : From 00000000h to 00000001h, in rcx
  1(   UNKNOWN) : From 00000000h to 00000001h, in rdx
  1(   UNKNOWN) : From 00000000h to 00000004h, in rdx

This agrees with our observation in R2RDump, confirming that it is not a decoding issue.

This dump serves as a starting point for me to start exploring the code. Here are my high-level learnings:

At the final phase of the compilation, we generate the code for each basic block. Initially, these instructions are generated into instruction groups. During generation, we create something called scopes. These scopes represent the fact that we have a mapping between a variable and a physical location. The scope remembers the begin and the end code offset within the instruction group. When we have done with the generation, these instruction groups are organized and we generate the actual code offset by adding the offset of the instruction group and the code offset within the instruction group.

As I debugged, I figured we have 3 instruction groups representing the prolog, body, and the epilog. To my surprise, the instruction group offset for the body is 0. I further figure out that the prolog actually has size 0. Turn out the wrong entry is not the last one, instead, it is the two entries above. We shouldn’t have generated any debug info for the prolog if the prolog doesn’t have a size, but I found this in genSetScopeInfoUsingsiScope:

        // The range may be 0 if the prolog is empty. For such a case,
        // report the liveness of arguments to span at least the first
        // instruction in the method. This will be incorrect (except on
        // entry to the method) if the very first instruction of the method
        // is part of a loop. However, this should happen
        // very rarely, and the incorrectness is worth being able to look
        // at the argument on entry to the method.
        if (startOffs == endOffs)
        {
            noway_assert(startOffs == 0);
            endOffs++;
        }

When I debugged it, we actually went into the branch, that is why we generated a couple debug variable mappings with just 1 byte. That code was introduced long ago, chances is in my last run of the tests, we don’t have a method with 0 sized prolog.

The fix

Now it is obvious that this anomaly is a conscious choice. Now we have to adjust DebugInfoHelper. I worked on multiple approaches and found more interesting details in the debug info. For example, I found that sometimes the compiler is able to map two variables to the same physical location because they should share the same value. They are overlapping ranges in the above sense, but it is not wrong at all. To accommodate for these cases as well, instead of mapping a single location to a single variable, I mapped a single location to multiple variables (duplication allowed), and that solved all the problems listed here. I have submitted a PR to ILSpy to get this fixed.