Tracking a kernel bug that doesn’t appear in gdb

I have been trying to locate the cause of a seemingly random bug in KernOS, and after a long investigation, I finally pinned it down.

On hindsight, the bug could have been reasoned out much earlier as the investigation led to more and more credible clues along the way. It’d be a fun detective challenge for you to follow along and see if you can reason out the issue without using as many clues.

To do this, I will lay out my investigation process and label the important clues I discovered chronologically.

Table of contents/clues

  1. Bug behavior 1
  2. Bug behavior 2
  3. Stepping through gdb, instrumenting gdb
  4. Ruling out page directory setup, page fault handler, kernel heap allocation
  5. Instrumenting qemu
  6. Ruling out timer interrupt handler
  7. Noticing pattern of interrupt that causes bug
  8. Fixing the bug

Bug behavior 1

The first time the bug was observed was when I was implementing a page fault mechanism. Where a memory access on an unmapped virtual memory address will cause a page fault exception. Which in turn leads to allocating a physical page frame, and updating the page directory entry.

For context, the 0 – 8MB physical memory range is identity mapped to virtual memory, in addition, the first 4KiB page remains unmapped to catch nullptr dereferencing. In other words, any other memory address reference will cause an exception. And below is the code to test whether virtual memory works.

  
    void MemoryTest()
    {
       kprintf("Test %i: Page fault and virtual memory mapping\n", ++TestCount);

       volatile uint32_t *Mem = (uint32_t*) 0x500000;  // 0 - 8MB is already mapped
       *Mem;                                           // won't cause Page fault exception

       volatile uint32_t *Mem2 = (uint32_t*) 0x900000;
       *Mem2 = 1;                                      // memory access causes Page Fault
    }

Clue 1: The bug was that about 1 in 5 runs, right after 0x900000 is mapped, I would get another page fault exception at either address 0x3

Bug behavior 2

In addition to virtual memory mapping, I also have a custom vector container test. Since this vector container isn’t from STL, heap allocation of the elements is done (in a ghetto manner) by reserving a chunk of memory in the kernel’s .text section, which is managed by a linked list (K&R style). The vector test looks like below

 
    void VectorTest()
    {
       kprintf("Test %i: Vector and kmalloc/free\n", ++TestCount);

       KL::Vector V(5);

       V[2] = 5;
       V[4] = 15;

       V.Pushback(30);
       V.Pushback(31);

       kassert(V[2] == 5, "Vector[] failed\n");
       kassert(V[6] == 31, "Vector::Pushback failed\n");
       kassert(V.Back() == V[6], "Vector::Back failed\n");
    }

Any of the asserts above would fail randomly, and once again, with the frequency of once in a few kernel boots.

To further figure out what’s going on, I added a series of vector indexing as below.


    void VectorTest()
    {
       kprintf("Test %i: Vector and kmalloc/free\n", ++TestCount);

       KL::Vector V(5);

       V[2] = 5;

       kputchar("%i", V[2]);
       kputchar("%i", V[2]);
       kputchar("%i", V[2]);
       kputchar("%i", V[2]);
       ...
       kputchar("%i", V[2]);
    }

And the output looks like

Clue 2: Where once in a while, the indexing would return 0, and in those cases, it would also trigger a page fault at address 0x0.

Stepping through gdb, instrumenting gdb

Given the two above random behaviors, the first thing that came to mind was to use gdb to figure out how things like V[2] could suddenly return a different value. After much trial, I could never reproduce any of the buggy behavior while single stepping.

Perhaps I just wasn’t running the kernel enough times to catch the bug occurring. Single stepping was also pretty tedious, so next I tried to run the kernel many times, and simply log every executed instructions (using gdb on qemu on the kernel), hoping to analyze the executed code later when the bug occurs.

Here’s how I was running qemu and gdb,

    qemu-system-i386 -s -S kern.bin
    # -s means to allow gdb to connect at port tcp::1234
    # -S means to stop at first instruction

And in gdb

   
    target remote connection :1234
    set pagination off
    set logging on
    display/i $pc
    while 1
    si
    end

Clue 3: No matter how many times I run the kernel and log the instructions, the bug NEVER occurs.

Ruling out page directory setup, page fault handler, kernel heap allocation

Stumped by the fact that gdb isn’t able to record the behavior, I decided to move on from approaching the issue through debugging for the time being. I nonetheless felt that it’s related to virtual memory/page fault handling that was the culprit.

Thus I carefully inspected all code related to page fault handling.

Clue 4: It is not related to

  • Page fault exception handler
  • Page frame allocation
  • Kernel heap memory allocator
  • Page directory, or page table entry update

Instrumenting qemu

At this point, I decided that I needed to look into qemu in order to understand what’s going on. Compiling qemu and learning about how it works. I started poking around the source code and realized that logging qemu instructions executing the kernel does allow the buggy behavior to occur. That’s a very important progress, since as long as I can reproduce/log the sequence of instructions that caused the bug, I can eventually figure this out. Qemu source code is still one complex beast though. But at least the bug is no longer elusive.

Before this, the bug almost felt supernatural, as if it knows not to manifest when you’re looking at it.

After some fumbling, I decided to start from inspecting qemu’s handling of page fault exceptions. The function where exceptions and interrupt are handled in qemu is do_interrupt_all. And logging it, I realized that if the bug occurs, it would occur on the 314-th call to do_interrupt_all. This is pretty unscientific, but I decided to log qemu instructions on the 313-th call and 314-th call and diff-ing them to see where they diverges.

Here’s what I got

You can ignore Break313, Break314, it’s simply a dummy function I inserted into qemu’s do_interrupt_all function to allow easy breaking. By chance, I noticed an important difference between the 313-th and 314-th call. Looking at line 18 in the image above, you can see that the intno on the left is 32, while the one on the right is 14.

So it looks like interrupt no 32 occurs before interrupt no 14. I recall that a page fault exception is 14, and since the bug had always been related to page fault exception, that wasn’t surprising at all. But what is interrupt no 32? And does the fact that it occurred right before the page fault exception be the cause for our bug?

Looking up the source code of my kernel, I had remapped interrupts on my Programmable Interrupt Controller according to standard ISA IRQ numbers, with an offset of 0x20. Thus interrupt number 32 corresponds to a timer interrupt.

This is a revelation. I had never suspected that the bug is related to a timer interrupt, but a timer interrupt bug seems to fit the behavior of the bug’s randomness.

Now why would a timer interrupt not manifest in gdb? Armed with this information, it didn’t take long to confirm that qemu on gdb by default disables the clock timer, and IRQ interrupts. The qemu docs here indicates how to enable them both in gdb debugging. And sure enough, once I enabled both, the bug now shows up in gdb as well.

Clue 5: The bug occurs after a timer interrupt.

Ruling out timer interrupt handler

On hindsight, the problem is obvious now, but it still didn’t cross my mind how the timer interrupt could be causing the buggy behaviors we had observed such as 1) immediate page fault right after a page fault handler, and 2) indexing vector giving different value.

Thus I continued by inspecting timer interrupt related code. Below is the InterruptTimerHandler

   
    extern "C" void InterruptTimerHandler()
    {
        {
            IRQAcknowledge AckEOI(INTRP::IVT::TIMER);
        }

        if (++timer_ticks >= TICKS_PER_SECOND)
        {
            ++seconds_since_boot;
            timer_ticks = 0;
        }

        if (seconds_since_boot == 3)
        {
            seconds_since_boot = 0;
            PROCESS::Reschedule();
        }
    }

It’s very simple, basically it acknowledges the IRQ, and increments a count. And if the count reaches 3 seconds, call the process scheduler. I don’t see any issue with the timer handler

Clue 6: Interrupt timer handler is correct.

Noticing pattern of interrupt that causes bug

Now I started to suspect that it was qemu’s i386 hardware emulation that was buggy. That it somehow messes up the system’s environment after a timer interrupt. I felt this was unlikely, but I thought of Sherlock Holmes’ quote and went on anyway.

“When you have eliminated the impossible, whatever remains, however improbable, must be the truth”

Stepping through qemu’s source, I noticed the code snippet below,

Looks like qemu has a log mode that prints out assembly instructions. Duh, I should have rtfm. Turns out qemu has some really sweet debugging print outs.

With interrupt logs, and guest assembly logs (-d int,in_asm) on, here’s the output I obtained (modified for clarity)

In line 9, the CPU is attempting to dereference memory location 0x900000, which as mentioned earlier isn’t yet mapped to physical memory. Thus it immediately triggers a page fault exception, which can be seen in line 11 – 21. Line 18 – 21 is all the page fault handler code that I have omitted since they aren’t the culprit.

We see in line 15 that when the page fault exception occurred, we have EIP at 0x001059a3, which means after the page fault handler, the CPU would execute line 9 again. This is expected behavior as the CPU attempts to continue accessing the memory location 0x900000 after the page has been mapped.

However, before the CPU can execute instruction in line 9 again, a timer interrupt occurs, as can be seen in line 23. Notice also that EAX in line 13, and 25 are both 0x900000, since that’s the address we are attempting to assess when the page fault exception, and timer interrupt occurred respectively.

After the timer interrupt, the CPU now again tries to resume where it left off in EIP 0x001059a3. However, we notice a 2nd page fault exception now happens immediately in line 33. What stood out now is that EAX is now 0x3. This means that the CPU is trying to assess memory location 0x3. And since the first 4KiB is intentionally unmapped (explained earlier to catch nullptr errors), a page fault exception is triggered.

Clue 7: EAX after timer interrupt has been changed

Fixing the bug

We are finally at the last section! If you are still with me, the bug is that our timer interrupt handler doesn’t restore registers after it is called. Thus whenever a timer interrupt occurs, it leaves common registers in a changed state, in particular, EAX is set to 0x3. And the behavior according to
System V ABI calling convention
is that the registers EAX, EDX, and ECX are free to be used and need not be preserved.

The problem is that timer interrupt can be called anywhere interrupt is enabled, which means if it is called right before an instruction that expects to use some value stored in the register EAX, it will be containing junk instead.

This explains the two buggy behaviors. In 1) the unmapped virtual memory access case, EAX which is supposed to contain the now mapped memory location, is now 0x3, leading to a second page fault exception. Whereas in 2) the vector indexing access case, the indexed element is already contained in EAX, but it is corrupted by the timer interrupt before it is being used.

Finally, the fix is to store and restore all the unpreserved registers before we execute the timer interrupt handler.

In all, the most useful thing I learned is to utilize qemu’s instruction and interrupt dump, especially in a kernel building project. The quality of the information produced is just a gold mine.

Infinite Page Fault, Triple Fault, and General Protection Fault

In KernOS’s development, we are at the point of handling page fault exceptions. To do this, I have written the following test code,

volatile uint32_t *Mem = (uint32_t*) 0x400000;
*Mem = 10;

volatile uint32_t *Mem2 = (uint32_t*) 0x400f00;
*Mem2 = 20;

Memory location 0 to 4MB has been identity mapped at kernel initialization (see virtualmemory.cpp). Thus we are going to intentionally access memory beyond 4MB to trigger page fault.

The first *Mem is expected to cause a page fault, while the second *Mem2 being less than 4K apart from the first, should not trigger a page fault; as the latter’s page would have been mapped.

To test that the unmapped memory access can 1) trigger a page fault, and 2) be handled by adding a corresponding page table entry to the address’ page table, I wrote a crude page fault handler as below.

extern "C" void FaultPageHandler()
{
    uint32_t Fault_Address;
    asm volatile("mov %%cr2, %0" : "=r" (Fault_Address));

    if (  Fault_Address >= 4 * MB
       && Fault_Address  < 8 * MB
       )
    {
      VM::MapPageTable(1, VM::kernel_page_directory, VM::pagetable1);

      asm volatile( // flush tlb
      "invlpg %0"
      :
      : "m"(*(char*) Fault_Address)
      : "memory"
      );
    }
    else
    {
        kprintf("Page fault handler only works for 4MB - 8MB range for now\n");
        Hang();
    }

    kprintf("Page fault handler called\n");
}

Booting the kernel, I was expecting to see “Page fault handler called” being logged only once. And sure enough, that’s exactly what did not happen.

Instead, I was greeted with an infinite page fault – where the page fault handler code keeps getting called. Initially, I attempted to troubleshoot it by haphazardly attempting whatever comes to mind, like resetting cr3 to the page directory location, flushing (or not flushing) tlb, and dumping registers.

Whatever it is, they didn’t work. But at some point, I got a triple fault. And after some more directionless meddling, I got a general protection fault. It’s clear that at this point that something is horribly wrong.

However, reasoning the various types of error do give us some clues. An infinite page fault indicates that we are repeatedly accessing unmapped memory; a triple fault indicates that we are encountering a second and third fault while handling the first fault; and a general protection fault indicates that we could be accessing protected memory, or simply executing bytes in memory not designated as instructions.

I thought this could be coming from the page fault handler jumping into unexpected memory location. Inspecting the exception handler, it appears innocent, as it’s the exact same wrapper that I used for an interrupt handler. The interrupt handler has been tested to be working on the PIT controller.

The interrupt handler macro below simply wraps the handler call with an iret instruction.

#define INTRP_ENTRY(Type)                     \
  extern "C" void Interrupt##Type##Entry();   \
  extern "C" void Interrupt##Type##Handler(); \
  asm(                                        \
  ".globl Interrupt" #Type "Entry \n"         \
  "Interrupt" #Type "Entry:       \n"         \
  "call Interrupt" #Type "Handler \n"         \
  "    iret\n");

INTRP_ENTRY(Timer)

Finally, on some google-fu, I chanced upon a comment which said that the CPU would push an addition error code on certain exceptions that was not the case for interrupts.

Thus the exception handler shouldn’t be the exact same as the interrupt handler. As it should pop the error code before returning, otherwise iret would be popping the stack content’s of [error code, return address, segment registers, flags] into [instruction pointer, segment registers, flags], thoroughly messing up $eip, segment registers, and flags. This explains the infinite page fault, triple fault, and general protection fault that we were running into.

Here’s the revised exception handler code that correctly pops the error code.

#define FAULT_ENTRY(Type)                   \
    extern "C" void Fault##Type##Entry();   \
    extern "C" void Fault##Type##Handler(); \
    asm(                                    \
    ".globl Fault" #Type "Entry \n"         \
    "Fault" #Type "Entry:       \n"         \
    "    call Fault" #Type "Handler \n"     \
    "    add $0x4, %esp\n"                  \
    "    iret\n");

FAULT_ENTRY(Page)

The plan next is to get physical memory info from the bootloader, and organize them into 4KB pages of available page frame for allocation. With this, we can then enhance our page fault handler so as to be capable of allocating virtual addresses beyond 8MB.

Planning the KernOS

As initial experimentation, I got a working x86 kernel binary. The key things it accomplishes are that it can be identified by a boot-loader, and subsequently transfers control to print characters to a VGA device [see https://github.com/kernyan/KernOS].

Next, to get better clarity on the design direction, let’s lay out some requirements and constrains,

Properties I want,

  1. Running on a modern processor (i.e. clock cycles in the order of GHz)
  2. Modern memory capacity (i.e. at least 4GB of memory)
  3. OS to be written in modern C++, but not wasteful of memory. Also assumes no library support
  4. Portable to multiple architecture (i.e. all assembly codes have to be neatly placed in their own architecture specific directories)
  5. Initial implementation targets x86

As a starting point, I want reliable and convenient debugging support as I expect this to be a huge time saver especially in debugging assembly code. Thus, I need

  1. Utility function to dump memory content in specified range
  2. Unit tests on modular functions to prevent accidental behavior change
  3. Implication of 2) is that I want to move away from the common technique of using globals for OS data structures to the extend possible, otherwise those globals has to be easily mocked for testing
  4. Pretty printing of data structure contents
  5. Gdb hooks, maybe?

With those constrains and requirements specified, what part of the OS component listed in [https://kernyan.com/2020/01/15/building-an-operating-system-kernos/] do we write first?

Let’s scan the immediate components and see what interfaces/data structures that are repeatedly needed,

  1. Process creation/switching
    • data structure to represent process, and its properties
    • assembly to save stack/register, and jump
    • rescheduling – tracking ready processes, ordered by priority
  2. Dynamic memory allocation
    • Partitioning heap memory into chunks,
    • Finding suitably sized unused chunks, marking as used
    • Returning and coalescing memory,
    • Memory pool for each process (to avoid one process hogging all free memory), maybe?
  3. Coordination of concurrent processes
    • Semaphores that track waiting processes

Of the above, 1) tracking ready processes, 2) tracking memory chunks, and 3) tracking waiting processes all require the similar feature of being able to link together similar data structures. Since we don’t have a heap allocator yet, we need a data structure that describes a collection of homogeneous but arbitrary data types, that doesn’t require dynamic memory.

One solution is to have our data structure to have contiguous memory layout. One can imagine the data structure to have pointers that each point to the next item. However we cannot directly point to the item we are describing because we also need to know how to get from that item to the next one. Instead we introduce an “entry” data type that our data structure contains. Each “entry” would contain the information of the actual memory location (or indices into a C array if we assume the items are contiguous in memory as well) of the item, AND the next “entry”. Thus when following sequences of items, we remain entirely within our own data structure.

Essentially, I am describing a linked-list with the requirement that every list node are contiguous.

Things to note of the linked-list that we need,

  1. We want the following atomic functions
    • remove item from list (and joining remaining parts of the list)
    • add item to list
  2. Ideally the list also accommodates ordering by priority which is ready for process rescheduling

Let’s build that next.

Building an operating system – KernOS

I am starting a project of building an operating system, whose goal is motivated purely by education rather than practical application. Yet it should be sufficiently capable to make the learning meaningful. For instance, it should have a filesystem, scheduler, GUI, and build tools.

There are many good operating system resources / guides (osdev, small OS book, SerenityOS, Xinu, etc.) out there. But I haven’t come across one that fulfills the following,

  1. Fully reproducible (with both qemu and physical hardware),
  2. Incrementally building the operating system. With explanation of the theory without assuming concepts that haven’t been elaborated earlier. And most of all, working code to go along.
  3. Documentation of the exploration, and thought process between going from one step to the next. For example, what should be the next step once the processor reads the boot stub and jumps to the initialization function? Should we set up a stack and free store? Or should we set up interrupt vector table? Why are we choosing one against the other?
  4. Minimal feature so that the actual intent of learning an OS is not obscured. In this sense, studying linux code base is not what I was looking for.
  5. C++/C implementation because of my relative familiarity with those two languages compared to others
  6. Good practice, consistent coding-style, modular components, and unit tested

The OS should have the following components,

  1. boot loader
  2. kernel
    • processor, stack, and free store initialization; kernel data structures; system calls
    • scheduler
    • process creation / fork
  3. peripheral
    1. keyboard interrupt
    2. serial communication
  4. filesystem
  5. virtual memory map
  6. GUI
  7. user application
  8. networking
  9. build tools

Aside from the high level outline, I haven’t much intuition as to which order to proceed, well, that’s part of the learning and exploration process. This is going to take a while.

clang-format style-option listings

While perusing the clang-format documentation today, I noticed some very exciting features such as “AllowAllArgumentsOnNextLine”, ” AllowAllConstructorInitializersOnNextLine”, and “AllowShortIfStatementsOnASingleLine (ShortIfStyle)”.

Eager to try out these new features, I setup a .clang-format file with the above style-options, only to find out that the clang-format binary on my machine wasn’t up to date. Turns out that the clang documentation was actually for clang 9.0, for which the associated binaries haven’t been officially released yet.

Looking around the clang docs, I also realized that while each of the style-options supported across the clang versions are documented, there wasn’t a single changelog-like document.

Such an information would be particularly useful since aside from new style-options being added, existing ones could also 1) be renamed, or 2) have their option settings upgraded from boolean to enum.

Perhaps there’s already a style-option listing somewhere, if so please let me know. In the meantime, I wrote a python script to pull the style-option interface and curated them [here] (link tables new additions, changes to interface, and total listing of the clang-format style-options).

For clang 9.0 in particular,

New for clang-format 9

  • AlignConsecutiveMacros (bool)
  • AllowAllArgumentsOnNextLine (bool)
  • AllowAllConstructorInitializersOnNextLine (bool)
  • AllowShortLambdasOnASingleLine (ShortLambdaStyle)
  • TypenameMacros (std::vectorstd::string)
  • NamespaceMacros (std::vectorstd::string)
  • SpaceAfterLogicalNot (bool)

Changed interface for clang-format 9

  • from: AllowShortIfStatementsOnASingleLine (bool)
  • to: AllowShortIfStatementsOnASingleLine (ShortIfStyle)

Comparison operator for any user defined class in C++

For fast indexing / lookup without having to specify hash function in C++, we can use memcmp to achieve generic comparison operator for arbitrary class

When we need a dictionary (or key value pair) structure in C++, a good place to start is to consider std::map, or std::unordered_map.

std::unordered_map requires that the key be hash-able (i.e., that std::hash<type> be defined), while std::map requires that the key has the less than comparison operator be defined.

std::unordered_map being implemented as a hash table provides an O (1) lookup, while std::map being implemented as a tree provides an O (log n) lookup. Therefore if lookup time is our main concern, we should prefer unordered_map.

However, when the key is a user defined class, defining a good hash function is not straightforward. A good hash function needs several properties, among others,

  • to be fairly uniform in mapping the domain over the possible range,
  • range has to be large enough to make collision unlikely,
  • hash algorithm has to be fast and account for arbitrary member layout of the user defined class

When the user defined class has uniform members, a hash function similar to java’s hashCode implementation is a good reference. The idea roughly works as,

struct UserObj 
{ 
    float a; 
    float b; 
    float c; 
}; 

namespace std 
{ 
    template <> 
    struct hash<UserObj> 
    { 
        size_t operator() (UserObj const & in) const 
        { 
            const size_t n = sizeof(UserObj) / sizeof (float); 
            float *p = (float*) &in; 
            size_t key = 17; 
            for (size_t i = 0; i < n; ++i) { 
                key = 31 * key + hash<float> ()(*p++); 
            } 
            return key; 
        } 
    }; 
}

However, the hash function above has two limitations,

  1. size_t is the return type defined by std::hash, and thus our hash template specialization is limited to the range of size_t, which on a 32 bit architecture has a 1 in 10’000 probability of collision when our container size gets to around 1’000, which is quite unsatisfactory (and that’s assuming that our hash algorithm uniformly spreads the input keys over the 32 bit integer range, which is not always true),
  2. the hash function is generic only when all the members of the user defined class are of the same type, otherwise the pointer arithmetic hash algorithm would have to be reviewed each time a new member is added.

What’s the alternative then?

If we can live with O (log n) lookup, then we can get away without the hash function entirely, by using std::map as the container. In exchange for the slower lookup, we no longer have to worry about hash collision. The O (log n) lookup complexity is a consequence of std::map’s implementation, which orders the elements via a binary tree. That’s also why we need to define a less than operator.

Now, for a less than operator, all we need is to have a rule that systematically tells us whether an element is less than another element. As long as the rule is systematic, there’s no requirement on what rule we use. Knowing this, the simplest (and also most generic way) of comparing two arbitrary element is to compare their bits. Our rule could then be whichever element that has the first different bit being smaller. It doesn’t matter if the bits are ordered little or big endian too, because either way, it is still systematic.

This is precisely what the built-in C library function memcmp does, where

memcmp

returns value < 0 when the first byte that does not match in both memory blocks has a lower value in ptr1 than in ptr2 (if evaluated as unsigned char values)

http://www.cplusplus.com/reference/cstring/memcmp/

Thus we can write something like,

bool operator<(UserObj const & lhs, UserObj const & rhs)
{
    return memcmp(&lhs, &rhs, sizeof(lhs)) < 0;
}

Caveat worth mentioning is that this only works on structure which does not contain pointer or references, since in such a case, memcmp would be comparing the pointer address instead of the content.

Getting window’s pdb symbol for air gapped / isolated environment

Occasionally while debugging on visual studio, we encounter certain MFC functions for which no pdb symbols exist. For instance, the dll function with gibberish function names in visual studio’s call stack window as seen below,

Those unloaded symbols are due to VS not finding the corresponding pdb for the dll used. pdb files are useful for debugging, as it shows the actual function name, and matches the function to the source file (provided that the appropriate source and header files are present in the VS solution).

However, another important need for pdb that is unrelated to the dll functions is how its absence affect visual studio’s line stepping time during debugging. In those cases, having unloaded symbols severely slows down line by line debugging, which can be alleviated by closing the “Call Stack” window.

An alternative to closing the “Call Stack” window is to simply load the symbols, for MFC symbols, right clicking on the dll functions and selecting “Load Symbols” will get VS to retrieve the missing symbols from Microsoft’s symbol server.

Simple.

What’s tricky is figuring out how to load those symbols on an isolated environment / air gapped machine that can’t connect to Microsoft’s symbol server.

We could manually download the pdb files and transfer over to the isolated environment. To do that, first we need to know the pdb’s name and version that we need. Obviously, the pdb’s version has to match up with the dll used, which can be found by double clicking on the dll functions in the call stack window (ensure that you uncheck show disassembly in VS’s debug setting beforehand, else you’ll load up the disassembly instead of the dll information page below)

While we now know the right pdb version and name, we don’t know where we could obtain one, since they aren’t exactly listed for download by Microsoft.

For that, we could use Window’s Debugger Tools symchk.exe to download the pdb manually (on a separate machine with internet access) by specifying a manifest file (which is a format that tells Microsoft’s symbol server which pdb files we need).

The manifest file can created from the executable using symchk, but that means we have to either transfer the executable or dll from the isolated environment out, or create the manifest file in the isolated environment and then transfer the manifest file out.

If transferring files out from the isolated environment is possible, we can follow symchk’s instruction here to obtain the pdb.

Some isolated environment have strict policies on transferring files out. In this case, our best bet is to manually construct the manifest file by using the GUID of the pdb-s we need. The GUID is a unique identifier for the name and version of the pdb.

To obtain the GUID of the pdb-s we need, we can trigger the creation of a symbol folder with the GUID as its path by clicking “Load Symbol” in the “Call Stack” of Visual Studio, and then view the GUID by clicking “Symbol Load Information”.

Copy the GUID down to paper and reconstruct the manifest file manually as below,

Then use symchk.exe to download the pdb using the manifest file as argument,

>symchk.exe /v /im manifest.txt

By default, the downloaded pdb-s will be placed in C:\Windows\SYMBOLS\

Once you have the pdb-s, we can then transfer and load them on to the isolated environment.

Efficiency of Function Return in C/C++

Function return mechanism in assembly. Impact of Return Value Optimization.

In the C code snippet below,

UserObject Foo()
{
    UserObject Local = UserObject();
    return Local;
}

int main()
{
    UserObject O = Foo();
}

Foo()
has a class return value, which is the local variable in line 3.

Let’s start with the behavior without Return Value Optimization (RVO). As typical for local variables in C, when the instruction jumps to Foo(), the CPU will 1) allocate the local variable “Local” on the stack, and 2) call “Local”‘s default constructor on the allocated stack space. At the end of the function call, the returned value is passed back to the caller by keeping the memory address of “Local” in the EAX/RAX register. Subsequently, with the control returning to the main function, the CPU would call “O”‘s copy (or move constructor if there’s one) with the returned value as argument.

The key thing to note here is that “O” is constructed by first having the CPU construct “Local” in Foo(), and then calling “O”‘s copy constructor.

In C++17, the standard now guarantees RVO behavior. This means that instead of allocating on the stack memory for “Local”, the CPU would call UserObject’s constructor in Foo() on its caller’s variable location (in this case, the memory address of “O”). With this, there’s no longer the need for the caller (i.e. main) to call UserObject’s copy constructor when Foo() returns.

What Passing by Value Actually Means to the CPU

Taking a look at pass-by-value and pass-by-reference from the perspective of the computer

In C, there are three ways of passing a variable to a function, (1) by-value, (2) by-reference, and (3) by-pointer.

The compiler cares about the distinction between (2) and (3), but not the CPU. Thus, we are going to consider (2) and (3) as the same case. You can convince yourself that they are indeed the same by inspecting the disassembly below.

Back to the more interesting discussion of pass-by-value vs pass-by-reference. The following code snippet shows a function call which takes the same variable as parameter

struct Four64Bits
{    
    long long i1;
    long long i2;
    long long i3;
    long long i4;  
};  

void ByReference (Four64Bits &in) 
{
}

void ByValue (Four64Bits in) 
{
}

int main()
{
    Four64Bits Var;

    ByReference(Var);    
    ByValue(Var);

    return 0;
}

The two function calls do exactly nothing, and they vary only in the parameter type. Also, you might have heard that passing a parameter by reference instead of by value would save you a copy. To concretely see what it means by “saving a copy”, let’s take a look at godbolt‘s disassembly again,

On ByReference

Right side’s line 18

lea rax, [rbp-32]
is taking the memory location of “Var”. The memory location is “[rbp-32]” as local variables are allocated on the stack with a negative offset of its size (and Four64Bits is 32 bytes). Line 19
mov rdi, rax
is moving the memory location of “Var” to register rdi, which you can then see in line 4, where it is being used in the ByReference instructions. What’s most important to notice is that we only needed a 64-bit memory location to pass a variable of 4*64 bits in size.

On ByValue

On the other hand, right side’s line 21 to 24 of

push QWORD PTR [rbp-8x]
shows that the CPU had to execute 4 instructions just to pass “Var” to the ByValue function. The “push QWORD” means that the CPU is copying 8 bytes of data to the stack. Thus we see that four “push QWORD” instructions are required to fully copy a 32 byte variable.

Takeaway is that passing by value means the CPU is physically copying all the bits of the variable instead of just the address of its memory location.