Tracking a kernel bug that doesn’t appear in gdb

I have been trying to locate the cause of a seemingly random bug in KernOS, and after a long investigation, I finally pinned it down.

On hindsight, the bug could have been reasoned out much earlier as the investigation led to more and more credible clues along the way. It’d be a fun detective challenge for you to follow along and see if you can reason out the issue without using as many clues.

To do this, I will lay out my investigation process and label the important clues I discovered chronologically.

Table of contents/clues

  1. Bug behavior 1
  2. Bug behavior 2
  3. Stepping through gdb, instrumenting gdb
  4. Ruling out page directory setup, page fault handler, kernel heap allocation
  5. Instrumenting qemu
  6. Ruling out timer interrupt handler
  7. Noticing pattern of interrupt that causes bug
  8. Fixing the bug

Bug behavior 1

The first time the bug was observed was when I was implementing a page fault mechanism. Where a memory access on an unmapped virtual memory address will cause a page fault exception. Which in turn leads to allocating a physical page frame, and updating the page directory entry.

For context, the 0 – 8MB physical memory range is identity mapped to virtual memory, in addition, the first 4KiB page remains unmapped to catch nullptr dereferencing. In other words, any other memory address reference will cause an exception. And below is the code to test whether virtual memory works.

  
    void MemoryTest()
    {
       kprintf("Test %i: Page fault and virtual memory mapping\n", ++TestCount);

       volatile uint32_t *Mem = (uint32_t*) 0x500000;  // 0 - 8MB is already mapped
       *Mem;                                           // won't cause Page fault exception

       volatile uint32_t *Mem2 = (uint32_t*) 0x900000;
       *Mem2 = 1;                                      // memory access causes Page Fault
    }

Clue 1: The bug was that about 1 in 5 runs, right after 0x900000 is mapped, I would get another page fault exception at either address 0x3

Bug behavior 2

In addition to virtual memory mapping, I also have a custom vector container test. Since this vector container isn’t from STL, heap allocation of the elements is done (in a ghetto manner) by reserving a chunk of memory in the kernel’s .text section, which is managed by a linked list (K&R style). The vector test looks like below

 
    void VectorTest()
    {
       kprintf("Test %i: Vector and kmalloc/free\n", ++TestCount);

       KL::Vector V(5);

       V[2] = 5;
       V[4] = 15;

       V.Pushback(30);
       V.Pushback(31);

       kassert(V[2] == 5, "Vector[] failed\n");
       kassert(V[6] == 31, "Vector::Pushback failed\n");
       kassert(V.Back() == V[6], "Vector::Back failed\n");
    }

Any of the asserts above would fail randomly, and once again, with the frequency of once in a few kernel boots.

To further figure out what’s going on, I added a series of vector indexing as below.


    void VectorTest()
    {
       kprintf("Test %i: Vector and kmalloc/free\n", ++TestCount);

       KL::Vector V(5);

       V[2] = 5;

       kputchar("%i", V[2]);
       kputchar("%i", V[2]);
       kputchar("%i", V[2]);
       kputchar("%i", V[2]);
       ...
       kputchar("%i", V[2]);
    }

And the output looks like

Clue 2: Where once in a while, the indexing would return 0, and in those cases, it would also trigger a page fault at address 0x0.

Stepping through gdb, instrumenting gdb

Given the two above random behaviors, the first thing that came to mind was to use gdb to figure out how things like V[2] could suddenly return a different value. After much trial, I could never reproduce any of the buggy behavior while single stepping.

Perhaps I just wasn’t running the kernel enough times to catch the bug occurring. Single stepping was also pretty tedious, so next I tried to run the kernel many times, and simply log every executed instructions (using gdb on qemu on the kernel), hoping to analyze the executed code later when the bug occurs.

Here’s how I was running qemu and gdb,

    qemu-system-i386 -s -S kern.bin
    # -s means to allow gdb to connect at port tcp::1234
    # -S means to stop at first instruction

And in gdb

   
    target remote connection :1234
    set pagination off
    set logging on
    display/i $pc
    while 1
    si
    end

Clue 3: No matter how many times I run the kernel and log the instructions, the bug NEVER occurs.

Ruling out page directory setup, page fault handler, kernel heap allocation

Stumped by the fact that gdb isn’t able to record the behavior, I decided to move on from approaching the issue through debugging for the time being. I nonetheless felt that it’s related to virtual memory/page fault handling that was the culprit.

Thus I carefully inspected all code related to page fault handling.

Clue 4: It is not related to

  • Page fault exception handler
  • Page frame allocation
  • Kernel heap memory allocator
  • Page directory, or page table entry update

Instrumenting qemu

At this point, I decided that I needed to look into qemu in order to understand what’s going on. Compiling qemu and learning about how it works. I started poking around the source code and realized that logging qemu instructions executing the kernel does allow the buggy behavior to occur. That’s a very important progress, since as long as I can reproduce/log the sequence of instructions that caused the bug, I can eventually figure this out. Qemu source code is still one complex beast though. But at least the bug is no longer elusive.

Before this, the bug almost felt supernatural, as if it knows not to manifest when you’re looking at it.

After some fumbling, I decided to start from inspecting qemu’s handling of page fault exceptions. The function where exceptions and interrupt are handled in qemu is do_interrupt_all. And logging it, I realized that if the bug occurs, it would occur on the 314-th call to do_interrupt_all. This is pretty unscientific, but I decided to log qemu instructions on the 313-th call and 314-th call and diff-ing them to see where they diverges.

Here’s what I got

You can ignore Break313, Break314, it’s simply a dummy function I inserted into qemu’s do_interrupt_all function to allow easy breaking. By chance, I noticed an important difference between the 313-th and 314-th call. Looking at line 18 in the image above, you can see that the intno on the left is 32, while the one on the right is 14.

So it looks like interrupt no 32 occurs before interrupt no 14. I recall that a page fault exception is 14, and since the bug had always been related to page fault exception, that wasn’t surprising at all. But what is interrupt no 32? And does the fact that it occurred right before the page fault exception be the cause for our bug?

Looking up the source code of my kernel, I had remapped interrupts on my Programmable Interrupt Controller according to standard ISA IRQ numbers, with an offset of 0x20. Thus interrupt number 32 corresponds to a timer interrupt.

This is a revelation. I had never suspected that the bug is related to a timer interrupt, but a timer interrupt bug seems to fit the behavior of the bug’s randomness.

Now why would a timer interrupt not manifest in gdb? Armed with this information, it didn’t take long to confirm that qemu on gdb by default disables the clock timer, and IRQ interrupts. The qemu docs here indicates how to enable them both in gdb debugging. And sure enough, once I enabled both, the bug now shows up in gdb as well.

Clue 5: The bug occurs after a timer interrupt.

Ruling out timer interrupt handler

On hindsight, the problem is obvious now, but it still didn’t cross my mind how the timer interrupt could be causing the buggy behaviors we had observed such as 1) immediate page fault right after a page fault handler, and 2) indexing vector giving different value.

Thus I continued by inspecting timer interrupt related code. Below is the InterruptTimerHandler

   
    extern "C" void InterruptTimerHandler()
    {
        {
            IRQAcknowledge AckEOI(INTRP::IVT::TIMER);
        }

        if (++timer_ticks >= TICKS_PER_SECOND)
        {
            ++seconds_since_boot;
            timer_ticks = 0;
        }

        if (seconds_since_boot == 3)
        {
            seconds_since_boot = 0;
            PROCESS::Reschedule();
        }
    }

It’s very simple, basically it acknowledges the IRQ, and increments a count. And if the count reaches 3 seconds, call the process scheduler. I don’t see any issue with the timer handler

Clue 6: Interrupt timer handler is correct.

Noticing pattern of interrupt that causes bug

Now I started to suspect that it was qemu’s i386 hardware emulation that was buggy. That it somehow messes up the system’s environment after a timer interrupt. I felt this was unlikely, but I thought of Sherlock Holmes’ quote and went on anyway.

“When you have eliminated the impossible, whatever remains, however improbable, must be the truth”

Stepping through qemu’s source, I noticed the code snippet below,

Looks like qemu has a log mode that prints out assembly instructions. Duh, I should have rtfm. Turns out qemu has some really sweet debugging print outs.

With interrupt logs, and guest assembly logs (-d int,in_asm) on, here’s the output I obtained (modified for clarity)

In line 9, the CPU is attempting to dereference memory location 0x900000, which as mentioned earlier isn’t yet mapped to physical memory. Thus it immediately triggers a page fault exception, which can be seen in line 11 – 21. Line 18 – 21 is all the page fault handler code that I have omitted since they aren’t the culprit.

We see in line 15 that when the page fault exception occurred, we have EIP at 0x001059a3, which means after the page fault handler, the CPU would execute line 9 again. This is expected behavior as the CPU attempts to continue accessing the memory location 0x900000 after the page has been mapped.

However, before the CPU can execute instruction in line 9 again, a timer interrupt occurs, as can be seen in line 23. Notice also that EAX in line 13, and 25 are both 0x900000, since that’s the address we are attempting to assess when the page fault exception, and timer interrupt occurred respectively.

After the timer interrupt, the CPU now again tries to resume where it left off in EIP 0x001059a3. However, we notice a 2nd page fault exception now happens immediately in line 33. What stood out now is that EAX is now 0x3. This means that the CPU is trying to assess memory location 0x3. And since the first 4KiB is intentionally unmapped (explained earlier to catch nullptr errors), a page fault exception is triggered.

Clue 7: EAX after timer interrupt has been changed

Fixing the bug

We are finally at the last section! If you are still with me, the bug is that our timer interrupt handler doesn’t restore registers after it is called. Thus whenever a timer interrupt occurs, it leaves common registers in a changed state, in particular, EAX is set to 0x3. And the behavior according to
System V ABI calling convention
is that the registers EAX, EDX, and ECX are free to be used and need not be preserved.

The problem is that timer interrupt can be called anywhere interrupt is enabled, which means if it is called right before an instruction that expects to use some value stored in the register EAX, it will be containing junk instead.

This explains the two buggy behaviors. In 1) the unmapped virtual memory access case, EAX which is supposed to contain the now mapped memory location, is now 0x3, leading to a second page fault exception. Whereas in 2) the vector indexing access case, the indexed element is already contained in EAX, but it is corrupted by the timer interrupt before it is being used.

Finally, the fix is to store and restore all the unpreserved registers before we execute the timer interrupt handler.

In all, the most useful thing I learned is to utilize qemu’s instruction and interrupt dump, especially in a kernel building project. The quality of the information produced is just a gold mine.

Infinite Page Fault, Triple Fault, and General Protection Fault

In KernOS’s development, we are at the point of handling page fault exceptions. To do this, I have written the following test code,

volatile uint32_t *Mem = (uint32_t*) 0x400000;
*Mem = 10;

volatile uint32_t *Mem2 = (uint32_t*) 0x400f00;
*Mem2 = 20;

Memory location 0 to 4MB has been identity mapped at kernel initialization (see virtualmemory.cpp). Thus we are going to intentionally access memory beyond 4MB to trigger page fault.

The first *Mem is expected to cause a page fault, while the second *Mem2 being less than 4K apart from the first, should not trigger a page fault; as the latter’s page would have been mapped.

To test that the unmapped memory access can 1) trigger a page fault, and 2) be handled by adding a corresponding page table entry to the address’ page table, I wrote a crude page fault handler as below.

extern "C" void FaultPageHandler()
{
    uint32_t Fault_Address;
    asm volatile("mov %%cr2, %0" : "=r" (Fault_Address));

    if (  Fault_Address >= 4 * MB
       && Fault_Address  < 8 * MB
       )
    {
      VM::MapPageTable(1, VM::kernel_page_directory, VM::pagetable1);

      asm volatile( // flush tlb
      "invlpg %0"
      :
      : "m"(*(char*) Fault_Address)
      : "memory"
      );
    }
    else
    {
        kprintf("Page fault handler only works for 4MB - 8MB range for now\n");
        Hang();
    }

    kprintf("Page fault handler called\n");
}

Booting the kernel, I was expecting to see “Page fault handler called” being logged only once. And sure enough, that’s exactly what did not happen.

Instead, I was greeted with an infinite page fault – where the page fault handler code keeps getting called. Initially, I attempted to troubleshoot it by haphazardly attempting whatever comes to mind, like resetting cr3 to the page directory location, flushing (or not flushing) tlb, and dumping registers.

Whatever it is, they didn’t work. But at some point, I got a triple fault. And after some more directionless meddling, I got a general protection fault. It’s clear that at this point that something is horribly wrong.

However, reasoning the various types of error do give us some clues. An infinite page fault indicates that we are repeatedly accessing unmapped memory; a triple fault indicates that we are encountering a second and third fault while handling the first fault; and a general protection fault indicates that we could be accessing protected memory, or simply executing bytes in memory not designated as instructions.

I thought this could be coming from the page fault handler jumping into unexpected memory location. Inspecting the exception handler, it appears innocent, as it’s the exact same wrapper that I used for an interrupt handler. The interrupt handler has been tested to be working on the PIT controller.

The interrupt handler macro below simply wraps the handler call with an iret instruction.

#define INTRP_ENTRY(Type)                     \
  extern "C" void Interrupt##Type##Entry();   \
  extern "C" void Interrupt##Type##Handler(); \
  asm(                                        \
  ".globl Interrupt" #Type "Entry \n"         \
  "Interrupt" #Type "Entry:       \n"         \
  "call Interrupt" #Type "Handler \n"         \
  "    iret\n");

INTRP_ENTRY(Timer)

Finally, on some google-fu, I chanced upon a comment which said that the CPU would push an addition error code on certain exceptions that was not the case for interrupts.

Thus the exception handler shouldn’t be the exact same as the interrupt handler. As it should pop the error code before returning, otherwise iret would be popping the stack content’s of [error code, return address, segment registers, flags] into [instruction pointer, segment registers, flags], thoroughly messing up $eip, segment registers, and flags. This explains the infinite page fault, triple fault, and general protection fault that we were running into.

Here’s the revised exception handler code that correctly pops the error code.

#define FAULT_ENTRY(Type)                   \
    extern "C" void Fault##Type##Entry();   \
    extern "C" void Fault##Type##Handler(); \
    asm(                                    \
    ".globl Fault" #Type "Entry \n"         \
    "Fault" #Type "Entry:       \n"         \
    "    call Fault" #Type "Handler \n"     \
    "    add $0x4, %esp\n"                  \
    "    iret\n");

FAULT_ENTRY(Page)

The plan next is to get physical memory info from the bootloader, and organize them into 4KB pages of available page frame for allocation. With this, we can then enhance our page fault handler so as to be capable of allocating virtual addresses beyond 8MB.

Building an operating system – KernOS

I am starting a project of building an operating system, whose goal is motivated purely by education rather than practical application. Yet it should be sufficiently capable to make the learning meaningful. For instance, it should have a filesystem, scheduler, GUI, and build tools.

There are many good operating system resources / guides (osdev, small OS book, SerenityOS, Xinu, etc.) out there. But I haven’t come across one that fulfills the following,

  1. Fully reproducible (with both qemu and physical hardware),
  2. Incrementally building the operating system. With explanation of the theory without assuming concepts that haven’t been elaborated earlier. And most of all, working code to go along.
  3. Documentation of the exploration, and thought process between going from one step to the next. For example, what should be the next step once the processor reads the boot stub and jumps to the initialization function? Should we set up a stack and free store? Or should we set up interrupt vector table? Why are we choosing one against the other?
  4. Minimal feature so that the actual intent of learning an OS is not obscured. In this sense, studying linux code base is not what I was looking for.
  5. C++/C implementation because of my relative familiarity with those two languages compared to others
  6. Good practice, consistent coding-style, modular components, and unit tested

The OS should have the following components,

  1. boot loader
  2. kernel
    • processor, stack, and free store initialization; kernel data structures; system calls
    • scheduler
    • process creation / fork
  3. peripheral
    1. keyboard interrupt
    2. serial communication
  4. filesystem
  5. virtual memory map
  6. GUI
  7. user application
  8. networking
  9. build tools

Aside from the high level outline, I haven’t much intuition as to which order to proceed, well, that’s part of the learning and exploration process. This is going to take a while.