Kernel Big Lock Pattern

Synchronization

A big portion of a working kernel is the synchronization code.

In a fairly modern OS, you’d want to support SMP or Symmetric Multiprocessing (multiple cores), so that the user experience doesn’t feel clogged up. Also you can run more tasks concurrently.

Synchronization (or lack thereof ) is also a HUGE source of bugs, known as race conditions. What is a race condition? A race condition is, when two concurrent tasks try to work with the same resource, at the same time, for eg. one tasks modifies a variable, while another one reads it - then the reader may read a partially-correct value, leading to broken state.

So what do you do then?

You implement LOCKS.

There are many types of locks, but today we’ll focus on two types: big / and fine-grained.

Big vs. fine-grained locks

What’s the difference?

The difference is scope / domain. Big locks don’t care about scope, they just restrict access to entire subsystems or even the entire kernel. An example of this would be NetBSD, which had a big project, aiming to "modernize" their networking code (ref: https://wiki.netbsd.org/projects/project/smp_networking/). This is a consequence of building on top of 4.4 (4.3?) BSD kernel, which were built in times, where SMP hardware was quite unpopular, so it took a lot of effort to rewrite the system in an SMP-friendly manner. All BSD systems have suffered from this and even pre-2.36 Linux.

All of these systems have switched to fine-grained locking, which on paper is better, because you have many tiny locks, which span over little critical sections. This allows the kernel to not block other CPUs/tasks for too long, which makes the system faster / able to handle more throughput.

The main idea here is that big locks are considered "bad". And yes, they are in a sense that they make critical sections system-wide, but on the other hand, they make concurrent code easier to reason about (there’s nothing to reason about) and maintain. Of course, this comes with a performance penalty, but check this out:

Nobody cares

Or at least in scenarios, like mine.

Why the big lock suits MOP3 better?

Fine-grained locking, while giving a big performance boost, is hard (really hard) and even pro-sigma-h4x0r-67 Linux kernel devs can’t get it right at times. When you have so many locks, it’s easy to loose your mind and forget things.

A big lock is not an issue for MOP3, because

it’s a hobby OS, so there’s no pressure to squeeze out best performance imaginable
there’s no performance penalty due to lack of code to perform (not enough apps, drivers and so on)

Lock hierachies

Also one thing to keep in mind about fine-grained locks is the lock hierarchy. What is it?

ABBA deadlocks

Picture this:

Task 1 takes lock A
Task 1 takes lock B
Task 2 takes lock B, but B is taken, so it waits
Task 2 takes lock A, but A is taken, so it waits

Now there’s a circular dependency - Task 1 depends on Task 2 to release it’s lock and Task 2 depends on Task 1. DEADLOCK!!!

So what do you do? You have to ensure a lock hierarchy - lock B must ALWAYS be held under lock A; doing otherwise will lead to issue presented above.

Obviously big locks don’t have such issue, because there’s no other locks :)

What changes in the code?

Now we need to answer the question: when do we hold the big lock?

On kernel entry in: interrupt handlers* and syscalls

The syscalls look like this now:

extern void syscall_entry(void);

static uintptr_t syscall_dispatch1(void* stack_ptr) {
  struct saved_regs* regs = stack_ptr;

  struct proc* caller = thiscpu->proc_current;

  int caller_pid = caller->pid;

  memcpy(&caller->pdata.regs, regs, sizeof(struct saved_regs));

  fx_save(caller->pdata.fx_env);

  int syscall_num = regs->rax;
  syscall_handler_func_t func = syscall_find_handler(syscall_num);

  if (func == NULL) {
    return -ST_SYSCALL_NOT_FOUND;
  }

  struct reschedule_ctx rctx;
  memset(&rctx, 0, sizeof(rctx));

  lapic_timer_mask();

  intr_enable();

  uintptr_t r =
      func(caller, regs, &rctx, regs->rdi, regs->rsi, regs->rdx, regs->r10, regs->r8, regs->r9);

  intr_disable();

  lapic_timer_unmask();

  caller = proc_find_pid(caller_pid);

  if (caller != NULL) {
    caller->pdata.regs.rax = r;
  }

  bool do_thiscpu = false;
  for (size_t i = 0; i < lengthof(rctx.cpus); i++) {
    if (rctx.cpus[i] != NULL && rctx.cpus[i] != thiscpu)
      cpu_request_sched(rctx.cpus[i], true);
    else
      do_thiscpu = true;
  }

  if (do_thiscpu)
    cpu_request_sched(thiscpu, true);

  return r;
}

uintptr_t syscall_dispatch(void* stack_ptr) {
  load_kernel_cr3();

  biglock_lock();

  uintptr_t r = syscall_dispatch1(stack_ptr);

  biglock_unlock();

  return r;
}

We lock before entering a syscall and unlock on exit. Don’t worry, cpu_request_sched will unlock before it loses context due to switching. We also must enable interrupts during syscalls, so that device drivers can work properly. Other than that, not much has changed.

/* Handle incoming interrupt, dispatch IRQ handlers. */
void intr_handler(void* stack_ptr) {
  load_kernel_cr3();

  struct saved_regs* regs = stack_ptr;

  if (regs->trap <= 31) {
    intr_exception(regs);
  } else {
    bool user = false;

    if (regs->cs == (GDT_UCODE | 0x3)) {
      biglock_lock();

      user = true;

      struct proc* proc_current = thiscpu->proc_current;
      memcpy(&proc_current->pdata.regs, regs, sizeof(struct saved_regs));
      fx_save(proc_current->pdata.fx_env);
    }

    lapic_eoi();

    struct irq* irq = irq_find(regs->trap);

    if (irq == NULL) {
      if (user)
        biglock_unlock();
      return;
    }

    struct reschedule_ctx rctx;
    memset(&rctx, 0, sizeof(rctx));

    irq->func(irq->arg, stack_ptr, user, &rctx);

    if (user) {
      bool do_thiscpu = false;
      for (size_t i = 0; i < lengthof(rctx.cpus); i++) {
        if (rctx.cpus[i] != NULL && rctx.cpus[i] != thiscpu)
          cpu_request_sched(rctx.cpus[i], user);
        else
          do_thiscpu = true;
      }

      if (do_thiscpu)
        cpu_request_sched(thiscpu, user);

      biglock_unlock();
    }
  }
}

Now here things get a little more interesting. We lock as before, but we do so only in the case of an interrupt incoming, while we previously were in user-space. This is because when in kernel-space we’re already under the protection of the Big Lock. Locking in such circumstance would lead to a deadlock, because we would retake the same lock and block ourselves.

Conclusion

In conclusion, MOP3 will from now on use the Big Lock, because it’s easier to maintain as a solo developer and the penalty doesn’t affect the system, because of it’s premises - being a hobby OS and lacking sophisticated software stacks, which would require better performance/control.

Hope you’ve learned something useful about big locks vs. fine-grained locks!

EDIT 1

Funnily enough, after some testing and playing around with the system, I’ve found that the big lock actually has improved performance significantly?

Why?

I think it’s due to previously having a lot of spin locks, which equate to a huge amount of atomic reads/writes. Such operations on x86 may cause a performance problem, because they have to be synced across all cores. Having a big lock once on entry and big unlock on exit reduces the amount of these operations. That’s just speculation, I don’t have any real numbers to back this up; it’s all based on anecdotal feel of the system, while using it and running test apps.