30 January 2026

Hello!

In this blog post I would like to share my thoughts and some implementation details about the mutex mechanism in MOP3!

WAIT! MOP 3?

Yes! MOP2 is undergoing a huge rewrite and redesign right now, due to me not being happy with some design choices of the system. This is mostly about writing SMP-first/SMP-friendly code. The rewrite turned out to be so big that it was just easier to start a new project, while reusing big portions of older MOP2 code - and thus MOP3 was born. Excpect more interesting updates in the future!

Multitasking is deliquete

Preface

This portion of the article will explain critical sections and why they’re so fragile. If you’re an advanced reader, you can easily skip it.

Writing multitasked/paralell code is notoriously difficult and can lead to bazzire, awful and undeterministic bugs. One should keep in mind that multitasked code is not something debuggable with normal developer tools like gdb or lldb or some other crash analyzers and code inspectors. Rather than that, such code should be proven to be safe/correct, either by just reading the code or writing formal proofs.

Formal proofs can be written using the Promela scripting language: https://spinroot.com/spin/whatispin.html

The problem

Imagine this problem in a multitasked application:

A disaster happens here
struct my_list_node {
  struct my_list_node* next, *prev;
  int value;
};

/* These are complex operations on multiple pointers */
#define LIST_ADD(list, x) ...
#define LIST_REMOVE(list, x) ...

struct my_list_node* my_list = ...;

void thread_A (void) {
  struct my_list_node* new_node = malloc (sizeof (*new_node));
  new_node->value = 67;

  LIST_ADD (my_list, new_node);
}

void thread_B (void) {
  LIST_REMOVE (my_list, my_list); /* remove head */
}

The issue here is that, while thread_A is adding a new node, thread_B can try to remove it. This will lead to pointer corruption, since LIST_REMOVE and LIST_ADD will try to re-link list’s nodes at the same time.

Such fragile sections of code are called "critical sections", meaning that one has to be VERY careful when tasks enter it. So where are they in the code?

void thread_A (void) {
  /* NOT CRITICAL - new_node is not yet visible to other tasks */
  struct my_list_node* new_node = malloc (sizeof (*new_node));
  new_node->value = 67;

  /* CRITICAL */
  LIST_ADD (my_list, new_node);
}

void thread_B (void) {
  /* CRITICAL */
  LIST_REMOVE (my_list, my_list); /* remove head */
}

So how do we solve this?

This specific problem can be easily solved by, what’s called "Mutual Exclusion" or "MutEx" for short.

Why "mutual exclusion"? As you can see, to modify the list, a task needs exclusive access to it, ie. nobody else can be using the list at the same time. Kind of like a bathroom - imagine someone happily walks in while you’re sitting on a toilet and tries to fit right beside you.

A poor man’s mutex and atomicity

Let’s park it here and move to a slightly different, but related topic - atomicity. What is it?

To be atomic is to be indivisible. An atomic read or a write is an operation that can’t be split up and has to be completed from start to finish without being interrupted.

While atomic operations are mutually exclusive, they work on a level of individual reads/writes of integers of various sizes. One can atomically read/write a byte or an uint64_t. This of course won’t solve our problem, because we’re operating on many integers (here pointers) simultaneously - struct my_list_node’s prev and next pointers - but despite that, we can leverage atomicity to implement higher level synchronization constructs.

BTW, documentation on atomic operations in C can be found here: https://en.cppreference.com/w/c/atomic.html. The functions/macros are implemented internally by the compiler, for eg. GCC atomics: https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html. They have to be implemented internally, because they’re very platform dependant and each architecture may handle atomicity slightly differently, although the same rules apply everwhere.

Now that we’ve discussed C atomics, let’s implement the poor man’s mutex!

What needs to be done? We’ve established that only one task can be using the list at a time. We’d probably want some sort of a flag to know if it’s our turn to use the list. If the flag is set, it means that someone is currently using the list and thus we have to wait. If clear, we’re good to go.

Obviously, the flag has to be atomic by itself. Otherwise one task could read the flag, while it’s being updated, leading to a thread entering the critical section, when it’s not it’s turn.

#include <stdatomic.h>

#define SPIN_LOCK_INIT ATOMIC_FLAG_INIT

typedef atomic_flag spin_lock_t;

void spin_lock (spin_lock_t* sl) {
  while (atomic_flag_test_and_set_explicit (sl, memory_order_acquire))
    ;
}

void spin_unlock (spin_lock_t* sl) {
  atomic_flag_clear_explicit (sl, memory_order_release);
}

// List same as before...

// the lock
spin_lock_t the_lock = SPIN_LOCK_INIT;

void thread_A (void) {
  struct my_list_node* new_node = malloc (sizeof (*new_node));
  new_node->value = 67;

  spin_lock (&the_lock);
  LIST_ADD (my_list, new_node);
  spin_unlock (&the_lock);
}

void thread_B (void) {
  spin_lock (&the_lock);
  LIST_REMOVE (my_list, my_list); /* remove head */
  spin_unlock (&the_lock);
}

As you can see, if a task tries to modify the list, it must first check whether it can do so. If not, wait until such operation is possible. Pretty easy, eh?

There are some big and small improvements to be made, so let’s pick up a spying glass and analyze!

Small inefficiency

When we fail to lock the spin lock, we’re stuck in a spin-loop. This is really bad for performance, because we’re just eating up CPU cycles for nothing. A better spin lock would "relax the CPU", ie. tell the processor that it’s stuck in a spin-loop and make it consume less power or switch hardware tasks.

Big inefficiency

Despite the pause or yield instructions and such, we still have an underlying flaw in our spin lock. The issue is that we’re just spinning when we fail to acquire the lock, when we could hand over control to another task - we can’t have access ourselves, so let someone else take the precious CPU time and get a little bit of work done.

Despite this "inefficiency", spin locks are the most primitive-yet-useful way of locking. This example shows a simple solution to a simple problem, but in the real-world one would use spin locks to implement higher level/smarter ways of providing task synchronization.

The better mutex and it’s building blocks

So if spinning in place is "bad" for performance, what do we do then? Here come the "suspension queues".

Scheduler states

If we fail to acquire a mutex, it means that we’re stuck waiting for it. If we’re stuck, we shouldn’t be scheduled, ie. given CPU time (there’s no work to be done). To denote this, we have to introduce another process state:

/* process states */
#define PROC_READY     0
#define PROC_DEAD      1
#define PROC_SUSPENDED 2 /* NEW */

This new state describes a new situation in our scheduler - a process IS alive, but CANNOT be scheduled. Before we only had alive (schedulable) processes and dead (unschedulable) processes.

Now our scheduler won’t pick up suspended processes:

// ...

static struct proc* proc_find_sched (struct cpu* cpu) {
  if (!cpu->proc_run_q)
    return NULL;

  struct list_node_link *current, *start;

  if (cpu->proc_current)
    current = cpu->proc_current->cpu_run_q_link.next;
  else
    current = cpu->proc_run_q;

  if (!current)
    current = cpu->proc_run_q;

  start = current;

  do {
    struct proc* proc = list_entry (current, struct proc, cpu_run_q_link);

    if (atomic_load (&proc->state) == PROC_READY)
      return proc;

    current = current->next ? current->next : cpu->proc_run_q;
  } while (current != start);

  return NULL;
}

// ...

void proc_sched (void) {
  spin_lock_ctx_t ctxcpu;

  int s_cycles = atomic_fetch_add (&sched_cycles, 1);

  if (s_cycles % SCHED_REAP_FREQ == 0)
    proc_reap ();

  struct proc* next = NULL;
  struct cpu* cpu = thiscpu;

  spin_lock (&cpu->lock, &ctxcpu);

  next = proc_find_sched (cpu);

  if (next) {
    cpu->proc_current = next;

    do_sched (next, &cpu->lock, &ctxcpu);
  } else {
    cpu->proc_current = NULL;
    spin_unlock (&cpu->lock, &ctxcpu);

    spin ();
  }
}

Suspension queues

A suspension queue is a little tricky - it’s not just a list of suspended processes. We need to establish a many:many relationship (kinda like in databases). A process can be a member of many suspension queues (image a process, which waits on many mutexes) and a single suspension queue has many member processes. In order to achieve that, we can do some pointer trickery!

struct proc {
  int pid;
  struct rb_node_link proc_tree_link;
  struct rb_node_link procgroup_memb_tree_link;
  struct list_node_link cpu_run_q_link;
  struct list_node_link reap_link;
  struct list_node_link* sq_entries; /* list of intermediate structs */
  struct procgroup* procgroup;
  struct proc_platformdata pdata;
  uint32_t flags;
  spin_lock_t lock;
  struct cpu* cpu;
  atomic_int state;
  uintptr_t uvaddr_argument;
};

struct proc_suspension_q {
  struct list_node_link* proc_list; /* list of processes via proc_sq_entry */
  spin_lock_t lock;
};

/* the intermediate struct */
struct proc_sq_entry {
  struct list_node_link sq_link; /* list link */
  struct list_node_link proc_link; /* list link */
  struct proc* proc; /* pointer back to the process */
  struct proc_suspension_q* sq; /* pointer back to the suspension queue */
};

Now we can talk about the suspend/resume algorithms.

bool proc_sq_suspend (struct proc* proc, struct proc_suspension_q* sq, spin_lock_t* resource_lock,
                      spin_lock_ctx_t* ctxrl) {
  spin_lock_ctx_t ctxpr, ctxcpu, ctxsq;
  struct cpu* cpu = proc->cpu;

  /* allocate intermediate struct */
  struct proc_sq_entry* sq_entry = malloc (sizeof (*sq_entry));
  if (!sq_entry) {
    spin_unlock (resource_lock, ctxrl);
    return PROC_NO_RESCHEDULE;
  }

  /* set links on both ends */
  sq_entry->proc = proc;
  sq_entry->sq = sq;

  /* lock with compliance to the lock hierachy */
  spin_lock (&cpu->lock, &ctxcpu);
  spin_lock (&proc->lock, &ctxpr);
  spin_lock (&sq->lock, &ctxsq);

  spin_unlock (resource_lock, ctxrl);

  /* transition the state to PROC_SUSPENDED */
  atomic_store (&proc->state, PROC_SUSPENDED);

  /* append to sq's list */
  list_append (sq->proc_list, &sq_entry->sq_link);

  /* append to proc's list */
  list_append (proc->sq_entries, &sq_entry->proc_link);

  /* remove from CPU's run list and decrement CPU load counter */
  list_remove (cpu->proc_run_q, &proc->cpu_run_q_link);
  atomic_fetch_sub (&cpu->proc_run_q_count, 1);

  if (cpu->proc_current == proc)
    cpu->proc_current = NULL;

  /* a process is now unschedulable, so it doesn't need a CPU */
  proc->cpu = NULL;

  /* release the locks */
  spin_unlock (&sq->lock, &ctxsq);
  spin_unlock (&proc->lock, &ctxpr);
  spin_unlock (&cpu->lock, &ctxcpu);

  return PROC_NEED_RESCHEDULE;
}

bool proc_sq_resume (struct proc* proc, struct proc_sq_entry* sq_entry) {
  spin_lock_ctx_t ctxsq, ctxpr, ctxcpu;
  struct cpu* cpu = cpu_find_lightest (); /* find the least-loaded CPU */
  struct proc_suspension_q* sq = sq_entry->sq;

  /* acquire necessary locks */
  spin_lock (&cpu->lock, &ctxcpu);
  spin_lock (&proc->lock, &ctxpr);
  spin_lock (&sq->lock, &ctxsq);

  /* remove from sq's list */
  list_remove (sq->proc_list, &sq_entry->sq_link);

  /* remove from proc's list */
  list_remove (proc->sq_entries, &sq_entry->proc_link);

  /* Give the CPU to the process */
  proc->cpu = cpu;

  /*
   * update process state to PROC_READY, but only if it's not waiting inside
   * another suspension queue.
   */
  if (proc->sq_entries == NULL)
    atomic_store (&proc->state, PROC_READY);

  /* attach to CPU's run list and increment load counter */
  list_append (cpu->proc_run_q, &proc->cpu_run_q_link);
  atomic_fetch_add (&cpu->proc_run_q_count, 1);

  /* unlock */
  spin_unlock (&sq->lock, &ctxsq);
  spin_unlock (&proc->lock, &ctxpr);
  spin_unlock (&cpu->lock, &ctxcpu);

  /* intermediate struct is no longer needed, so free it */
  free (sq_entry);

  return PROC_NEED_RESCHEDULE;
}

As you can see, this algorithm is quite simple actually. It’s just a matter of safely appending/removing links to specific lists.

The mutex itself

The mutex is quite simple too.

struct proc_mutex {
  struct proc_resource* resource; /* pointer back to the resource */

  bool locked; /* taken or free */
  struct proc_suspension_q suspension_q; /* suspension queue of waiting processes */
  struct proc* owner; /* current owner of the mutex / process, which can enter the critical section */
};

And now the lock and unlock algorithms.

bool proc_mutex_lock (struct proc* proc, struct proc_mutex* mutex) {
  spin_lock_ctx_t ctxmt;

  spin_lock (&mutex->resource->lock, &ctxmt);

  /* we can enter if: the mutex is NOT locked or we're already the owner of the mutex */
  if (!mutex->locked || mutex->owner == proc) {
    mutex->locked = true;
    mutex->owner = proc;
    spin_unlock (&mutex->resource->lock, &ctxmt);
    return PROC_NO_RESCHEDULE;
  }

  /* we failed to become an owner, thus we have to free up the CPU and suspend ourselves */
  return proc_sq_suspend (proc, &mutex->suspension_q, &mutex->resource->lock, &ctxmt);
}

bool proc_mutex_unlock (struct proc* proc, struct proc_mutex* mutex) {
  spin_lock_ctx_t ctxmt, ctxsq;

  spin_lock (&mutex->resource->lock, &ctxmt);

  /*
   * we can't unlock a mutex, which we don't already own,
   * otherwise we could just steal someone else's mutex.
   */
  if (mutex->owner != proc) {
    spin_unlock (&mutex->resource->lock, &ctxmt);
    return PROC_NO_RESCHEDULE;
  }

  spin_lock (&mutex->suspension_q.lock, &ctxsq);

  struct list_node_link* node = mutex->suspension_q.proc_list;

  /*
   * This is a little tricky. What we're doing here is "direct handoff" - we basically
   * transfer ownership, while keeping the mutex locked, instead of just unlocking and
   * resuming everyone.
   */

  if (node) {
    struct proc_sq_entry* sq_entry = list_entry (node, struct proc_sq_entry, sq_link);
    struct proc* resumed_proc = sq_entry->proc;
    /* transfer ownership */
    mutex->owner = resumed_proc;
    mutex->locked = true;

    spin_unlock (&mutex->suspension_q.lock, &ctxsq);
    spin_unlock (&mutex->resource->lock, &ctxmt);

    /* resume new owner - it can enter the critical section now */
    return proc_sq_resume (resumed_proc, sq_entry);
  }

  /* no possible future owner found, so we just unlock */
  mutex->locked = false;
  mutex->owner = NULL;

  spin_unlock (&mutex->suspension_q.lock, &ctxsq);
  spin_unlock (&mutex->resource->lock, &ctxmt);

  return PROC_NEED_RESCHEDULE;
}
Direct handoff and "The Thundering Herd"
A thundering herd of bison

Why the direct handoff? Why not just unlock the mutex?

The thundering herd problem is very common in multitasking and synchronization.

We free the mutex, wake up all waiters and now anyone can take it. The issue here is that ONLY ONE task will eventually get to acquire the mutex and enter the critical section. If only one gets to take the mutex, why bother waking up everyone? The solution here would be to pass down the ownership to the first encountered waiter and then the next one and so on. Also waking everyone would be cause a performance penatly, because we’re wasting CPU cycles just to be told not to execute any further.

two birds, one stone

Via direct handoff, we’ve also just eliminated a fairness problem.

What does it mean for a mutex to be "fair"?

Fairness in synchronization means that each task at some point will get a chance to work in the critical section.

Fairness example problem:

#include <stdatomic.h>

#define SPIN_LOCK_INIT ATOMIC_FLAG_INIT

typedef atomic_flag spin_lock_t;

void spin_lock (spin_lock_t* sl) {
  while (atomic_flag_test_and_set_explicit (sl, memory_order_acquire))
    __asm__ volatile ("pause");
}

void spin_unlock (spin_lock_t* sl) {
  atomic_flag_clear_explicit (sl, memory_order_release);
}

// the lock
spin_lock_t the_lock = SPIN_LOCK_INIT;

void thread_A (void) {
  for (;;) {
    spin_lock (&the_lock);
    {
      /* critical section goes here */
    }
    spin_unlock (&the_lock);
  }
}

void thread_B (void) {
  for (;;) {
    spin_lock (&the_lock);
    {
      /* critical section goes here */
    }
    spin_unlock (&the_lock);
  }
}

Let’s take thread_B for example. thread_B locks, does it’s thing and unlocks. Because it’s in an infinite loop, there’s a chance that it may re-lock too quickly, before thread_A gets the chance. This causes a fairness issue, because thread_A is getting starved - it can’t enter the critical section, even though it plays by the rules, synchronizing properly with thread_B via a spinlock. On real CPUs (ie. real hardware or QEMU with -enable-kvm), there will always be some tiny delays/"clog ups" that mitigate this problem, but theoretically it is possible that a task can starve.

Our algorithm solves this by always selecting the next thread to own the mutex, so we never get the issue that when we unlock, the new owner is selected randomly - the first one to get it, owns it. This makes the approach more deterministic/predictible.

Final result

This is the final result of our work!

#include <limits.h>
#include <proc/local.h>
#include <proc/proc.h>
#include <stddef.h>
#include <stdint.h>
#include <string/string.h>

/* ID of the mutex resource */
#define MUTEX 2000

/* TLS (thread-local storage) support was added recently ;)! */
LOCAL volatile char letter = 'x';

void app_proc (void) {
  char arg_letter = (char)(uintptr_t)argument_ptr ();

  letter = arg_letter;

  for (;;) {
    /* enter the critical section, print the _letter_ 3 times and leave */

    mutex_lock (MUTEX);

    for (int i = 0; i < 3; i++)
      test (letter);

    mutex_unlock (MUTEX);
  }

  process_quit ();
}

void app_main (void) {
  mutex_create (MUTEX);

  letter = 'a';

  process_spawn (&app_proc, (void*)'b');
  process_spawn (&app_proc, (void*)'c');
  process_spawn (&app_proc, (void*)'d');

  for (;;) {
    mutex_lock (MUTEX);

    for (int i = 0; i < 3; i++)
      test (letter);

    mutex_unlock (MUTEX);
  }
}