Post

Deep Dive into Linux System Calls: From Legacy int 0x80 to Modern syscall

Deep Dive into Linux System Calls: From Legacy int 0x80 to Modern syscall

As we all know, System calls are the backbone of every Linux program.

In this blog, we’ll take a deep dive into Linux system calls:

  • We’ll start by exploring what system calls are and why they exist.
  • Then, we’ll uncover the internals of how a system call is invoked — from the old-school int 0x80 interrupt method to modern fast paths like syscall and sysenter, powered by Model-Specific Registers (MSRs).
  • We’ll look at how the kernel handles a call, step by step, from the moment a user process invokes syscall until the kernel executes the appropriate handler.
  • Finally, we’ll get our hands dirty by adding a custom system call to the Linux kernel, recompiling it, and testing it on QEMU with a BusyBox-based initramfs.

A system call is the controlled gateway through which an unprivileged program asks the Linux kernel to do something it cannot do directly (I/O, memory management, process control, etc.). In practice, most programs call libc functions; those wrappers may (a) do work in user space, (b) call fast vDSO functions that avoid a kernel transition for certain queries, or (c) issue an architecture‑specific trap instruction (syscall, sysenter, int 0x80, svc, ecall, …) that transfers control into a tiny kernel assembly entry stub. The stub validates state, switches stacks/privilege, dispatches via the per‑arch syscall table to the C implementation, runs security hooks (seccomp/LSM/audit), copies arguments between user and kernel space, performs the requested work, encodes success or -errno in a register, and executes a return path back to user space, where libc translates negative values into -1 and sets errno. Modern kernels also provide accelerators (vDSO, io_uring, restartable sequences, Syscall User Dispatch) to reduce the cost of full transitions when possible.

System Call Categories

Linux offers hundreds of system calls, mostly falling into:

  • Process Control: fork(), exec(), exit()
  • File Management: open(), read(), write(), close()
  • Device Management: ioctl()
  • Information Maintenance: getpid(), uname()
  • Communication: pipe(), socket()

System Call (vs a Library Call)

1
man 2 syscall

Applications rarely invoke the trap instruction directly; instead they call C library wrapper functions. Those wrappers set up registers, may translate arguments (e.g., open() wrapper calling openat() on newer glibc), and issue the actual kernel transition as needed.

Many standard library functions never enter the kernel (e.g., strcmp()), while others are thin wrappers over syscalls (getpid(), read()), and some are richer abstractions that bundle multiple syscalls and user‑space logic (fopen() wrapping open() + buffering).

Userspace Side Details

libc Wrappers Aren’t Always Thin

Wrappers sometimes change the actual syscall invoked for portability, feature detection, or security hardening. Examples:

  • exit() wrapper calling exit_group() to terminate all threads.
  • fork() wrapper calling clone() with specific flags.
  • Beginning with glibc 2.26, the open() wrapper may call openat() unconditionally to ensure O_LARGEFILE etc. work uniformly.

Because of such indirections, seccomp filters based only on the function name you called may miss the actual syscall issued; you must filter by the underlying syscall numbers that libc emits.

Direct Syscalls From C

When no wrapper exists (new syscall, custom build, or early testing), you can use the generic syscall() function from glibc, passing the numeric __NR_xxx constant and arguments. This bypasses higher‑level wrapper behavior. The man page notes a 0 return indicates success and -1 indicates error with errno set; ENOSYS means the syscall is not implemented.

syscall

Each system call has a unique number (e.g., __NR_write for the write system call).

These numbers are defined in <sys/syscall.h>. In this example, we will use the syscall function to invoke the write system call directly. The program writes the string "Hello World!\n" to standard output.

1
2
3
4
5
6
7
8
9
10
11
12
#include <sys/syscall.h> // For system call numbers (e.g., __NR_write)
#include <unistd.h>      // For syscall() function
#include <string.h>      // For strlen() function
int main() {
    char *str = "Hello World!\n"; // String to write
    size_t len = strlen(str);     // Length of the string

    // Make the write system call
    syscall(__NR_write, 1, str, len);

    return 0;
}

For Syscall Calling Convention you can refer this

Legacy Entry Mechanisms vs Modern Fast Paths

int $0x80 (Software Interrupt) — The Old Way

Early Linux on x86 used a software interrupt vector (INT 0x80). This goes through the full interrupt descriptor table (IDT) microcode path and is relatively slow: state saves, privilege switches, etc. It remains for backward compatibility (32‑bit code, some 64‑bit compat cases) but is discouraged for performance.

sysenter / sysexit (Intel Fast System Call, 32‑bit)

Intel introduced sysenter to cut overhead (bypassing some IDT work) but required per‑CPU MSR configuration (EIP/ESP/CS). Linux wires this for 32‑bit fast paths; compatibility layers map userspace transitions accordingly.

syscall / sysret (AMD64 Fast Path, Used on x86‑64)

AMD’s syscall/sysret pair (adopted by Intel in long mode) is the standard on 64‑bit Linux. Kernel sets MSR_LSTAR to the entry point (entry_SYSCALL_64), uses MSR_STAR, MSR_SYSCALL_MASK to define segment and flags behavior, and handles the fragile swapgs dance for per‑CPU data. Faster than int 0x80, but still not free.

vDSO & (Obsolete) vsyscall

To avoid any trap for high‑frequency trivial queries (time, CPU ID), Linux maps a small shared object—the vDSO—into every process. libc can call functions there as normal user‑space calls; the kernel keeps the data page updated. The older vsyscall page hard‑wired a few fixed addresses but was limited (only a handful of calls) and created security issues (fixed address gadgets); it has been largely superseded by vDSO and may be emulated or disabled.

Inside the Kernel Syscall Entry Path (x86‑64 Walkthrough)

Let’s zoom into what happens in entry_SYSCALL_64 (simplified):

  1. swapgs to switch GS base from user to kernel per‑CPU area.
  2. Switch to kernel stack; save user %rsp.
  3. Push user segments, flags, RIP, and registers to build a pt_regs frame.
  4. Validate syscall number vs __NR_syscall_max; if valid, move %r10%rcx to match C calling convention and call through sys_call_table.
  5. On return, store %rax as result; call trace exit hooks; restore registers.
  6. swapgs back and sysretq to user RIP.

The entry code must be extremely careful with speculative execution mitigations, stack alignment, and nested interrupts; any mistake is a security bug or crash.

Role of MSRs in System Calls

When a user program executes syscall or sysenter, the CPU needs to know:

  • Where in the kernel to jump (entry address)
  • What code segment to switch to
  • Which stack and privilege level to use

This information is configured in MSRs by the Linux kernel during boot. For example:

Key MSRs on x86-64 for syscall

  • MSR_LSTAR : Contains the 64-bit address of the entry point for syscall instructions (e.g., entry_SYSCALL_64 in Linux). When user mode executes syscall, the CPU jumps here.
  • MSR_STAR : Stores segment selector values for switching code segments between user mode and kernel mode.
  • MSR_SYSCALL_MASK : A bitmask of flags (in RFLAGS) that are cleared on syscall entry to disable certain user-level flags (like interrupts or traps).

When user space executes let’s say write syscall -

The CPU:

  • Reads MSR_LSTAR → gets the address of entry_SYSCALL_64().
  • Switches to ring 0 (kernel mode).
  • Changes CS/SS selectors per MSR_STAR.
  • Masks certain flags (e.g., disables interrupts).
  • Jumps to the kernel address in MSR_LSTAR.

The value in MSR_LSTAR is the start of a small assembly routine (entry_SYSCALL_64 in arch/x86/entry/entry_64.S).

This routine:

  1. Saves user registers (so the kernel doesn’t clobber them).
  2. Switches to the kernel stack (defined in the per-CPU TSS).
  3. Prepares a pt_regs structure on the stack to hold all user register values.
  4. Performs the swapgs trick: Switches the GS base from user-space TLS to kernel’s percpu data region.
  5. Disables interrupts (if not already masked).
  • The syscall number (in %rax) is extracted.
  • It is compared against NR_syscalls (the max syscall number).
    • If it’s invalid → return -ENOSYS.
  • If valid, the address of the C syscall handler is fetched from the sys_call_table[].

For sys_write, sys_call_table[1] = sys_write.

Transition to C Code

  • The kernel uses a small stub (do_syscall_64) to convert from assembly to C calling convention.
  • Your 6 arguments (in registers %rdi, %rsi, %rdx, ...) are passed to the handler.
  • The actual sys_write function (in fs/read_write.c) is invoked.
  • Control returns to do_syscall_64, which places the return value into %rax.
  • If it’s an error (negative code in the range -4095..-1), the libc wrapper translates it into -1 and sets errno.

The assembly exit path:

  • Restores all saved registers from the pt_regs frame.
  • swapgs again (switch back to user GS).
  • Executes sysretq, which:
    • Restores %rip (user instruction pointer).
    • Returns to user mode with the proper flags and stack.

SUMMARY

[User executes syscall]
         |
         v
CPU jumps to MSR_LSTAR (entry_SYSCALL_64)
         |
         v
[Save regs] -> [swapgs] -> [switch to kernel stack]
         |
         v
   do_syscall_64()
         |
         +--> lookup sys_call_table[%rax]
         |
         +--> call sys_write(fd, buf, len)
                   |
                   +--> vfs_write()
                   +--> copy_from_user()
                   |
         +--> return value in %rax
         |
         v
[restore regs] -> [swapgs] -> sysretq
         |
         v
Back to user-space code

Let’s add syscall to our kernel source code.

NOTE

We will be using the same kernel that we built in the previous blog.

1. Create a New Syscall Implementation

We’ll create mysyscall.c in the kernel/ directory.

1
2
cd /home/fury/Desktop/Blog/Kernel_Lab/linux-5.11.4
cd kernel

Add the following code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
// kernel/mysyscall.c
#include <linux/kernel.h>
#include <linux/syscalls.h>
#include <linux/uaccess.h> // for copy_from_user

SYSCALL_DEFINE1(mysyscall, const char __user *, user_msg)
{
    char buf[128];

    if (copy_from_user(buf, user_msg, sizeof(buf)))
        return -EFAULT;

    buf[127] = '\0';  // ensure null termination
    pr_info("mysyscall: %s\n", buf);
    return 0;
}

2. Add It to the Kernel Build

Open the kernel/Makefile:

1
nano kernel/Makefile

Find the list of obj-y files and add:

1
obj-y += mysyscall.o

3. Add the Syscall to the Table

On x86-64, syscalls are listed in:

1
arch/x86/entry/syscalls/syscall_64.tbl

Edit it:

1
nano arch/x86/entry/syscalls/syscall_64.tbl

Add a new line at the end:

# This is the end of the legacy x32 range.  Numbers 548 and above are
# not special and are not to be used for x32-specific syscalls.
548    common   mysyscall      sys_mysyscall

(Use 548 or the next free number; check the last entry number first.)

4. Build Kernel

Rebuild:

1
2
cd ..
make -j$(nproc)

After this, you’ll have a new arch/x86/boot/bzImage.

5. Create a Test Program

Outside the kernel source tree, create test_mysyscall.c:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
#define _GNU_SOURCE
#include <unistd.h>
#include <sys/syscall.h>
#include <stdio.h>

#ifndef __NR_mysyscall
#define __NR_mysyscall 548
#endif

int main() {
    const char *msg = "Hello from user space!";
    long res = syscall(__NR_mysyscall, msg);
    if (res == 0)
        printf("mysyscall succeeded!\n");
    else
        perror("mysyscall failed");
    return 0;
}

Compile it:

1
gcc test_mysyscall.c -static -o test_mysyscall

6. Add to Initramfs

Copy your test_mysyscall binary to your BusyBox initramfs root directory and rebuild the initramfs.cpio.gz:

1
2
cp test_mysyscall busybox-1.36.1/_install/
find . -print0 | cpio --null -ov --format=newc | gzip -9 > ../initramfs.cpio.gz

7. Run in QEMU

Boot the new kernel:

1
qemu-system-x86_64 -kernel ./linux-5.11.4/arch/x86/boot/bzImage -initrd ./busybox-1.36.1/initramfs.cpio.gz -append "root=/dev/ram rw console=ttyS0 quiet" -nographic

Run your program:

1
2
./test_mysyscall
dmesg | tail

You should see:

[    6.507706] mysyscall: Hello from user space!
This post is licensed under CC BY 4.0 by the author.