Deep Dive into Linux System Calls: From Legacy int 0x80 to Modern syscall
As we all know, System calls are the backbone of every Linux program.
In this blog, we’ll take a deep dive into Linux system calls:
- We’ll start by exploring what system calls are and why they exist.
- Then, we’ll uncover the internals of how a system call is invoked — from the old-school
int 0x80
interrupt method to modern fast paths likesyscall
andsysenter
, powered by Model-Specific Registers (MSRs). - We’ll look at how the kernel handles a call, step by step, from the moment a user process invokes
syscall
until the kernel executes the appropriate handler. - Finally, we’ll get our hands dirty by adding a custom system call to the Linux kernel, recompiling it, and testing it on QEMU with a BusyBox-based initramfs.
A system call is the controlled gateway through which an unprivileged program asks the Linux kernel to do something it cannot do directly (I/O, memory management, process control, etc.). In practice, most programs call libc functions; those wrappers may (a) do work in user space, (b) call fast vDSO functions that avoid a kernel transition for certain queries, or (c) issue an architecture‑specific trap instruction (syscall
, sysenter
, int 0x80
, svc
, ecall
, …) that transfers control into a tiny kernel assembly entry stub. The stub validates state, switches stacks/privilege, dispatches via the per‑arch syscall table to the C implementation, runs security hooks (seccomp/LSM/audit), copies arguments between user and kernel space, performs the requested work, encodes success or -errno
in a register, and executes a return path back to user space, where libc translates negative values into -1
and sets errno
. Modern kernels also provide accelerators (vDSO, io_uring, restartable sequences, Syscall User Dispatch) to reduce the cost of full transitions when possible.
System Call Categories
Linux offers hundreds of system calls, mostly falling into:
- Process Control:
fork()
,exec()
,exit()
- File Management:
open()
,read()
,write()
,close()
- Device Management:
ioctl()
- Information Maintenance:
getpid()
,uname()
- Communication:
pipe()
,socket()
System Call (vs a Library Call)
1
man 2 syscall
Applications rarely invoke the trap instruction directly; instead they call C library wrapper functions. Those wrappers set up registers, may translate arguments (e.g., open()
wrapper calling openat()
on newer glibc), and issue the actual kernel transition as needed.
Many standard library functions never enter the kernel (e.g., strcmp()
), while others are thin wrappers over syscalls (getpid()
, read()
), and some are richer abstractions that bundle multiple syscalls and user‑space logic (fopen()
wrapping open()
+ buffering).
Userspace Side Details
libc Wrappers Aren’t Always Thin
Wrappers sometimes change the actual syscall invoked for portability, feature detection, or security hardening. Examples:
exit()
wrapper callingexit_group()
to terminate all threads.fork()
wrapper callingclone()
with specific flags.- Beginning with glibc 2.26, the
open()
wrapper may callopenat()
unconditionally to ensureO_LARGEFILE
etc. work uniformly.
Because of such indirections, seccomp filters based only on the function name you called may miss the actual syscall issued; you must filter by the underlying syscall numbers that libc emits.
Direct Syscalls From C
When no wrapper exists (new syscall, custom build, or early testing), you can use the generic syscall()
function from glibc, passing the numeric __NR_xxx
constant and arguments. This bypasses higher‑level wrapper behavior. The man page notes a 0
return indicates success and -1
indicates error with errno
set; ENOSYS
means the syscall is not implemented.
syscall
Each system call has a unique number (e.g., __NR_write for the write system call
).
These numbers are defined in <sys/syscall.h>
. In this example, we will use the syscall
function to invoke the write
system call directly. The program writes the string "Hello World!\n"
to standard output
.
1
2
3
4
5
6
7
8
9
10
11
12
#include <sys/syscall.h> // For system call numbers (e.g., __NR_write)
#include <unistd.h> // For syscall() function
#include <string.h> // For strlen() function
int main() {
char *str = "Hello World!\n"; // String to write
size_t len = strlen(str); // Length of the string
// Make the write system call
syscall(__NR_write, 1, str, len);
return 0;
}
For Syscall Calling Convention you can refer this
Legacy Entry Mechanisms vs Modern Fast Paths
int $0x80
(Software Interrupt) — The Old Way
Early Linux on x86 used a software interrupt vector (INT 0x80
). This goes through the full interrupt descriptor table (IDT) microcode path and is relatively slow: state saves, privilege switches, etc. It remains for backward compatibility (32‑bit code, some 64‑bit compat cases) but is discouraged for performance.
sysenter
/ sysexit
(Intel Fast System Call, 32‑bit)
Intel introduced sysenter
to cut overhead (bypassing some IDT work) but required per‑CPU MSR configuration (EIP/ESP/CS). Linux wires this for 32‑bit fast paths; compatibility layers map userspace transitions accordingly.
syscall
/ sysret
(AMD64 Fast Path, Used on x86‑64)
AMD’s syscall
/sysret
pair (adopted by Intel in long mode) is the standard on 64‑bit Linux. Kernel sets MSR_LSTAR
to the entry point (entry_SYSCALL_64
), uses MSR_STAR
, MSR_SYSCALL_MASK
to define segment and flags behavior, and handles the fragile swapgs
dance for per‑CPU data. Faster than int 0x80
, but still not free.
vDSO & (Obsolete) vsyscall
To avoid any trap for high‑frequency trivial queries (time, CPU ID), Linux maps a small shared object—the vDSO—into every process. libc can call functions there as normal user‑space calls; the kernel keeps the data page updated. The older vsyscall page hard‑wired a few fixed addresses but was limited (only a handful of calls) and created security issues (fixed address gadgets); it has been largely superseded by vDSO and may be emulated or disabled.
Inside the Kernel Syscall Entry Path (x86‑64 Walkthrough)
Let’s zoom into what happens in entry_SYSCALL_64
(simplified):
swapgs
to switch GS base from user to kernel per‑CPU area.- Switch to kernel stack; save user
%rsp
. - Push user segments, flags, RIP, and registers to build a
pt_regs
frame. - Validate syscall number vs
__NR_syscall_max
; if valid, move%r10
→%rcx
to match C calling convention and call throughsys_call_table
. - On return, store
%rax
as result; call trace exit hooks; restore registers. swapgs
back andsysretq
to user RIP.
The entry code must be extremely careful with speculative execution mitigations, stack alignment, and nested interrupts; any mistake is a security bug or crash.
Role of MSRs in System Calls
When a user program executes syscall
or sysenter
, the CPU needs to know:
- Where in the kernel to jump (entry address)
- What code segment to switch to
- Which stack and privilege level to use
This information is configured in MSRs by the Linux kernel during boot. For example:
Key MSRs on x86-64 for syscall
MSR_LSTAR
: Contains the 64-bit address of the entry point forsyscall
instructions (e.g.,entry_SYSCALL_64
in Linux). When user mode executessyscall
, the CPU jumps here.MSR_STAR
: Stores segment selector values for switching code segments between user mode and kernel mode.MSR_SYSCALL_MASK
: A bitmask of flags (inRFLAGS
) that are cleared on syscall entry to disable certain user-level flags (like interrupts or traps).
When user space executes let’s say write
syscall -
The CPU:
- Reads
MSR_LSTAR
→ gets the address ofentry_SYSCALL_64()
. - Switches to ring 0 (kernel mode).
- Changes
CS
/SS
selectors perMSR_STAR
. - Masks certain flags (e.g., disables interrupts).
- Jumps to the kernel address in
MSR_LSTAR
.
The value in MSR_LSTAR
is the start of a small assembly routine (entry_SYSCALL_64
in arch/x86/entry/entry_64.S
).
This routine:
- Saves user registers (so the kernel doesn’t clobber them).
- Switches to the kernel stack (defined in the per-CPU
TSS
). - Prepares a
pt_regs
structure on the stack to hold all user register values. - Performs the
swapgs
trick: Switches theGS
base from user-space TLS to kernel’spercpu
data region. - Disables interrupts (if not already masked).
- The
syscall
number (in%rax
) is extracted. - It is compared against
NR_syscalls
(the max syscall number).- If it’s invalid → return
-ENOSYS
.
- If it’s invalid → return
- If valid, the address of the C syscall handler is fetched from the
sys_call_table[]
.
For sys_write
, sys_call_table[1] = sys_write
.
Transition to C Code
- The kernel uses a small stub (
do_syscall_64
) to convert from assembly to C calling convention. - Your 6 arguments (in registers
%rdi, %rsi, %rdx, ...
) are passed to the handler. - The actual
sys_write
function (infs/read_write.c
) is invoked. - Control returns to
do_syscall_64
, which places the return value into%rax
. - If it’s an error (negative code in the range
-4095..-1
), the libc wrapper translates it into-1
and setserrno
.
The assembly exit path:
- Restores all saved registers from the
pt_regs
frame. swapgs
again (switch back to user GS).- Executes
sysretq
, which:- Restores
%rip
(user instruction pointer). - Returns to user mode with the proper flags and stack.
- Restores
SUMMARY
[User executes syscall]
|
v
CPU jumps to MSR_LSTAR (entry_SYSCALL_64)
|
v
[Save regs] -> [swapgs] -> [switch to kernel stack]
|
v
do_syscall_64()
|
+--> lookup sys_call_table[%rax]
|
+--> call sys_write(fd, buf, len)
|
+--> vfs_write()
+--> copy_from_user()
|
+--> return value in %rax
|
v
[restore regs] -> [swapgs] -> sysretq
|
v
Back to user-space code
Let’s add syscall to our kernel source code.
NOTE
We will be using the same kernel that we built in the previous blog.
1. Create a New Syscall Implementation
We’ll create mysyscall.c
in the kernel/
directory.
1
2
cd /home/fury/Desktop/Blog/Kernel_Lab/linux-5.11.4
cd kernel
Add the following code:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
// kernel/mysyscall.c
#include <linux/kernel.h>
#include <linux/syscalls.h>
#include <linux/uaccess.h> // for copy_from_user
SYSCALL_DEFINE1(mysyscall, const char __user *, user_msg)
{
char buf[128];
if (copy_from_user(buf, user_msg, sizeof(buf)))
return -EFAULT;
buf[127] = '\0'; // ensure null termination
pr_info("mysyscall: %s\n", buf);
return 0;
}
2. Add It to the Kernel Build
Open the kernel/Makefile
:
1
nano kernel/Makefile
Find the list of obj-y files and add:
1
obj-y += mysyscall.o
3. Add the Syscall to the Table
On x86-64, syscalls are listed in:
1
arch/x86/entry/syscalls/syscall_64.tbl
Edit it:
1
nano arch/x86/entry/syscalls/syscall_64.tbl
Add a new line at the end:
# This is the end of the legacy x32 range. Numbers 548 and above are
# not special and are not to be used for x32-specific syscalls.
548 common mysyscall sys_mysyscall
(Use 548 or the next free number; check the last entry number first.)
4. Build Kernel
Rebuild:
1
2
cd ..
make -j$(nproc)
After this, you’ll have a new arch/x86/boot/bzImage
.
5. Create a Test Program
Outside the kernel source tree, create test_mysyscall.c
:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
#define _GNU_SOURCE
#include <unistd.h>
#include <sys/syscall.h>
#include <stdio.h>
#ifndef __NR_mysyscall
#define __NR_mysyscall 548
#endif
int main() {
const char *msg = "Hello from user space!";
long res = syscall(__NR_mysyscall, msg);
if (res == 0)
printf("mysyscall succeeded!\n");
else
perror("mysyscall failed");
return 0;
}
Compile it:
1
gcc test_mysyscall.c -static -o test_mysyscall
6. Add to Initramfs
Copy your test_mysyscall
binary to your BusyBox initramfs root directory and rebuild the initramfs.cpio.gz
:
1
2
cp test_mysyscall busybox-1.36.1/_install/
find . -print0 | cpio --null -ov --format=newc | gzip -9 > ../initramfs.cpio.gz
7. Run in QEMU
Boot the new kernel:
1
qemu-system-x86_64 -kernel ./linux-5.11.4/arch/x86/boot/bzImage -initrd ./busybox-1.36.1/initramfs.cpio.gz -append "root=/dev/ram rw console=ttyS0 quiet" -nographic
Run your program:
1
2
./test_mysyscall
dmesg | tail
You should see:
[ 6.507706] mysyscall: Hello from user space!