generated html version of blog/without_libc.txtHOME

2021 day 97

Today I'll be showing what you have to work with if you don't use the 
C standard library. Specifically, on Linux, every program has access to the
Linux kernel's ABI. This ABI is similar to, but different from, the function
interfaces that libc and POSIX provide to C. The most obvious difference of
course is that you have to use assembler instructions to call the kernel and
get it to do things. I'll be describing the ABI on x86_64 but on other
architectures the principle is the same.

The system call has a number which is placed in the rax register. It can take
up to 6 arguments, placed in the rdi, rsi, rdx, r10, r8, r9 registers.
After placing these into the correct registers, the syscall instruction is
executed. This instruction is faster than the int 0x80 instruction used on
32 bit x86 machines. (On x86 32 bit the register for the syscall is eax and 
the arguments are in order ebx, ecx, edx, esi, edi.)

To actually place these arguments into the correct registers, it's a 
compiler-specific trick, but the following code works on TCC, GCC, and on 
clang. We will use the mmap syscall as an example.

void* mmap(void* addr, unsigned long len, int prot, int flags, int fd, long off) {
        void* a=9;
        asm volatile(
                "movl %4,%%r10\n"
                "\tmovl %5,%%r8\n"
                "\tmovq %6,%%r9\n"
        return a;

The gcc-style asm statement consists of four sections separated by : colons. 
The first section is assembler code in a string literal, with statements 
separated by newlines and tabs. This assembler code can contain references 
to C variables (defined in the next sections), in the format %1, %2, etc. 
Actual % signs to be included in the assembler code are doubled as %%.

The second and third section of the asm statement contains variable references 
where the second section is output variables and the third is input values.
Each variable is put in () and preceded by a constraint string. The output
constraints begin with one of the symbols:

+ for a variable that will be read as well as written to
= for a variable that will only be written to.
=& for a variable that will be written to, and only read from after write.

The input constraints in the third section do not use these prefixes.
The input and output constraints both contain a set of letters designating
what sort of assembly object the value/variable should be equated to.

r for a register
m for a memory address
i for a constant
rm for a register or memory address
g for any of the above

Specific to x86 and x86-64 the following useful constraints are available:

a for rax / eax
b for rbx / ebx
c for rcx / ecx
d for rdx / edx
D for rdi / edi
S for rsi / esi

However, note the lack of any constraints for the numbered x86-64 registers.
that's why I had to do those movq instructions above. Anyway, the last section
of the asm statement lists registers that are clobbered (modified) by the 
given asm statement (other than ones used for its inputs and outputs.) The 
syscall clobbers rcx and r11 as well as the registers used for its arguments.

Since the output of the syscall is returned in rax, I had to use a variable of
a good type for that output, even though the input value is just a 64 bit number.
If the syscall uses more inputs than the number of registers in the ABI, usually
the rdi register is instead set to a pointer to a structure containing the 

The mmap and brk syscalls are the low-level interface that underpins the C 
malloc and free functions. Similarly, one can use the open, write, read, close
syscalls instead of stdio.h. By default, the standard input is file descriptor
0, the standard output is 1, and the standard error is 2. Have fun!

-- Oren Watson

Here are useful references to linux syscalls on x86-64.