Cosmopolitan Libc: build-once run-anywhere C library

0
7

Cosmopolitan makes C a build-once run-anywhere language, similar to
Java, except it doesn’t require interpreters or virtual machines be
installed beforehand. Cosmo provides the same portability benefits as
high-level languages like Go and Rust, but it doesn’t invent a new
language and you won’t need to configure a CI system to build separate
binaries for each operating system. What Cosmopolitan focuses on is
fixing C by decoupling it from platforms, so it can be pleasant to use
for writing small unix programs that are easily distributed to a much
broader audience.

Getting Started

Assuming you have GCC on Linux, then all you need are the five
additional files which are linked below:

# create simple c program on command line
echo '
  main() {
    printf("hello worldn");
  }
' >hello.c

# run gcc compiler in freestanding mode
gcc -g -Os -static -fno-pie -nostdlib -nostdinc -o hello.com hello.c 
  -Wl,--oformat=binary -Wl,--gc-sections -Wl,-z,max-page-size=0x1000 
  -Wl,-T,ape.lds -include cosmopolitan.h crt.o ape.o cosmopolitan.a

# ~40kb static binary (can be ~16kb w/ MODE=tiny)
./hello.com

The above command fixes GCC so it outputs portable binaries that will
run on every Linux distro in addition to Mac OS X, Windows NT,
FreeBSD, and OpenBSD too. For details on how this works, please read
the αcτµαlly pδrταblε εxεcµταblε blog post. This
novel binary format is also optional: conventional ELF binaries can be
compiled too by removing the -Wl,--oformat=binary flag.

Your program will also boot on bare metal too. In other words, you’ve
written a normal textbook C program, and thanks to Cosmopolitan’s
low-level linker magic, you’ve effectively created your own operating
system which happens to run on all the existing ones as well. Now
that’s something no one’s done before.

Mailing List

Please join
the Cosmopolitan
Cosmonauts
Google Group!

Performance

Cosmopolitan has been optimized by hand for excellent performance on
modern desktops and servers. Compared with glibc, you should expect
Cosmopolitan to be almost as fast, but with an order of a magnitude
tinier code size. Compared with Musl or Newlib, you can expect that
Cosmopolitan will generally go much faster, while having roughly the
same code size, if not tinier.

In the case of the most important libc function, memcpy(),
Cosmopolitan outperformed every other open source library tested. The
chart below shows how quickly memory is transferred depending on the
size of the copy. Since it’s log scale, each grid square represents a
2x difference in performance. What makes Cosmopolitan so fast here is
it uses uses several different memory copying strategies. For small
sizes it uses an indirect branch with overlapping moves; for medium
sizes it uses simd vectors, and for large copies it uses nontemporal
hints which prevent cache trash. Other libraries usually fall short
because they use a one-size-fits-all strategy. For example, Newlib
goes 10x slower for the optimal block size (half L1 cache) because it
always does nontemporal moves.


memcpy() performance for varying n values

Trickle-Down Performance

Performing the best on benchmarks isn’t enough. Cosmopolitan also uses
a second technique that the above benchmark doesn’t measure, which we
call “trickle-down performance”. For an example of how that works,
consider the following common fact about C which that’s often
overlooked. External function calls such as the following:

memcpy(foo, bar, n);

Are roughly equivalent to the following assembly, which leads
compilers to assume that most cpu state is clobbered:

asm volatile("call memcpy"
             : "=a"(rax), "=D"(rdi), "=S"(rsi), "=d"(rdx)
             : "1"(foo), "2"(bar), "3"(n)
             : "rcx", "r8", "r9", "r10", "r11", "memory", "cc",
               "xmm0", "xmm1", "xmm2", "xmm3", "xmm4", "xmm5", "xmm6");

In other words the compiler assumes that, in calling the function,
fifteen separate registers and all memory will be overwritten. See
the System V
ABI
for further details. This can be problematic for
frequently-called functions such as memcpy, since it inhibits many
optimizations and it tosses a wrench in the compiler register
allocation algorithm, thus causing stack spillage which further
degrades performance while bloating the output binary size.

So what Cosmopolitan does for memcpy() and many other
frequently-called core library leaf functions, is defining a simple
macro wrapper, which tells the compiler the correct subset of the abi
that’s actually needed, e.g.

#define memcpy(DEST, SRC, N) ({       
  void *Dest = (DEST);                
  void *Src = (SRC);                  
  size_t Size = (N);                  
  asm("call memcpy"                   
      : "=m"(*(char(*)[Size])(Dest))  
      : "D"(Dest), "S"(Src), "d"(n),  
        "m"(*(char(*)[Size])(Src))    
      : "rcx", "xmm3", "xmm4", "cc"); 
    Dest;                             
  })

What this means, is that Cosmopolitan memcpy() is not simply fast, it
also makes unrelated code in the functions that call it faster too as
a side-effect. When this technique was first implemented for memcpy()
alone, many of the functions in the Cosmopolitan codebase had their
generated code size reduced by a third.

For an example of one such function, consider strlcpy,
which is the BSD way of saying strcpy:

/**
 * Copies string, the BSD way.
 *
 * @param d is buffer which needn't be initialized
 * @param s is a NUL-terminated string
 * @param n is byte capacity of d
 * @return strlen(s)
 * @note d and s can't overlap
 * @note we prefer memccpy()
 */
size_t strlcpy(char *d, const char *s, size_t n) {
  size_t slen, actual;
  slen = strlen(s);
  if (n) {
    actual = MIN(n - 1, slen);
    memcpy(d, s, actual);
    d[actual] = '';
  }
  return slen;
}

If we compile our strlcpy function, then here’s the
assembly code that the compiler outputs:

/ compiled with traditional libc
strlcpy:
	push	%rbp
	mov	%rsp,%rbp
	push	%r14
	mov	%rsi,%r14
	push	%r13
	mov	%rdi,%r13
	mov	%rsi,%rdi
	push	%r12
	push	%rbx
	mov	%rdx,%rbx
	call	strlen
	mov	%rax,%r12
	test	%rbx,%rbx
	jne	1f
	pop	%rbx
	mov	%r12,%rax
	pop	%r12
	pop	%r13
	pop	%r14
	pop	%rbp
	ret
1:	cmp	%rbx,%rax
	mov	%r14,%rsi
	mov	%r13,%rdi
	cmovbe	%rax,%rbx
	mov	%rbx,%rdx
	call	memcpy
	movb	$0,0(%r13,%rbx)
	mov	%r12,%rax
	pop	%rbx
	pop	%r12
	pop	%r13
	pop	%r14
	pop	%rbp
	ret
	.endfn	strlcpy,globl
/ compiled with cosmopolitan libc
strlcpy:
	mov	%rdx,%r8
	mov	%rdi,%r9
	mov	%rsi,%rdi
	call	strlen
	test	%r8,%r8
	je	1f
	cmp	%r8,%rax
	lea	-1(%r8),%rdx
	mov	%r9,%rdi
	cmova	%rax,%rdx
	call	MemCpy
	movb	$0,(%r9,%rdx)
1:	ret
	.endfn	strlcpy,globl

That’s a huge improvement in generated code size. The above two
compiles used the same gcc flags and no changes to the code needed to
be made. All that changed was we used cosmopolitan.h (instead of the
platform c library string.h) which contains ABI specialization macros
for memcpy and strlen. It’s a great example
of how merely choosing a better C library can systemically eliminate
bloat throughout your entire codebase.

LEAVE A REPLY

Please enter your comment!
Please enter your name here