The Missing Cow Call

Linux allows users to allocate memory via mmap(2). This can then be shrunk, grown, and moved by mremap(2). The feature of the last syscall is that no data actually gets copied. Instead the physical memory pages are simply mapped to different virtual pages.

With that one can even have different virtual memory point to the same underlying physical memory.

void *ptr = mmap((void *)0, size, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_SHARED, -1, (size_t)0);
void *ptr2= mremap(ptr, 0, size, MREMAP_MAYMOVE);

Now ptr and ptr2 are two pointers, pointing at different addresses, but are aliased. Thus a write through one, will change the data behind the other. Passing zero as a second parameter is not properly documented, but works just fine in Linux.

The above code allows one to easily copy big data structures. However, the underlying data is shared. Instead, what would be cool, is to have them copy-on-write with page granularity.

void *ptr = mmap((void *)0, size, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_SHARED, -1, (size_t)0);
void *ptr2= mremap(ptr, 0, size, MREMAP_MAYMOVE | MREMAP_COW);

Now the two pointers share the data. However, if a write happens, the physical page gets copied, written to, and the virtual page points to that new data. This is what happens on a fork(2). Unfortunately there is no MREMAP_COW or similar mechanism to expose this kind of functionality. Instead, only kernel same-page merging will merge pages containing the same data and COW if they get written to. So that does not elide the expensive copy in the first place.

The closest I have managed to achieve the above semantics is to use the relatively new syscall memfd_create(2). Infact, it is so new that glibc does not provide a wrapper for it.

// provide wrapper
int memfd_create(const char *name, unsigned int flags)
{
    return syscall(SYS_memfd_create, name, flags);
}

int main(){
    // create a temporary in-memory file
    int fd = memfd_create("cow_buffer", 0);
    if (fd == -1)  err(errno, "memfd_create failed");

    size_t size = 4096;

    // resize file
    int check = ftruncate(fd, size);
    if (check == -1) err(errno, "ftruncate failed");

    // map file into memory as SHARED
    void *ptr = mmap((void *)0, size, PROT_READ | PROT_WRITE,
                     MAP_SHARED, fd, (size_t)0);

    if (m_ptr == MAP_FAILED) err(errno, "mmap failed");

    // write to ptr
    ...

    // map the same file into memory as PRIVATE
    void *ptr2 = mmap((void *)0, size, PROT_READ | PROT_WRITE,
                      MAP_PRIVATE, fd, (size_t)0);
}

Now the pointers, again, share data. But as soon as we write trough ptr2 that is a COW, and ptr still sees the old state. However, it is not the other way round. Write from ptr are passing through and will be visible from ptr2. So this is a unidirectional COW, if you will. I have no idea if that is even useful for anything or if the COW could be expanded to both directions.

It is my feeling that this is a missing optimization opportunity. Especially, as the functionality already exists in the kernel, but there is just no way for a user to explicitly request it. So all that is missing is a COW syscall. Someone just has to create it.