C Library Interface

Process Migration With BProc

BProc provides a number of mechanisms for creating processes on remote nodes. It is instructive to think of these mechanisms as moving processes from the front end to the remote node. The rexec mechanism is like doing a move then exec with lower overhead. The rfork mechanism is implemented as an ordinary fork on the front end and then a move to the remote node before the system call returns. Execmove does an exec and then move before the exec returns to the new process.

Movement to another machine on the system is voluntary and is not transparent. Once a process has been moved all its open files are lost except for STDOUT and STDERR. These two are replaced with a single socket(their outputs are combined). There is an IO daemon which will forward from the other end of that connection to whatever the original STDOUT was connected. No pseudo tty operations are done.

The move is completely visible to the process after it has moved except for process ID space operations. Process ID space operations include fork, wait, kill, etc. All file operations will operate on files local to the node to which the process has been moved. Memory that was shared on the front end will no longer be shared.

For more information on the details of BProc's process migration, see @ref{VMADump}. Since VMADump is BProc's process migration mechanism, understanding how it works will illustrate all the caveats with process migration in BProc.

VMADump: Virtual Memory Area Dumper

VMADump is the system used by BProc to take a running process and copy it to a remote node. VMADump saves or restores a process's memory space to or from a stream. In the case of BProc, the stream is a TCP socket to the remote machine.

Most programs on the system are dynamically linked. At run time, they will use mmap to get copies of various libraries in their memory spaces. Since they are demand paged, the entire library is always mapped even if most of it will never be used. These regions must be included when copying a process's memory space and again when the process is restored. This is expensive since the C library dwarfs many programs in size.

Here is an example memory space for the program sleep. This is taken directly from /proc/pid/maps.

08048000-08049000 r-xp 00000000 03:01 288816     /bin/sleep
08049000-0804a000 rw-p 00000000 03:01 288816     /bin/sleep
40000000-40012000 r-xp 00000000 03:01 911381     /lib/ld-2.1.2.so
40012000-40013000 rw-p 00012000 03:01 911381     /lib/ld-2.1.2.so
40017000-40102000 r-xp 00000000 03:01 911434     /lib/libc-2.1.2.so
40102000-40106000 rw-p 000ea000 03:01 911434     /lib/libc-2.1.2.so
40106000-4010a000 rw-p 00000000 00:00 0
bfffe000-c0000000 rwxp fffff000 00:00 0

The total size of the memory space for this trivial program is 1089536 bytes. All but 32K of that comes from shared libraries - VMADump takes advantage this.

VMADump can avoid copying these memory regions when migrating a process to a remote machine if we are willing to guarantee that the libraries that they are mapped from are present on the remote machine. Instead of storing the data contained in each of these regions, it stores a reference to the regions. When the image is restored, that files will be mmaped to the same memory locations.

In order for this optimization to work, VMADump must know which files it can expect to find in the location where they are restored. VMADump has a list of files which it presumes are present on remote systems. The vmadlib utility exists to manage this list.

Limitations / Important Details

Note that VMADump will correctly handle regions mapped with MAP_PRIVATE, which have been written.

VMADump does not specially handle shared memory regions. A copy of the data within the region will be included in the dump. No attempt to re-share the region will be made at restoration time. The process will get a private copy.

VMADump does not save or restore any information about file descriptors.

VMADump will only dump a single thread of a multi-threaded program. There is currently no way to dump a multi-threaded program in a single dump.