How to find the best VM strategy for Linux

Brain core dump of Martin Devera aka devik (devik@cdi.cz)

Why I've written this ? I have some advanced ideas about VM subsystem but no time to test them. I wrote them here to allow others to think about these and to be sure that potentionaly good idea will not be lost.

Goals

Create framework which allows for capturing "VM fingerprint" of running linux system. The fingerprint can be later used by userspace program which can analyze it and compute the best page frame allocation for virtual memory pages. So that

First goal is to find for given load pattern of system the best possible page allocation scheme. Then we can say for example that current AA VM is only 20% slower than best case and that we are comfortable with it. If we find that the best case is either always or for some load scenarios 3 times faster (see what I mean be speed below) then

Second goal is to find cheap algorithm which would approximate the best behaviour.

Definitions

Performance or speed of system is defined by number of hard faults here (page read or write either to/from swap or mmaped file). It should be ok as long as we can assume that actual VM algorithm has negligible CPU overhead and that CPU overhead of processes doesn't depend on page in/out timing (true for most cases).

VM action is one from: memory read, memory write, process wakeup, process wait and process sleep.

VM fingerprint of system is defined as paralel run of sequences. Each sequence represents one running process. The sequence elements are VM actions of the process.

Example of fingerprint

Assume simple scenario represented by this shell script:
# this script is names sortls
ls -lR / | sort -u
When you run the script using ./sortls command the fingerprint captured during execution might read:
Seq. name   Actions
sortls      U(sortls) U(ls) S(10) W(0x343) R(0x432) U(sort)  D(ls,sort)
ls          D(sortls)             W(0x123) D(sort)           R(0x34)
sort        D(sortls)                               R(0x234) U(ls)

Key: D(x) - wait for sequence(s) x     W/R(addr) - write/read memory
     U(x) - wakeup sequence x          S(ns) - sleep some time (simulate CPU work)
"ls" and "sort" sequences waits as start to be awakened by sortls script. "sortls" starts by waking itself immediately (it is "root" of all tasks here) and it forks ls (U action). Then it computes something for 10ns, makes memory read and writes and spawns sort. Then it waits for their completion.
The wait/wakeup are needed here because VM subsystem requests are dependent on process relationship and if we change (create new) paging layout then all reads and writes would occur at other time. So that we doesn't depent on wall time and rather record it in this interdependent-processes way.

Note that there would as many sequences as high was number of created processes during capture.
Of sourse we would need event E as exit so say that process is released and its VM pages can be used again.

How to capture fingerprint under Linux

This part needs to be finished - I have only some ideas.
The R an W events can be read by scanning PTEs on each context switch (so we will scan only one process) and addresses recorded must include some identifier allowing us to catch shared memory accesses (including file cache access). The scan will be done with finite time granularity but it is what we want anyway.

The D and U (sleep & wakeup) are generated by process forking and by waitpid (and similar). The are also generated in case where one process is waiting on other (pipe or unix sock blocking write/read).
Only problem I see here are related to polling. If process A polls on pipe to B with timeout, it can be hard to express because we can't see what process would do after timeout.

S (sleep) is generated in just before R|W|D|U to say how much user time the process needed since last non S action till now. It allows us later to account CPU work (only relative values are important).

After some time we could have library of such fingerprints for various loads (Oracle,kernel compile,mozilla ..) and developers can use them to create and tune VM models in userspace.

How to compute the best possible page allocation

Given a fingerprint we can compute best possible allocation to fixed number of page frames.
One can employ dynamic programing to this. Because we can look "into future" (because we have whole fingerprint) our mission is simple. For each W/R which causes fault we release frame which causes minimal disk I/O per second. It means we look into future and for each candidate frame we compute:
 io = (is_dirty + 1)/time_to_nearest_future_hit
and we use frame with minimal io. I didn't attempt to prove this condition as valid predicate for dynamic programing but it seems that it should be ok.

As we go thru fingerprint we simulate behaviour of system by selecting process with smallest S value if all are in S state and we have to handle sleeps and wakeups.

Hence for each fingerprint we can compute cheapest possible paging strategy.

How to develop paging algorithm which implements the best strategy

When we have enough fingerprints and best allocations computed in previous step we can use several methods to syntetize appropriate function and algorithm.

Just now I can think about least squares method to find set of parameters whose affect the frame selection and how. Then look at resulting function and try to create conforming algorithm.

This part should be easy.

Problems

The main problem is in fingerprint representation. I'm not sure whether the sleep/wakeup generalizes all common process interdependences and whether is it possible to compute them by hooking syscalls only (and not looking inside machine code).

It would be nice if someone experienced in process synchronization could bring more light into it.

Thanks for reading and if someone has something to say about it, drop me email.
devik