HTB 3 Linux queuing discipline developement
Martin Devera aka devik (devik@cdi.cz)
Well, what is rationale for new HTB ? Current HTB version is slower than
CBQ - sometimes 2 sometimes 5 times. It is because CBQ algorithm is less
precise in favor of speed. I don't want to make HTB less precise because
people are using it just because of CBQ's impreciseness.
Let's rewiew the goal for simple setup:
A
/ \
B C
/ \
D E
Class Guaranted rate Ceil rate Priority Level
A 100 100 2
B 60 80 Low 0
C 40 80 1
D 10 60 High 0
E 30 60 Low 0
Guaranted rate will be satisfied always when class has enough demand. Ceil
rate will never be exceeded. Higher prio leaves are sent first relative to
other leaves borrowing from the same level. Lower levels are completely
dequeueued before borrowing from higher levels.
Levels are 0 for leaves and each classe's children has the same lower level.
So that F being child of A and not being leaf would have level 1.
CBQ's way
Treat B,D and E as leaves and link them into priority list - so that B and E
are in one low prio circular list and E in hi prio list (it is alone there).
Only active (backlogged) classes are in lists.
CBQ maintains global variable toplevel which tells you which level you
can borrow from. It is needed to handle priority (or hi prio leaf could starve
all other classes).
Assume that C,D are overloaded so that you can't allow D to borrow from A - you
have to allow B and E to use their own rate (if any) to send and only when
all leaves (BDE) need to borrow from A then D will be served first and can consume
whole A.
Toplevel can't be computed precisely (it is O(N) problem) so that CBQ aproximates
it in O(1) time but the aproximation allows some weird errors for some types of
config and data.
Contrary to popular belief CBQ is not O(1) algorithm because whenever toplevel <
max_level some classes have to be skipped in round robin lists - there is at most
L classes to skip (where L is number of active leaves).
Also note that CBQ doesn't do ceiling.
HTB1/2 (now in use) way
Couple classes into prio lists in the same way as CBQ does. Then go thru this list
(in prio order) and test whether leaf can send. When we find such leaf we are done.
It is the same what CBQ does with toplevel == 0.
If we haven't found leaf with enough rate loop again and try whether someone can send
by borrowinf from level 1 classes. If still none continue with level 2 ....
Simple and slow. Actualy I do only one loop (level 0) and remember the minimal level
where is someone who can send. Example:
Try D,B and E in this order - if one can send without borrowing then use it.
If none can we found (by testinf parents) that D can borrow from C (level 1),
B from A (l 2) and E can't because of ceil. Minimal level is 1 so that let's
C to go.
It is more complicated for both CBQ and HTB because we can select leaf to send
from but it may refuse to give us packet (like TBF). So ve have to disable it
and try again.
From performance view while CBQ has worst case complexity O(L) HTB has average
complexity O(L). A measured that for L=3000 classes HTB is 7 times slower than
CBQ. Which can become problem on multimegabyte networks.
HTB3 first attempt - hierarchical stop-trees
The first attempt for quaranted O(log L) algorithm. Each inner class would have
stoplist - list of all leaves which can send by borrowing from the class. I extracted
relatively complex algorithm for shuffling leaves between lists. Each list is actualy
binary tree sorted by DRR quantum. The problem was too high complexity and L1 cache
pollution by frequent moving of big structures. Also problem with DRR.
Three months spent here.
HTB3 second attempt - hierarchical toplevel
Here I tried to compute toplevel (like in CBQ) but precisely. I did it by maintaining
per class and prio counts of leaves depending on class to lend it some bw. Also I maintain
sum of these counters per level. Toplevel value is then first level with nonzero counter.
Medium complexity, relatively fast but only 30-50% benefit. Still 3 times slower than CBQ.
There is speed penalty like in CBQ: we know level at which to dequeue but we still have
to skip many active classes to find the one which caused toplevel to this value.
HTB3 third attempt - hierarchical feeds
Each inner class has one feed-list per priority. There are classes whose want to borrow from
the class on the list. For each possible level/prio there is also global "self" list. All
classes (inner and leaves) which can send itself (actual_rate < class_rate) and are
active are in one of self list. Inner class thus can be either on self list or parent's
feed list or nowhere.
Because send status can change in time (overlimit class was delayed long enough) all classes
which are not on self list are also in per-level timer list (sorted by time of next change
in status).
Dequeuing packet now means only to find lowest nonempty feed list, for non-leaf class
go thru feed list down to leaf and dequeue it. Update leaf's DRR deficit and optionaly
rotate actual class in ancestor feed+self lists (ensures proper sharing of borrowed rate).
Feed, self and delay lists are implemented by rb_trees (these are part of 2.4 kernels).
It seems to be really O(log L) algorithm ;).
TODO:
- traversing of self/feed during dequeue and roatting of various lists
(htb_lookup_leaf and htb_dequeue_tree)
- testing/benchmarking
If you want to hack on it install latest HTB2 and replace sch_htb.c by one below.
The tc tool is also the same.
Code for 2.4.17 (should work now)
Hierarchical feeds aproach rules !
Now the new code seems to work. It seems that all tests agree with my assumptions
and wishes. It is fast, stable and yet more precise. Look
HERE to see graphs and explanations.