HTB 3 Linux queuing discipline developement

Martin Devera aka devik (devik@cdi.cz)
Well, what is rationale for new HTB ? Current HTB version is slower than CBQ - sometimes 2 sometimes 5 times. It is because CBQ algorithm is less precise in favor of speed. I don't want to make HTB less precise because people are using it just because of CBQ's impreciseness.

Let's rewiew the goal for simple setup:

       A
      / \
     B   C
        / \
       D   E
	  
Class   Guaranted rate  Ceil rate  Priority   Level
A       100             100                   2
B       60              80         Low        0
C       40              80                    1
D       10              60         High       0
E       30              60         Low        0
Guaranted rate will be satisfied always when class has enough demand. Ceil rate will never be exceeded. Higher prio leaves are sent first relative to other leaves borrowing from the same level. Lower levels are completely dequeueued before borrowing from higher levels.
Levels are 0 for leaves and each classe's children has the same lower level. So that F being child of A and not being leaf would have level 1.

CBQ's way

Treat B,D and E as leaves and link them into priority list - so that B and E are in one low prio circular list and E in hi prio list (it is alone there). Only active (backlogged) classes are in lists. CBQ maintains global variable toplevel which tells you which level you can borrow from. It is needed to handle priority (or hi prio leaf could starve all other classes).
Assume that C,D are overloaded so that you can't allow D to borrow from A - you have to allow B and E to use their own rate (if any) to send and only when all leaves (BDE) need to borrow from A then D will be served first and can consume whole A.

Toplevel can't be computed precisely (it is O(N) problem) so that CBQ aproximates it in O(1) time but the aproximation allows some weird errors for some types of config and data.
Contrary to popular belief CBQ is not O(1) algorithm because whenever toplevel < max_level some classes have to be skipped in round robin lists - there is at most L classes to skip (where L is number of active leaves).
Also note that CBQ doesn't do ceiling.

HTB1/2 (now in use) way

Couple classes into prio lists in the same way as CBQ does. Then go thru this list (in prio order) and test whether leaf can send. When we find such leaf we are done. It is the same what CBQ does with toplevel == 0.
If we haven't found leaf with enough rate loop again and try whether someone can send by borrowinf from level 1 classes. If still none continue with level 2 .... Simple and slow. Actualy I do only one loop (level 0) and remember the minimal level where is someone who can send. Example:

Try D,B and E in this order - if one can send without borrowing then use it. If none can we found (by testinf parents) that D can borrow from C (level 1), B from A (l 2) and E can't because of ceil. Minimal level is 1 so that let's C to go.

It is more complicated for both CBQ and HTB because we can select leaf to send from but it may refuse to give us packet (like TBF). So ve have to disable it and try again.

From performance view while CBQ has worst case complexity O(L) HTB has average complexity O(L). A measured that for L=3000 classes HTB is 7 times slower than CBQ. Which can become problem on multimegabyte networks.

HTB3 first attempt - hierarchical stop-trees

The first attempt for quaranted O(log L) algorithm. Each inner class would have stoplist - list of all leaves which can send by borrowing from the class. I extracted relatively complex algorithm for shuffling leaves between lists. Each list is actualy binary tree sorted by DRR quantum. The problem was too high complexity and L1 cache pollution by frequent moving of big structures. Also problem with DRR.
Three months spent here.

HTB3 second attempt - hierarchical toplevel

Here I tried to compute toplevel (like in CBQ) but precisely. I did it by maintaining per class and prio counts of leaves depending on class to lend it some bw. Also I maintain sum of these counters per level. Toplevel value is then first level with nonzero counter. Medium complexity, relatively fast but only 30-50% benefit. Still 3 times slower than CBQ. There is speed penalty like in CBQ: we know level at which to dequeue but we still have to skip many active classes to find the one which caused toplevel to this value.

HTB3 third attempt - hierarchical feeds

Each inner class has one feed-list per priority. There are classes whose want to borrow from the class on the list. For each possible level/prio there is also global "self" list. All classes (inner and leaves) which can send itself (actual_rate < class_rate) and are active are in one of self list. Inner class thus can be either on self list or parent's feed list or nowhere.
Because send status can change in time (overlimit class was delayed long enough) all classes which are not on self list are also in per-level timer list (sorted by time of next change in status).
Dequeuing packet now means only to find lowest nonempty feed list, for non-leaf class go thru feed list down to leaf and dequeue it. Update leaf's DRR deficit and optionaly rotate actual class in ancestor feed+self lists (ensures proper sharing of borrowed rate). Feed, self and delay lists are implemented by rb_trees (these are part of 2.4 kernels). It seems to be really O(log L) algorithm ;).

TODO: 
- traversing of self/feed during dequeue and roatting of various lists
  (htb_lookup_leaf and htb_dequeue_tree)
- testing/benchmarking
If you want to hack on it install latest HTB2 and replace sch_htb.c by one below. The tc tool is also the same.
Code for 2.4.17 (should work now)

Hierarchical feeds aproach rules !

Now the new code seems to work. It seems that all tests agree with my assumptions and wishes. It is fast, stable and yet more precise. Look HERE to see graphs and explanations.