What is HTB

I developed Linux qdisc which I call HTB a year ago. It was nothing special ..
Now I have qdisc which has very special and nice propertias and due lack of imagination a call it HTB again (it superseedes old one).

I got the idea while reading Floyd's paper about CBQ. I suddenly realized that I can create algorithm which solves goals of link sharing given in Floyd'd paper (called formal guidelines) and the algorithm is simpler more robust and more precise that top-level algorithm.
It was 1.9.2001.
3.9. I implemented basic idea
4.9. first results are here about 10 oops repaired.
5.9. (today) tested real link sharing which surprisingly behaves better than I ever expected

Why I have written it ?

I created it for one relatively big ISP which uses many linux routers. And CBQ sometimes give them weird results and they also need cross device sharing and class "ceiling".

How does it differ from CBQ

HTB solves the same problem as CBQ. The advantages are:

HTB does use TBF as estimator instead of EWMA of idle time. As result one need to set only rate and burst (no ewma coef, max/minburst ..). It is simpler and more intuitive IMHO.
TBF is more precise in estimating rate. See results bellow.
TBF doesn't need precise end of send time as ewma-idle needs. So that ANK's time integrator is not needed. As result HTB works robustly on all network devices (including loopback and wireless) where CBQ gives weird and wrong rates.
CBQ implements top-level aproximation from Floyd's paper. HTB uses totaly different approach which implements formal sharing (exactly, no aproximation). Yes I did the proof.
HTB allows cross-device bandwidth sharing. Code is in place but not tested. I will finish it in few days.
HTB allows precisely control borrowing. You set minimal rate which class will ALWAYS get (with no or minimal delay) and upper (ceil) rate toward which we can approach by borrowing from parent. These can be set for both leaves and inner nodes.
Proven characteristic of resulting flow. It conforms to TBF with rate equal to sum of rates of all roots and burst which is sum of all bursts over all nodes. It is the worst case. When there are active "ceils" then the total burst is even lower (I will formulate it later).
tc class show shows actual byte and packet rate measured over 10 sec.

Well we are not in fantasy world so that there are disadvantages:

It is new thing. It can crash your router, damage your brain or start atomic war. Testing is needed.
It is slower than CBQ. How much slower ? About N times where N is depth of your sharing structure. But the computations are simpler so that I expect the speed to be nearly the same. I will measure it later.

Principle

Let's recapitulate knowledge from Floyd's paper on CBQ. We have tree of classes, each class has rate and priority. The packets are divided into flows and attached to leaves if tree. We will assume (and HTB mandates it) that tree is complete - each leaf has the same depth.
The goal is to schedule packets in such way that each class is satisfied. We can use abstraction and describe algorithm in this way:
First select all leaves whose rate is not yet reached. Send packets from them starting with high priority continuing to lower prio leaves (use DRR for these with the same priority).
When we rates of all leaves are exceeded then repeat whole cycle but test for each leaf whether it can borrow from its parent instead of testing its own rate.
When no such class remains then repeat cycle again but borrow from grandparent. And so on ..

Some important observations here. When class A has higher priority than B and both are borrowing from C then A gets as much as it can and remainder is served to B (but B still gets its own rate). If B has the same priority as A then borrowed bandwidth from C would be divided between A and B in the same proportion as their own base rates are.

The CBQ solves the goal by maintaining top-level variable and cycles only over leaves. At each leaf try to borrow from ancestors up to level top-level. When you update top-level to be upper level below which there are no leaves which could send without borrowing (unsatisfied in Floyd's terms) then result is essentialy the same. Top-level is faster but less precise way.
The "rate conformness" is measured by measuring time between packet edges. Note that it is very hard to implement precisely.

In HTB I use the "abstraction" above directly. I really cycle over leaves more times. It is slower but is allows me to do thinks more precisely (I implement private DRR values for each level).
It uses TBF estimators which are easy to implement even on hardware without tsc (still with reasonable precision).
I added notion of ceil rate (ceil of rate the class can ever get). So that the class can always use bandwidth between its base and ceil rate values.
It is simply implemeted by disallowing borrowing when class is over this rate (both leaves and interiors can have ceil). Note that when ceil is tha same as base rate it is the same as bounded keyword in CBQ.

Cross device sharing in HTB

I really needed this. We have kernel-wide global set of slots (there is limited number of them, 4 for now). You can select class and mark it as owner of slot (by tc ... slot N). Each slot can have just one owner.
Then you can link any root class of other devices (interfaces) to be "child" of slot. Such root class will then borrow bandwidth from slot's owner. But how to divide borrowed bandwidth ?
This was hard problem to solve. It's because each device can demand packet at any time and so we can't coordinate borrows by DRR. When both (or more) devices have permanent demand the result is that shared bandwidth is stolen by one of it. It is the same problem as with rate policers.
I decided to maintain actual rate for each class (recalculated every 100ms). Now when anyone wants to borrow from slot owner I compute sum of base and actual rates of all slot's active children. Let R be ratio of that sums. Then R tells us "how much times we are actualy overusing base rate". Now multiply requestor base rate by R and you got actual rate which requestor should be using now. Compare with actual requestor's rate and decide whether it can borrow from slot (or slot's ancestors).
It has several drawbacks. It reacts slower than DRR (see charts from 9.9. bellow). The rate guard is applied only to slot owner not to ancestors so that slot owner should be root class.
But is works and is excelent solution for agency sharing router with more ethercards or ppp links.

Measuring and results

It is primary goal of this "paper" to explain you briefly what HTB is and amaze you with results :-) I really want a few testers ..

How I measured

I wrote my own measuring trafic generator. I has taken me 2 hours, code is hairy but works well. I use two 10Mbit ethernet cards in single Pentium II machine. Cards are connected by crossover cable (I have special RJ connector which does loopback on single card but I forgot it in work). Because IP doesn't work in this constelation (routing) my program creates flow of special ethernet frames with protocol set to ETH_P_CUST and filled with packets start time (to measure delay) and flowid (to use qos filters). The same program send and recieve the flows thru single RAW socket and measures various properties of each packet and flow. The generator is controled by simple "script". The one used for graphs bellow is:

0	P	0	0x10002	# set flow's 0 and 1 SO_PRIORITY which qdisc
0	P	1	0x10003 # can use to switch between classes 1:2 and 1:3
0	S	0	1000	# both flows has pasket size 1kB
0	S	1	1000
0	R	0	90000	# start flow 0 generating 90 kBps
4000	R	1	50000	# at time 4s start flow 1 at 50 kBps
#7000	R	1	70000
9000	R	0	0	# at 9s stop flow 0
14000	X	0	0	# at 14s terminate measuring

IMO it is intuitive.
Simulator outputs ascii data suitable to gnuplot. Also because of imature code of simulator, the x axis labeling is off sync from event starts (about 0.5 sec). But it is not problem if you know what you want to see.

CBQ results

Here is code to setup CBQ for first simulation:

tc qdisc add dev eth0 root handle 1: cbq bandwidth 10Mbit avpkt 1000
tc class add dev eth0 parent 1: classid 1:1 cbq bandwidth 10Mbit rate 100kbps maxburst 10 bounded avpkt 1000
tc class add dev eth0 parent 1:1 classid 1:2 cbq bandwidth 10Mbit rate 40kbps maxburst 10 avpkt 1000 weight 4 prio 1
tc class add dev eth0 parent 1:1 classid 1:3 cbq bandwidth 10Mbit rate 20kbps maxburst 10 avpkt 1000 weight 2 prio 1
tc qdisc add dev eth0 parent 1:2 handle 20: pfifo limit 5
tc qdisc add dev eth0 parent 1:3 handle 30: pfifo limit 5

Note the pfifos. These MUST be used because we send skbs from raw socket so that all packets sitting in queues are accounted against the socket. When you leave default 100 packet queues then send() syscall starts to block and measuring is imprecise (my simulator notices it and warns).
Also note that child classes use 60 (20+40) kBps while total is 100 kBps. This is equal to setup where another classes (of 40 kBps total) are present but idle (unused). It is intentional.

Above you see result. First notice that total rate is higher than 100 kBps. It is ewma-idle error. The shared rate is ok. According to weights the class A should get two times more than B:

rA = 40 + 40*2/3 ~= 67 kBps
rB = 20 + 40*1/3 ~= 33 kBps

Well let's look how to use priority value. I changed "prio 1" for 1:3 to "prio 0" to boost the class. Because generator generates 50 kBps for 1:3 and class A (1:2) has quaranted 40 kBps there is room bellow 100 kBps and class A should experience NO delay.

As you see it is almost right. Delay is not zero but is reasonably small.

HTB results

Here is setup:

tc qdisc add dev eth0 root handle 1: htb depth 2
AC="tc class add dev eth0 parent"
$AC 1: classid 1:1 htb rate 100kbps burst 2k
$AC 1:1 classid 1:2 htb rate 40kbps ceil 80kbps burst 2k prio 1
$AC 1:1 classid 1:3 htb rate 20kbps ceil 80kbps burst 2k prio 1

tc qdisc add dev eth0 parent 1:2 handle 20: pfifo limit 5
tc qdisc add dev eth0 parent 1:3 handle 30: pfifo limit 5

This is the same setup. I will debate the ceil parameter bellow.

You can see that traffic looks very similar. Let's discuss ceil setting. At time 1 to 3 only class A sends 90 kBps. The spike at 2 is due to "burst 2k" parameter. Then the flow is limited to 80 kBps mandated by ceil. So that class has its 40 kBps always and can borrow from parent up to 80 kBps. So that neither A nor B can saturate whole link.
At 4 B starts transmit and again as with CBQ the bandwidth is divided correctly but the 100 kBps limit is more precise.
The delay of B at 5 is a bit worse than with CBQ but it is about 15%. I'm not sure about reason.
What about priority:

Nothing interesting only as you can see the B's delay is impressive :-)

9.9.2001

I fixed numerous bugs. One of them is fundamental error in current CBQ implementation (and probably Floyd's too). They use common DRR independly of level they borrow from. This leads to sharing error about 10-20%. I hit the same problem and wanted to repair it by independent DRRs for all borrow levels. But this would lead to a bit complex implementation so that I tried to maintain only separate deficits. There is almost no speed penalty and error is bellow 5% now.

I tested cross device sharing. There are problems with sharing bw borrowed from another device because we can't tell foreign device when it should dequeue. So that I added rate meter (which you can see in tc show) and implemented heuristic weak sharing control. It simply calculates "ideal" rate which class should get (computed from rates of active classes) and deny borrowing from "slot" (see description of cross sharing) if actual rate is higher.
Let's see how it works in real. I have the same testing bed as described above. There is new generator setup:

0	i	0	eth0
0	i	1	eth0
0	i	2	eth1
0	P	0	0x10002
0	P	1	0x10003
0	P	2	0x10002
0	S	0	500
0	S	1	500
0	S	2	500
0	R	0	90000
2000	R	2	70000
4000	R	1	50000
9000	R	0	0
14000	X	0	0

The new "i" command binds output interface to flow. And qdisc setup:

tc qdisc add dev eth0 root handle 1: htb depth 2
tc qdisc add dev eth1 root handle 1: htb depth 2
AC="tc class add dev eth0 parent"
$AC 1: classid 1:1 htb rate 50kbps ceil 100kbps burst 2k pslot 1
$AC 1:1 classid 1:2 htb rate 30kbps ceil 100kbps burst 2k
$AC 1:1 classid 1:3 htb rate 10kbps ceil 100kbps burst 2k
AC="tc class add dev eth1 parent"
$AC 1: classid 1:1 htb rate 800000 burst 2k slot 1
$AC 1:1 classid 1:2 htb rate 30kbps ceil 100kbps burst 2k
tc qdisc add dev eth0 parent 1:2 handle 20: pfifo limit 2
tc qdisc add dev eth0 parent 1:3 handle 30: pfifo limit 2
tc qdisc add dev eth1 parent 1:2 handle 20: pfifo limit 2

You can see we have one "general" super-parent 1:1 on eth1 which offers it's bandwidth to cross device sharing (by "slot 1" parameter). It has 100kBps. It has direct child 1:2.
Eth0's root class can borrow from that eth1 super-parent as designated by parameter "pslot 1".

As you can see I've invented new gnuplot script which renders R events bellow x axis as band:new_rate. The generator's off by 0.5 problem is solved too.
At time 3.5 you can see flow 0 (eth0) and 2 (eth1) getting almost the same share but the should get 50:30 ratio. Why ? The cross device sharing is based on rate estimator with 6 sec averaging interval. It simply did not catch up yet.
At time 6 is starts to estimate well and ratio is correct (yellow vs. blue line).
As you see the ratios and total bandwidth is maintained very well now. I hope that how we need only to catch bugs :).

Conclusion

I hope you enjoy the qdisc and if you like it let me know.

I need someone to consult inter-qdisc locking issues.
I need testers.

Sources

diff against 2.2.17
diff against iproute2 rel 991023
C source of traffic simulator
example shell script to gnuplot results
example control program for simulator

Martin Devera aka devik devik@cdi.cz