Hierachical token bucket
Martin Devera aka devik (devik@cdi.cz)
Abstract
This paper discusses another implementation of the CBQ packet scheduler [1].
It focuses mainly on scheduling precision and ease of implementation. I'll
show how to generalize the "bounded" parameter, replace the rate estimator
by a token bucket one and introduce a new algorithm which efectively
implements formal sharing (as described in [1] section 3). I assume that the
reader is familiar with CBQ theory.
Formal link sharing guidelines revisited
The goals described in [1] remain the same. Let's refresh them:
- Each interior or leaf class should recieve roughly
its allocated link-sharing bandwidth over appropriate
time intervals, given sufficient demand.
- If all classes with sufficient demand have recieved
at least their allocated link-sharing bandwidth, the
distribution of any excess bandwidth should not be arbitrary
but should follow some set of reasonable guidelines.
As in the original paper the second goal is not discussed here.
This behavior is affected by used scheduler algorithm.
Most CBQ implementations use deficit round-robin (DRR) [4] in which excess
bandwidth is distributed according to the weight parameters specified.
I will address related issues later in this paper.
In [1], formal guidelines are described which implement the goals
above. They are repeated here for your convenience:
A class can continue unregulated if one of the following conditions hold:
- The class is not overlimit, OR
- The class has not-overlimit ancestor at level i, and there
are no unsatisfied classes in the link-sharing structure at levels
lower than i.
(see [1] for terms' declarations).
The second rule ensures that we don't borrow bandwidth from parent
class while there is child which itself can send without borrowing.
Iterative approach
I argue that following algorithm conforms to the formal guidelines.
Let's call it iterative approach.
First we set L to 1 and use this rule to decide whether the class can send:
The leaf can continue unregulated only if there exists
no overlimit ancestor class at level L.
In the case where all leaves are overlimit (step 1 didn't yield a packet)
we increment L and try again until we either get packet or L is bigger
than the level of root class.
It is not hard to prove correctness of algorithm above. For L=1 the
algorithm behaves exactly as first rule of formal guidelines.
To agree with the second rule we have to assure that we don't
allow a leaf to borrow from class at level N if other leaf exists
which has demand and can borrow from class at lower level.
In iterative approach we always test lower levels before
higher ones thus we satisfy both formal sharing rules.
The algorithm can be implemented in faster way. During the first
step (with L=1) we can remember the lowest level we can send at and
use it directly in the second pass (if the first one didn't yield
us a packet). I call this two phase approach.
Comparison between top-level and iterative approaches
The iterative approach implements formal sharing rules exactly while
top-level (proposed in [1]) uses heuristic to approximate them. Hence
the iterative approach quarantees you the best possible precision.
At first sight it is clear that the iterative approach is more expensive.
Top-level can determine class status immediately while two phase variant
can need up to one full scan over all active leaves. The complexity
increases with amount of borrowing done. However my tests
showed that difference in the real case is not so critical.
Precise top-level
While the algorithm described in the previous paragraphs is simple to implement
it has still the worst case complexity O(N) where N is number of active leaves.
Even as complex algorithm as WF2Q [5] is faster (O(log N)).
Top level heuristic as used in [1] has complexity O(1) but it is
inacurate especially if more flows are active.
Here I'll outline how to change top level heuristics to comply with
the formal guidelines as precisely as possible and how to do it with
the worst case complexity O(log N).
The key idea is to maintain one sorted list of classes per level. Each
list contains only active classes from single level and these
are sorted by time which they become underlimit at. When we need
to know top level then we can simply look heads of the sorted lists up
starting from lowest level. The first level we find underlimit class at
becomes new top level.
The ceiling will make it a bit more complicated but still it is not hard
to prove correctness of the idea above.
Note that this idea was not tested and should be treated as such.
Estimators
In [1] the authors state that overlimit status of class is determined by
estimator and they then discuss idle time based estimator used
in prototype implementation. The same estimator is used in all
CBQ implementation I'm aware of.
I see two problems with idle based estimator. At first it is relatively hard
to implement without having precise clock or without knowing hardware speed
of network device (for example variable rate modems or wireless adapters).
The second "problem" (minor one) is that it is hard to interpret idle
estimator parameters (except for rate of course).
I replaced the estimator by a classical token bucket one [2]. It is simple
to setup as the most of network administrators have knowledge about it. It
uses only two parameters (rate and burst) of which both have unambiguous
physical meaning.
In my feeling the major contribution is fact that token bucket algorithm
is simple to implement on systems with finite nonzero granularity of
timer and it doesn't depent on hardware rate of network device.
The token bucket algorithm has to remember not only positive tokens
(burst) but also negative one at least for non leaf classes. It is
because parent class can be overlimit while one of its children is
underlimit. Sending from the child causes overlimit parent to be
yet more overlimit thus forcing tokens to be negative.
To assure that the overal goal of scheduler is still satisfied, inner
classes must allow for negative tokens up to sum of bursts of all descendant
classes. This limit can however be set to arbitrary large value, say 2
minutes because it is unlikely that user will use bursts bigger than several
seconds. Bursts are in this case specified in the time domain, whereas they
are usually expressed in bytes. These two views can be converted
bidirectionaly using the token bucket's rate).
Rate ceiling
The original paper distinguished at least between bounded and not bounded
class. Bounded class can't borrow bandwidth from its parent. I generalized
this a bit and blended the border between them. Each class has a second
estimator attached to it which I call "ceil". Each class can borrow as long
as it is under ceil. Setting ceil and rate estimators to the same value is
equivalent to a bounded class. Setting ceil to "infinity" (or other high
value, like the link's speed) is equal to a non-bounded class.
Probably you see the advantages of this approach. One can guarantee 1Mbit
to an agency and allow it to use up to 2Mbit (ceil) if the link is idle.
From the implementation side it is trivial. Just another estimator test
in underlimit parent lookup loop.
General scheduler problems
There is one problem with using DRR (used in [1]) as
general scheduler. When class oscilates between being and not being
regulated it is often skipped in DRR loop and it loses it's quantum. It then
leads into inadequate sharing of excess bandwidth.
I believe that all CBQ implementations suffer from this problem but it is
not so visible because idle estimator uses relatively large time constant.
In HTB is it more visible because its token bucket resolution is very close
to the DRR's round time.
The simple solution is to change DRR to use separate deficit and active
class pointer for each borrowing level. Because active pointer is
private for given borrowing level then class borrowing from other level
(and possibly forcing us to be regulated) can't mess our DRR position.
Tests in my implementation shows rate ratios improvement ranging from
5 to 50% when implementing this simple solution.
Implementation
I tested all parts described here in Linux based implementation [3].
There are no graph or other results in this paper because they
corresponds to those found in [1].
Future directions
There is possibility to cache last scheduler decision. We could cache
it as long as we have enough DRR tokens. It makes possible to skip
whole CBQ algorithm for several packets and account them to estimators
when first non cached packet is to be dequeued. It could speed things
up by large factor.
Use of clock based scheduler like WF2Q with link sharing could be helpful
to better isolate interactive traffic and provide better delay bounds.
Acknowledments
I'd like to say thanks to numerous people incuding Jamal Hadi Salim for
helpful comments, Bert Hubert who helped me with this paper and many
others.
[1] Link-sharing and Resource Management models for Packet Networks,
Sally Floyd and Van Jacobson, 1995
[2] Token Bucket ???? - can't remember the paper :(
[3] HTB for Linux, http://luxik.cdi.cz/~devik/qos/htb
[4] Efficient Fair Queuing using Deficit Round Robin, M. Shreedhar and G. Varghese
[5] WF2Q: Worst-case Fair Weighted Fair Queuing, J.C.R. Bennet and Hui Zhang