- Each interior or leaf class should recieve roughly its allocated link-sharing bandwidth over appropriate time intervals, given sufficient demand.
- If all classes with sufficient demand have recieved at least their allocated link-sharing bandwidth, the distribution of any excess bandwidth should not be arbitrary but should follow some set of reasonable guidelines.

As in the original paper the second goal is not discussed here.
This behavior is affected by used scheduler algorithm.
Most CBQ implementations use deficit round-robin (DRR) [4] in which excess
bandwidth is distributed according to the weight parameters specified.
I will address related issues later in this paper.

In [1], **formal guidelines** are described which implement the goals
above. They are repeated here for your convenience:

**A class can continue unregulated if one of the following conditions hold:
**

**The class is not overlimit, OR**- The class has not-overlimit ancestor at level i, and there are no unsatisfied classes in the link-sharing structure at levels lower than i.

(see [1] for terms' declarations).

The second rule ensures that we don't borrow bandwidth from parent class while there is child which itself can send without borrowing.

**The leaf can continue unregulated only if there exists
no overlimit ancestor class at level L.**

In the case where all leaves are overlimit (step 1 didn't yield a packet)
we increment L and try again until we either get packet or L is bigger
than the level of root class.

It is not hard to prove correctness of algorithm above. For L=1 the
algorithm behaves exactly as first rule of formal guidelines.
To agree with the second rule we have to assure that we don't
allow a leaf to borrow from class at level N if other leaf exists
which has demand and can borrow from class at lower level.
In iterative approach we always test lower levels before
higher ones thus we satisfy both formal sharing rules.

The algorithm can be implemented in faster way. During the first
step (with L=1) we can remember the lowest level we can send at and
use it directly in the second pass (if the first one didn't yield
us a packet). I call this **two phase** approach.

At first sight it is clear that the iterative approach is more expensive. Top-level can determine class status immediately while two phase variant can need up to one full scan over all active leaves. The complexity increases with amount of borrowing done. However my tests showed that difference in the real case is not so critical.

Top level heuristic as used in [1] has complexity O(1) but it is inacurate especially if more flows are active.

Here I'll outline how to change top level heuristics to comply with the formal guidelines as precisely as possible and how to do it with the worst case complexity O(log N).

The key idea is to maintain one sorted list of classes per level. Each
list contains only **active** classes from single level and these
are sorted by time which they become underlimit at. When we need
to know top level then we can simply look heads of the sorted lists up
starting from lowest level. The first level we find underlimit class at
becomes new top level.

The ceiling will make it a bit more complicated but still it is not hard
to prove correctness of the idea above.

Note that this idea was not tested and should be treated as such.

I see two problems with idle based estimator. At first it is relatively hard to implement without having precise clock or without knowing hardware speed of network device (for example variable rate modems or wireless adapters). The second "problem" (minor one) is that it is hard to interpret idle estimator parameters (except for rate of course).

I replaced the estimator by a classical token bucket one [2]. It is simple
to setup as the most of network administrators have knowledge about it. It
uses only two parameters (rate and burst) of which both have unambiguous
physical meaning.

In my feeling the major contribution is fact that token bucket algorithm
is simple to implement on systems with finite nonzero granularity of
timer and it doesn't depent on hardware rate of network device.

The token bucket algorithm has to remember not only positive tokens
(burst) but also negative one at least for non leaf classes. It is
because parent class can be overlimit while one of its children is
underlimit. Sending from the child causes overlimit parent to be
yet more overlimit thus forcing tokens to be negative.

To assure that the overal goal of scheduler is still satisfied, inner
classes must allow for negative tokens up to sum of bursts of all descendant
classes. This limit can however be set to arbitrary large value, say 2
minutes because it is unlikely that user will use bursts bigger than several
seconds. Bursts are in this case specified in the time domain, whereas they
are usually expressed in bytes. These two views can be converted
bidirectionaly using the token bucket's rate).

Probably you see the advantages of this approach. One can guarantee 1Mbit to an agency and allow it to use up to 2Mbit (ceil) if the link is idle. From the implementation side it is trivial. Just another estimator test in underlimit parent lookup loop.

The simple solution is to change DRR to use separate deficit and active
class pointer for each borrowing level. Because active pointer is
private for given borrowing level then class borrowing from other level
(and possibly forcing us to be regulated) can't mess our DRR position.

Tests in my implementation shows rate ratios improvement ranging from
5 to 50% when implementing this simple solution.

Use of clock based scheduler like WF2Q with link sharing could be helpful to better isolate interactive traffic and provide better delay bounds.

[1] Link-sharing and Resource Management models for Packet Networks,
Sally Floyd and Van Jacobson, 1995

[2] Token Bucket ???? - can't remember the paper :(

[3] HTB for Linux, http://luxik.cdi.cz/~devik/qos/htb

[4] Efficient Fair Queuing using Deficit Round Robin, M. Shreedhar and G. Varghese

[5] WF2Q: Worst-case Fair Weighted Fair Queuing, J.C.R. Bennet and Hui Zhang