Devik's linux QoS developement

Issues regarding 2.4 kernels

Thanks to Lijie Sheng and Dhiman Barman, they discovered number of QoS changes in 2.4 kernels and Lijie sent patch at least for WRR. Here is.
Note that in 2.4 there are different return values from enqueue/dequeue and different way to lock QoS subsystem.

Ingres Loopback fix for masquerading

Thanks to Mr. Borusik from MTC I discovered that I forget to put here patch for masquerading while using loop-ingres patch.
Here is:

diff -rubB /usr/src/linux-2.2.16/net/ipv4/ip_masq.c gatek/net/ipv4/ip_masq.c
--- /usr/src/linux-2.2.16/net/ipv4/ip_masq.c	Thu May  4 02:16:53 2000
+++ gatek/net/ipv4/ip_masq.c	Mon Jul 24 17:10:27 2000
@@ -1982,11 +1982,12 @@
 	 *	... don't know why 1st test DOES NOT include 2nd (?)
 	 */
 
-	if (skb->pkt_type != PACKET_HOST || skb->dev == &loopback_dev) {
-		IP_MASQ_DEBUG(2, "ip_fw_demasquerade(): packet type=%d proto=%d daddr=%d.%d.%d.%d ignored\n",
+	if ((skb->pkt_type != PACKET_HOST || skb->dev == &loopback_dev) && skb->fwmark == 0) {
+		IP_MASQ_DEBUG(2, "ip_fw_demasquerade(): packet type=%d proto=%d sadr=%d.%d.%d.%d daddr=%d.%d.%d.%d ignored"
+				" - dev=%s, fwmark=%d\n",
 			skb->pkt_type,
-			iph->protocol,
-			NIPQUAD(iph->daddr));
+			iph->protocol, NIPQUAD(iph->saddr),
+			NIPQUAD(iph->daddr),skb->dev->name,skb->fwmark);
 		return 0;
 	}

NEW qdisc for Linux: HTB

NOTE NOTE: HTB evolved and now it is quite different - see this link.

Old text:
This is most recent qdisc. It superceedes WRR and classfull TBF. It is prioritized DRR plus TBF in one.
It can be used instead CBQ because CBQ impl. in linux has many drawback (low accuracy and high complexity).
You create HTB using:

tc qdisc add dev eth0 root handle 1: htb

It creates qdisc which has no classes now. So that create some:

tc class add dev eth0 classid 1:1 htb rate 100kbit burst 10k prio 1
tc class add dev eth0 classid 1:2 htb rate 100kbit burst 10k prio 0

Not attach filters to 1: and you are done. Each class acts as small TBF but classes are prioritized and each class can borrow unused BW from others. (tc -s -d shows it).
It is very acurate, but there is the same rate limit as for TBF (read comments in both TBF and HTB sources).
CBQ rates can differ wildly from your expectation (because they borrow from parents and don't account it to children) but in HTB it will work exactly how you want.
You need to set only rate and burst. Priority is default 0 (highest), you can change it to 0 .. 3. Unused bandwidth is divided by DRR - default is that each class has DRR quantum 1500. It means that unused bw is divided equaly not proportionaly to rates. You can change it using quantum parameter (not implemented in tc yet).

TODO: I plan to make it hierarchical like CBQ and add bounded and isolated functionality but only if I will see that someone wants it.
I use HTB currently for sharing on wireless link. Using priorities I can perfectly rate limit to 256kbit and still get 1ms RTT for small prioritized packets on 100% utilized link. Along with SFQ at leaves, the link goes like charm ;-)

Here is patch for 2.2.15 and for tc. Bugs to devik@cdi.cz.

Have a fun !

TBF improvements

Linux has support for CBQ algorithm which is great for link bandwidth sharing. It is problematic to use it on shared ethernet link because the algorithm needs to know physical bandwidth of device. But we often don't know it.
Also the CBQ code is rather huge one. I spent several months digging thru linux cbq sources and Floyd's papers. It is not trivial task to set up cbq correctly.

Thanks to these problems I have been forced to find another way how to handle traffic in our company.
In linux there is good implementation of TBF (token bucket filter). Hence one could create TBF and insert WRR (weighted round robin) scheduler into it. TBF will be set to rate which we can use on shared medium and quanta sizes in WRR will do correct division into flows. For WRR scheduler see bellow.
But linux's TBF can't use inner queuing discipline for TBF. Instead it always uses internal fifo queue. So that here is patch to sch_tbf.c for 2.2.15 kernel. Now it supports optional inner qdisc hence makes whole thing much more flexible.

tbf.diff

The patch has still two TODOs described in code. It adds support for classes in TBF or to be more precious it will create exactly one class named X:1 where X is qdisc handle.
Unfortunately it triggers bug in iproute2/tc code, so that here is diff which fixes it:

--- tc_class.old.c	Tue Jul 11 21:32:21 2000
+++ tc_class.c	Tue Jul 11 11:52:00 2000
@@ -230,7 +230,7 @@
 	}
 	if (t->tcm_info)
 		fprintf(fp, "leaf %x: ", t->tcm_info>>16);
-	if ((q = get_qdisc_kind(RTA_DATA(tb[TCA_KIND]))) != NULL)
+	if ((q = get_qdisc_kind(RTA_DATA(tb[TCA_KIND]))) != NULL && q->print_copt)
 		q->print_copt(q, fp, tb[TCA_OPTIONS]);
 	else
 		fprintf(fp, "[UNKNOWN]");

Because TBF sometimes needs to delay already dequeued packet I used internal queue to hold such packet until next dequeue event. It also minimized changes into original code. The scheduler works in the same way as before until child qdisc is attached.

Note that original TBF could introduce delays at most limit/rate. Now when you use prio scheduler as inner qdisc the average delay for packet in high priority band will be avg_packet_len/rate/2.

WRR scheduler

To fulfil our needs to divide our Internet link bandwidth between our departments, protocols and customers I implemented new qdisc. The qdisc can have arbitrary number of classes. Each class has one parameter - quantum. When WRR qdisc needs to dequeue packet is scans over all it's classes and sends at most quantum bytes. Then moves to the next class.
When quantum is smaller than packet's size then it can take several rounds to dequeue it. As oposite when quantum is very large then its class can send several packets at once (if it has them of course).

So that using classes' quantum we can affect both bandwidth ratio between classes (by quantum ratios) and packet delay (by quantums' absolute size).

I hope this qdisc become standard part of linux kernel as it fills gap between simple prio scheduler and complex cbq one.

sch_wrr.c - to compile the thing you have to add appropriate line into net/sched/Makefile. If sched code maintainer will consider my patches/code useful I will merge the code into 2.5 branch. q_wrr.c and Makefile are patches for TC tool (iproute2).

Ingres queue

Linux allows you to attach qdisc to shape incoming packets. Oh wait please. It can't really shape them. You can only attach special qdisc as "ingres" but it will not queue incoming packets. Instead it will only drop some incoming packets by policing them.
It is nice for limiting SYN attacks for example but in 2.3 kernels it can be done by iptables (yes, they have TBF based test in firewall). In my opinion it would be much better to queue incoming packets into real qdisc. It is often important.
Consider linux router with eth0 connected to ISP's shared wireless line and eth1,2 to your departments. You have aggrement with ISP about maximal incoming/outgoint rates. You have to limit outgoing rate and ISP limits your incoming rate. But ISP probably uses Cisco router and generic shaper to limit your rate. Cisco's implementation of shaper is not very good. They shape flows BEFORE they are queued in WFQ queue - hence the queueing is not efective and flow is often very bursty and unfair (with respect to different TCP flows).
It can be solved by attaching our own shaper to incoming packets on eth0 (with a bit lower rate of course). Note that you can't attach them to eth1,2 because then you can't set sharing of unused bandwidth bwtween eth1 and eth2.

I tried to hack the kernel in such manner but I have not some important infos. For example I don't understand why there is both dev->qdisc and dev->qdisc_sleeping. I need to understand it before I start to implement real ingres qdisc.

Temporary hack
Because we need to do ingres shaping just now, I hacked 2.2.15 kernel a bit. I added new field into device (netdevice.h) which controls what to do with incoming packets. The field is controled using ifconfig IF metric N (because the metric ioctl was unused).
When the N is nonzero, all incoming packets from such device are resent to loopback device. It of course do not apply to packets which originated from the loopback and also ARP packets are excluded (because ARP needs to know original device where packed arrived).
Now you can attach any qdisc queue to the lo device and all incoming packets goes thru it.
To distunguish between localy generated packets and our fictive packet, each fictive packet is marked by setting its fwmark to the value N. One can use fw filter to assign them to different class (queue).
There shouls be no problem with locking as the net_bh in 2.2 kernel can be interrupted only by HW.
The hack is a bit dirty but works. I would not expect such feature in kernel but if you need it, here is the patch.

iloop.diff

Martin Devera <devik@cdi.cz>