All the “reasons” that bufferbloat isn’t a problem

As we take a shot at opening people’s eyes to bufferbloat, we should know some of the “objections” we’ll run up against. Even though there’s terrific technical data to back up our research, people seem especially resistant to thinking that bufferbloat might affect their network, even when they’re seeing problems that sound exactly like bufferbloat symptoms. But first, some history:

The very idea of bufferbloat is simply unbelievable. Jim Gettys in 2011 [1] couldn’t believe it, and he’s a smart networking guy,. At the time, it seemed incredible (that is “not credible” == impossible) that something could induce 1.2 seconds of latency into his home network connection. He called in favors from technical contacts at his ISP and at Bell Labs who went over everything with a fine-toothed comb. It was all exactly as spec’d. But he still had the latency. 

This led Jim and Dave Täht to start the investigation into the phenomenon known today as “bufferbloat” – the undesirable latency that comes from a router or other network equipment buffering too much data. Over several years, a group of smart people made huge improvements: fq_codel was released 14 May 2012 [3]; it was incorporated into the Linux kernel shortly afterward. CAKE came in 2015, and the Wi-Fi fixes that minimize bufferbloat the drivers arrived in 2018. In 2021 cake-autorate [4] arrived to handle varying capacity ISP links. All these techniques work great: in 2014, my 7mbps DSL link was quite usable. And when the pandemic hit, fq_codel on my OpenWrt router allowed me to use that same 7mbps DSL line for two simultaneous zoom calls. 

As one of the authors of [2], I am part of the team that has tried over the years to explain bufferbloat and how to fix it. We’ve spoken with vendors. We’ve spent untold hours responding to posts on assorted boards and forums with the the bufferbloat story. 

With these technical fixes in hand, we cockily set about to tell the world about how to fix bufferbloat. We assumed it would be an easy task. Our efforts have been met with skepticism at best, or stony silence. What are the objections? 

  • This is just the ordinary behavior: Of course things will be slower when there’s more traffic (Willfully ignoring orders of magnitude increase in delay.)
  • Besides, I’m the only one using the internet. (Except when my phone uploads photos. Or my computer kicks off some automated process. Or I browse the web. Or …)
  • It only happens some of the time. (Exactly. That’s probably when something’s uploading photos, or your computer is doing stuff in the background.)
  • Those bufferbloat tests you hear about are bogus. They artificially add load, which isn’t a realistic test. (…and what do you experience if you actually are downloading a file?)
  • Bufferbloat only happens when the network is 100% loaded. (True. But when you open a web page, your browser briefly uses 100% of the link. Is this enough to cause momentary lag?)
  • It’s OK. I just tell my kids/spouse not to use the internet when I’m gaming. (Huh?)
  • I have gigabit service from my ISP. (That helps, but if you’re complaining about “slowness” you still need to rule out bufferbloat in your router.)
  • I can’t believe that router manufacturers would ever allow such a thing to happen in their gear. (See the Jim Gettys story above.)
  • I mean… wouldn’t router vendors want to provide the best for their customers? (Not necessarily – implementing this (new-ish) code requires engineering effort. They’re selling plenty of routers using the decade-old software. The Boss says, “would we sell more if we make these changes? Probably not.”)
  • Why would my ISP provision/sell me a router that gave crappy service? They’re a big company, they must know about this stuff. (Maybe. We have reached out to all the vendors. But remember they profit if you decide your network feels too slow and you upgrade to a higher capacity device/plan.)
  • But couldn’t I just tweak the QoS on my router? (Maybe. But see [5])
  • Besides, I just spent $300 on a “gaming router”. I just bought the most expensive/best possible solution on the market. (And you’re concerned that you still have lag?)
  • You’re telling me that a bunch of pointy-headed academics are smarter than commercial router developers – who sold me that $300 router? I can’t believe it. (Well, these fixes seem to solve the problem when it replaces vendor firmware…)
  • And then you say that I should throw away that gaming router and install some “open source firmware”? What the heck is that? And why should I believe you? (Same answer as above.)
  • What if it doesn’t solve the problem? Who will give me support? And how will I get back to a vendor-supported system? (Valid point – the first valid point)
  • Aren’t there any commercial solutions I can just buy? (Not at the moment. IQrouter was a shining light here – available from Amazon, simple setup, worked a treat – but they have gone out of business. And of course, for the skeptic, this is proof that the “fq_codel-stuff” isn’t really a solution – it seems just like snake oil.)

So… All these hurdles make it hard to convince people that bufferbloat could be the problem, or that they can fix for themselves.

A couple of us have reached out to Consumer Reports, wondering if they would like a story about how vendors would prefer to sell you a shiny new router (or new bigger ISP plan) than fix your bufferbloat. This kind of story seemed to be straight up their alley, but we never heard back after an initial contact. Maybe they deserve another call…

The recent latency results from Starlink give me a modicum of hope. They’re a major player. They (and their customers) can point to an order of magnitude reduction in latency over other solutions. It still requires enough “regular customers” to tell their current ISP that they are switching to Starlink (and spend $600 to purchase a Dishy plus $100/month) to provide a market incentive.

Despite all this doom and gloom, I remain hopeful that things will get better. We know the technology exists for people to take control of their network and solve the problem for themselves. We can continue to respond on forums where people express their dismay at the crummy performance and suggest a solution. We can hope that a major vendor will twig to this effect and bring out a mass-market solution.

I think your suggestion of speaking to eSports people is intriguing. They’re highly motivated to make their personal networks better. And actually solving the problem would have a network effect of bringing in others with the same problem. 

Good luck, and thanks for thinking about this.

This is a slightly edited version of a post first published on the bloat and starlink lists.

Rich Brown

[1] https://courses.cs.washington.edu/courses/cse550/21au/papers/bufferbloat.pdf

[2] https://www.bufferbloat.net/projects/bloat/wiki/What_can_I_do_about_Bufferbloat/

[3] https://lists.bufferbloat.net/pipermail/cerowrt-devel/2012-May/000233.html

[4] https://github.com/lynxthecat/cake-autorate

[5] https://www.bufferbloat.net/projects/bloat/wiki/More_about_Bufferbloat/#what-s-wrong-with-simply-configuring-qos

Save money while keeping latency low

I often hear people saying they have bad responsiveness / high latency on their high-speed internet connection. Even though they have a contract for 500 mbps+ service from their ISP, a speedtest shows high ping times when the link is loaded.

The answer is well-known – install a router with SQM that knows how to control latency (“bufferbloat”). But… they counter, a router to handle my high-speed network is expensive! And that blows out my budget!

I posted a contrarian viewpoint in a post on the OpenWrt forum:

Save money and purchase a 250mbps to 350mbps connection and use good SQM (for example, the IQrouter v3, or any reasonable performance OpenWrt compatible router.)

Unless you’re unusual, and the transfer speeds of your bulk up/downloads are unacceptable, it’s likely that a lower speed to your ISP with a modestly-priced router that controls latency will make you just as happy.

Not only will you save money with a less-expensive router, you save every month with a lower ISP bill.

Using Apple’s RPM tool

macOS Monterey ships with a tool that measures the responsiveness of your network connection. It saturates the network with traffic for 20 seconds, then measures the rate of short transactions to compute “Responses Per Minute.” Big numbers (above 2000) mean your network remains responsive when the network is heavily loaded. Small numbers (under 800 or so) mean your network isn’t responsive – potentially caused by bufferbloat.

There’s an iOS version described at https://support.apple.com/en-gb/HT212313

Stuart Cheshire and Vidhi Goel talked about the RPM tool at WWDC 2021. Apple also published an Internet-Draft that describes the RPM technique

Here’s a sample run from my Mac. The RPM tool displays my download and upload speeds (nominally 25mbps), and the number of simultaneous flows required to saturate the link (12, in this case). It shows the responsiveness as 1995 round-trips per minute. That’s really good: the average latency – even during heavy load – only increases a bit above the baseline (idle) 21 msec.

% /usr/bin/networkQuality -v
==== SUMMARY ====
Upload capacity: 22.657 Mbps
Download capacity: 23.755 Mbps
Upload flows: 12
Download flows: 12
Responsiveness: High (1995 RPM)
Base RTT: 21
Start: 11/7/21, 7:18:37 AM
End: 11/7/21, 7:18:47 AM
OS Version: Version 12.1 (Build 21C5021h)
%

Here’s a video that shows the tool in operation: https://youtu.be/e9DUTB9okMA

NameD•Tective — mDNS over AppleTalk

[From the Archives of Amusing Technology…] Back in the ’90s, Dave Fisher and I created NameD•tective, a Macintosh control panel that gave any Mac on the Dartmouth network a static DNS name.

It used Name Binding Protocol to let someone create a DNS name based on their name plus their AppleTalk zone. The DNS name had the form: person-name.AppleTalk-zone…dartmouth.edu. The NameD•tective server looked up the NBP name, and returned the computer’s current IP address.

This screen shot shows an example: cd-changer.kiewit.atzone.dartmouth.edu was a server in my office that distributed information to other developers in the Kieiwt building. NameD•tective probably didn’t get broad use at Dartmouth, but it was a neat demonstration project. It led Dave and me to develop the MacDNS software that was shipped as part of Apple’s Internet Connection Kit. Here’s the page from Dartmouth’s website archived by the Wayback Machine: https://web.archive.org/web/19961220043011/http://www.dartmouth.edu/pages/softdev/named.html

Today, computer naming is much simpler. Modern operating systems let a computer specify a mDNS (multicast DNS) name that can be directly looked up to find a host that provides the service.

Comcast modems decrease bufferbloat

Last month, Comcast released a paper Improving Latency with Active Queue Management (AQM) During COVID-19 that shows that their PIE AQM dramatically decreases lag/latency (by a factor of 10X — from 250 msec down to 15-30 msec.) From the paper (page 13):

 

… As explained earlier, for two variants of XB6 cable modem gateway, upstream DOCSIS-PIE AQM was enabled on the CGM4140COM (experiment) variant but was not available on the TG3482G (control) variant during the measurement period

At a high level, when a device had AQM it consistently experienced between 15-30 milliseconds of latency under load. … [The] non-AQM devices experienced in many cases 250 milliseconds or higher latency under load.

See if you can get an XB6 / CGM4140COM cable modem

Toward a Consumer Responsiveness Metric

At a recent videoconference, I advocated strongly for a consumer-facing measurement of latency/responsiveness. I had not planned to speak, so I gave off-the-cuff comments. This is an organized explanation of my position. I offer these thoughts for consideration at the IAB Workshop “Measuring Network Quality for End-Users, 2021” – Rich Brown

I hunger for a day when vendors (router manufacturers and service providers) compete on the basis of “responsiveness” in the same way that they compete on speed – “Up to X megabits per second, and Y responsiveness!”

I have been working on the “Bufferbloat Project” [1] since 2011, trying to find layman’s terms for what was happening, and what to do about it. [2] [3] The delay goes by the name “lag”, “latency under load”, or “bufferbloat”. At first, the effects seemed mysterious and non-intuitive. Even to knowledgeable individuals, the magnitude of the delay caused by queueing was astonishing. No matter what name you use, it makes people say, “the internet is slow today”.

My router at home has solved this problem. I enjoy the fruits of the intense research from the mid 2010’s that led to well-understood solutions such as fq_codel, cake, PIE, and airtime fairness. Even using 7 mbps DSL, my network was quite usable, and very responsive.

My frustration in 2021 is that this remains a problem for nearly everyone else. The market has not provided solutions. Every day, people purchase brand name equipment that happily queues hundreds of msec of traffic.

I postulate that vendors have not considered responsiveness to be an important characteristic of their offerings. Consequently, they have not prioritized the engineering resources to incorporate the well-tested solutions listed above.

My hope, from this note, and from our on-going efforts, is that we can come up with a test tool that consumers can use to raise awareness of the problem of bad responsiveness.

Characteristics of a Responsiveness Tool

I seek a “responsiveness tool” with these characteristics:

  1. Easy to use. People need an easy way to measure responsiveness so they can give feedback to their vendors.
  2. A single number, so it’s easy to report and compare.
  3. Bigger must be better. High latency means bad responsiveness. People have no intuitive feel for a millisecond: “Is 100 msec bad? Isn’t that really short…?”
  4. An approximate measure is OK. Consumers won’t mind separate runs varying 20% or 30%, especially since poor responsiveness could be an order of magnitude different from good.
  5. Resistant to cheating. Vendors sometimes optimize pings to make latency look lower. But real people’s traffic doesn’t use pings. The responsiveness test must use protocols that match actual traffic patterns.
  6. Vendor and technology independent. People should use and get similar results from their phone, their desktop, on the web, or using an app.
  7. “Good enough”. A widely implemented and promoted metric that substantially matches people’s real experience is vastly superior to a host of competing metrics that muddy the waters in consumer’s minds.

A Proposed Metric – RPM

Apple has produced an Internet Draft “Responsiveness under Working Conditions” [4] and implementation. It defines a procedure for continually making short HTTPS transactions on a path to a server that has been fully loaded in both directions. The number of transactions in a fixed time is expressed as the number of “round-trips per minute”, which is given the name “RPM”, a wink to the “revolutions per minute” that we use for cars.

The RPM measurement satisfies all my concerns.

Non-requirements

It is not a requirement for the responsiveness test to provide:

  • Strict reproducibility. The wider internet has widely varying conditions, with bottlenecks moving around by time of day or adjacent traffic. It is not reasonable/feasible to expect that any measure used by consumers will be exactly reproducible.

  • Detailed statistics or distributions of measurements. This is not a diagnostic tool. A nuanced data set with medians and percentiles may excite techies, but for others, it’s hard to understand the implications.

  • Performance of any particular protocol. The responsiveness tool must measure a broad variety of typical traffic.

  • Data to be used as input for vendors to design solutions. The responsiveness measure needs to be used the same way we say to our mechanic, “The car makes a funny noise when I …”. I expect the specialist to work to reproduce the symptom, using the provided equipment, and come up with an appropriate solution.

Summary

The research of the last decade has developed a wide variety of solutions. There are plenty of corner-cases where these solutions aren’t perfect. I encourage vendors and researchers to study the field and advance our knowledge further. I would be delighted if they found practices even better than the current state of the art.

But “the rest of the internet” (including my neighbors and family members, for whom I’m the support person) would all benefit from a world where off-the-shelf equipment already incorporated well-known, best practice solutions.

References

[1] Bufferbloat Project https://bufferbloat.net

[2] Bufferbloat and the Ski Shop https://randomneuronsfiring.com/bufferbloat-and-the-ski-shop/

[3] Best Bufferbloat Analogy – Ever https://randomneuronsfiring.com/best-bufferbloat-analogy-ever/

[4] Responsiveness under Working Conditions – Internet-Draft at: https://datatracker.ietf.org/doc/draft-cpaasch-ippm-responsiveness/ Full disclosure: I am one of the editors of the “Responsiveness Under Working Conditions I-D”

Best Bufferbloat Analogy – Ever

My friends frequently ask, “Why is my network so slow?” And often, the answer is “latency” or the screwy term, “Bufferbloat” – the “undesirable latency caused when a router buffers too much data.” But what the heck does that mean?

A while back, I attempted a layman’s explanation of Bufferbloat. I compared it to a ski shop. It was pretty unsuccessful: it just didn’t have any intuitive appeal.

That’s why I was delighted that Waveform.com published what I believe is the Best Bufferbloat Analogy – Ever. (I am pleased to have contributed to the final version of their description.) That page also has a well-designed web-based Bufferbloat Tester (on a par with the Ookla Speedtest or the Cloudflare speed test).

They asked, Can you explain bufferbloat like I’m five? and noted that flows of liquids were sort of like flows of packets. The analogy was when a friend dumps a bucket of water into a sink with a narrow drain, it slows other flows (like a teaspoon of oil) from emptying out. Read the whole description…

This made me think about having a SmartSink™ to give a visual image for understanding how a well-designed router can decrease latency.

What’s a SmartSink™?

Instead of accepting a full bucket of water all at once, a SmartSink controls the bucket of water with a valve. It allows just enough water into the sink to keep the drain full. If the water gets too low, the SmartSink opens the valve: if it gets “too full”, it closes it a bit.

A SmartSink also works when lots of friends have their own buckets, pouring in colored water – pink, blue, etc. The valves on the SmartSink control each color. If the SmartSink notices too much pink water, it closes that valve a bit to bring back balance, so that each color gets its “fair share” of the drain’s capacity. And because there’s never too much water (of any color) in the sink, a small new flow always drains quickly.

Reality check: This is just an analogy. I realize that a SmartSink is a ridiculous idea. But it helps me visualize how small flows can drain quickly while big flows share the drain capacity fairly.

What does this have to do with routers?

The Smart Queue Management (SQM) algorithm in a router works like the SmartSink. When a device starts sending a lot of data (maybe a phone starts uploading photos to the cloud), SQM controls the amount of data queued for each flow (each separate upload, videoconference, voice call, gaming session, Youtube, Bittorrent, etc) to prevent any one flow from using more than its share. Instead of operating valves to control the flow of water, SQM controls the size of each flow’s queue by:

  1. Separating every traffic flow’s arriving packets into their own queue.
  2. Removing a small batch of packets from a queue, round-robin style,
    and transmitting that batch through the (slow) bottleneck link to the ISP.
    When each batch has been fully sent, retrieving a batch from the next queue, and so on.
  3. Offering back pressure to flows that are sending “more than their share” of data.

This process provides these desirable effects:

  • Most importantly, SQM provides low latency. Small flows (with just one or a few small packets) get sent right away in their next “round robin” batch.
  • Equal sharing of the bottleneck: If there are multiple senders, each can send an equal amount of data with each round-robin opportunity.
  • No waste of the bottleneck: If there’s only one sender (one queue with data), that one gets the full capacity of the link.
  • Offering backpressure to bulk senders minimizes lost packets and re-transmissions, making the network globally more efficient.

Does SQM work?

YES! Can I get a router with SQM today? YES!

Got questions? Send them to me and I’ll include them in Part 2 (coming soon) of this blog. Thanks.

WireGuard Vanity Keys

A WireGuard VPN provides a fast, secure tunnel between endpoints. It uses public/private key pairs to encrypt the data.

If you have several clients, you have to enter their public keys into your server. Keeping track of those keys gets to be a hassle, since ordinarily, the keys are essentially random numbers.

I found a great project to help this problem: WireGuard Vanity Address. It continually generates WireGuard private/public key pairs, printing keys that contain a desired string in the first 10 characters. For example, I generated this public key for my MacBook Pro (MBP): MBP/DzPRZ05vNZ0XS3P9tlokZPrLy/1lb1Zsm3du4QA= Note the MBP/ at the start – it makes it easy to know that this is my Mac’s key.

To do it, I ran the wireguard-vanity-address program. Here is sample output:


$ ./wireguard-vanity-address MBP/
searching for 'mbp/' in pubkey[0..10], one of every 299593 keys should match
one trial takes 28.7 us, CPU cores available: 2
est yield: 4.3 seconds per key, 232.30e-3 keys/s
hit Ctrl-C to stop
private qMKPNrCMId59XTn5vgDICUh/QzIfhqZdrZ+XQBIJj2w= public zmbP/YEpC8Zl6MacYhcY1lq126tL2UudFjmrwbl2/18=
private HHtPY8IwGBxQ5OTtJY6GcuFpImXtDp9d187zvI0axFo= public qhIiSMbp/extT5irPy4EJfLRPR9jTzQZHlM15Fo/P2E=
private BEnEu1lVdcRI997nj2uPNGsyCZNPhBTCNfgJuYPPJHA= public hZzmBP/8EthWPOFp5wroEGPeJTHGxZ5KENnMiZvniGY=
private 8HRj+YZfSBnYZn38MPE09W2g03JvRJoGbjlDkHQ0Wnk= public mBP/q2dOd+m457PyKTIvI7MDTuXLCneG6MM0ir9rwRc=
...
private dFE8xsDDWNNNY1OjOIlxQiNVbp7Z6tZhXsaOo/5gPH0= public MBP/DzPRZ05vNZ0XS3P9tlokZPrLy/1lb1Zsm3du4QA=
^C
# This last line contains a public key starting with "MBP/"

For more details, read the github page, and also the issue where the author addresses security concerns about decreasing the size of the key space.

Update: I created a Dockerfile to make it even easier to run wireguard-vanity-address. Check out my personal github repo for details.

WireGuard GUI on macOS

A WireGuard VPN provides a fast, secure tunnel between endpoints. A macOS GUI client is available from the App Store

It works great. But its documentation is minimal. Even though the required keywords (which you must type manually) are the same as other clients, the GUI doesn’t give a hint about whether it’s right until you type it exactly correctly. Consequently, it can be a pain to configure it properly.

This screen shot shows a correctly configured (although fictitious) VPN tunnel. To get to this configuration window, use the Wireguard Manage Tunnels menu, click  and choose Add Empty Tunnel… then fill in the resulting window as shown below:

Screen shot of macOS WireGuard GUI

Although there are plenty of guides to explain WireGuard, this summarizes my best understanding of the meaning of these fields. There may be additional ways to configure the VPN, but following this advice will result in a working secure configuration.

[Interface] Section

  • PrivateKey: Private key for this computer. WireGuard uses this key to encrypt data sent to its peer, and decrypt received data. WireGuard displays the corresponding PublicKey (which you’ll enter into the peer) at the top of the window.
  • Address: Address for the VPN tunnel interface on this computer. Use a /32 address chosen from an address range that not is in either this network or the peer’s network. (This example uses 10.0.10.2/32 for this end. The peer (not shown) is 10.0.10.1/32. They were chosen because the 10.0.10.0/24 subnet is not in use on either side of the tunnel.)
  • DNS: (Optional) Address(es) of DNS servers to be used by this computer. It’s OK to leave this out – by default, WireGuard will use the underlying OS DNS servers.
  • ListenPort: (Optional) WireGuard listens on this port for traffic from its peer. It’s OK to leave this out – by default, WireGuard will select an unused port.

[Peer] Section

  • PublicKey: The public key of the remote peer. WireGuard uses this key to decrypt the packets sent from the peer, and encrypt packets sent to the peer.
  • PresharedKey: (Optional) This key will be used to encrypt the session. If specified, it is used in lieu of the public/private key pair for the peers.
  • AllowedIPs: A comma-separated list of IP (v4 or v6) addresses with CIDR masks which are allowed as destination addresses when sending via this peer and as source addresses when receiving via this peer.
  • Endpoint: (Optional) The address (or DNS name) and port of the remote peer. If specified, this peer will attempt to connect to the endpoint periodically.
  • PersistentKeepalive: (Optional) The number of seconds this peer waits before sending another keep-alive message. These messages “keep the session alive” through NAT.

I would appreciate comments on these descriptions so I can make them more helpful/useful.

Additional Thoughts

The following thoughts are refinements to the advice shown above.

    • The example above only allows traffic to/from the 192.168.4.0/24 and 172.30.42.0/24 subnets to travel through the tunnel. To send all traffic through the tunnel (say, to avoid prying eyes of your ISP, etc), you can set the AllowedIPs to 0.0.0.0/0. To send all IPv6 traffic through the tunnel, add ::/0
    • It neither necessary nor recommended to include the peer’s Address in the AllowedIPs list.
    • Although both Endpoint and PersistentKeepalive are listed as optional, you normally set both when using the macOS WireGuard client. Activating the tunnel (from the WireGuard menu), causes WireGuard to begin sending Keepalive packets to the Endpoint, which starts up the tunnel.
    • Dealing with NAT. If your ISP requires your remote peer to be behind NAT, you must configure your ISP’s router/modem to pass the WireGuard packets through. The setup varies from ISP to ISP, but in general, you’ll need to set up some kind of “virtual server”, “DMZ”, or “port forwarding” in the ISP router/modem to pass the WireGuard packets (on the port specified in the Endpoint) to the peer device.