This reminds me of one of the most interesting bugs I've faced: I was responsible for developing the component that provided away market data to the core trading system of a major US exchange (which allows the trading system to determine whether an order should be matched in-house or routed to another exchange with a better price).
Throughputs were in the multiple tens of thousands of transactions per second and latencies were in single digit milliseconds (in later years these would drop to double digit microseconds, but that's a different story). Components were written in C++, running on Linux. The machine that ran my component and the trading engine were neighbors in a LAN.
We put my component through a full battery of performance tests, and for a while, we seem to be meeting the numbers. Then one day, with absolutely zero code changes from my end or the trading engine's end, the the latency numbers collapsed. We checked the hardware configs and the rate at which the latest test was run. Both identical.
It took, I think, several days to solve the mystery: in the latest test run, we had added one extra away market to a list of 7 or 8 markets for which my component provided market data to the trading system. We had added markets before without an issue. It's a negligible change to the market data message size, because it only adds a few bytes: market ID, best bid price & quantity, best offer price & quantity. In no way should such a small change result in a disproportionate collapse in the latency numbers. It took a while for us to realize that before the addition of these few bytes, our market data message (a binary packed format), neatly fit into a single ethernet frame. Those extra few bytes pushed it over the 1600 (or 1500?) mark and caused all market data message frames (which were the bulk of messages on the system, next to orders), to fragment. The frame fragmentation and reassembly overhead was enough to clog up the pipes at the rates we were pumping data.
In the short run, I think we managed to do some tweaks and get the message back under 1600 bytes (by omitting markets that did not have a current bid/offer, rather than sending NULLs). I can't recall what we did in the long run.
3 hours ago [-]
2 hours ago [-]
Veserv 12 hours ago [-]
MTU discovery would be so much easier if the default behavior was truncate and forward when encountering a oversized packet. The endpoints can then just compare the bytes received against the size encoded inside of the packet to trivially detect truncation and thus get the inbound MTU size.
This allows you to do MTU discovery as a endpoint protocol with all the authentication benefits that provides and allows you to send a single large probe packet to precisely identify the MTU size. It would also allow you to immediately and transparently identify MTU reductions due to route changes or any other such cause instead of packets just randomly blackholing or getting responses from unknown, unauthenticated endpoints.
zamadatix 3 hours ago [-]
Truncation for a dedicated probe packet type: you lose the information it's a probe when you go through a tunnel of some sort (VPN, L2TP, IPsec, MPLS, VPLS, VXLAN, PBB, q-in-q, whatever). You're also dealing with different layers e.g. a client could send an L3 packet probe and now you're expecting a layer 2 PBB/q-in-q node to recognize IP packet types and treat them specially (layering violation).
Truncation for all packet types: data in transit can occasionally get split for other reasons. Right now that's just made into loss, if we had built every protocol layer on the idea it should forward anyways then any instances of this type of loss also become MTU renegotiations, at best. At worst we're having to forward generally corrupted packets which can cause all sorts of other problems. It'd be another layering violation to require that e.g. an L2 switch must adjust the UDP checksum when it's intentionally truncating a packet, but that'd be the only way to avoid that. Tunnels (particularly secure) are also tricky here (you need to run multiple separate layers of this continuously to avoid truncation information not propagating to the right endpoints). It also doesn't allow for truly unidirectional protocols e.g. a UDP video stream as there is no allowance for out of session signaling to be possible.
The above is for "if we have started networking day 1 with this plan in mind". There are of course additional problems given we didn't. I'm also not sure I follow how allowing any intermediate node to truncate a packet is any more authenticated.
The (still ugly) beauty of using PMTUD-style approach over truncation or probe+notification is it doesn't try to make assumptions about how anything in the middle could ever work for the rest of time, and that makes it both simple (despite sounding like a lot of work) and reliable. You and your peer just exchange packets until you find the biggest size that fits (or that you care to check for) and you're off! MTU changes due to a path change? No problem, it's just part of your "I had a connection and the other side seems to have stopped responding. How do I attempt to continue" logic (be that retry a new session or attempt to be smart about it). It also plays nice with the ICMP too large messages - if they are there you can choose to listen, if they are not it still "just works".
Or, like the article says, safe minimums can be more practical.
ikiris 10 hours ago [-]
And how do you tell the difference between cut off packets, and a mtu drop? What about crcs / frame checks? Do you regenerate the frames? Do you do this at routed interfaces? What if there's just layer 2 only involved?
LegionMammal978 10 hours ago [-]
> And how do you tell the difference between cut off packets, and a mtu drop?
You don't, apart from enforcing a bare-minimum MTU for sanity's sake. If your jumbo-size packets are getting randomly cut off by a middlebox, then they probably aren't stable at that size anyway.
Veserv 10 hours ago [-]
Packets do not get “cut-off” normally. That is kind of the point. Some protocols allow transparent fragmentation, but the fragments need to encode enough information for reconstruction, so you can still detect “less data received than encoded on send”.
You do not need bit error detection because you literally truncated the packet. The data is already lost. But in the process you learned it was due to MTU limits which is very useful. Protocols are already required to be robust to garbage that fails bit error detection anyways, so it is not “required” to always have valid integrity tags. You could transparently re-encode bit error detection on the truncated packet if you so desire to ensure data integrity of the “MTU resulted in truncation” packet that you are now forwarding, but again, not necessary.
Any end-to-end protocol that encodes the intended data size in-band can use this technique across truncating transport layers. And any protocol which does so already requires implementations to not blindly trust the in-band value otherwise you get trivial buffer overflows. So, all non-grossly insecure client implementations should already be able to safely handle MTU truncation if they received it (they would just not be able to use that for MTU discovery until they are updated). The only thing you need is routers to truncate instead of drop and then you can slowly update client implementations to take advantage of the new feature since this middlebox change should not break any existing implementations unless they are inexcusably insecure.
ikiris 8 hours ago [-]
I don’t think you understand what normally looks like if you start forwarding damaged frames like this because you can’t tell the difference. That was the point.
Veserv 8 seconds ago [-]
I literally have no idea what you are talking about. You can send garbage packets that conform to no known protocol on the internet. You can get more bit errors or perfect bit errors that make your bit error detection pass while still forwarding corrupt payloads. Transport protocols and channels must be and are robust to this.
“Damaged” frames and frame integrity only matter if you need the contents of the entire packet to remain intact. Which you explicitly do not when truncating.
The only new problem that arises is that maybe the in-band length information or headers get corrupted resulting in misinterpreting the truncation that actually occurred. And again, you already need to be robust to garbage. And you can just change my proposal to recompute the integrity tag on the truncated data if you think that really matters.
cryptonector 14 hours ago [-]
> Path MTU discovery has not been enthusiastically embraced
PMTU just doesn't feel reliable to me because of poorly behaved boxes in the middle. The worst offender I've had to deal with was AWS Transit Gateway, which just doesn't bother sending ICMP too big messages. The second worst offender is, IMO (data center and ISP) routers that generate ICMP replies in their CPU, meaning large packets hit a rate limited exception punt path out of the switch ASIC over to the cheapest CPU they could find to put in the box. If too many people are hitting that path at the same time, (maybe) no reply for you.
More rare cases, but really frustrating to debug was when we had an L2 switch in the path with lower MTU than the routers it was joining together. Without an IP level stack, there is no generation of ICMP messages and that thing just ate larger packets. The even stranger case was when there was a Linux box doing forwarding that had segment offload left on. It was taking in several 1500 byte TCP packets from one side, smashing them into ~9000 byte monsters, and then tried to send those over a VPNish network interface that absolutely couldn't handle that. Even if the network in the middle bothered to generate the ICMP too big message, the source would have been thoroughly confused because it never sent anything over 1500.
cryptonector 6 minutes ago [-]
> PMTU just doesn't feel reliable to me because of poorly behaved boxes in the middle. The worst offender I've had to deal with was AWS Transit Gateway, which just doesn't bother sending ICMP too big messages.
Passive PMTUD does NOT depend on ICMP messages.
toast0 1 hours ago [-]
> The even stranger case was when there was a Linux box doing forwarding that had segment offload left on. It was taking in several 1500 byte TCP packets from one side, smashing them into ~9000 byte monsters, and then tried to send those over a VPNish network interface that absolutely couldn't handle that. Even if the network in the middle bothered to generate the ICMP too big message, the source would have been thoroughly confused because it never sent anything over 1500.
This is an old Linux tcp offloading bug; large receive offload smooshes the inbound packet, then it's too big to forward.
I had to track down the other side of this. FreeBSD used to resend the whole send queue if it got a too big message, even if the size did not change. Sending all at once made it pretty likely for the broken forwarder to get packets close enough to do LRO, which resulted in large enough packet sending to show up as network problems.
I don't remember where the forwarder seemed to be, somewhere far away, IIRC.
Hikikomori 8 hours ago [-]
They recently started supporting pmtud on tgw. But it wasn't a big deal really as it adjusted mss instead.
immibis 7 hours ago [-]
L2 not generating errors is expected behaviour - all ports on the L2 network are supposed to have the same MTU set
mkj 11 hours ago [-]
Would that help with UDP, or only TCP?
cryptonector 5 minutes ago [-]
You can implement passive PMTUD with UDP if you like. It's more work for you, but it's perfectly doable.
Because UDP is only a very thin layer, each layer on top (eg, QUIC) has to implement PLPMTUD; although, recently IETF standardised a way to extend UDP to have options and PLPTMUD is also specified for that: https://datatracker.ietf.org/doc/draft-ietf-tsvwg-udp-option...
2 hours ago [-]
2OEH8eoCRo0 8 minutes ago [-]
Do you count the frame preamble?
posnet 14 hours ago [-]
"Jumbogram", an IPv6 packet with the Jumbo Payload option set, allowing for an frame size of up to 2³²-1 bytes.
At 10Gbps it would take 3.4 seconds just to serialize the frame.
hugmynutus 14 hours ago [-]
Luckily 400Gb/s nics are already on the market [1]
> The speed of light in glass or fiber-optic cable is significantly slower, at approximately 194,865 kilometers per second. The speed of voltage propagation in copper is 224,844 kilometres per second.
If I understand correctly, the speed of light in an electrical cable doesn't depend on the metal that carries current, but instead depends on the dielectric materials (plastic, air, etc.) between the two conductors?
tonyarkles 18 minutes ago [-]
If I’m interpreting what you’re asking correctly, yes. The velocity factor of a cable doesn’t spend on the metal it’s made of but rather the insulator material and the geometry of the cable.
For fibre the velocity factor depends on the refraction index of the fibre.
lucb1e 10 hours ago [-]
Huh? Maybe I'm completely misreading the question, but when they say fiber-optic cable, they do mean optic. It's not an "electrical cable"; there is no metal needed in optic communication cables (perhaps for stiffness or whatnot, but not for the communication)
Hikikomori 8 hours ago [-]
>The speed of voltage propagation in copper is 224,844 kilometres per second.
This part?
beeburrt 12 hours ago [-]
That font size is tiny. If this is your site, maybe consider a larger font size
nayuki 11 hours ago [-]
The site specifies a base font size of 12px. The better practice is to not specify a base font size at all, just taking it from the user's web browser instead. Then, the web designer should specify every other font size and box dimension as a scaled version of the base font size, using units like em/rem/%, not px.
It's the same size as HN: 12px. HN looks larger to me for some reason, but I can't figure out why: when I overlay a quote someone posted here over the website with half transparency in GIMP, the text is clearly the same height. Some letters are wider, some narrower, but the final length of the 8 words I sampled is 360px on HN vs. 358px on that website (so differences basically cancel out)
This is on Firefox/Debian, in case that means something for installed fonts. I see that site's CSS specifies Verdana and Arial, names that sound windowsey to me but I have no idea if my system has (analogous versions to) those
tomthecreator 9 hours ago [-]
There's a PDF version linked at the top of the article, it's actually much better typeset.
usefulcat 11 hours ago [-]
Given the subject of TFA, this seems appropriate in a meta sort of way.
jeffbee 3 hours ago [-]
The efficiency argument applies to private flows mostly. In terms of overall network traffic, the huge majority takes place between peers that share a local or private network. Internetworking as such has a relatively small share of total flows. So large frame sizes are beneficial in the context where they are also not problematic, and path MTU discovery is not beneficial in the context where it has many drawbacks. It seems as though the current state is pretty much optimal.
nullc 10 hours ago [-]
Is there any convenient way to tell linux distributions that the local subnet can handle 9k jumbos (or whatever) but that anything routed out must be 1500?
I currently have this solved by just sticking hosts on two vlans, one that has the default route and another that only has the jumbo capable hosts. ... but this seems kinda stupid.
fbouynot 9 hours ago [-]
Yes you can set your interface MTU at 9000 and assign a 1500 MTU to the routes themselves.
throw0101b 3 hours ago [-]
> […] and assign a 1500 MTU to the routes themselves.
Throughputs were in the multiple tens of thousands of transactions per second and latencies were in single digit milliseconds (in later years these would drop to double digit microseconds, but that's a different story). Components were written in C++, running on Linux. The machine that ran my component and the trading engine were neighbors in a LAN.
We put my component through a full battery of performance tests, and for a while, we seem to be meeting the numbers. Then one day, with absolutely zero code changes from my end or the trading engine's end, the the latency numbers collapsed. We checked the hardware configs and the rate at which the latest test was run. Both identical.
It took, I think, several days to solve the mystery: in the latest test run, we had added one extra away market to a list of 7 or 8 markets for which my component provided market data to the trading system. We had added markets before without an issue. It's a negligible change to the market data message size, because it only adds a few bytes: market ID, best bid price & quantity, best offer price & quantity. In no way should such a small change result in a disproportionate collapse in the latency numbers. It took a while for us to realize that before the addition of these few bytes, our market data message (a binary packed format), neatly fit into a single ethernet frame. Those extra few bytes pushed it over the 1600 (or 1500?) mark and caused all market data message frames (which were the bulk of messages on the system, next to orders), to fragment. The frame fragmentation and reassembly overhead was enough to clog up the pipes at the rates we were pumping data.
In the short run, I think we managed to do some tweaks and get the message back under 1600 bytes (by omitting markets that did not have a current bid/offer, rather than sending NULLs). I can't recall what we did in the long run.
This allows you to do MTU discovery as a endpoint protocol with all the authentication benefits that provides and allows you to send a single large probe packet to precisely identify the MTU size. It would also allow you to immediately and transparently identify MTU reductions due to route changes or any other such cause instead of packets just randomly blackholing or getting responses from unknown, unauthenticated endpoints.
Truncation for all packet types: data in transit can occasionally get split for other reasons. Right now that's just made into loss, if we had built every protocol layer on the idea it should forward anyways then any instances of this type of loss also become MTU renegotiations, at best. At worst we're having to forward generally corrupted packets which can cause all sorts of other problems. It'd be another layering violation to require that e.g. an L2 switch must adjust the UDP checksum when it's intentionally truncating a packet, but that'd be the only way to avoid that. Tunnels (particularly secure) are also tricky here (you need to run multiple separate layers of this continuously to avoid truncation information not propagating to the right endpoints). It also doesn't allow for truly unidirectional protocols e.g. a UDP video stream as there is no allowance for out of session signaling to be possible.
The above is for "if we have started networking day 1 with this plan in mind". There are of course additional problems given we didn't. I'm also not sure I follow how allowing any intermediate node to truncate a packet is any more authenticated.
The (still ugly) beauty of using PMTUD-style approach over truncation or probe+notification is it doesn't try to make assumptions about how anything in the middle could ever work for the rest of time, and that makes it both simple (despite sounding like a lot of work) and reliable. You and your peer just exchange packets until you find the biggest size that fits (or that you care to check for) and you're off! MTU changes due to a path change? No problem, it's just part of your "I had a connection and the other side seems to have stopped responding. How do I attempt to continue" logic (be that retry a new session or attempt to be smart about it). It also plays nice with the ICMP too large messages - if they are there you can choose to listen, if they are not it still "just works".
Or, like the article says, safe minimums can be more practical.
You don't, apart from enforcing a bare-minimum MTU for sanity's sake. If your jumbo-size packets are getting randomly cut off by a middlebox, then they probably aren't stable at that size anyway.
You do not need bit error detection because you literally truncated the packet. The data is already lost. But in the process you learned it was due to MTU limits which is very useful. Protocols are already required to be robust to garbage that fails bit error detection anyways, so it is not “required” to always have valid integrity tags. You could transparently re-encode bit error detection on the truncated packet if you so desire to ensure data integrity of the “MTU resulted in truncation” packet that you are now forwarding, but again, not necessary.
Any end-to-end protocol that encodes the intended data size in-band can use this technique across truncating transport layers. And any protocol which does so already requires implementations to not blindly trust the in-band value otherwise you get trivial buffer overflows. So, all non-grossly insecure client implementations should already be able to safely handle MTU truncation if they received it (they would just not be able to use that for MTU discovery until they are updated). The only thing you need is routers to truncate instead of drop and then you can slowly update client implementations to take advantage of the new feature since this middlebox change should not break any existing implementations unless they are inexcusably insecure.
“Damaged” frames and frame integrity only matter if you need the contents of the entire packet to remain intact. Which you explicitly do not when truncating.
The only new problem that arises is that maybe the in-band length information or headers get corrupted resulting in misinterpreting the truncation that actually occurred. And again, you already need to be robust to garbage. And you can just change my proposal to recompute the integrity tag on the truncated data if you think that really matters.
Ugh. I don't understand this. Especially passive PMTUD should just be rolled out everywhere. On Linux it still defaults to disabled! https://sourcegraph.com/search?q=context%3Aglobal+repo%3A%5E...
More rare cases, but really frustrating to debug was when we had an L2 switch in the path with lower MTU than the routers it was joining together. Without an IP level stack, there is no generation of ICMP messages and that thing just ate larger packets. The even stranger case was when there was a Linux box doing forwarding that had segment offload left on. It was taking in several 1500 byte TCP packets from one side, smashing them into ~9000 byte monsters, and then tried to send those over a VPNish network interface that absolutely couldn't handle that. Even if the network in the middle bothered to generate the ICMP too big message, the source would have been thoroughly confused because it never sent anything over 1500.
Passive PMTUD does NOT depend on ICMP messages.
This is an old Linux tcp offloading bug; large receive offload smooshes the inbound packet, then it's too big to forward.
I had to track down the other side of this. FreeBSD used to resend the whole send queue if it got a too big message, even if the size did not change. Sending all at once made it pretty likely for the broken forwarder to get packets close enough to do LRO, which resulted in large enough packet sending to show up as network problems.
I don't remember where the forwarder seemed to be, somewhere far away, IIRC.
Because UDP is only a very thin layer, each layer on top (eg, QUIC) has to implement PLPMTUD; although, recently IETF standardised a way to extend UDP to have options and PLPTMUD is also specified for that: https://datatracker.ietf.org/doc/draft-ietf-tsvwg-udp-option...
At 10Gbps it would take 3.4 seconds just to serialize the frame.
[1] https://docs.broadcom.com/doc/957608-PB1
If I understand correctly, the speed of light in an electrical cable doesn't depend on the metal that carries current, but instead depends on the dielectric materials (plastic, air, etc.) between the two conductors?
For fibre the velocity factor depends on the refraction index of the fibre.
This part?
Related reading: https://joshcollinsworth.com/blog/never-use-px-for-font-size
This is on Firefox/Debian, in case that means something for installed fonts. I see that site's CSS specifies Verdana and Arial, names that sound windowsey to me but I have no idea if my system has (analogous versions to) those
I currently have this solved by just sticking hosts on two vlans, one that has the default route and another that only has the jumbo capable hosts. ... but this seems kinda stupid.
See "mtu" option in ip-route(8):
* https://man.archlinux.org/man/ip-route.8.en#mtu
The BSDs also have an "-mtu" option in route(8):
* https://man.freebsd.org/cgi/man.cgi?route(8)
* https://man.openbsd.org/route