Over the last few weeks I’ve been looking into tso and gso, mostly various issues encountered with offloading on virtual environments such as KVM and Xen.
The problems span all hypervisors seemingly:
- KVM Example 1 : KVM Guest OS does not respect MTU
- KVM Example 2 : KVM Bridged Mode TCP/IP Performance
- KVM Example 3 : KVM Guest with Virtio having network issues
- Xen Example 1 : Turning off NIC Acceleration vastly improves networking
- Xen Example 2 : Improving performance by disabling TCP Offloading
- Cross Example : The mystery of TCP segmentation offload bug (VMware, KVM, Xen)
- GSO Example : GSO being on results in not adhering to MTU or MSS size (larger MTU than normal)
In addition, in my past work, we would have to turn sg offloading off to get any proper internal networking performance to work properly on Xen. Otherwise on the same dom0, domU <-> domU would result in speeds of 20 KB/s.
My favorite blog, ‘Lessons from the trenches‘ even has encountered the ‘death packet’ issues which resulted in CentOS guests networking breaking with tso and gso on. They’ve properly shut it off at the host level, and suggested users do so as well to avoid issues.
Finally, Red Hat suggests that if you’re encountering any type of performance issue on Virtualized guests that are running VirtIO, you disable tso and gso on the host-node as best practice:
# ethtool -k interface
# ethtool -K interface gso off # ethtool -K interface tso off
I’ve gotten feedback about gso not being an issue, and yes – the chksums are incorrect since they’re calculated later, but this is all extra strain on a system and that’s not necessary, upon scaling at 1Gbps+ of networking, your incorrect checksums add up to poor performance. Add internal networking and an ethernet adapter or two, and you’re suffering from performance issues – in the least.
Worse, if you have any equipment that does not have MSS or MTU fragmenting properly handled, with GSO enabled you will have erratic MTUs higher than set by MSS or MTU and have network issues, such as 2900 in this article.
In short, turn GSO and TSO off at the host-node level, especially br0. It’s best practice, and the bug reports of TSO and GSO causing instability on hypervisors, amongst other offloading such as sg means you should stick with Red Hat’s advice, everyone’s findings, and simply disable it on the host-node interfaces, then troubleshoot if you’re still having trouble. In the least, a known feature causing issues that has no advantages to guests will be gone.
Also, it’s not just to hide the checksum errors from UDP, but because it’s likely causing hard to duplicate network issues across your host-node if you’re stumped, be it UDP performance or a dying internal network.