On Thu, Aug 8, 2013 at 9:35 PM, John Jasen <[email protected]> wrote:
> You may want to test jumbo frames, just to see what would happen. I
> would expect you to see closer to 10 Gb/s with the same number of
> interrupts.
Results for jumbo frames are below (spoiler: 10 Gbps, same number of
interrupts, 40% CPU0 usage).
> On 08/08/2013 08:26 PM, Maxim Khitrov wrote:
>> Active Processor Cores: All
>
> I would turn that off, or at least make it only dual core.
No effect, results are also below.
>> That's... a bit faster. The CPU in the desktops is Intel i7-3770,
>> which is very similar to the Xeon E3-1275v2. Is this a FreeBSD vs
>> OpenBSD difference?
>
> Could be. It might be worth testing FreeBSD on your packet forwarding
> boxes, just to see if you get similar results.
I installed FreeBSD on a USB flash drive, booted the backup firewall
from that, and ran iperf -c 127.0.0.1 -t 60:
[ 3] 0.0-60.0 sec 373 GBytes 53.4 Gbits/sec
Almost the same as the desktops, so this performance boost is due to
FreeBSD (which keeps all cores at 70% load) and not the hardware.
Now for jumbo frames:
# s1: iperf -s
# c1: iperf -c s1 -t 60 -m
[ 3] 0.0-60.0 sec 69.1 GBytes 9.89 Gbits/sec
[ 3] MSS size 8192 bytes (MTU 8232 bytes, unknown interface)
With MTU set to 9000 along the entire path, a single client can max
out the 10 gigabit link through the firewall. This also addresses the
question of PCIe bandwidth - not an issue. I just had to double
kern.ipc.nmbjumbo9 to 12800 on all FreeBSD hosts before I could enable
jumbo frames (got "ix0: Could not setup receive structures"
otherwise).
Both clients together:
# s1: iperf -s
# s2: iperf -s
# c1: nc gw 1234 ; iperf -c s1 -t 60
# c2: nc gw 1234 ; iperf -c s2 -t 60
[ 3] 0.0-60.0 sec 34.6 GBytes 4.95 Gbits/sec
[ 3] 0.0-60.0 sec 34.5 GBytes 4.94 Gbits/sec
During all of these tests, systat shows 8k interrupts on each
interface, and CPU0 usage is 40% interrupt, 60% idle.
Going back to 1500 MTU, disabling Hardware Prefetcher and Adjacent
Cache Line Prefetch in BIOS has no effect:
# c1->s1
[ 3] 0.0-60.0 sec 29.5 GBytes 4.22 Gbits/sec
# c1->s1, c2->s2
[ 3] 0.0-60.0 sec 14.8 GBytes 2.12 Gbits/sec
[ 3] 0.0-60.0 sec 15.7 GBytes 2.25 Gbits/sec
Same goes for disabling two of the cores:
# c1->s1
[ 3] 0.0-60.0 sec 30.7 GBytes 4.39 Gbits/sec
# c1->s1, c2->s2
[ 3] 0.0-60.0 sec 15.2 GBytes 2.18 Gbits/sec
[ 3] 0.0-60.0 sec 15.2 GBytes 2.17 Gbits/sec
Same with bsd.sp kernel and all but one of the cores disabled:
# c1->s1
[ 3] 0.0-60.0 sec 31.3 GBytes 4.48 Gbits/sec
# c1->s1, c2->s2
[ 3] 0.0-60.0 sec 15.0 GBytes 2.15 Gbits/sec
[ 3] 0.0-60.0 sec 16.1 GBytes 2.30 Gbits/sec
Finally, I went back to all cores enabled, bsd.mp kernel, Hardware
Prefetcher and Adjacent Cache Line Prefetch enabled:
# c1->s1
[ 3] 0.0-60.0 sec 30.9 GBytes 4.43 Gbits/sec
# c1->s2, c2->s2
[ 3] 0.0-60.0 sec 16.8 GBytes 2.40 Gbits/sec
[ 3] 0.0-60.0 sec 14.0 GBytes 2.00 Gbits/sec
As you can see, none of these tweaks had any measurable impact. The
firewall can only handle so many packets per second. To push more
packets through, I need to reduce the per-packet processing overhead.
Here's a simple illustration of this fact using just the c1->s1 test:
# pf disabled (set skip on {ix0, ix1}):
[ 3] 0.0-60.0 sec 37.4 GBytes 5.35 Gbits/sec
# pf enabled, no state on ix0:
[ 3] 0.0-60.1 sec 8.28 GBytes 1.18 Gbits/sec
# pf enabled, keep state:
[ 3] 0.0-60.0 sec 30.8 GBytes 4.41 Gbits/sec
# pf enabled, keep state (sloppy):
[ 3] 0.0-60.0 sec 31.2 GBytes 4.46 Gbits/sec
# pf enabled, modulate state:
[ 3] 0.0-60.0 sec 28.3 GBytes 4.05 Gbits/sec
# pf enabled, modulate state scrub (random-id reassemble tcp):
[ 3] 0.0-60.0 sec 25.8 GBytes 3.69 Gbits/sec
The interesting thing about the last test is that systat shows double
the number of interrupts (32k total, 16k per interface) and CPU0 is
about 5% idle instead of the usual 10%. The rest is self-evident. More
work per packet = lower throughput. This is also another confirmation
that the sloppy state tracker has no performance benefits.
Unless someone has any other ideas on how to reduce the per-packet
processing time, I think ~4.5 Gbps is the most that my hardware can
handle at the default MTU. A bit disappointing, but it was the fastest
CPU that I could get from Lanner and also my first step beyond 1
gigabit.
If OpenBSD starts using multiple cores for interrupt processing in the
future, 10+ Gbps should be easy to achieve. FreeBSD is an option if
performance is critical, but for now I'd rather have all the 4.6+ pf
improvements.