I need to transfer about 40TB from one zfs box to another box because of a problem of initial setup in box 1 that left it without redundant raid and it now has a failing disk. ZFS points this out quite nicely, but we have to get all that data off before the disk fails completely, even though it’s a backup box. (Apparently you can’t replace a disk in a pure stripe without zfs having conniptions).
So, what to do? zfs send/recv are a marvelous tool for this, especially combined with a pure network stream like ttcp or netcat or the like. First, start up 20 listening processes on the receiver listening on 20 different ports piped into zfs recv. Second, make a snapshot on all of the filesystems so you have a consistent starting point. zfs snapshot -r zpool1@mig1. Third, send all of these filesystems over in parallel using a quick bash script and incrementing the port number.
All of these zfs sends are now running in parallel between 2 4x1G connected x4500s. It’s not going as fast as I would have thought, so I check things out. I check out the receiving end throughput using my little sunaggrbps script (below) and it shows me something odd:
0 384
1 377033000
2 454
3 384
The number in the first column represents the interface index. The number in the second column represents the number of bytes received in 3 seconds. We’re doing very well on e1000g1, but the others only have a small number of packets, partially comprised of LACP control packets and other minor traffic.
Since LACP is fully dynamic, I have the opportunity to play around a little bit. So I shut dowh the switch port corresponding to e1000g1 on the switch side. When I do this, I see some different behavior:
1 0
2 219965912
3 112587165
0 1348
1 0
2 383779635
3 283032155
0 384
1 0
2 358813400
3 301547626
0 448
1 0
2 379984136
3 273314922
0 640
1 0
2 227159881
3 121488640
Much better! I’m not getting almost twice the throughput! There’s still only a smattering of traffic on e1000g0, and none on the shutdown port, but 2x is better than 1x in my book. I’ve talked with the vendor of the switch and unfortunately the LACP hashing algorithms are not adjustable, so there’s not much that I can do. Since this is an L2 link, it is using source and destination based hashing. I’m not sure what exactly this behavior is, but you’d think that it would even out a little bit more.
Moral: check your traffic on LACP links to make sure you really are getting your best balance and utilization. This is usually not a big deal as you have more than 2 hosts, but sometimes it’s worth checking.
sunaggrbps
#!/usr/bin/perl BEGIN { my $pass=0; } while (my $d = `kstat -m e1000g -n mac -s rbytes64`) { @strs = split(/s+/, $d); $d =~ m/instance: ([0-9]).*rbytes64s+([0-9]+)/; if ($pass++ > 0) { foreach my $i (3, 13, 23, 33) { print "$strs[$i] " . ($strs[$i+6] - $ostr[$i+6]) . "n"; } }; @ostr = @strs; sleep 3; }