===Last time….===
In my [last post|http://lopsa.org/node/1711] I talked about some of
the more common ways to get performance numbers out of your storage:
__hdparm__, __dd__, and __bonnie++__.
These tools are pretty good at what they do: measure the performance of
a single drive, single LUN, or single filesystem. For many sysadmins,
that’s all you need to care about. Now we’ll look at some more advanced
tools for measuring performance in different situations.
===Why we need other tools===
One project I’ve been working on recently is determining how much data we
can push from a single host out to multiple shelves of disk storage.
This host has multiple fiber channel adapters with dedicated storage
behind each port. Of course, it would be easy if I could just run __dd__
to a single LUN and then multiply that result by the number of FC ports
I have.
Lots of things would be easier if everything scales linearly. That’s not
the case, though. There are lots of factors that can determine how
performance scales: how well the kernel parallelizes I/O requests,
how saturated your PCI-express buses are, how fast your CPU is, how
the kernel’s I/O caching algorithm works, etc.
Beyond the case of multiple I/O streams from a single host, we can also
think about parallel I/O from multiple hosts to single or multiple
chunks of backend storage. The simple case that most sysadmins run in
to is NFS server performance: multiple hosts all to one NFS server.
Think of a web farm reading its static content from a central NetApp
fileserver. We can use the tools I’ve already discussed to test
performance of a single client, but what if all the clients are connected
with gigabit ethernet, but your server has a 10-gigabit ethernet NIC,
and multiple 4-gigabit Fiber Channel connections to storage? We’d like
a way to examine overall performance of the system.
Parallel and/or clustered files are gaining a lot of popularity
these days. Filesystems such as [Lustre|http://www.lustre.org],
[GPFS|http://www-03.ibm.com/systems/clusters/software/gpfs/index.html],
[GFS|http://www.redhat.com/gfs/], [GlusterFS|http://www.gluster.org/] and
others all enable multiple hosts to write to a single shared filesystem
with multiple backend servers handling the I/O.
With more advanced environments, we need to look at some
more advanced tools. In this post, I’ll be talking about
two that I’ve used: [xdd|http://www.ioperformance.com/] and
[IOR|http://www.cs.sandia.gov/Scalable_IO/ior.html]. I generally use
__xdd__ to test performance of a single host (although it has the ability
to test multi-host performance). I use __IOR__ to test performance across
multiple machines to shared filesystems. It uses MPI and has multiple
different I/O methods intended to emulate various HPC workloads.
===xdd===
__xdd__ is pretty easy to use. You can think of it like __dd__ on
steroids. Like __dd__, you can use it to simply write chunks of data to
disk, varying things like the block size, input and output files, etc.
__xdd__ also allows you to specify multiple targets, which can be block
devices or files on a filesystem. Another advantage __xdd__ has over
__dd__ is the ability to do DirectIO, bypassing the operating system’s
page cache.
I’ll also point out that like __dd__, __xdd__ does ”no” sanity checking
for you. It’s very easy to accidentally overwrite your OS partitions.
I’d always recommend doing your benchmarks on a dedicated test/development
system that won’t (significantly) impact your life should you fat finger
something.
First, let’s run a quick __xdd__ test that should emulate what I did with __dd__ before:
{{{
root@sif:/tmp# ./xdd.linux -blocksize 1M -reqsize 1 -numreqs 2048 -targets 1 /dev/sda
Machine hardware type, i686
Number of processors on this system, 1
Page size in bytes, 4096
Number of physical pages, 258916
Megabytes of physical memory, 1011
Seconds before starting, 0
Target[0] Q[0], /dev/sda
Target directory, “./”
Process ID, 10787
Thread ID, -1242899568
Processor, all/any
Read/write ratio, 0.00, 100.00
Throttle in MB/sec, 0.00
Throttle in MB/sec, 0.00
Per-pass time limit in seconds, 0
Blocksize in bytes, 1048576
Request size, 1, blocks, 1048576, bytes
Number of Requests, 2048
Start offset, 0
Pass Offset in blocks, 0
I/O memory buffer is a normal memory buffer
I/O memory buffer alignment in bytes, 4096
Data pattern in buffer, ‘0x00’
Data buffer verification is disabled.
Direct I/O, disabled
Seek pattern, sequential
Seek range, 1048576
Preallocation, 0
Queue Depth, 1
Timestamping, disabled
Delete file, disabled
T Q Bytes Ops Time Rate IOPS Latency %CPU OP_Type ReqSize
Combined 1 1 2147483648 2048 26.632 80.637 76.90 0.0130 0.00 write 1048576
}}}
So, those results match pretty well with the write results we got with __dd__. Let’s check out some read tests next:
{{{
root@sif:/tmp# ./xdd.linux -blocksize 1048576 -reqsize 1 -numreqs 2048 -targets 1 /dev/sda -op read
T Q Bytes Ops Time Rate IOPS Latency %CPU OP_Type ReqSize
Combined 1 1 2147483648 2048 27.445 78.246 74.62 0.0134 0.00 read 1048576
}}}
Like in the __dd__ tests, read speeds are a couple of MB/s slower than
writes, but nothing significantly. It’s always nice to verify results
with a different tool as a sanity check.
===Changing your I/O patterns===
Now, I’d like to show another test to illustrate the importance of
knowing your application behaviors. I re-ran __xdd__ with the same
amount of data, but this time using 1kB writes instead of 1MB:
{{{
root@sif:/tmp# ./xdd.linux -blocksize 1024 -reqsize 1 -numreqs 2097152 -targets 1 /dev/sda
T Q Bytes Ops Time Rate IOPS Latency %CPU OP_Type ReqSize
Combined 1 1 2147483648 2097152 97.361 22.057 21540.01 0.0000 0.00 write 1
}}}
Wow! We only got a little over one quarter the performance with small
write. That’s pretty rotten. But is that ”really” how well the disk
can stream small writes?
It turns out the answer is no. The disk is a lot worse at handling these
small transactions than we think. To illustrate this point, I re-ran
the __xdd__ with the same parameters, but this time enabling DirectIO
to bypass any caching of writes by the operating system. The results:
{{{
root@sif:/tmp# ./xdd.linux -blocksize 1024 -reqsize 1 -numreqs 2097152 -targets 1 /dev/sda -dio
T Q Bytes Ops Time Rate IOPS Latency %CPU OP_Type ReqSize
Combined 1 1 2147483648 2097152 235.199 9.130 8916.49 0.0001 0.00 write 1024
}}}
Bypassing the OS cache brings the write speeds to under 10MB/s.
That’s less than 13% of our original, large-block cached I/O. It turns
out that for large-block writes with DirectIO turned on for my system,
on a single drive, we get similar results to cached I/O:
{{{
root@sif:/tmp# ./xdd.linux -blocksize 1048576 -reqsize 1 -numreqs 2048 -targets 1 /dev/sda
T Q Bytes Ops Time Rate IOPS Latency %CPU OP_Type ReqSize
Combined 1 1 2147483648 2048 26.632 80.637 76.90 0.0130 0.00 write 1048576
root@sif:/tmp# ./xdd.linux -blocksize 1048576 -reqsize 1 -numreqs 2048 -targets 1 /dev/sda -dio
T Q Bytes Ops Time Rate IOPS Latency %CPU OP_Type ReqSize
Combined 1 1 2147483648 2048 27.403 78.366 74.74 0.0134 0.00 write 1048576
}}}
===Multiple drives===
These results are all well and good for your average small server with
one disk. But what if we want more than one drive? This is another case
where __xdd__ gives us a better picture than __dd__. Let’s check it out:
{{{
root@sif:/tmp# ./xdd.linux -blocksize 1048576 -reqsize 1 -numreqs 2048 -targets 2 /dev/sda /dev/sdc
T Q Bytes Ops Time Rate IOPS Latency %CPU OP_Type ReqSize
Combined 2 2 4294967296 4096 26.937 159.444 152.06 0.0066 0.01 write 1048576
}}}
In this test, I wrote 2GB to ”each” target, for a total of 4GB. __xdd__
basically spins up a thread for each target, effectively doing them in
parallel, or at least as parallel as one can get on my single-core system.
Pretty cool. But what if you have applications that say, write lots of
small blocks to different files on disk? We can easily simulate that too.
For this test, I created two filesystems, __/mnt/sda__ and __/mnt/sdc__.
I’ll have __xdd__ create 4 files on each partition (the __{}__ is a
shell construct), for a total of 8 simultaneous writes across two filesystems.
{{{
root@sif:/tmp# ./xdd.linux -blocksize 1048576 -reqsize 1 -numreqs 1024 -targets 8 /mnt/sda/{1,2,3,4} /mnt/sdc/{1,2,3,4}
T Q Bytes Ops Time Rate IOPS Latency %CPU OP_Type ReqSize
Combined 8 8 8589934592 8192 80.512 106.692 101.75 0.0098 0.02 write 1048576
}}}
Not bad, we only lost ~33% of our raw disk performance by adding in a
filesystem and simultaneous write. Maybe we should take a look at some
smaller write sizes …
{{{
root@sif:/tmp# ./xdd.linux -blocksize 1024 -reqsize 1 -numreqs 1048576 -targets 8 /mnt/sda/{1,2,3,4} /mnt/sdc/{1,2,3,4}
T Q Bytes Ops Time Rate IOPS Latency %CPU OP_Type ReqSize
Combined 8 8 8589934592 8388608 362.816 23.676 23120.83 0.0000 0.01 write 1024
}}}
So again, we see the small writes really bring the performance down.
The OS can do some amount of batching the writes, but even it has issues
because you’re essentially doing a lot of small writes to and from kernel
space for the cache.
One last __xdd__ example. Let’s make this look more like a real-world
application: say 512k I/Os with a 50/50 mix of reads and writes to some
32 files spread across two filesystems:
{{{
root@sif:/tmp# ./xdd.linux -blocksize 524288 -reqsize 1 -numreqs 2048 -targets 16 /mnt/sda/{1,2,3,4,5,6,7,8} /mnt/sdc/{1,2,3,4,5,6,7,8} -rwratio 50
T Q Bytes Ops Time Rate IOPS Latency %CPU OP_Type ReqSize
Combined 16 16 17179869184 32768 281.347 61.063 116.47 0.0086 0.02 mixed 524288
}}}
(Quick note: since we’re reading files, I had to create those files
before setting rwratio… I just did with the same command line, with
__-op write__ instead).
So, we got a pretty respectable 61MB/s with that combined workload.
Thus ends our quick tour of __xdd__. You can run __xdd.linux__ with no
options to see the myriad of other things you can do with it, or check
out the lengthy manual.
===IOR===
I’m going to run through a pretty quick tour of IOR. I don’t have a
real distributed environment at home, so all these tests will be run
on a single machine with a filesystem mounted over NFS, and not very
indicative of a clustered filesystem. The goal is to show how one would
run a tool like this.
Well – that was the intent. It seems as though the version of IOR I
have has a bug causing it to segfault, or it’s simply operator error.
Maybe next week 🙂
===What’s next===
Hopefully, some quick IOR examples.
If not, I’ll likely move on to showing some networking benchmarks I like
to run, and maybe even dive into tuning TCP for high-latency networks.
===Final thoughts===
* I can’t say this enough – know your applications. Benchmarks are silly if you’re not testing the right stuff.
* __xdd__ can also simulate random I/O patterns with the __-randomize__ flag. Play around with it.
* Play around with cached I/O vs. DirectIO. The results can be interesting.
* There are many layers from your application down to the hardware that can affect how your applications run. Running these benchmarks can sometimes help narrow down the cause of slow or erratic performance.