Intro to benchmarking part 1: Disks/storage

===Background===

In my daily work, I’ve noticed that not a lot of attention is given
to benchmarking various subsystems. When a new disk array comes in,
there may be a run or two of bonnie++, but that’s generally it.
Network benchmarks? Generally I don’t see people run them. Maybe you
don’t care; maybe there’s an entire storage team that handles this at
your site. However, if you do care, read on for my experiences.

===Why benchmark at all?===

The answer to question might seem obvious to some, but maybe not. I’ve often
observed people at work getting requirements of X MB/s for a particular
application or use case. Now, we all know that vendors love to provide
us with theoretical limits. Now, try getting their testing methodology
out of them. Depending on the vendor and your relationship with them,
that can be a difficult task. So, we run our own benchmarks and figure
out how well the storage/network/sprocket performs in our environment.

Another motivation for running benchmarks is to identify problems.
In one instance, we saw ~4GB/s out of a certain vendors storage array,
using raw devices. We bought a number to satisfy the requirements of
our application. One we got a filesystem on top of that array, we saw a
significant performance drop. Running various benchmarks showed us that
the filesystem journal updates behaved essentially like random small-block
I/O, whereas our initial testing focused on sequential large-block I/O.
Our application performs large sequential I/Os, but each I/O incurs a
journal update, and that killed us.

===Disk benchmarking===
I’ve recently started working on a project that’s all about figuring
out different performance metrics – disk, network, computational, etc.
This has opened up my eyes in a lot of ways.

In this post, I’d like to start talking about disk benchmarking
(since this is what I’m primarily involved in at the moment). Now,
when we’re talking about benchmarking disks, we need to define what we’re
talking about. There are 3 different categories of numbers we want to look at:

* Raw disk I/O with OS caching
* Raw disk I/O without OS caching
* Filesystem I/O

You might wonder why we’d care about raw disk I/O without any OS caching.
The answer is that OS caching can actually kill our performance. One test
I’ve run is I/O across 3 dual-ported FC4 HBAs to 3 different shelves
of disk. Testing showed that running a bunch of simultaneous I/O threads
with OS caching enabled actually brought our throughput down quite a bit.

It turns out that certain applications and/or filesystems can use
Direct I/O. Lustre is a good example of this.

===General notes===
Some notes before going into specific tools.

* Always keep in mind which of the 3 categories any particular test is showing you
* You should always run tests with no load on the system to get your baselines
** Later on, add a realistic workload and see what happens to your I/O
* Test repeatedly – it’s always possible that something else was going on in the system at the time you ran your numbers
* You should generally always test with a data set at least 2X the size of your RAM
** Especially when you’re testing filesystems – otherwise the data can all go to the OS cache first, and you end up testing memory instead of disk
* It can be neat to have __vmstat 1__ running while you’re doing your tests.
The __bi/bo__ columns show you what’s ”actually” being read from/written
to your disks. It can often show you when, e.g., the OS is caching your
writes and then flushing them to disk later.

===Tools===
Now let’s talk about some of the different tools that can be used to
test the performance of your disk systems.

====hdparm====
Perhaps the quickest and most basic test that can be run, __hdparm__,
is not really a benchmark, but a utility for setting drive parameters.
However, it’s got a quick little test mode too, just to give a very
basic idea. For example:
{{{
root@sif:~# hdparm -tT /dev/sda

/dev/sda:
Timing cached reads: 1190 MB in 2.00 seconds = 594.92 MB/sec
Timing buffered disk reads: 224 MB in 3.01 seconds = 74.54 MB/sec
root@sif:~#
}}}

This test is non-destructive, and provides a general idea of read
performance. Obviously, the cached reads are something of a fairy-tale —
you’ll (almost) never see that in real life.

====dd====
Next up on our list is __dd__. dd can again provide some basic numbers
for us. Newer Linux-y versions of dd will give you the transfers speeds,
otherwise you’ll have to use __time__ and some basic math skills.
In this example, I’m copying 2048 1MB blocks to and from the disk.
I have 1GB of RAM in this machine, so I use double that for testing.

{{{
# read test
root@sif:~# dd if=/dev/sda of=/dev/null bs=1M count=2048
2048+0 records in
2048+0 records out
2147483648 bytes (2.1 GB) copied, 27.5104 s, 78.1 MB/s
# write test
root@sif:~# dd if=/dev/zero of=/dev/sda bs=1M count=2048
2048+0 records in
2048+0 records out
2147483648 bytes (2.1 GB) copied, 28.2787 s, 75.9 MB/s
root@sif:~#
}}}

These results look fairly consistent with our hdparm results.

====bonnie/bonnie++====

In a very small, very unscientific survey of 2 other sysadmins,
I discovered that [bonnie++|www.coker.com.au/bonnie++/] is the most
well-known disk benchmark among sysadmins.

Unlike __hdparm__ and __dd__, which work great on raw disk devices, __bonnie++__
really insists on having a filesystem there.

Bonnie++ includes 3 basic tests:
* Per-character I/O
* Block I/O
* File ops

For the block I/O tests, it performs write, rewrite, and read tests.

I generally run with the __-f__ flag to disable per-character I/O. It’s
slow and not all that realistic for the apps that run on my systems.
Bonnie++ will automatically size its data to 2X your RAM.

For these tests, I created a simple ext2 filesystem using {{{mke2fs -j /dev/sda1}}}
on the drive used in the other tests.

Some results:
{{{
root@sif:~# bonnie++ -u root -d /mnt/sda -f
Using uid:0, gid:0.
Writing intelligently…done
Rewriting…done
Reading intelligently…done
start ’em…done…done…done…
Create files in sequential order…done.
Stat files in sequential order…done.
Delete files in sequential order…done.
Create files in random order…done.
Stat files in random order…done.
Delete files in random order…done.
Version 1.03b ——Sequential Output—— –Sequential Input- –Random-
-Per Chr- –Block– -Rewrite- -Per Chr- –Block– –Seeks–
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
sif 2G 63934 10 31543 5 73307 5 203.4 0
——Sequential Create—— ——–Random Create——–
-Create– –Read— -Delete– -Create– –Read— -Delete–
files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++
sif,2G,,,63934,10,31543,5,,,73307,5,203.4,0,16,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++
root@sif:~#
}}}

We again see similar results to our previous two tests. The file
ops tests generally do not provide useful results on today’s machines
until you bump the number of files up way past the default 16k.
16k inodes will generally fit pretty easily into the OS cache, so you
need many more creates to get a reasonable result.

Just for fun, I re-ran my dd tests against a filesystem instead of the raw device:
{{{
root@sif:/mnt/sda# dd if=/dev/zero of=/mnt/sda/testfile bs=1M count=2048
2048+0 records in
2048+0 records out
2147483648 bytes (2.1 GB) copied, 33.1636 s, 64.8 MB/s
root@sif:/mnt/sda# dd if=/mnt/sda/testfile of=/dev/zero bs=1M count=2048
2048+0 records in
2048+0 records out
2147483648 bytes (2.1 GB) copied, 27.7735 s, 77.3 MB/s
root@sif:/mnt/sda#
}}}

Here we can see that the filesystem definitely adds some overhead for
writes, almost 15MB/s slower than the raw disk. So, as in the example
I mentioned, the filesystem incurs a penalty for writes, nearly 20%!

===What’s next===

In my next entry, I’ll look at some other tools for disk benchmarking,
possibly including [XDD|http://www.ioperformance.com/],

sgp_dd, [IOR|http://sourceforge.net/projects/ior-sio/], and
[lmdd|http://www.bitmover.com/lmbench/]

===Final thoughts===

* Know your applications – it makes no sense to benchmark large block writes on a machine that will be serving small, read-mostly web-pages
** And of course, your most important benchmark is ”your” application!
* Know your benchmarks – what are you really testing?
* You can use __vmstat(8)__ to see if you’re actually hitting disk, or are still in OS-cache land
* Repeat any benchmark you run a couple of times to make sure your results are statistically valid