File System Latency文件系统延迟

On 2011年05月17日, in linux, by netoearth

When considering I/O performance from the perspective of applications, file systems can prove to be a better target for analysis than disks. With modern file systems using more DRAM-based cache and performing more asynchronous disk I/O, what the application experiences can vastly differ to what the disks are doing. I’ll demonstrate this by examining I/O performance of a MySQL database at both the disk and file system level.

In this post I’ll discuss the commonly used approach, disk I/O analysis using iostat(1M), and in the following post I’ll show file system analysis using DTrace.

Disk I/O

Historically there has been a focus on performance of the storage level devices – the disks themselves – when trying to understand how disk I/O affects application performance. This includes the use of tools such as iostat(1M), which prints various I/O statistics for disk devices. Either system administrators are running iostat(1M) directly at the command line, or are using it via another interface. Some monitoring software will use iostat(1M) to fetch the disk statistics which they then archive and plot (eg, munin).

Here is an iostat(1M) screenshot from a Solaris-based system which is running a MySQL database as well as other applications (with some extraneous output lines trimmed):

# iostat -xnz 1 10
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    1.1   33.8   78.8 1208.1  0.0  1.0    0.0   27.8   0   4 c0t1d0
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  175.4    0.0 22449.9    0.0  0.0  1.1    0.0    6.1   0  82 c0t1d0
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  106.1  379.2 13576.5 22036.7  0.0  2.4    0.0    4.9   0  85 c0t1d0
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  139.9    0.0 17912.6    0.0  0.0  1.0    0.0    6.8   0  82 c0t1d0
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  176.1    0.0 22538.0    0.0  0.0  1.0    0.0    5.7   0  85 c0t1d0
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  208.0    0.0 26619.9    0.0  0.0  1.9    0.0    9.2   0  99 c0t1d0
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  208.0    0.0 26624.4    0.0  0.0  1.7    0.0    8.2   0  95 c0t1d0
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  106.0  368.9 13566.1 26881.3  0.0  2.7    0.0    5.7   0  93 c0t1d0
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  146.0    0.0 18691.3    0.0  0.0  0.9    0.0    6.4   0  88 c0t1d0
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
   84.2    0.0 10779.2    0.0  0.0  0.5    0.0    6.1   0  42 c0t1d0

These statistics show an average I/O service time (“asvc_t”) between 4.9 and 9.2 milliseconds and a high percent busy (“%b”) rate, reaching 99% busy in one interval. The MySQL database on this server is suffering a degree of slow queries (queries longer than 1 second), and based on the iostat(1M) output you may be guessing why – the disks. For an application, this looks like an AWFUL system to be running on.

iostat(1M) shows disk level performance.

And I do love iostat(1M), especially the “-x” form of output. I’ll save going into details for a later post. Here, I’ll explain why iostat(1M) can become pretty confusing to interpret when your perspective is an application.

Other Processes

I’ll explain a simple issue to start with. I said earlier that the system above was running MySQL and other applications. So, the heavy disk I/O could be caused by – and largely affecting – some other application. What if MySQL was caching very well in DRAM and hardly using the disks, while a nightly backup processes walked the entire file system rattling the disks? You might see output like the above, with the slow MySQL queries I mentioned caused by something else entirely.

iostat(1M)’s disk I/O is from all processes, which may not be the one you’re looking at.

I’ve worked this issue before, creating psio and later iosnoop and iotop to identify disk I/O by-process (and filename). But they can’t always identify the process and file, as happens with the ZFS file system. Fixing this isn’t easy, and can lead to the question – is looking at the disks the right place to start with? We can get a handle on this by considering the I/O stack.

I/O Stack

Typically, applications are not performing I/O to the disks directly, rather they do so via a file system. And file systems work hard to prevent applications from suffering disk I/O latency directly, such as by using DRAM to buffer writes and to cache and prefetch reads.

File System in the I/O Stack

Here is an example I/O stack showing key components of file system, based loosely on ZFS:

                   +---------------+     +---------------+
                   |  Application  | ... |  Application  |
                   |    mysqld     |     |    (others)   |
                   +---------------+     +---------------+
                    reads ^ | writes     //
                          | |           //
                          | |          //      user-land
       -------------system calls---------------------------
                          | |        //        kernel
          |  File System  | |      //      |
          |               | V     //       |
          |        [ access locks ]        |
          |               | |              |
          |      +------------------+      |
          |      |    DRAM cache    |      |
          |      +------------------+      |
          |          ^    ^ |    |         |
          |          |    | |    V         |
          | +----------+  | |  +---------+ |
          | | prefetch |  | |  | flusher | |
          | +------^---+  | |  +----|----+ |
          |      +-|----------------V-+    |
          |      |   on-disk layout   |    |
          |      |           +--------|    |
          |      |           |metadata|    |
          |      +--------------------+    |
          |        |      | |   |^  |      |
                   |      | |   ||  |
                   |      | V   V|  V
              |   Disk Device Drivers   |      iostat(1M)

This diagram shows that there are other sources of disk I/O other than what the application is directly (synchronously) requesting. For instance, on the write-side, the application may “dirty” buffers in the file system cache and consider the I/O completed, however the file system doesn’t perform the disk I/O until much later – seconds – by batching together dirty data and writing them in bulk. This was evident in the previous iostat(1M) output with the bursts of writes (see the “kw/s” column) – which does not reflect how the application is actually performing writes.

I/O Inflation

Apart from other sources of disk I/O adding to the confusion, there is also what happens to the direct I/O itself – particularly with the on-disk layout layer. This is a big topic I won’t go into much here, but I’ll enumerate a single example of I/O inflation to consider:

  1. An application performs a 1 byte write to an existing file.
  2. The file system identifies the location as part of a 128 Kbyte file system “record”, which is not cached (but the metadata to reference it is).
  3. The file system requests the record be loaded from disk.
  4. The disk device layer breaks the 128 Kbyte read into smaller reads suitable for the device.
  5. The disks perform multiple smaller reads, totaling 128 Kbytes.
  6. The file system now replaces the 1 byte in the record with the new byte.
  7. Sometime later, the file system requests the 128 Kbyte “dirty” record be written back to disk.
  8. The disks write the 128 Kbyte record (broken up if needed).
  9. The file system writes new metadata; eg, references (for Copy-On-Write), or atime (access time).
  10. The disks perform more writes.

So, while the applicaiton performed a single 1 byte write, the disks performed multiple reads (128 Kbytes worth) and even more writes (over 128 Kbytes worth).

Application I/O to the file system != Disk I/O

It can be even worse than that example, such as if the required metadata to reference the location had to be read in the first place, and if the file system (volume manager) employed RAID with a stripe size larger than the record size.

I/O Deflation

Having mentioned inflation, I should mention deflation possibilities in case it isn’t obvious. These include caching in DRAM to satisfy reads, and cancellation of buffered writes (data rewritten before it has been flushed to disk).

I’ve recently been watching production ZFS file systems running with over a 99.9% cache hit rate, meaning that only a trickle of reads are reaching disk.

Considering Disk I/O for understanding Application I/O

To summarize so far: looking at how hard the disks are rattling, like we did above using iostat(1M), is far away from what the target application is actually experiencing. Application I/O can be inflated and deflated by the file system by the time it reaches disks, making it difficult to make a direct connection between disk and application I/O. Disk I/O also includes requests from other file system components, such as prefetch, the background flusher, on-disk layout metadata, as well as other users of the file system (other applications). If you were to identify an issue at the disk level, it’s hard to tell how much it matters to the application of interest.

iostat(1M) includes other file system I/O, which may not directly affect the performance of the target application.

Summarizing the previous issues:

iostat(1M) shows disk level performance, not file system performance.

This is part 2 of File System Latency, a series on storage I/O performance from the application perspective. In the first part I explained why disk I/O is difficult to associate with an application, and why it can be altered from what the application requested. In this part I’ll focus more on the file system, and show why I/O that never reaches disk can be important to study.

File System Latency

What matters most to the application is the latency of its requests to the file system, which can be measured here (see part 1 for the full diagram):

                   |  Application  |
                   |    mysqld     |
           HERE --> reads ^ | writes
                          | |
                          | |                  user-land
       -------------system calls---------------------------
                          | |                  kernel
          |  File System  | |              |
          |               | V              |
          |        [ access locks ]        |
          |               | |              |

Measuring I/O and latency at this level is much more interesting – it is directly affecting the application.

If we are also able to examine application context during the latency – whether it’s during a sensitive code-path – then we can answer with certainty whether there is a file system issue that is affecting the application or not, and whether that’s worth investigating further. Being able to answer this early in the diagnosis phase can be immensely useful, so that the right path is taken sooner.

Apart from having better relevance to the application than disk I/O, file system I/O is also inclusive of other phenomenon that can be worth examining, including cache hits, lock latency, additional queueing and disk-cache flush latency.

DRAM cache hits

Reads and writes may be served from the file system DRAM cache instead of disk (if present, enabled and eligible for the I/O type). For reads:

                   |  Application                |
                   |    mysqld                   |
                    read |||   ^^ cache    ^ cache
                requests |||   || hits     | misses
                         |||   ||          |
                         |||   ||          |               <-- file system I/O
          |  File System |||   ||          |          |
          |              VVV   ||          |          |
          |   +----------------------------------+    |
          |   |     DRAM cache                   |    |
          |   +----------------------------------+    |
          |                         |    ^            |
                               disk |    |
                               read |    |                 <-- disk I/O
                                    V    |
                                  |  disks |

Since these cache hits don’t reach the disk, they are never observable using iostat(1M). They will be visible when tracing at the file system level, and if the file system cache is performing well you may see orders of magnitude more I/O than at the disk level.

If cache hits are “good” I/O, it may not be immediately obvious why we’d even want to see them. Here are three reasons to consider:

  • load analysis: by observing all requested I/O you know exactly how the application is using the file system – the load applied – which may lead to tuning or capacity planning decisions.
  • unnecessary work: identifying I/O that shouldn’t be sent to the file system to start with, whether it’s a cache hit or not.
  • latency: cache hits may not be as fast as you think.

What if the I/O was slow due to file system lock contention, even though no disk I/O was involved?

Lock latency

File systems employ locks for data integrity in a multi-threaded environment, and the latency incurred with lock contention will be included when tracing at this level. The time waiting to acquire the lock could dominate I/O latency as seen by the application:

                   |  Application                  |
                   |    mysqld                     |
                      I/O |    I/O latency     ^ I/O
                  request |  <-------------->  | completion
                          |                    |
                          |                    |
          |  File System  |                    |      |
          |               |   lock wait        |      |
          |               X-------------->.    |      |
          |                               |I/O |      |
          |                               |wait|      |
          |                               V    |      |

In which case, the disks paint a rosier picture than reality, as their latency could be dwarfed by lock latency. While this is unlikely to happen – it could happen – and if you are chasing down mysterious I/O latency you don’t want to leave any stone unturned.

High lock wait (contention) could happen for a number of reasons, including extreme I/O conditions or file system bugs (remembering that the file system is software, and software can have bugs). Lock and other sources of file system latency won’t be visible from iostat(1M).

If you only use iostat(1M), you may be flying blind regarding lock and other file system issues.

There is one other latency source that iostat(1M) does directly show: waiting on one of the I/O queues. I’ll dig into queueing a little here, and explain why we need to return back to file system latency.


I/O can be queued in the kernel before it is issued to disk.

I’ve been trying to describe file systems generically to avoid getting sidetracked into implementation and internals, but will dip into ZFS a little here. On Solaris-based systems an I/O can queue in the ZFS I/O pipeline (“ZIO pipeline”), and then ZFS vdev queues, and finally a SCSI “sd” block device driver queue. iostat(1M)’s “wsvc_t” does show queue latency for the sd driver (and the “wait” and “%w” columns relate to this queue as well), but these don’t reflect ZFS queueing.

         |  ZFS      V                                         |
         |        +---------------------------+                |
         |        |            ARC            |                |
         |        +---------------------------+                |
         |           |     __________________                  |
         |           +--> ()  ZIO pipeline   )---.             |
         |                 `""""""""""""""""'    |             |
         |                       vdev queue      V             |
         |                    +-+-+-+-+-+-+-+-+-+-+            |
         |                    |_|_|_|_|_|_|_|_|_|_|            |
         |                     |     <---                      |
                               V  sd queue
                              +-+-+-+-+-+-+-+-+       wsvc_t
                              |_|_|_|_|_|_|_|_|	    <-------- iostat(1M)
                                     --->    |
                                             V                  /
                                          .------.             /
                                          `------'    asvc_t  /
                                          | disk |  <--------'

So, iostat(1M) gets a brief reprieve – it’s not just disk I/O latency. The block device driver queue latency can be seen as well.

However, similarly to disk I/O latency, queue latency may not matter unless the application is waiting for that I/O to complete. To understand this from the application perspective, we are still best served by measuring latency at the file system level – which will include any queueing latency from any queue that the application I/O has synchronously waited for.

Cache Flush

Some file systems take care to make sure that synchronous write I/O – where the application has requested that the write not complete until it is on stable storage – really is on stable storage. (I think ZFS is currently the only one to do this.) It can work by sending SCSI cache flush commands to the disk devices, and not completing the application I/O until the cache flush command has completed. This ensures that the data really is on stable storage – and not buffered.

                   |  Application                         |
        synchronous write |       I/O latency        ^ synchronous write
                  request |  <-------------------->  | completion
                          |                          |
                          |                          |
          |  File System  |                          |        |
          |               |                          |        |
                     disk |        ^     | SCSI      |
                    write | iostat |     | cache     |
                          | asvc_t |     | flush     |
                          | <----> |     | command   |
                          |        |     |           |
              |  Disk     |        |     |           |      |
              |  Device   V        |     V           |      |
              |   +-------------------------------------+   |
              |   | on-disk DRAM cache                  |   |
              |   +-------------------------------------+   |
              |                          |  actual   ^      |
              |                          |  disk     |      |
              |                          |  latency  |      |
              |                          | <-------> |      |
              |                          V           |      |
              |                         .-------------.     |
              |                         `-------------'     |
              |                         |    disk     |     |
              |                         `-------------'     |

The application is actually waiting for the SCSI flush command to complete, which is not (currently) included in iostat(1M). This means the application can be suffering write latency issues – that really are caused by disk latency – that are invisible via iostat(1M). I’ve wrestled with this issue before, and included scripts in the DTrace book to show the SCSI cache flush latency.

If the latency is measured at the file system interface, this latency will include cache flush commands.

Issues Missing from Disk I/O

Part 1 showed how application storage I/O was confusing to understand from the disk level. In this part I showed some scenarios where issues are just not visible.

This isn’t really a problem with iostat(1M) – it’s a great tool for system administrators to understand the usage of their resources. But the applications are far, far away from the disks – and have a complex file system in-between. For application analysis, iostat(1M) may provide clues that disks could be causing issues, but you really want to measure at the file system level to directly associate latency with the application, and to be inclusive of other file system latency issues.

In the next post I’ll measure file system latency on a running application (MySQL).

As an example of an application to study I picked a busy MySQL server, and I’ll focus on using the DTrace pid provider to examine storage I/O. For an introduction to MySQL analysis with DTrace, see my earlier post on MySQL Query Latency. Here I’ll take that topic further, measuring the file system component of query latency.

File System Latency Distribution

I’ll start by showing some results of measuring this, and then tool used to do it.

This is file system latency within a busy MySQL database, for a 10 second interval:

# ./mysqld_pid_fslatency.d -n 'tick-10s { exit(0); }' -p 7357
Tracing PID 7357... Hit Ctrl-C to end.
MySQL filesystem I/O: 55824; latency (ns):

           value  ------------- Distribution ------------- count
            1024 |                                         0
            2048 |@@@@@@@@@@                               9053
            4096 |@@@@@@@@@@@@@@@@@                        15490
            8192 |@@@@@@@@@@@                              9525
           16384 |@@                                       1982
           32768 |                                         121
           65536 |                                         28
          131072 |                                         6
          262144 |                                         0        

           value  ------------- Distribution ------------- count
            2048 |                                         0
            4096 |                                         1
            8192 |@@@@@@                                   3003
           16384 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@             13532
           32768 |@@@@@                                    2590
           65536 |@                                        370
          131072 |                                         58
          262144 |                                         27
          524288 |                                         12
         1048576 |                                         1
         2097152 |                                         0
         4194304 |                                         10
         8388608 |                                         14
        16777216 |                                         1
        33554432 |                                         0

This shows the distribution of file system I/O latency in nanoseconds in the left column (“value”), with the number of I/O in that latency range shown in the right column (“count”). Most of the I/O (where the ASCII distribution plot has its spikes) was between 2 and 16 microseconds for the reads, and 8 and 65 microseconds for the writes. That’s fast – and is a strong indication that these reads and writes were to the DRAM cache and not to disk.

The slower time for writes vs reads is probably due to the time to acquire write locks and the buffers to write data to, and to manage the new file system metadata to reference it. I can answer this with more DTrace if needed.

A small handful of the writes, 25 in total, fell in the 4 to 33 millisecond range – the expected time for disk I/O, including a degree of queueing. (If it’s not clear in the above output – 4194304 nanoseconds == 4 milliseconds.) Compared to all the other I/O during tracing shows that the file system cache was running with a hit rate of over 99.9%.

It’s neat to be able to see these system components from the latency distribution, with annotations:

           value  ------------- Distribution ------------- count
            2048 |                                         0
            4096 |                                         1
            8192 |@@@@@@                                   3003
           16384 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@            13532   <--- DRAM cache I/O
           32768 |@@@@@                                    2590
           65536 |@                                        370
          131072 |                                         58
          262144 |                                         27
          524288 |                                         12
         1048576 |                                         1
         2097152 |                                         0
         4194304 |                                         10     <--- disk I/O
         8388608 |                                         14
        16777216 |                                         1
        33554432 |                                         0

This is just based on the times; to be certain that the 4+ ms I/O was disk based, more DTrace can confirm.

In summary, this shows that file system I/O is usually lightening fast here, hitting out of DRAM. For an application, this looks like a GREAT system to be running on.

Comparing to iostat(1M)

This is actually the same system examined in part 1 using iostat(1M), and was traced at the same time. I had two shells running, and collected the iostat(1M) output at the exact same time as this DTrace output. As a reminder, the disks were doing this according to iostat(1M):

# iostat -xnz 1 10
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  208.0    0.0 26619.9    0.0  0.0  1.9    0.0    9.2   0  99 c0t1d0
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  208.0    0.0 26624.4    0.0  0.0  1.7    0.0    8.2   0  95 c0t1d0
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  106.0  368.9 13566.1 26881.3  0.0  2.7    0.0    5.7   0  93 c0t1d0

Which looks awful. But at the file system level, it is great.

During the 10 seconds that both tools were running, this MySQL database experienced multiple slow queries (queries longer than one second). Based on the iostat(1M) output, one may spend a while investigating disk I/O issues, but this would be starting out in the wrong direction. The issue isn’t slow disk I/O. The file system latency distribution show only a trickle reaching disk, and the vast majority of I/O returning at microsecond speeds.

iostat(1M) pointed in the wrong direction for this application issue.

So what are the disks doing? In this example, the reads are mostly from other applications that are running on the same system as this MySQL database. The bursts of writes seen are ZFS transaction group flushes, which are batching writes from MySQL and the other applications for sending to disk later as a group. Some of the disk I/O are other file system types as described in part 1 as well. All of these details were confirmed using more DTrace (which could make for another blog post).

What it isn’t

So what are causing the slow queries? It may seem like we haven’t learned anything about the slow queries yet, but we have – it’s not the disks. We can move onto searching other areas – the database itself, as well as where the time is spent during the slow query (on-CPU or off-CPU, as measured by mysqld_pid_slow.d). The time could be spent waiting on database locks, for example.

Quickly identifying what an issue isn’t helps narrow the search to what it is.

Before you say…

  • There was disk I/O shown in the distribution – couldn’t they all combine to cause a slow query? … Not in this example: the sum of those disk I/O is between 169 and 338 milliseconds, which is a long way from causing a single slow query (over 1 second). If it were a closer call, I’d rewrite the DTrace script to print the sum of file system latency per query (more on this later).
  • Could the cache hits shown in the distribution combine to cause a slow query? … Not in this example, but their sum does get closer. While fast, there were a lot of them. Again, file system latency can be expressed as a sum per query instead of a distribution to identify this with certainty.


The above latency distributions were a neat way of presenting the data, but not the only way. As just mentioned, a different presentation of this data would be needed to really confirm that slow queries were caused by the file system: specifically, a sum of file system latency per query.

It was getting the latency data in the first place that was previously difficult – which DTrace now lets us do. The presentation of that data can be what we need to effectively answer questions, and DTrace lets us present it as totals, averages, min, max, and event-by-event data as well, if needed.

Distribution Script

The previous example traced I/O inside the application using the DTrace pid provider, and printed out a distribution plot of the file system latency. The tool that was executed, mysqld_pid_fslatency.d, is a DTrace script that is about 50 lines long (including comments). I’ve included it here, enumerated:


     1	#!/usr/sbin/dtrace -s
     2	/*
     3	 * mysqld_pid_fslatency.d  Print file system latency distribution.
     4	 *
     5	 * USAGE: ./mysqld_pid_fslatency.d -p mysqld_PID
     6	 *
     7	 * TESTED: these pid-provider probes may only work on some mysqld versions.
     8	 *	5.0.51a: ok
     9	 *
    10	 * 27-Mar-2011
    11	 */
    13	#pragma D option quiet
    15	dtrace:::BEGIN
    16	{
    17		printf("Tracing PID %d... Hit Ctrl-C to end.\n", $target);
    18	}
    20	pid$target::os_file_read:entry,
    21	pid$target::os_file_write:entry,
    22	pid$target::my_read:entry,
    23	pid$target::my_write:entry
    24	{
    25		self->start = timestamp;
    26	}
    28	pid$target::os_file_read:return  { this->dir = "read"; }
    29	pid$target::os_file_write:return { this->dir = "write"; }
    30	pid$target::my_read:return       { this->dir = "read"; }
    31	pid$target::my_write:return      { this->dir = "write"; }
    33	pid$target::os_file_read:return,
    34	pid$target::os_file_write:return,
    35	pid$target::my_read:return,
    36	pid$target::my_write:return
    37	/self->start/
    38	{
    39		@time[this->dir] = quantize(timestamp - self->start);
    40		@num = count();
    41		self->start = 0;
    42	}
    44	dtrace:::END
    45	{
    46		printa("MySQL filesystem I/O: %@d; latency (ns):\n", @num);
    47		printa(@time);
    48		clear(@time); clear(@num);
    49	}

This script traces functions in the mysql and innodb source that perform reads and writes to the file system: os_file_read(), os_file_write(), my_read() and my_write(). A little poking around this version of MySQL (5.0.51a) showed that these were causing all the file system I/O, which along with their function names and use in the code made them attractive locations to trace.

Script Caveats

Since this is pid provider-based, these functions are not considered a stable interface – and this script may not work on other versions of MySQL. This would happen if the function names were changed, removed, or other file system functions added. It shouldn’t be too hard to locate the new locations if that has happened, and fix the script.

Besides the stability of this script, other caveats are:

  • Overhead: there will be additional overhead (extra CPU cycles) for DTrace to instrument MySQL and collect this data. This should be minimal, especially considering the code-path instrumented – file system I/O. See my post on pid provider overhead for more discussion on this type of overhead.
  • CPU latency: the times measured by this script include CPU dispatcher queue latency.

I’ll explain that last note in more detail:

CPU Latency

These MySQL functions use system calls to perform the I/O, which will block and voluntarily context switch the thread off-CPU and put it to sleep until the I/O completes. When the I/O completes, the thread is woken up but it may need to wait its turn on-CPU if there are threads with a higher priority already running (it will probably have had a priority boost due to the scheduler to help its chances at preemption). Waiting its turn is the “CPU dispatcher queue latency”, and if the CPUs are heavily saturated with work that could be milliseconds. This is included in the time that mysqld_pid_fslatency.d prints out.

Showing the time components with CPUs at saturation:

                |           Application          |
                |             mysqld             |
                     |     FS I/O latency     ^
                     | <--------------------> |
                     |                        |   user-land
     -----------------------system calls ----------------------
                     |                        |   kernel
        | File       |                        |     |
        | System     |  I/O wait     CPU wait |     |
        |            V <--------->   <------> |     |
        |            X.............+----------+     |
        |            |             ^                |
        | thread is: | off-CPU     | runnable       |
        |            |             |                |
            disk I/O |             | disk I/O
             request |             | completion
                     V             |
                    |     disk      |

While this could make interpreting the measured file system I/O latency confusing, you don’t want to be running the system in this state to start with. If the CPUs are at saturation, then the application could be affected at random times with involuntary context switches, apart from the additional dispatcher queue latency at the end of I/O.

Identifying CPU saturation is usually straightforward with standard operating system tools (or at least, the lack of CPU %idle); the best tool on Solaris-based systems would be “prstat -mL” to examine the percent of time threads spent waiting on the CPU dispatcher queues (“LAT”). This is a much better measurement, as it also catches other cases that can’t be seen via the lack of %idle (eg, dispatcher queue latency due to processes reaching their CPU caps).

Slow Query Logger

Apart from examining file system latency as a distribution, it may be desirable to express it as a portion of the query time, so that “slow” queries can be clearly attributed to file system latency. The following script does this, by tracing both the query latency and summing file system latency, for any query where the total file system latency is over a threshold:


     1	#!/usr/sbin/dtrace -s
     2	/*
     3	 * mysqld_pid_fslatency_slowlog.d  Print slow filesystem I/O events.
     4	 *
     5	 * USAGE: ./mysql_pid_fslatency_slowlog.d mysqld_PID
     6	 *
     7	 * This traces mysqld filesystem I/O during queries, and prints output when
     8	 * the total I/O time during a query was longer than the MIN_FS_LATENCY_MS
     9	 * tunable.  This requires tracing every query, whether it performs FS I/O
    10	 * or not, which may add a noticable overhead.
    11	 *
    12	 * TESTED: these pid-provider probes may only work on some mysqld versions.
    13	 *	5.0.51a: ok
    14	 *
    15	 * 27-Mar-2011
    16	 */
    18	#pragma D option quiet
    20	inline int MIN_FS_LATENCY_MS = 1000;
    22	dtrace:::BEGIN
    23	{
    24		min_ns = MIN_FS_LATENCY_MS * 1000000;
    25	}
    27	pid$1::*dispatch_command*:entry
    28	{
    29		self->q_start = timestamp;
    30		self->io_count = 0;
    31		self->total_ns = 0;
    32	}
    34	pid$1::os_file_read:entry,
    35	pid$1::os_file_write:entry,
    36	pid$1::my_read:entry,
    37	pid$1::my_write:entry
    38	/self->q_start/
    39	{
    40		self->fs_start = timestamp;
    41	}
    43	pid$1::os_file_read:return,
    44	pid$1::os_file_write:return,
    45	pid$1::my_read:return,
    46	pid$1::my_write:return
    47	/self->fs_start/
    48	{
    49		self->total_ns += timestamp - self->fs_start;
    50		self->io_count++;
    51		self->fs_start = 0;
    52	}
    54	pid$1::*dispatch_command*:return
    55	/self->q_start && (self->total_ns > min_ns)/
    56	{
    57		this->query = timestamp - self->q_start;
    58		printf("%Y filesystem I/O during query > %d ms: ", walltimestamp,
    59		    MIN_FS_LATENCY_MS);
    60		printf("query %d ms, fs %d ms, %d I/O\n", this->query / 1000000,
    61		    self->total_ns / 1000000, self->io_count);
    62	}
    64	pid$1::*dispatch_command*:return
    65	/self->q_start/
    66	{
    67		self->q_start = 0;
    68		self->io_count = 0;
    69		self->total_ns = 0;
    70	}

A key difference with this script is that it only examines file system I/O if they are called during a query, as checked on line 38 (which checks that a thread-local variable q_start was set, which is only true during a query). The previous script, mysqld_pid_fslatency.d, showed all file system I/O latency – whether it was during a query or for another task in the database.

To capture some sample output, I modified line 20 to reduce the threshold to 100 milliseconds:

# ./mysqld_pid_fslatency_slowlog.d 29952
2011 May 16 23:34:00 filesystem I/O during query > 100 ms: query 538 ms, fs 509 ms, 83 I/O
2011 May 16 23:34:11 filesystem I/O during query > 100 ms: query 342 ms, fs 303 ms, 75 I/O
2011 May 16 23:34:38 filesystem I/O during query > 100 ms: query 479 ms, fs 471 ms, 44 I/O
2011 May 16 23:34:58 filesystem I/O during query > 100 ms: query 153 ms, fs 152 ms, 1 I/O
2011 May 16 23:35:09 filesystem I/O during query > 100 ms: query 383 ms, fs 372 ms, 72 I/O
2011 May 16 23:36:09 filesystem I/O during query > 100 ms: query 406 ms, fs 344 ms, 109 I/O
2011 May 16 23:36:44 filesystem I/O during query > 100 ms: query 343 ms, fs 319 ms, 75 I/O
2011 May 16 23:36:54 filesystem I/O during query > 100 ms: query 196 ms, fs 185 ms, 59 I/O
2011 May 16 23:37:10 filesystem I/O during query > 100 ms: query 254 ms, fs 209 ms, 83 I/O

In the few minutes this was running, there were 9 queries longer than 100 milliseconds due to file system I/O. With this output, we can immediately identify the reason for those slow queries – they spent most of their time waiting on the file system. Reaching this conclusion with other tools is much more difficult and time consuming – if possible (and practical) at all.

DTrace can be used to positively identify slow queries caused by file system latency.

But this is about more than DTrace – it’s about the metric itself – file system latency. Since this has been of tremendous use so far, it may make sense to add file system latency to the slow query log (requiring a MySQL source code change). If you are on MySQL 5.6 and later, it looks like you may also be able to get this information from the wait/io events from the new performance schema additions (I haven’t tried). If that isn’t viable, or you are on older MySQL, or a different application entirerly (MySQL was just my example application), you can keep using DTrace to dynamically fetch this information from the running source.

Interpreting Totals

For the queries in the above output, most of the query latency is due to file system latency (eg, 509 ms out of 538 ms = 95%). The last column printed out the I/O count, which helps answer the next question: is the file system latency due to many I/O, or a few slow I/O? For example:

  • The first line of output showed 509 milliseconds of file system latency from 83 I/O, which works out to be about 6 milliseconds on average. Just based on the average, this could mean that most of these were cache misses causing random disk I/O. The next step may be to investigate doing fewer file system I/O in the first place, by caching more in MySQL.
  • The fourth line of output shows 152 milliseconds of file system latency from a single I/O. This line of output is more alarming than any of the others, as it shows the file system returning with very high latency (this system is not at CPU saturation). Fortunately, this may be an isolated event.

If those descriptions sound a little vague, that’s because we’ve lost so much data when summarizing as an I/O count and a total latency. The script has achieved its goal – of identifying the issue – but to investigate the issue further I need to return to the distribution plots like those used by mysqld_pid_fslatency.d. By examining the full distribution, I could confirm if the 152 ms I/O was as isolated as it looks, or if in fact slow I/O are responsible for all of the latency seen above (eg, for the first line, was it 3 slow I/O + 80 fast I/O = 83 I/O?).

Script Caveats

Caveats and notes for this script are:

  • This script has a higher level of overhead than mysqld_pid_fslatency.d, since it is tracing all queries via the dispatch_command() function, and not just file system functions.
  • CPU latency: the file system time measured by this script include CPU dispatcher queue latency (as explained earlier).
  • The dispatch_command() function is matched as “*dispatch_command*”, since the full function name is the C++ signature (eg, “_Z16dispatch_command19enum_server_commandP3THDPcj”, for this build of MySQL).
  • Instead of using “-p PID” at the command line, this script fed in the PID as $1 (macro variable). This was done as the script is intended to be left running for a while (hours), during which it may be desirable to continue investigating MySQL with other DTrace programs. Only one “-p PID” script can be run at a time (“-p” locks the process, and other instances get the error “process is traced”). By using $1 for the long running script, shorter -p based ones can be run in the meantime.

I’ll finish with one final note about these scripts: since they are pid provider-based, they can work from inside Solaris Zones or Joyent SmartMachines. A minor exception is with how I executed the first script: I added “-n ‘tick-10s { exit(0); }’” at the command line to exit after 10 seconds, which currently doesn’t work reliably in those environments due to a bug (tick only sometimes fires). Until that bug is fixed, drop that statement from the command line and it will still work fine, and require a Ctrl-C to end tracing.

Considering File System Latency

By examining latency at the file system, application issues can be immediately identified as file system (and probably disk) based, or not. This can set the investigation down the right path immediately, and not the wrong direction as was being suggested by iostat(1M)’s view of busy disks.

Two pid provider-based DTrace scripts were introduced in this post to do this: mysqld_pid_fslatency.d for summarizing the distribution of file system latency, and mysqld_pid_fslatency_slowlog.d for printing slow queries due to the file system.

The pid provider isn’t the only way to measure file system latency: it’s also possible from the syscall layer and from the file system code in the kernel. I’ll demonstrate those methods in the next post, and discuss how they differ to the pid provider method.

Comments are closed.