Discussion:
[Dump-devel] [patch] Use sys_clone under Linux to share I/O contexts between dump processes
Jeff Moyer
2008-11-17 22:19:21 UTC
Permalink
Hi,

dump performs poorly when run under the CFQ I/O scheduler. The reason
for this is that the dump command interleaves I/O between two (or
three?) cooperating processes. This is about the worst case scenario
you can get for CFQ, as the I/O access pattern within each process is
sequential. Thus, CFQ will idle for a number of milliseconds waiting
for the current process to issue more I/O before switching to the next.

Now, this behaviour can be changed with tuning. However, if the dump
command simply shared I/O contexts between cooperating processes, CFQ
could make more intelligent decisions about I/O scheduling.

So, here are the numbers, running under 2.6.28-rc3.

deadline 82241 kB/s
cfq 34143 kB/s
cfq-shared 82241 kB/s

cfq-shared denotes that the dump utility was patched with the attached
patch to share I/O contexts. As you can see, with a very little bit of
code change, we can drastically increase the performance of dump under
CFQ (which is the default I/O scheduler used in a number of
distributions).

For more information on the underlying problems, you can refer to the
following kernel discussion:
http://lkml.org/lkml/2008/11/9/133

Comments are appreciated.

Cheers,

Jeff

diff -up ./dump/tape.c.orig ./dump/tape.c
--- ./dump/tape.c.orig 2005-08-20 17:00:48.000000000 -0400
+++ ./dump/tape.c 2008-11-17 16:40:42.575792509 -0500
@@ -187,6 +187,40 @@ static sigjmp_buf jmpbuf; /* where to ju
static int gtperr = 0;
#endif

+/*
+ * Determine if we can use Linux' clone system call. If so, call it
+ * with the CLONE_IO flag so that all processes will share the same I/O
+ * context, allowing the I/O schedulers to make better scheduling decisions.
+ */
+#ifdef __linux__
+#include <syscall.h>
+
+#ifndef SYS_clone
+#define fork_clone_io fork
+#else /* SYS_clone */
+#include <linux/version.h>
+
+/*
+ * Kernel 2.5.49 introduced two extra parameters to the clone system call.
+ * Neither is useful in our case, so this is easy to handle.
+ */
+#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,5,49)
+/* clone_flags, child_stack, parent_tidptr, child_tidptr */
+#define CLONE_ARGS SIGCHLD|CLONE_IO, 0, NULL, NULL
+#else
+#define CLONE_ARGS SIGCHLD|CLONE_IO, 0
+#endif /* LINUX_VERSION_CODE */
+
+#define _GNU_SOURCE
+#include <sched.h>
+#include <unistd.h>
+#undef _GNU_SOURCE
+pid_t fork_clone_io(void);
+#endif /* SYS_clone */
+#else /* __linux__ not defined */
+#define fork_clone_io fork
+#endif /* __linux__ */
+
int
alloctape(void)
{
@@ -755,6 +789,16 @@ rollforward(void)
#endif
}

+#ifdef __linux__
+#ifdef SYS_clone
+pid_t
+fork_clone_io(void)
+{
+ return syscall(SYS_clone, CLONE_ARGS);
+}
+#endif
+#endif
+
/*
* We implement taking and restoring checkpoints on the tape level.
* When each tape is opened, a new process is created by forking; this
@@ -801,7 +845,7 @@ restore_check_point:
/*
* All signals are inherited...
*/
- childpid = fork();
+ childpid = fork_clone_io();
if (childpid < 0) {
msg("Context save fork fails in parent %d\n", parentpid);
Exit(X_ABORT);
@@ -1017,7 +1061,7 @@ enslave(void)
}

if (socketpair(AF_UNIX, SOCK_STREAM, 0, cmd) < 0 ||
- (slaves[i].pid = fork()) < 0)
+ (slaves[i].pid = fork_clone_io()) < 0)
quit("too many slaves, %d (recompile smaller): %s\n",
i, strerror(errno));
Marc Thomas
2009-03-12 00:54:36 UTC
Permalink
Hi Jeff,

Patch applies cleanly to dump-0.4b41, but it doesn't then compile on my
system (Slackware 12.2, kernel 2.6.27.7, CFQ scheduler):

tape.c: In function 'fork_clone_io':
tape.c:797: error: 'CLONE_IO' undeclared (first use in this function)
tape.c:797: error: (Each undeclared identifier is reported only once
tape.c:797: error: for each function it appears in.)

I'm no programmer, but I did find that changing the include for
<sched.h> to <linux/sched.h> fixed that.

Performance is indeed improved, in my case from 11801 kB/s to 13434 kB/s
dumping an LVM snapshot of the ext3 root filesystem on md-raid
(mirrored) to another LVM volume in the same md-raid (same physical
spindles).

I suspect the faster the destination device, the more apparent the
performance improvement.

Stelian - were there any objections to including this patch?

BR,
Marc
Post by Jeff Moyer
Hi,
dump performs poorly when run under the CFQ I/O scheduler. The reason
for this is that the dump command interleaves I/O between two (or
three?) cooperating processes. This is about the worst case scenario
you can get for CFQ, as the I/O access pattern within each process is
sequential. Thus, CFQ will idle for a number of milliseconds waiting
for the current process to issue more I/O before switching to the next.
Now, this behaviour can be changed with tuning. However, if the dump
command simply shared I/O contexts between cooperating processes, CFQ
could make more intelligent decisions about I/O scheduling.
So, here are the numbers, running under 2.6.28-rc3.
deadline 82241 kB/s
cfq 34143 kB/s
cfq-shared 82241 kB/s
cfq-shared denotes that the dump utility was patched with the attached
patch to share I/O contexts. As you can see, with a very little bit of
code change, we can drastically increase the performance of dump under
CFQ (which is the default I/O scheduler used in a number of
distributions).
For more information on the underlying problems, you can refer to the
http://lkml.org/lkml/2008/11/9/133
Comments are appreciated.
Cheers,
Jeff
Jeff Moyer
2009-03-12 13:16:06 UTC
Permalink
Post by Marc Thomas
Hi Jeff,
Patch applies cleanly to dump-0.4b41, but it doesn't then compile on
tape.c:797: error: 'CLONE_IO' undeclared (first use in this function)
tape.c:797: error: (Each undeclared identifier is reported only once
tape.c:797: error: for each function it appears in.)
I'm no programmer, but I did find that changing the include for
<sched.h> to <linux/sched.h> fixed that.
Hi, Marc. Thanks for testing, and for the suggested fix. There is
another problem with compiling the patch on systems that don't have the
CLONE_IO flag at all, and I need to address that.

Cheers,
Jeff
Jeff Moyer
2009-03-23 19:07:01 UTC
Permalink
Post by Jeff Moyer
Post by Marc Thomas
Hi Jeff,
Patch applies cleanly to dump-0.4b41, but it doesn't then compile on
tape.c:797: error: 'CLONE_IO' undeclared (first use in this function)
tape.c:797: error: (Each undeclared identifier is reported only once
tape.c:797: error: for each function it appears in.)
I'm no programmer, but I did find that changing the include for
<sched.h> to <linux/sched.h> fixed that.
Hi, Marc. Thanks for testing, and for the suggested fix. There is
another problem with compiling the patch on systems that don't have the
CLONE_IO flag at all, and I need to address that.
Here is a follow-up patch that should work everywhere. Comments and
suggestions are welcome. I'd love to hear at least *some* feedback from
the dump maintainer on this.

Thanks,
Jeff

Index: dump/tape.c
===================================================================
RCS file: /cvsroot/dump/dump/dump/tape.c,v
retrieving revision 1.90
diff -u -p -r1.90 tape.c
--- dump/tape.c 4 Jun 2008 19:27:48 -0000 1.90
+++ dump/tape.c 23 Mar 2009 19:05:00 -0000
@@ -187,6 +187,41 @@ static sigjmp_buf jmpbuf; /* where to ju
static int gtperr = 0;
#endif

+/*
+ * Determine if we can use Linux' clone system call. If so, call it
+ * with the CLONE_IO flag so that all processes will share the same I/O
+ * context, allowing the I/O schedulers to make better scheduling decisions.
+ */
+#ifdef __linux__
+/* first, pull in the header files that define sys_clone and CLONE_IO */
+#include <syscall.h>
+#define _GNU_SOURCE
+#include <sched.h>
+#include <unistd.h>
+#undef _GNU_SOURCE
+
+/* If either is not present, fall back on the fork behaviour */
+#if ! defined(SYS_clone) || ! defined (CLONE_IO)
+#define fork_clone_io fork
+#else /* SYS_clone */
+/* CLONE_IO is available, determine which version of sys_clone to use */
+#include <linux/version.h>
+/*
+ * Kernel 2.5.49 introduced two extra parameters to the clone system call.
+ * Neither is useful in our case, so this is easy to handle.
+ */
+#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,5,49)
+/* clone_flags, child_stack, parent_tidptr, child_tidptr */
+#define CLONE_ARGS SIGCHLD|CLONE_IO, 0, NULL, NULL
+#else
+#define CLONE_ARGS SIGCHLD|CLONE_IO, 0
+#endif /* LINUX_VERSION_CODE */
+pid_t fork_clone_io(void);
+#endif /* SYS_clone */
+#else /* __linux__ not defined */
+#define fork_clone_io fork
+#endif /* __linux__ */
+
int
alloctape(void)
{
@@ -755,6 +790,16 @@ rollforward(void)
#endif
}

+#ifdef __linux__
+#if defined(SYS_clone) && defined(CLONE_IO)
+pid_t
+fork_clone_io(void)
+{
+ return syscall(SYS_clone, CLONE_ARGS);
+}
+#endif
+#endif
+
/*
* We implement taking and restoring checkpoints on the tape level.
* When each tape is opened, a new process is created by forking; this
@@ -801,7 +846,7 @@ restore_check_point:
/*
* All signals are inherited...
*/
- childpid = fork();
+ childpid = fork_clone_io();
if (childpid < 0) {
msg("Context save fork fails in parent %d\n", parentpid);
Exit(X_ABORT);
@@ -1017,7 +1062,7 @@ enslave(void)
}

if (socketpair(AF_UNIX, SOCK_STREAM, 0, cmd) < 0 ||
- (slaves[i].pid = fork()) < 0)
+ (slaves[i].pid = fork_clone_io()) < 0)
quit("too many slaves, %d (recompile smaller): %s\n",
i, strerror(errno));
Stelian Pop
2009-03-23 21:20:40 UTC
Permalink
Post by Jeff Moyer
Here is a follow-up patch that should work everywhere. Comments and
suggestions are welcome. I'd love to hear at least *some* feedback from
the dump maintainer on this.
Sorry for my silence on this. I have been rather busy lately and not had
much time for dump.

But I have no objections on your patch, and you can consider it applied,
and part of the next version of dump, whenever I get the time to
do a new release (as a matter of fact, there are quite a few pending
changes in CVS so I might just force me to do a release sometime soon).

Thanks for the patch !

Stelian.
--
Stelian Pop <***@popies.net>
Jeff Moyer
2009-03-24 13:04:33 UTC
Permalink
Post by Stelian Pop
Post by Jeff Moyer
Here is a follow-up patch that should work everywhere. Comments and
suggestions are welcome. I'd love to hear at least *some* feedback from
the dump maintainer on this.
Sorry for my silence on this. I have been rather busy lately and not had
much time for dump.
But I have no objections on your patch, and you can consider it applied,
and part of the next version of dump, whenever I get the time to
do a new release (as a matter of fact, there are quite a few pending
changes in CVS so I might just force me to do a release sometime soon).
Thanks for the patch !
Thanks for taking time to look it over, Stelian!

Cheers,
Jeff

Loading...