[Dump-devel] Understanding dump's use of multiple processes

Discussion:

Phillip Susi

2010-04-26 15:32:47 UTC

The archives show almost no activity on this list for some time, so I
hope this finds its way to someone. I have been trying to understand
why dump uses multiple processes that seem to burn through a lot of cpu
time when doing a compressed backup with a larger block size ( 512 kb ).

The code is very hard to follow, but it seems like the first process
reads blocks from the disk and writes them to a pipe. The second
process reads from the pipe and compresses the data, then writes it to
another pipe. Finally the third process reads the compressed data from
the pipe, slaps a header on it, and writes it to the tape. Is this
correct? Does the third process attempt to reblock the data so it
always writes fixed size records? If so, how does it do this?

Kenneth Porter

2010-04-26 21:18:47 UTC

Permalink

Post by Phillip Susi
The archives show almost no activity on this list for some time, so I
hope this finds its way to someone.

That's likely an indication that those of us who use dump find it very
stable and have nothing to complain about. (At least, that we're willing to
put effort into fixing. ;))

I'll leave it to Stelian to explain the architecture.

Phillip Susi

2010-04-27 15:44:02 UTC

Permalink

Post by Kenneth Porter

Post by Phillip Susi
The archives show almost no activity on this list for some time, so I
hope this finds its way to someone.

That's likely an indication that those of us who use dump find it very
stable and have nothing to complain about. (At least, that we're willing to
put effort into fixing. ;))
I'll leave it to Stelian to explain the architecture.

Heheh. I find it quite nice as well, but have been investigating how it
works to see if I can make it faster on modern hardware, since the code
is SO old.

After parsing the code a lot more and walking through it a bit with gdb
( NOT easy when it uses multiple processes... gdb tends to get confused
) it seems that it has a master process and 3 slave processes. The
master walks the inodes to be dumped, adds a 1kb inode record to the
queue, then adds each of the 4kb disk blocks to the queue one at a time
until the queue is full. More specifically, it adds the block number to
the queue rather than actually reading the block. The size of the queue
depends on the -b parameter, which defaults to only 10.

Special blocks, like the inode header, are stored in another buffer, and
the data blocks have their numbers stored in the queue. When the queue
is full it is sent to a slave via a socketpair() as well as any of the
queued special blocks. The slave walks the request list either reading
the 1kb special block from the socket, or reading blocks from disk until
it has assembled the entire -b size record. Then it optionally
compresses the record and writes it to tape.

At first I was using -b 64 -z9 and when I tried -b 1024 -z9, both of my
cpu cores went to 100% for a long time to complete the dump and
temperatures got rather high. Now that I understand what is going on,
this seems to simply be a result of gzip actually being able to do much
compression. See, with -z9, gzip tries to use a 900kb tree to hold
recently repeated phrases. That doesn't do much good when you are only
compressing 64kb at a time. It ran rather fast since with only 64k of
input, the tree never gets very full. Once I switched to -b 1024, gzip
had some real data to work with, filled the tree, and slowed down,
though also getting better compression in the process. I may just try
using -z3 in the future which seems to use much less cpu and compress
almost as well with -b 1024.

I noticed that the slave was only reading one 4kb block at a time, so I
experimented a bit last night with having dumpblock() try to combine the
new block with the last one in the request queue ( if they are
contiguous ) to try to generate larger block read requests. This seemed
to give me about 5-10% better throughput when dumping my root on an ssd
to /dev/null. I also realized that the slave was leaving the special
records sitting in the socket while reading disk blocks, so I changed
that code to make two passes over the request queue and only read from
the socket in the first pass, then go back and read the disk blocks.
This seems to have gained another 5-10%.

Unfortunately, it does not combine nearly as many large requests as it
could, because if two different inodes happen to follow one another on
the disk, they still have their inode special record inserted between
them in the dump record, so the blocks can not be read into the dump
record all at once.

I'm mulling over a few more radical changes to speed things up, as I
feel it could go quite a bit faster than it does now:

First, why are the slaves completely separate processes that have their
commands and any special blocks passed to them via socket? It would be
better if the slaves were just threads that could directly access the
request and special block buffers without the bother of copying them to
kernel socket buffers and back.

Second, why write the dump output as inode, data, inode, data, etc.
instead of inode, inode, inode, data, data, data? It could be quite a
bit faster to dump all of the inodes first, building block maps of them
on the way, THEN go back and dump the data blocks. This would make
parsing the dump file to find what files are in it much faster since it
would only have to read the first part with all of the inodes in it, and
not bother with the rest of the data, and dumping should be faster since
it can walk the entire inode table in one go, then all of the data
blocks ( combining sequential ones into larger reads ) in one go.

Finally, this could be combined with O_DIRECT and aio to get rid of the
time spent in the kernel copying data between buffer cache and the dump
record buffer, and stop the kernel readahead from picking up data from
the disk that isn't needed, and creating memory pressure in the buffer
cache.

I have just now also been looking at the TS_ADDR record. It looks like
it has one byte per block in a file that is set to 0 or 1 to indicate
whether the block actually exists and is present, or if it was a hole.
Wouldn't it be better if this were instead an array of { blockno, count,
present }? In fact present could probably be encoded in the high order
bit of count, so count would be 100 for 100 actual blocks or -100 for a
hole of 100 missing blocks. That way you would not need an additional
TS_ADDR record for every 2mb of a file 2mb or larger where half of it is
redundant information every special record contains about the entire
volume, and the other half is just filled with 0x01.

Stelian Pop

2010-04-27 14:31:42 UTC

Permalink

Hi Phillip,

Post by Phillip Susi
The archives show almost no activity on this list for some time,

Yeah, these are quiet times for dump :)

Post by Phillip Susi
so I
hope this finds its way to someone. I have been trying to understand
why dump uses multiple processes that seem to burn through a lot of cpu
time when doing a compressed backup with a larger block size ( 512 kb ).

The choice of using multiple processes in dump was made back in the
(very) old days, when the speed of tape drives exceeded the disk IO (and
CPU) speeds, and when it was important - performance wise - to keep the
tape drives under a constant flow of data.

Post by Phillip Susi
The code is very hard to follow, but it seems like the first process
reads blocks from the disk and writes them to a pipe. The second
process reads from the pipe and compresses the data, then writes it to
another pipe. Finally the third process reads the compressed data from
the pipe, slaps a header on it, and writes it to the tape. Is this
correct? Does the third process attempt to reblock the data so it
always writes fixed size records? If so, how does it do this?

There is one main process (well, more exactly one main process per
volume) which traverses the filesystem, computing the list of disk blocks
to be saved. This list is written into a pipe, which is read by several
(#define SLAVES 3) slaves. Each slave reads a block number from the
pipe, seeks the disk to the wanted offset, reads the block, compresses
the block if asked to, writes it on the tape, then signals the next
slave to do its job.

The slaves do synchronize one another, serializing the access to the
tape drive, but doing disk reads and compression in parallel.

Hope this explains things a bit.

Stelian.

--
Stelian Pop <***@popies.net>