Speed up rsync with Simultaneous/Concurrent File Transfers?

BashShellUbuntu 12.04RsyncSimultaneous

Bash Problem Overview


We need to transfer 15TB of data from one server to another as fast as we can. We're currently using rsync but we're only getting speeds of around 150Mb/s, when our network is capable of 900+Mb/s (tested with iperf). I've done tests of the disks, network, etc and figured it's just that rsync is only transferring one file at a time which is causing the slowdown.

I found a script to run a different rsync for each folder in a directory tree (allowing you to limit to x number), but I can't get it working, it still just runs one rsync at a time.

I found the script here (copied below).

Our directory tree is like this:

/main
   - /files
      - /1
         - 343
            - 123.wav
            - 76.wav
         - 772
            - 122.wav
         - 55
            - 555.wav
            - 324.wav
            - 1209.wav
         - 43
            - 999.wav
            - 111.wav
            - 222.wav
      - /2
         - 346
            - 9993.wav
         - 4242
            - 827.wav
      - /3
         - 2545
            - 76.wav
            - 199.wav
            - 183.wav
         - 23
            - 33.wav
            - 876.wav
         - 4256
            - 998.wav
            - 1665.wav
            - 332.wav
            - 112.wav
            - 5584.wav

So what I'd like to happen is to create an rsync for each of the directories in /main/files, up to a maximum of, say, 5 at a time. So in this case, 3 rsyncs would run, for /main/files/1, /main/files/2 and /main/files/3.

I tried with it like this, but it just runs 1 rsync at a time for the /main/files/2 folder:

#!/bin/bash

# Define source, target, maxdepth and cd to source
source="/main/files"
target="/main/filesTest"
depth=1
cd "${source}"

# Set the maximum number of concurrent rsync threads
maxthreads=5
# How long to wait before checking the number of rsync threads again
sleeptime=5

# Find all folders in the source directory within the maxdepth level
find . -maxdepth ${depth} -type d | while read dir
do
	# Make sure to ignore the parent folder
	if [ `echo "${dir}" | awk -F'/' '{print NF}'` -gt ${depth} ]
	then
		# Strip leading dot slash
		subfolder=$(echo "${dir}" | sed 's@^\./@@g')
		if [ ! -d "${target}/${subfolder}" ]
		then
			# Create destination folder and set ownership and permissions to match source
			mkdir -p "${target}/${subfolder}"
			chown --reference="${source}/${subfolder}" "${target}/${subfolder}"
			chmod --reference="${source}/${subfolder}" "${target}/${subfolder}"
		fi
		# Make sure the number of rsync threads running is below the threshold
		while [ `ps -ef | grep -c [r]sync` -gt ${maxthreads} ]
		do
			echo "Sleeping ${sleeptime} seconds"
			sleep ${sleeptime}
		done
		# Run rsync in background for the current subfolder and move one to the next one
		nohup rsync -a "${source}/${subfolder}/" "${target}/${subfolder}/" </dev/null >/dev/null 2>&1 &
	fi
done

# Find all files above the maxdepth level and rsync them as well
find . -maxdepth ${depth} -type f -print0 | rsync -a --files-from=- --from0 ./ "${target}/"

Bash Solutions


Solution 1 - Bash

Updated answer (Jan 2020)

xargs is now the recommended tool to achieve parallel execution. It's pre-installed almost everywhere. For running multiple rsync tasks the command would be:

ls /srv/mail | xargs -n1 -P4 -I% rsync -Pa % myserver.com:/srv/mail/

This will list all folders in /srv/mail, pipe them to xargs, which will read them one-by-one and and run 4 rsync processes at a time. The % char replaces the input argument for each command call.

Original answer using parallel:

ls /srv/mail | parallel -v -j8 rsync -raz --progress {} myserver.com:/srv/mail/{}

Solution 2 - Bash

rsync transfers files as fast as it can over the network. For example, try using it to copy one large file that doesn't exist at all on the destination. That speed is the maximum speed rsync can transfer data. Compare it with the speed of scp (for example). rsync is even slower at raw transfer when the destination file exists, because both sides have to have a two-way chat about what parts of the file are changed, but pays for itself by identifying data that doesn't need to be transferred.

A simpler way to run rsync in parallel would be to use parallel. The command below would run up to 5 rsyncs in parallel, each one copying one directory. Be aware that the bottleneck might not be your network, but the speed of your CPUs and disks, and running things in parallel just makes them all slower, not faster.

run_rsync() {
    # e.g. copies /main/files/blah to /main/filesTest/blah
    rsync -av "$1" "/main/filesTest/${1#/main/files/}"
}
export -f run_rsync
parallel -j5 run_rsync ::: /main/files/*

Solution 3 - Bash

You can use xargs which supports running many processes at a time. For your case it will be:

ls -1 /main/files | xargs -I {} -P 5 -n 1 rsync -avh /main/files/{} /main/filesTest/

Solution 4 - Bash

Have you tried using rclone.org?

With rclone you could do something like

rclone copy "${source}/${subfolder}/" "${target}/${subfolder}/" --progress --multi-thread-streams=N

where --multi-thread-streams=N represents the number of threads you wish to spawn.

Solution 5 - Bash

There are a number of alternative tools and approaches for doing this listed arround the web. For example:

  • The NCSA Blog has a description of using xargs and find to parallelize rsync without having to install any new software for most *nix systems.

  • And parsync provides a feature rich Perl wrapper for parallel rsync.

Solution 6 - Bash

I've developed a python package called: parallel_sync

https://pythonhosted.org/parallel_sync/pages/examples.html

Here is a sample code how to use it:

from parallel_sync import rsync
creds = {'user': 'myusername', 'key':'~/.ssh/id_rsa', 'host':'192.168.16.31'}
rsync.upload('/tmp/local_dir', '/tmp/remote_dir', creds=creds)

parallelism by default is 10; you can increase it:

from parallel_sync import rsync
creds = {'user': 'myusername', 'key':'~/.ssh/id_rsa', 'host':'192.168.16.31'}
rsync.upload('/tmp/local_dir', '/tmp/remote_dir', creds=creds, parallelism=20)

however note that ssh typically has the MaxSessions by default set to 10 so to increase it beyond 10, you'll have to modify your ssh settings.

Solution 7 - Bash

The simplest I've found is using background jobs in the shell:

for d in /main/files/*; do
    rsync -a "$d" remote:/main/files/ &
done

Beware it doesn't limit the amount of jobs! If you're network-bound this is not really a problem but if you're waiting for spinning rust this will be thrashing the disk.

You could add

while [ $(jobs | wc -l | xargs) -gt 10 ]; do sleep 1; done

inside the loop for a primitive form of job control.

Solution 8 - Bash

The shortest version I found is to use the --cat option of parallel like below. This version avoids using xargs, only relying on features of parallel:

cat files.txt | \
  parallel -n 500 --lb --pipe --cat rsync --files-from={} user@remote:/dir /dir -avPi

#### Arg explainer
# -n 500           :: split input into chunks of 500 entries
#
# --cat            :: create a tmp file referenced by {} containing the 500 
#                     entry content for each process
#
# user@remote:/dir :: the root relative to which entries in files.txt are considered
#
# /dir             :: local root relative to which files are copied

Sample content from files.txt:

/dir/file-1
/dir/subdir/file-2
....

Note that this doesn't use -j 50 for job count, that didn't work on my end here. Instead I've used -n 500 for record count per job, calculated as a reasonable number given the total number of records.

Solution 9 - Bash

I've found UDR/UDT to be an amazing tool. The TLDR; It's a UDT wrapper for rsync, utilizing multiple UPD connections rather than a single TCP connection.

References: https://udt.sourceforge.io/ & https://github.com/jaystevens/UDR#udr

If you use any RHEL distros, they've pre-compiled it for you... http://hgdownload.soe.ucsc.edu/admin/udr

The ONLY downside I've encountered is that you can't specify a different SSH port, so your remote server must use 22.

Anyway, after installing the rpm, it's literally as simple as:

udr rsync -aP user@IpOrFqdn:/source/files/* /dest/folder/

and your transfer speeds will increase drastically in most cases, depending on the server I've seen easily 10x increase in transfer speed.

Side note: if you choose to gzip everything first, then make sure to use --rsyncable arg so that it only updates what has changed.

Solution 10 - Bash

using parallel rsync on a regular disk would only cause them to compete for the i/o, turning what should be a sequential read into an inefficient random read. You could try instead tar the directory into a stream through ssh pull from the destination server, then pipe the stream to tar extract.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionBT643View Question on Stackoverflow
Solution 1 - BashManuel RielView Answer on Stackoverflow
Solution 2 - BashStuart CaieView Answer on Stackoverflow
Solution 3 - BashnickgrygView Answer on Stackoverflow
Solution 4 - BashdantebarbaView Answer on Stackoverflow
Solution 5 - BashBryan PView Answer on Stackoverflow
Solution 6 - BashmaxView Answer on Stackoverflow
Solution 7 - BashsbaView Answer on Stackoverflow
Solution 8 - BashValerView Answer on Stackoverflow
Solution 9 - BashlemonskunnkView Answer on Stackoverflow
Solution 10 - BashEl MisterioView Answer on Stackoverflow