keep rsync from removing unfinished source files

StorageWeb CrawlerRsync

Storage Problem Overview


I have two machines, speed and mass. speed has a fast Internet connection and is running a crawler which downloads a lot of files to disk. mass has a lot of disk space. I want to move the files from speed to mass after they're done downloading. Ideally, I'd just run:

$ rsync --remove-source-files speed:/var/crawldir .

but I worry that rsync will unlink a source file that hasn't finished downloading yet. (I looked at the source code and I didn't see anything protecting against this.) Any suggestions?

Storage Solutions


Solution 1 - Storage

It seems to me the problem is transferring a file before it's complete, not that you're deleting it.

If this is Linux, it's possible for a file to be open by process A and process B can unlink the file. There's no error, but of course A is wasting its time. Therefore, the fact that rsync deletes the source file is not a problem.

The problem is rsync deletes the source file only after it's copied, and if it's still being written to disk you'll have a partial file.

How about this: Mount mass as a remote file system (NFS would work) in speed. Then just web-crawl the files directly.

Solution 2 - Storage

How much control do you have over the download process? If you roll your own, you can have the file being downloaded go to a temp directory or have a temporary name until it's finished downloading, and then mv it to the correct name when it's done. If you're using third party software, then you don't have as much control, but you still might be able to do the temp directory thing.

Solution 3 - Storage

Rsync can exclude files matching certain patters. Even if you can't modify it to make it download files to a temporary directory, maybe it has a convention of naming the files differently during download (for example: foo.downloading while downloading for a file named foo) and you can use this property to exclude files which are still being downloaded from being copied.

Solution 4 - Storage

If you have control over the crawling process, or it has predictable output, the above solutions (storing in a tempfile until finished, then mv'ing to the completed-downloads place, or ignoring files with a '.downloading' kind of name) might work. If all of that is beyond your control, you can make sure that the file is not opened by any process by doing 'lsof $filename' and checking if there's a result. Clearly if no one has the file open, it's safe to move it over.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionaaronswView Question on Stackoverflow
Solution 1 - StorageJason CohenView Answer on Stackoverflow
Solution 2 - StoragePaul TomblinView Answer on Stackoverflow
Solution 3 - StorageGrey PantherView Answer on Stackoverflow
Solution 4 - StoragepjzView Answer on Stackoverflow