Utilizing multi core for tar+gzip/bzip compression/decompression

Gzip Problem Overview

I normally compress using tar zcvf and decompress using tar zxvf (using gzip due to habit).

I've recently gotten a quad core CPU with hyperthreading, so I have 8 logical cores, and I notice that many of the cores are unused during compression/decompression.

Is there any way I can utilize the unused cores to make it faster?

Gzip Solutions

Solution 1 - Gzip

You can also use the tar flag "--use-compress-program=" to tell tar what compression program to use.

For example use:

tar -c --use-compress-program=pigz -f tar.file dir_to_zip

Solution 2 - Gzip

You can use pigz instead of gzip, which does gzip compression on multiple cores. Instead of using the -z option, you would pipe it through pigz:

tar cf - paths-to-archive | pigz > archive.tar.gz

By default, pigz uses the number of available cores, or eight if it could not query that. You can ask for more with -p n, e.g. -p 32. pigz has the same options as gzip, so you can request better compression with -9. E.g.

tar cf - paths-to-archive | pigz -9 -p 32 > archive.tar.gz

Solution 3 - Gzip

Common approach

There is option for tar program:

-I, --use-compress-program PROG
      filter through PROG (must accept -d)

You can use multithread version of archiver or compressor utility.

Most popular multithread archivers are pigz (instead of gzip) and pbzip2 (instead of bzip2). For instance:

$ tar -I pbzip2 -cf OUTPUT_FILE.tar.bz2 paths_to_archive
$ tar --use-compress-program=pigz -cf OUTPUT_FILE.tar.gz paths_to_archive

Archiver must accept -d. If your replacement utility hasn't this parameter and/or you need specify additional parameters, then use pipes (add parameters if necessary):

$ tar cf - paths_to_archive | pbzip2 > OUTPUT_FILE.tar.gz
$ tar cf - paths_to_archive | pigz > OUTPUT_FILE.tar.gz

Input and output of singlethread and multithread are compatible. You can compress using multithread version and decompress using singlethread version and vice versa.

p7zip

For p7zip for compression you need a small shell script like the following:

#!/bin/sh
case $1 in
  -d) 7za -txz -si -so e;;
   *) 7za -txz -si -so a .;;
esac 2>/dev/null

Save it as 7zhelper.sh. Here the example of usage:

$ tar -I 7zhelper.sh -cf OUTPUT_FILE.tar.7z paths_to_archive
$ tar -I 7zhelper.sh -xf OUTPUT_FILE.tar.7z

xz

Regarding multithreaded XZ support. If you are running version 5.2.0 or above of XZ Utils, you can utilize multiple cores for compression by setting -T or --threads to an appropriate value via the environmental variable XZ_DEFAULTS (e.g. XZ_DEFAULTS="-T 0").

This is a fragment of man for 5.1.0alpha version:

> Multithreaded compression and decompression are not implemented yet, so this > option has no effect for now.

However this will not work for decompression of files that haven't also been compressed with threading enabled. From man for version 5.2.2:

> Threaded decompression hasn't been implemented yet. It will only work > on files that contain multiple blocks with size information in > block headers. All files compressed in multi-threaded mode meet this > condition, but files compressed in single-threaded mode don't even if > --block-size=size is used.

Recompiling with replacement

If you build tar from sources, then you can recompile with parameters

--with-gzip=pigz
--with-bzip2=lbzip2
--with-lzip=plzip

After recompiling tar with these options you can check the output of tar's help:

$ tar --help | grep "lbzip2\|plzip\|pigz"
  -j, --bzip2                filter the archive through lbzip2
      --lzip                 filter the archive through plzip
  -z, --gzip, --gunzip, --ungzip   filter the archive through pigz

Solution 4 - Gzip

You can use the shortcut -I for tar's --use-compress-program switch, and invoke pbzip2 for bzip2 compression on multiple cores:

tar -I pbzip2 -cf OUTPUT_FILE.tar.bz2 DIRECTORY_TO_COMPRESS/

Solution 5 - Gzip

If you want to have more flexibility with filenames and compression options, you can use:

find /my/path/ -type f -name "*.sql" -o -name "*.log" -exec \
tar -P --transform='s@/my/path/@@g' -cf - {} + | \
pigz -9 -p 4 > myarchive.tar.gz

Step 1: `find`

find /my/path/ -type f -name "*.sql" -o -name "*.log" -exec

This command will look for the files you want to archive, in this case /my/path/*.sql and /my/path/*.log. Add as many -o -name "pattern" as you want.

-exec will execute the next command using the results of find: tar

Step 2: `tar`

tar -P --transform='s@/my/path/@@g' -cf - {} +

--transform is a simple string replacement parameter. It will strip the path of the files from the archive so the tarball's root becomes the current directory when extracting. Note that you can't use -C option to change directory as you'll lose benefits of find: all files of the directory would be included.

-P tells tar to use absolute paths, so it doesn't trigger the warning "Removing leading `/' from member names". Leading '/' with be removed by --transform anyway.

-cf - tells tar to use the tarball name we'll specify later

{} + uses everyfiles that find found previously

Step 3: `pigz`

pigz -9 -p 4

Use as many parameters as you want. In this case -9 is the compression level and -p 4 is the number of cores dedicated to compression. If you run this on a heavy loaded webserver, you probably don't want to use all available cores.

Step 4: archive name

> myarchive.tar.gz

Finally.

Solution 6 - Gzip

A relatively newer (de)compression tool you might want to consider is zstandard. It does an excellent job of utilizing spare cores, and it has made some great trade-offs when it comes to compression ratio vs. (de)compression time. It is also highly tweak-able depending on your compression ratio needs.

Content Type	Original Author	Original Content on Stackoverflow
Question	user1118764	View Question on Stackoverflow
Solution 1 - Gzip	Jen	View Answer on Stackoverflow
Solution 2 - Gzip	Mark Adler	View Answer on Stackoverflow
Solution 3 - Gzip	Maxim Suslov	View Answer on Stackoverflow
Solution 4 - Gzip	panticz	View Answer on Stackoverflow
Solution 5 - Gzip	Bloops	View Answer on Stackoverflow
Solution 6 - Gzip	pgebhard	View Answer on Stackoverflow

Utilizing multi core for tar+gzip/bzip compression/decompression

Gzip Problem Overview

Gzip Solutions

Solution 1 - Gzip

Solution 2 - Gzip

Solution 3 - Gzip

Common approach

p7zip

xz

Recompiling with replacement

Solution 4 - Gzip

Solution 5 - Gzip

Step 1: `find`

Step 2: `tar`

Step 3: `pigz`

Step 4: archive name

Solution 6 - Gzip

What is the difference between linear regression and logistic regression?

Moving uncommitted changes to a new branch

Attributions

Gzip Problem Overview

Gzip Solutions

Solution 1 - Gzip

Solution 2 - Gzip

Solution 3 - Gzip

Common approach

p7zip

xz

Recompiling with replacement

Solution 4 - Gzip

Solution 5 - Gzip

Step 1: find

Step 2: tar

Step 3: pigz

Step 4: archive name

Solution 6 - Gzip

What is the difference between linear regression and logistic regression?

Moving uncommitted changes to a new branch

Attributions

Step 1: `find`

Step 2: `tar`

Step 3: `pigz`