Computer / Programmazione · 2 October 2014 0

Which compression method do we have to choose for our archives?

We usually have to move at a glance a lot of files, and some of them can be very big, from a device to another one. For praticity and to save space we use the so-called compressed archives. A compressed archive is just a unique, big file, that incudes one or several files in a format that allows us to use an amount of bytes that is lower then the sum of the sizes of the single original files. A lot of user don’t stop to think about which format to use because the most widely used is the ZIP format. But this is not the only one nor the most efficient one. Let’s see to do some tests and look around to find other formats that could be better.

We can say that the compressed archives are born with the computers: in fact, since the first machines the users tried to find a way to optimize the storage size because the storage media weren’t as large as the today’s ones. In the Sixties magnetic tapes were very common, they were big coils like the audio cassettes of the Eighties but bigger than pizzas. They had a problem: they didn’t optimize the storage space because the datas were saved in blocks so that a block could contain just few bytes with a lot of empty and unusable space. To solve this problem one of the first archiving formats was invented, the TAR, for “TApe aRchive”: the files to be archived were included in a single, big container that later was saved on the tape. The first floppy disks, presented in 1971, weren’t bigger, too: despite the large size of 8″, they just could contain ~79KB of data. However, at that time the programs were very small because a common computer had just few KB of RAM memory and programmers couldn’t write huge programs. When UNIX was released, TAR was converted to be able to work with floppu disks too because it was convenient to have a single archive where to store a lot of single files and there wasn’t the need to save space yet. However, the data that a program could store started to grow up dramatically (just think about the list of costumers or the invoices of a company, for example): luckily, the storage technology was evolving very quickly and the users of that time, and their computers (that weren’t so powerful), could access to disks that were bigger than the dats they had to store in. In the middle of the Eighties the 3.5″ floppy disks offered 720/800/880 KB di capacity, respectively for PC, Apple, and Amiga.

With the coming of computer even more powerful, programs started to grow in size, so the users began to feel the need for programs that allowed to compress files so that they could be stored and copied easily. One of the first softwares of this genre was ARC (1985): it created compressed archives with the .arc extension. In 1989, when the PCs had been widely adopted as the standard business systems, an MS-DOS software appeared that quickly began the de-facto standard to manage compressed archives: PKZIP. It was quicker and more efficient than ARC: it required less time to compress and decompress the data and its algorithm let the creation of archives smaller than the ones creared by ARC. The archives had the extension .ZIP. Soon, the ZIP format spread all over the world and, during the following years, the most important operating systems offered integrated support for it: in fact, Windows and Mac OS X can open ZIP files natively. Even though the ZIP format is the most used, it’s not the only compression and archive format. During the years other formats were developed, some of them born on a specific platform and then ported to other systems: for example, the GZIP format was born on UNIX systems, while RAR was born on Windows. The 7Z format is relatively new: it was presented in 2000 and, despite its predecessors, has a high compress ratio while it’s slower during compression.

But which is the best format? It’s hard to say this, because the performances of a compression program are influenced not only by the specific qualities of the algorithm used for the compression but by the code optimitazion and the compression level, too. Another factor is the machine used to compress the files because a powerful machine with a good CPU (frequency and number of cores) heavily impacts the performances of the compression software. For out tests we chose to compress the folder of the Arduino IDE 1.0.6 for Windows without installer, a folder with 6686 files for a global size of 263,821,413 bytes (~266.3MB). We misured the time needed to create the archive and then the size of the compressed archive to look for the best performances. The tests were done on an iMac with Intel Core i7 @ 3.1 GHz and 16 GB of RAM using the KeKa app, a freeware software (if you want to support the developers, you can buy this app into the App Store) that can compress using the 7Z, ZIP, TAR, GZIP, BZIP2 formats (also, .DMG, and .ISO). During the tests we only used the 7Z, ZIP, GZIP, and BZIP2 formats because TAR doesn’t compress. We also didn’t used other proprietary formats like RAR because the compressing plugin isn’t freely available and need a commercial license so that you can find them only inside the commercial apps.

 

Format Time Size of

archive

Speed Archive/

original files

ZIP

min. compr.

3″ 100.2 MB 33.4 MB/s 37%
ZIP

med. compr.

9″ 94.6 MB 10.5 MB/s 35%
ZIP

max. compr.

1′ 8″ 93 MB 1.37 MB/s 34%
7z

min. compr.

14″ 79.8 MB 5.7 MB/s 30%
7z

med. compr.

50″ 55 MB 1.1 MB/s 21%
7Z

max. compr.

1′ 4″ 52.6 MB 0.82 MB/s 20%
Gzip 24″ 93 MB 3.87 MB/s 35%
Bzip2 40″ 80 MB 2 MB/s 30%

You can see in the table that the ZIP format has the highest compression speed for minimum or normal compression ratios, registering a 33.4 MB/s in the first case. But when we use the highest compression ratio its preformances drop down dramatically and the gain over the normal compression ratio is very small. Generally, the ZIP format doesn’t show a compression rate very high.

The 7Z lacks in speed but has at its own highest compression rates: at its lowest rate it already has created an archive smaller than the one made using the ZIP format at its maximum rate, and the archive continues to shrink by increasing the compression rate.

The GZIP and BZIP2 have a limited diffusion because the usually are limited only to the *NIX systems: the GZIP format is very common on these platforms because it’s widely used to distribuite the compressed source of the applications, while the BZIP2, newer than the previous one, reaches higher performances in therm of compression rates.

So, finally we can say that if you can choose, adopt the 7Z format. It’s the best format showing good compression speed and the highest rate.