This Blog is about an incident that emptied lots of files and costed me half a day search for the reason. It is about the UNIX tar tool ("tape archiver").
If you pack an archive with tar, and you unpack it with a tool that doesn't handle hardlinks, you may, under certain circumstances, encounter data loss.
The circumstances happen when the list of files to pack (that you pass to tar) contains a directory that again contains a file that was already listed before. In that case tar packs the file a second time, but this time as link, and with size zero. (It does so although there was no link of any kind in the source file system!) On unpacking with the wrong tool, the link overwrites the real file and truncates it to zero size.
tar-test
people.txt
All of peter.txt
, paul.txt
and mary.txt
are normal files. By accident the name of the directory people.txt
is similar to that of the files.
Now we want to pack an archive from these files, and we want all *.txt
files to be in the archive, so we apply a find
command with a wildcard:
cd tar-test tar -cvzf people.tgz `find . -name '*.txt'`
The tar -cvzf
command packs an archive (z
is for zip-compression), people.tgz
is the name of the resulting archive file, and the command-substitution `find . -name '*.txt'`
generates the names of files to pack into the archive.
The find
command delivers the correct list:
find . -name '*.txt'
./people.txt ./people.txt/maria.txt ./people.txt/paul.txt ./peter.txt
But unfortunately also the directory people.txt
matches the pattern '*.txt'
and is in that list.
Now when tar processes the directory people.txt
, it will pack every file inside also into the archive. Because these files are in the find-list too, the second occurrence of any of them will be stored as link(!) inside the archive.
Look at the result:
tar -tvf people.tgz
drwxrwxrwx root/root 0 2019-04-21 19:19 ./people.txt/ -rwxrwxrwx root/root 11 2019-04-21 19:20 ./people.txt/mary.txt -rwxrwxrwx root/root 10 2019-04-21 19:20 ./people.txt/paul.txt hrwxrwxrwx root/root 0 2019-04-21 19:20 ./people.txt/mary.txt link to ./people.txt/mary.txt hrwxrwxrwx root/root 0 2019-04-21 19:20 ./people.txt/paul.txt link to ./people.txt/paul.txt -rwxrwxrwx root/root 11 2019-04-21 19:20 ./peter.txt
The tar -tvf
command lists the archive people.tgz
. All files inside the people.txt
directory (that also matched the '*.txt'
wildcard) were packed twice. We see that all links have size zero (red color).
Mind that unpacking such an archive with tar -xvzf people.tgz
works fine, no trace of any links in the resulting file system. You can check this with the ls -l
command that lists a link count in 2nd column (red color), and we see that is just 1:
ls -l
-rwxrwxrwx 1 root root 11 Apr 21 19:20 mary.txt -rwxrwxrwx 1 root root 10 Apr 21 19:20 paul.txt
But unpacking with a tool that does not recognize links will do the following:
paul.txt
file (first because the link depends on it)paul.txt
which has size zeroDamage done! In case the file is not verified after unpacking, you may not detect the data loss for a long time.
Here is a way to fix it:
tar -cvzf people.tgz `find . -type f -name '*.txt'`
The find -type f
command would find only files, no directories, thus people.txt
matching the '*.txt'
pattern would not be in its result list.
Different operating systems have different file systems. Hard and symbolic links exist on UNIX systems, not on WINDOWS. When you use Java tools, you won't be able to create links of any kind, because Java is platform-independent and thus can provide only functionality that is present on all platforms.
Here is a fix for Java (that simply ignores links inside a tar-archive), referring to the com.ice.tar
library:
TarEntry e = tarInputStream.getNextEntry();
if (e != null) {
if (e.getHeader().linkName == null || e.getHeader().linkName.length() <= 0) {
File created = super.extractEntry(dir, e);
created.setLastModified(e.getModTime().getTime());
}
else {
System.err.println("Did not extract link: "+e.getName()+" -> "+e.getHeader().linkName);
}
}
ɔ⃝ Fritz Ritzberger, 2019-04-21