Mal wieder Schnee…

End of last year I migrated from PowerPC arch to i386 which made it necessary to reformat/reparition/relabel the old 250 GB disk from Amiga disk label to PC disk label. Therefor I backupped all necessary files from the 250 GB disk to a 160 GB disk.
After installing the system on the 250 GB disk, I wanted to copy over the backup from the other disk, but accidently issued a pvcreate on the backup partition. Big mistake. 😉 No need to tell that lots of files ended in lost+found: 28 GB in total. Of course it’s troublesome to sort all those files and directories back to their original places.

Because I don’t know any tool yet, I wrote some scripts to address this issue. The first script is to be run to get a list of directories and files and a unique file identifier for the files.

[code]
#!/bin/bash
#
# Purpose of this file:
# to make a file that is parseable for recovering
# a filled /lost+found directory by parsing
# filesize, md5sum, permisions and path+filename
#

# first: get all directories
nice -15 find / -path /sys -prune -o
-path /proc -prune -o
-path /var/lib/backuppc -prune -o
-path /var/spool/squid -prune -o
-type d -print > /root/ls-md5sum-dirs.txt

# next: get all relevant information
nice -15 find / -path /sys -prune -o
-path /proc -prune -o
-path /var/lib/backuppc -prune -o
-path /var/spool/squid -prune -o
-type f -printf “%s %m ”
-exec openssl sha1 {} ; | sed -e ‘s/SHA1(//g’ -e ‘s/)=//g’ > /root/ls-sha1sum-files.txt
#-exec nice -15 md5sum {} ; > /root/ls-md5sum-files.txt
[/code]

This will create two files: /root/ls-md5sum-dirs.txt and /root/ls-md5sum-files.txt. The above is the second version of the file with openssl sha1 instead of md5sum, because I discovered that md5sum is giving me double hashes:

[code]muaddib:~# wc -l ls-md5sum-files.txt
712251
muaddib:~# cat ls-md5sum-files.txt | awk ‘{print $3}’ | sort -u | wc -l
576539
[/code]

That makes a difference of ~140000 files. The same script with openssl sha1 is giving a better approximation, but still a gap between those two numbers:

[code]
muaddib:~# wc -l ls-sha1sum-files.txt
712137 ls-sha1sum-files.txt
muaddib:~# cat ls-sha1sum-files.txt | awk ‘{print $4}’ | uniq | wc -l
694200
[/code]

The difference is much smaller with openssl but it takes a lot of more time to generate the resulting file.

Anyway, my second script assumes that the files can be identified by an unique hash. Currently it looks like this:

[code]
#!/bin/bash

PFAD=$1
dry=$2

# generating md5sum/information of lost+found
echo “Examing ${PFAD}/lost+found…”
find $PFAD/lost+found -type f -printf “%s %m ” -exec nice -15 md5sum {} ; > /root/lostfound-files.txt

# creating missing directories if necessary by parsing ls-md5sum-dirs.txt
echo “Creating missing directories…”
for dir in `cat /root/ls-md5sum-dirs.txt | sed -e ‘s/ /,/g’` ; do
dir=`echo ${dir} | sed -e ‘s/,/ /g’`
if [ ! -d “${dir}” ] ; then
echo ” Missing dir “$dir” – creating…”
if [ “$dry” = “make_it_happen” ]; then
echo mkdir “${dir}”
fi
fi
done

# next, get the md5sum of files in lost+found and compare it
# against stored md5sum in ls-md5sum-files.txt
echo “Restoring/moving files…”
for z in `cat /root/lostfound-files.txt | tr -s ” ” | sed -e ‘s/ /,/g’` ; do
z=`echo $z | sed -e ‘s/,/ /g’`

#size1=`echo $z | awk ‘{print $1}’`
#perm1=`echo $z | awk ‘{print $2}’`
md5s1=`echo $z | awk ‘{print $3}’`
path1=`echo $z | awk ‘{print $4}’`

file=`grep $md5s1 /root/ls-md5sum-files.txt`
if [ ! -z “${file}” ] ; then
file=`echo $file | sed -e ‘s/,/ /g’`
#size2=`echo $file | awk ‘{print $1}’`
#perm2=`echo $file | awk ‘{print $2}’`
#md5s2=`echo $file | awk ‘{print $3}’`
path2=`echo $file | awk ‘{print $4}’`

echo “$path1 -=> $path2”
if [ “$dry” = “make_it_happen” ]; then
echo mv “${path1}” “${path2}”
fi
fi
done
[/code]

This script generates a similar list as the first script and compares each hash or file and tries to move it back to its original path. Because of the above mentioned difference between total number of lines and unique liines this approach seems a little bit broken.

So, dear lazyweb, is there something that can generate unique fingerprints in a reasonable amount of time (like md5sum) or is there even an already existing application out there that I’m not aware of? And yes: I know that there are backup applications, but that is another topic. In this case, the backup medium has been corrupt so the problem still exists: how to identify files in lost+found and move them back in an automatic way, if possible?

I’m appreciating help, tips, patches, ideas, … 🙂

Uncategorized