Zapp und kopierwütige Journalisten – WindfluechterNet Blog

I already wrote two days ago about it, but the problem then was to find a way of unique identifiers for files. Using md5sum and openssl sha1 seemed to deliver multiple files for each hash, which confused me.
A closer look later revealed that there were in fact multiple times the same file, so the same hash is of course correct.

Anywa, I’m now using the following two scripts to create an index file about the content of my filesystems and to be able to restore files from lost+found.

[code]
#!/bin/bash
#
# Usage: ./make-lsLR.sh
#
# Purpose:
# to make a file that is parseable for recovering
# a filled /lost+found directory by parsing
# filesize, md5sum, permisions and path+filename
#
# Author: Ingo Juergensman – http://blog.windfluechter.net
# License: GPL v2, see http://gnu.org for details.
#
# first: get all directories
nice -15 find / -path /sys -prune -o
-path /proc -prune -o
-path /var/lib/backuppc -prune -o
-path /var/spool/squid -prune -o
-type d -print > /root/ls-md5sum-dirs.txt

# next: get all relevant information
nice -15 find / -path /sys -prune -o
-path /proc -prune -o
-path /var/lib/backuppc -prune -o
-path /var/spool/squid -prune -o
-type f -printf “%s %U:%G %#m ”
-exec nice -15 md5sum {} ; | tr -s ” ” > /root/ls-md5sum-files.txt
[/code]

[code]
#!/bin/bash
#
# usage: check_lostfound.sh [make_it_happen]
#
# Purpose: to find files in lost+found and trying to restore
# original files by comparing ls-md5sum-files.txt (generated by
# make-lsLR.sh
# Option make_it_happen cause the data actually being written/moved
# whereas the script runs in dry mode per default.
#
# Author: Ingo Juergensman – http://blog.windfluechter.net
# License: GPL v2, see http://gnu.org for details.
#
PFAD=$1
dry=$2

# generating md5sum/information of lost+found
echo “Examing ${PFAD}/lost+found…”
find $PFAD/lost+found -type f -printf “%s %U:%G %#m ” -exec nice -15 md5sum {} ; > /root/lostfound-files.txt

# creating missing directories if necessary by parsing ls-md5sum-dirs.txt
echo “Creating missing directories…”
for dir in `cat /root/ls-md5sum-dirs.txt | sed -e ‘s/ /,/g’` ; do
dir=`echo ${dir} | sed -e ‘s/,/ /g’`
if [ ! -d “/space/check_lostfound/${dir}” ] ; then
echo ” Missing dir “$dir” – creating…”
if [ “$dry” = “make_it_happen” ]; then
mkdir -p “/space/check_lostfound/${dir}”
fi
fi
done

# next, get the md5sum of files in lost+found and compare it
# against stored md5sum in ls-md5sum-files.txt
echo “Restoring/moving files…”
for z in `cat /root/lostfound-files.txt | tr -s ” ” | sed -e ‘s/ /,/g’` ; do
z=`echo $z | sed -e ‘s/,/ /g’`

size1=`echo $z | awk ‘{print $1}’`
ugid1=`echo $z | awk ‘{print $2}’`
perm1=`echo $z | awk ‘{print $3}’`
md5s1=`echo $z | awk ‘{print $4}’`
path1=`echo $z | awk ‘{print $5}’`

file=`grep -m 1 $md5s1 /root/ls-md5sum-files.txt`
if [ ! -z “${file}” ] ; then
file=`echo $file | sed -e ‘s/,/ /g’`
size2=`echo $file | awk ‘{print $1}’`
ugid2=`echo $z | awk ‘{print $2}’`
perm2=`echo $file | awk ‘{print $3}’`
#md5s2=`echo $file | awk ‘{print $4}’`
path2=`echo $file | awk ‘{print $5}’`

if [ ! -e /space/check_lostfound/${path2} ]; then
echo “$path1 -=> $path2”
if [ “$dry” = “make_it_happen” ]; then
cp “${path1}” “/space/check_lostfound/${path2}”
fi
fi
fi
done
[/code]

It’s actually working for me, but use them at your own risk – as always! 😉

Anyway, there are some drawbacks with these versions:

it is dog slow. Grepping through 75M of ls-md5sum-files.txt for each file found in lost+found is CPU-intensive and slow.
files with spaces in their names are not handled well. In fact they are not handled properly at all. You’ll get “file not found” errors for each filename with spaces.
currently grep -m 1 stops grepping when the first match was found.

If anyone has an idea how to speed up the script or improve it otherwise, please let me know! I already though by myself to remove found files from ls-md5sum-files.txt to reduce the workload on grepping every time all of the 75M and to work around the last point from above. But rewriting the file everytime a fiel was found and moved will put extra workload on disk I/O, which – of course – will get better over time because the file gets smaller. Another idea was to store the file within an array in memory, but then I would prefer to use a Python script for this purpose.

So, when you have some ideas, tips or improvemts, please comment!

2 thoughts on “Zapp und kopierwütige Journalisten”

www.rostock-blogs.de says:

2009-09-08 at 21:10

PingBack
Netzaktivisten? Nur wenn es sein muss... | blog.windfluechter.net says:

2011-11-14 at 00:01

Pingback
[…] Anfang 2009 hatte ich mich noch darüber beschwert, daß die Medien nur noch Berichte übernehmen und nicht mehr selber recherchieren und […]

Comments are closed.