Compare Two File Systems

Sometimes a developer/sys admin needs to find the difference in two file systems. I used to need to do this on regular occassion while updating a large-scale ecommerce site that had digital products. Now, I find this useful when doing server moves.

The idea is simple. You have two filesystems that you need to synch. You don’t want the old leftovers from the old system, just the really juicy, important files. There’s no way you’re going to manually compare the two systems to figure out what is missing. That would take forever and you’re the good kind of lazy, so you’re looking for a better way.

You want a list of what is in a directory on the old system and a list of what is in the same directory on another system. This allows you to collect only the files missing on the new file system so you can move them over. Preferrably, in a compressed format so you can make quick work of this task.

The method described in this article will take approximately 30 minutes, depending largely on the amount of leftover gruff in the old system that you want to ignore. All the commands you need are included.

Let’s Begin…

This may look intimidating, but don’t be afraid. It’s really easy. For the sake of this article, let’s assume you have a website that you’re moving to a new server. Each server has a folder called ‘htdocs’ in the Document Root of the website.

1. On the new server, move to the htdocs folder and run the command

find . -xtype f -follow > ~/newserver.txt

2. On the old server, move to the htdocs folder and run the same command (name the file differently)

find . -xtype f -follow > ~/oldserver.txt

You now have two files in your home directory on each server that represent the files in the htdocs folder on that server. Now you just need to get them both on the same server so you can work with them. It’s easiest to move to the older server, from which you will be collecting the missing files.

3. Copy the file ‘newserver.txt’ from your home directory on the new server into your home directory on the older server you’re wanting moving from. On the old server, move to your home directory and use the command

scp you@newserver:~/newserver.txt .

A Little Linux Magick

Now you have both the newserver.txt and oldserver.txt files in your home directory on the old server. We’re going to use a few linux commands, combined with perl one-liners to get to the finish line.

4. Sort and get unique lines from each file.

sort newserver.txt | uniq > oldserver.sort

sort newserver.txt | uniq > newserver.sort

The new .sort files contain alphabetically sorted, unique strings that represent the files in question. Now you can diff the two files to see what needs to be moved and remove any extra files you might not want.

5. Diff the two files, the command is:

diff newserver.sort oldserver.sort > combined.diff

This command creates entries in the new file ‘combined.diff’ that start with either ‘>’ or ‘< '. If a line starts with '<', it's on the old server and not on the new one. If it starts with '>‘, it’s on the new server and not on the old one. We only care about what is on the old server and not on the new, so we’ll just pull those lines out of the combined.diff using the following command.

cat combined.diff | perl -pe 'while(<>) { next unless /^ missing.txt

Now you just want to remove any files referenced that you might not want and get only the ones that need to be moved. For instance, some other pesky developer make a directory called ‘BAK’, which you know isn’t used and you don’t want. Use another perl one-liner like so:

cat missing.txt | perl -pe 'while(<>) { next if /BAK/; print $_; }’ > missing.pass1

I’m only making a new file missing.pass1 because I’d like to make multiple passes on cleaning this file. If something where to go wrong, like a command line typo, I can rerun the command on the last ‘pass’ rather than botch the source file and have to start over.

6. Check out what’s in the file. Is there anything else in there you don’t want? I had a logfiles and a bunch of CVS directories I didn’t want so I removed them like so:

cat missing.pass1 | perl -pe 'while(<>) { next if /logfiles/; print $_; }’ > missing.pass2

cat missing.pass2 | perl -pe 'while(<>) { next if /CVS/; print $_; }’ > missing.pass3

7. Now, we want to use the tar command to gather the files in a compressed archive. Tar can accept a file as input, but it doesn’t know what to do with the ‘>’ or ‘< ' characters, so we need to remove them.

cat missing.pass3 | perl -pe 'while(<>) { s/> //g; print $_ }’ > missing.pass4

Let’s Move…

Now you can make an archive from your home directory of the files you need using some switches on the linux tar command.

tar -C /var/www/htdocs/ -T missing.pass4 -cvzf myarchive.tgz

Eureka! We have a compressed tar file containing all the files we need on the newer server. Now we just scp the archive over (which is a lot quicker than moving each file individually!).

scp myarchive.tgz you@newserver:

Warning: It’s always wise to back up a directory before making massive edits or changes. You may very well want to back up the htdocs folder on the new server before reading the next paragraph!

Be sure to move to the htdocs before uncompressing or, if you’re comfortable, just pass the -C command to tar again, pointing to inside the htdocs folder. e.g. tar -C /var/www/htdocs/ -xvf myarchive.tar
In this case, you also want to check file permissions immediately after uncompressing the archive because Apache is involved.

You can use this technique anytime you want to compare two file systems on linux and find missing files.

Post a Comment