On 06/10/2013 04:03 PM, bja...@jamesgang.dyndns.org wrote:
Hello I just took a 3 day PyCamp and am working on my first program from
start to finish at the moment and running into a problem.

Not sure how it's supposed to be on this list

Please start by defining your version of Python. I'm going to assume Python 2.7, as in Python 3, you'd have gotten an error from the hashing, logic, after opening that file in text mode.

so I'm going to first
describe what my program is supposed to do and where I'm having the
problems, then post the actual code.  Please don't simply come back with a
working version of the code since I want to know what each step does, but
also explain where I went wrong and why the new version is better, or even
better suggest where I should look to fix it without actually fixing it
for me.

The program is supposed to be given a directory and then look through that
directory and all sub-directories and find duplicate files and output the
list of duplicates to a text file.  This is accomplished by generating a
MD5 hash for the files then comparing that hash to a list of previous
hashes that have been generated.

My problem is the text file that is output seems to contain EVERY file
that the program went through rather than just the duplicates.  I've tried
to mentally step through and figure out where/why it's listing all of them
rather than just duplicates and I seem to be failing to spot it.

First some general comments. Please don't use so many columns for your source file. That may work for you on your own screen, but it's problematic when someone else has to deal with the code. In my case, several of those comments wrapped in the email, and I had to re-edit them before trying things.

Next, try to factor your code into reasonable pieces. You have a single function that does at least 3 things; make it three functions. One reason is that then you can frequently figure out which ones of them are correct, and which ones need work. Another reason is you can reuse the logic you spent time building and testing.

First function gathers filenames that meet a particular criteria. In this case, it's simply all files in a subtree of a starting place. After you get the hang of things, you'll realize this could better be a generator, but no hurry yet. At present, it should generate a list, and RETURN that list. Not just tuck things into some global.

Second function generates md5 checksums of all those file, uses that checksum as a key, and groups the files having the same key together. It should return the dict, rather than modifying some global one.

Third function analyzes the dict created by the second one, and prepares a report (to file, or to print).



here's the code:
---begin code paste---
import os, hashlib
#rootdir = 'c:\Python Test'

I know this is commented out, but notice that the backslash is a big risk here. If a directory happened to start with t, for example, the filename would have a tab in it. Either use forward slashes (which do work in Windows), or use a raw literal. Or escape it with doubling the backslashes.

hashlist = {}  # content signature -> list of filenames
dups = []

def get_hash(rootdir):
#   """goes through directory tree, compares md5 hash of all files,
#   combines files with same hash value into list in hashmap directory"""
     for path, dirs, files in os.walk(rootdir):
         #this section goes through the given directory, and all
subdirectories/files below
         #as part of a loop reading them in
         for filename in files:
             #steps through each file and starts the process of getting the
MD5 hashes for the file
             #comparing that hash to known hashes that have already been
calculated and either merges it
             #with the known hash (which indicates duplicates) or adds it
so that it can be compared to future
             #files
             fullname = os.path.join(path, filename)
             with open(fullname) as f:

You're defaulting to text files, so the checksum will not in general match the standard md5sum which should get the exact same value. And in Python 3, this would try to decode the file as text, using some default encoding, and even if it didn't fail, would then get an error down below, since the hexdigest() stuff doesn't work on characters, but on bytes.

               with open(fullname, "rb") as f:

                 #does the actual hashing
                 md5 = hashlib.md5()
                 while True:
                     d = f.read(4096)
                     if not d:
                         break
                     md5.update(d)
                 h = md5.hexdigest()
                 filelist = hashlist.setdefault(h, [])
                 filelist.append(fullname)

     for x in hashlist:
         currenthash = hashlist[x]

This is confusing both for the names, and for the inefficiency.  Try
       for hash, files in hashlist.items():

hash is what you called x, and files is what you called currenthash.

         #goes through and if has has more than one file listed with it
         #considers it a duplicate and adds it to the output list
         if len(currenthash) > 1:
             dups.append(currenthash)

Now with the renaming, this becomes:

           if len(files) > 1:
                dups.append(files)

Note that if I were you, I'd make it
                dups.append(hash, files)

that way, the output would (by default) show the hash that these files supposedly shared.

     output = open('duplicates.txt','w')
     output.write(str(dups))
     output.close()

Clearly this output could be made prettier, once you're confident the other parts are working. But I imagine you just hadn't gotten to that yet.

Now, the code worked fine for me, and I doublechecked each of the duplicates it found with md5sum (standard on Linux, and available for Windows).

When you look at the output, are there any entries that do NOT have multiple files listed? Have you done a dir /s /b of the same directory tree, redirected it to a file, and compared the size of that file to what this code finds?


--
DaveA
_______________________________________________
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Reply via email to