Re: [Tutor] First program after PyCamp

Dave Angel Mon, 10 Jun 2013 20:29:17 -0700

On 06/10/2013 04:03 PM, bja...@jamesgang.dyndns.org wrote:

Hello I just took a 3 day PyCamp and am working on my first program from
start to finish at the moment and running into a problem.


Not sure how it's supposed to be on this list

Please start by defining your version of Python. I'm going to assumePython 2.7, as in Python 3, you'd have gotten an error from the hashing,logic, after opening that file in text mode.


so I'm going to first

describe what my program is supposed to do and where I'm having the
problems, then post the actual code.  Please don't simply come back with a
working version of the code since I want to know what each step does, but
also explain where I went wrong and why the new version is better, or even
better suggest where I should look to fix it without actually fixing it
for me.

The program is supposed to be given a directory and then look through that
directory and all sub-directories and find duplicate files and output the
list of duplicates to a text file.  This is accomplished by generating a
MD5 hash for the files then comparing that hash to a list of previous
hashes that have been generated.

My problem is the text file that is output seems to contain EVERY file
that the program went through rather than just the duplicates.  I've tried
to mentally step through and figure out where/why it's listing all of them
rather than just duplicates and I seem to be failing to spot it.

First some general comments. Please don't use so many columns for yoursource file. That may work for you on your own screen, but it'sproblematic when someone else has to deal with the code. In my case,several of those comments wrapped in the email, and I had to re-editthem before trying things.

Next, try to factor your code into reasonable pieces. You have a singlefunction that does at least 3 things; make it three functions. Onereason is that then you can frequently figure out which ones of them arecorrect, and which ones need work. Another reason is you can reuse thelogic you spent time building and testing.

First function gathers filenames that meet a particular criteria. Inthis case, it's simply all files in a subtree of a starting place.After you get the hang of things, you'll realize this could better be agenerator, but no hurry yet. At present, it should generate a list, andRETURN that list. Not just tuck things into some global.

Second function generates md5 checksums of all those file, uses thatchecksum as a key, and groups the files having the same key together.It should return the dict, rather than modifying some global one.

Third function analyzes the dict created by the second one, and preparesa report (to file, or to print).


here's the code:
---begin code paste---
import os, hashlib
#rootdir = 'c:\Python Test'

I know this is commented out, but notice that the backslash is a bigrisk here. If a directory happened to start with t, for example, thefilename would have a tab in it. Either use forward slashes (which dowork in Windows), or use a raw literal. Or escape it with doubling thebackslashes.

hashlist = {}  # content signature -> list of filenames
dups = []

def get_hash(rootdir):
#   """goes through directory tree, compares md5 hash of all files,
#   combines files with same hash value into list in hashmap directory"""
     for path, dirs, files in os.walk(rootdir):
         #this section goes through the given directory, and all
subdirectories/files below
         #as part of a loop reading them in
         for filename in files:
             #steps through each file and starts the process of getting the
MD5 hashes for the file
             #comparing that hash to known hashes that have already been
calculated and either merges it
             #with the known hash (which indicates duplicates) or adds it
so that it can be compared to future
             #files
             fullname = os.path.join(path, filename)
             with open(fullname) as f:

You're defaulting to text files, so the checksum will not in generalmatch the standard md5sum which should get the exact same value. And inPython 3, this would try to decode the file as text, using some defaultencoding, and even if it didn't fail, would then get an error downbelow, since the hexdigest() stuff doesn't work on characters, but on bytes.


               with open(fullname, "rb") as f:

                 #does the actual hashing
                 md5 = hashlib.md5()
                 while True:
                     d = f.read(4096)
                     if not d:
                         break
                     md5.update(d)
                 h = md5.hexdigest()
                 filelist = hashlist.setdefault(h, [])
                 filelist.append(fullname)

     for x in hashlist:
         currenthash = hashlist[x]


This is confusing both for the names, and for the inefficiency.  Try
       for hash, files in hashlist.items():

hash is what you called x, and files is what you called currenthash.

         #goes through and if has has more than one file listed with it
         #considers it a duplicate and adds it to the output list
         if len(currenthash) > 1:
             dups.append(currenthash)


Now with the renaming, this becomes:

           if len(files) > 1:
                dups.append(files)

Note that if I were you, I'd make it
                dups.append(hash, files)

that way, the output would (by default) show the hash that these filessupposedly shared.

     output = open('duplicates.txt','w')
     output.write(str(dups))
     output.close()

Clearly this output could be made prettier, once you're confident theother parts are working. But I imagine you just hadn't gotten to that yet.

Now, the code worked fine for me, and I doublechecked each of theduplicates it found with md5sum (standard on Linux, and available forWindows).

When you look at the output, are there any entries that do NOT havemultiple files listed? Have you done a dir /s /b of the same directorytree, redirected it to a file, and compared the size of that file towhat this code finds?



--
DaveA
_______________________________________________
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] First program after PyCamp

Reply via email to