program to generate data helpful in finding duplicate large files

2014-09-18 Thread David Alban
greetings,

i'm a long time perl programmer who is learning python.  i'd be interested
in any comments you might have on my code below.  feel free to respond
privately if you prefer.  i'd like to know if i'm on the right track.  the
program works, and does what i want it to do.  is there a different way a
seasoned python programmer would have done things?  i would like to learn
the culture as well as the language.  am i missing anything?  i know i'm
not doing error checking below.  i suppose comments would help, too.

i wanted a program to scan a tree and for each regular file, print a line
of text to stdout with information about the file.  this will be data for
another program i want to write which finds sets of duplicate files larger
than a parameter size.  that is, using output from this program, the sets
of files i want to find are on the same filesystem on the same host
(obviously, but i include hostname in the data to be sure), and must have
the same md5 sum, but different inode numbers.

the output of the code below is easier for a human to read when paged
through 'less', which on my mac renders the ascii nuls as "^@" in reverse
video.

thanks,
david


*usage: dupscan [-h] [--start-directory START_DIRECTORY]*

*scan files in a tree and print a line of information about each regular
file*

*optional arguments:*
*  -h, --helpshow this help message and exit*
*  --start-directory START_DIRECTORY, -d START_DIRECTORY*
*specifies the root of the filesystem tree to be*
*processed*




*#!/usr/bin/python*

*import argparse*
*import hashlib*
*import os*
*import re*
*import socket*
*import sys*

*from stat import **

*ascii_nul = chr(0)*

* # from:
http://stackoverflow.com/questions/1131220/get-md5-hash-of-big-files-in-python
*
* # except that i use hexdigest() rather than digest()*
*def md5_for_file(f, block_size=2**20):*
*  md5 = hashlib.md5()*
*  while True:*
*data = f.read(block_size)*
*if not data:*
*  break*
*md5.update(data)*
*  return md5.hexdigest()*

*thishost = socket.gethostname()*

*parser = argparse.ArgumentParser(description='scan files in a tree and
print a line of information about each regular file')*
*parser.add_argument('--start-directory', '-d', default='.',
help='specifies the root of the filesystem tree to be processed')*
*args = parser.parse_args()*

*start_directory = re.sub( '/+$', '', args.start_directory )*

*for directory_path, directory_names, file_names in os.walk(
start_directory ):*
*  for file_name in file_names:*
*file_path = "%s/%s" % ( directory_path, file_name )*

*lstat_info = os.lstat( file_path )*

*mode = lstat_info.st_mode*

*if not S_ISREG( mode ) or S_ISLNK( mode ):*
*  continue*

*f = open( file_path, 'r' )*
*md5sum = md5_for_file( f )*

*dev   = lstat_info.st_dev*
*ino   = lstat_info.st_ino*
*nlink = lstat_info.st_nlink*
*size  = lstat_info.st_size*

*sep = ascii_nul*

*print "%s%c%s%c%d%c%d%c%d%c%d%c%s" % ( thishost, sep, md5sum, sep,
dev, sep, ino, sep, nlink, sep, size, sep, file_path )*

*exit( 0 )*



-- 
Our decisions are the most important things in our lives.
***
Live in a world of your own, but always welcome visitors.
-- 
https://mail.python.org/mailman/listinfo/python-list


Fwd: program to generate data helpful in finding duplicate large files

2014-09-19 Thread David Alban
here is my reworked code in a plain text email.

-- Forwarded message --
From: 
Date: Thu, Sep 18, 2014 at 3:58 PM
Subject: Re: program to generate data helpful in finding duplicate large
files
To: [email protected]


thanks for the responses.   i'm having quite a good time learning python.

On Thu, Sep 18, 2014 at 11:45 AM, Chris Kaynor 
wrote:
>
> Additionally, you may want to specify binary mode by using
open(file_path, 'rb') to ensure platform-independence ('r' uses Universal
newlines, which means on Windows, Python will convert "\r\n" to "\n" while
reading the file). Additionally, some platforms will treat binary files
differently.


would it be good to use 'rb' all the time?

On Thu, Sep 18, 2014 at 11:48 AM, Chris Angelico  wrote:
>
> On Fri, Sep 19, 2014 at 4:11 AM, David Alban  wrote:
> > exit( 0 )
>
> Unnecessary - if you omit this, you'll exit 0 implicitly at the end of
> the script.


aha.  i've been doing this for years even with perl, and apparently it's
not necessary in perl either.  i was influenced by shell.

this shell code:

 if [[ -n $report_mode ]] ; then
do_report
 fi

 exit 0

is an example of why you want the last normally executed shell statement to
be "exit 0".  if you omit the exit statement it in this example, and
$report_mode is not set, your shell program will give a non-zero return
code and appear to have terminated with an error.  in shell the last
expression evaluated determines the return code to the os.

ok, i don't need to do this in python.

On Thu, Sep 18, 2014 at 1:23 PM, Peter Otten <[email protected]> wrote:
>
> file_path may contain newlines, therefore you should probably use "\0" to
> separate the records.


i chose to stick with ascii nul as the default field separator, but i added
a --field-separator option in case someone wants human readable output.

style question:  if there is only one, possibly short statement in a block,
do folks usually move it up to the line starting the block?

  if not S_ISREG( mode ) or S_ISLNK( mode ):
return

vs.

  if not S_ISREG( mode ) or S_ISLNK( mode ): return

or even:

  with open( file_path, 'rb' ) as f: md5sum = md5_for_file( file_path )



fyi, here are my changes:


usage: dupscan [-h] [--start-directory START_DIRECTORY]
   [--field-separator FIELD_SEPARATOR]

scan files in a tree and print a line of information about each regular file

optional arguments:
  -h, --helpshow this help message and exit
  --start-directory START_DIRECTORY, -d START_DIRECTORY
Specify the root of the filesystem tree to be
processed. The default is '.'
  --field-separator FIELD_SEPARATOR, -s FIELD_SEPARATOR
Specify the string to use as a field separator in
output. The default is the ascii nul character.



#!/usr/bin/python

import argparse
import hashlib
import os

from platform import node
from stat import S_ISREG, S_ISLNK

ASCII_NUL = chr(0)

 # from:
http://stackoverflow.com/questions/1131220/get-md5-hash-of-big-files-in-python
 # except that i use hexdigest() rather than digest()
def md5_for_file( path, block_size=2**20 ):
  md5 = hashlib.md5()
  with open( path, 'rb' ) as f:
while True:
  data = f.read(block_size)
  if not data:
break
  md5.update(data)
  return md5.hexdigest()

def file_info( directory, basename, field_separator=ASCII_NUL ):
  file_path = os.path.join( directory, basename )
  st = os.lstat( file_path )

  mode = st.st_mode
  if not S_ISREG( mode ) or S_ISLNK( mode ):
return

  with open( file_path, 'rb' ) as f:
md5sum = md5_for_file( file_path )

  return field_separator.join( [ thishost, md5sum, str( st.st_dev ), str(
st.st_ino ), str( st.st_nlink ), str( st.st_size ), file_path ] )

if __name__ == "__main__":
  parser = argparse.ArgumentParser(description='scan files in a tree and
print a line of information about each regular file')
  parser.add_argument('--start-directory', '-d', default='.',
help='''Specify the root of the filesystem tree to be processed.  The
default is '.' ''')
  parser.add_argument('--field-separator', '-s', default=ASCII_NUL,
help='Specify the string to use as a field separator in output.  The
default is the ascii nul character.')
  args = parser.parse_args()

  start_directory = args.start_directory.rstrip('/')
  field_separator = args.field_separator

  thishost = node()
  if thishost == '':
thishost='[UNKNOWN]'

  for directory_path, directory_names, file_names in os.walk(
start_directory ):
for file_name in file_names:
  print file_info( directory_path, file_name, field_separator )

-- 
Our decisions are the most important things in our lives.
***
Live in a world of your own, but always welcome visitors.
-- 
https://mail.python.org/mailman/listinfo/python-list


trouble building data structure

2014-09-28 Thread David Alban
greetings,

i'm writing a program to scan a data file.  from each line of the data file
i'd like to add something like below to a dictionary.  my perl background
makes me want python to autovivify, but when i do:

  file_data = {}

  [... as i loop through lines in the file ...]

  file_data[ md5sum ][ inode ] = { 'path' : path, 'size' : size, }

i get:

Traceback (most recent call last):
  File "foo.py", line 45, in 
file_data[ md5sum ][ inode ] = { 'path' : path, 'size' : size, }
KeyError: '91b152ce64af8af91dfe275575a20489'

what is the pythonic way to build my "file_data" data structure above that
has the above structure?

on http://en.wikipedia.org/wiki/Autovivification there is a section on how
to do autovivification in python, but i want to learn how a python
programmer would normally build a data structure like this.

here is the code so far:

#!/usr/bin/python

import argparse
import os

ASCII_NUL = chr(0)

HOSTNAME = 0
MD5SUM   = 1
FSDEV= 2
INODE= 3
NLINKS   = 4
SIZE = 5
PATH = 6

file_data = {}

if __name__ == "__main__":
  parser = argparse.ArgumentParser(description='scan files in a tree and
print a line of information about each regular file')
  parser.add_argument('--file', '-f', required=True, help='File from which
to read data')
  parser.add_argument('--field-separator', '-s', default=ASCII_NUL,
help='Specify the string to use as a field separator in output.  The
default is the ascii nul character.')
  args = parser.parse_args()

  file = args.file
  field_separator = args.field_separator

  with open( file, 'rb' ) as f:
for line in f:
  line = line.rstrip('\n')
  if line == 'None': continue
  fields = line.split( ASCII_NUL )

  hostname = fields[ HOSTNAME ]
  md5sum   = fields[ MD5SUM ]
  fsdev= fields[ FSDEV ]
  inode= fields[ INODE ]
  nlinks   = int( fields[ NLINKS ] )
  size = int( fields[ SIZE ] )
  path = fields[ PATH ]

  if size < ( 100 * 1024 * 1024 ): continue

  ### print "'%s' '%s' '%s' '%s' '%s' '%s' '%s'" % ( hostname, md5sum,
fsdev, inode, nlinks, size, path, )

  file_data[ md5sum ][ inode ] = { 'path' : path, 'size' : size, }

thanks,
david
-- 
Our decisions are the most important things in our lives.
***
Live in a world of your own, but always welcome visitors.
-- 
https://mail.python.org/mailman/listinfo/python-list