fdups: calling for beta testers

2005-02-25 Thread Patrick Useldinger
Hi all,
I am looking for beta-testers for fdups.
fdups is a program to detect duplicate files on locally mounted 
filesystems. Files are considered equal if their content is identical, 
regardless of their filename. Also, fdups ignores symbolic links and is 
able to detect and ignore hardlinks, where available.

In contrast to similar programs, fdups does not rely on md5 sums or 
other hash functions to detect potentially identical files. Instead, it 
does a direct blockwise comparison and stops reading as soon as 
possible, thus reducing the file reads to a maximum.

fdups has been developed on Linux but should run on all platforms that 
support Python.

fdups' homepage is at http://www.homepages.lu/pu/fdups.html, where 
you'll also find a link to download the tar.

I am primarily interested in getting feedback if it produces correct 
results. But as I haven't been programming in Python for a year or so, 
I'd also be interested in comments on code if you happen to look at it 
in detail.

Your help is much appreciated.
-pu
--
http://mail.python.org/mailman/listinfo/python-list


Re: fdups: calling for beta testers

2005-02-26 Thread Patrick Useldinger
John Machin wrote:
(1) It's actually .bz2, not .bz (2) Why annoy people with the
not-widely-known bzip2 format just to save a few % of a 12KB file?? (3)
Typing that on Windows command line doesn't produce a useful result (4)
Haven't you heard of distutils?
(1) Typo, thanks for pointing it out
(2)(3) In the Linux world, it is really popular. I suppose you are a 
Windows user, and I haven't given that much thought. The point was not 
to save space, just to use the "standard" format. What would it be for 
Windows - zip?
(4) Never used them, but are very valid point. I will look into it.

(6) You are keeping open handles for all files of a given size -- have
you actually considered the possibility of an exception like this:
IOError: [Errno 24] Too many open files: 'foo509'
(6) Not much I can do about this. In the beginning, all files of equal 
size are potentially identical. I first need to read a chunk of each, 
and if I want to avoid opening & closing files all the time, I need them 
open together.
What would you suggest?

Once upon a time, max 20 open files was considered as generous as 640KB
of memory. Looks like Bill thinks 512 (open files, that is) is about
right these days.
Bill also thinks it is normal that half of service pack 2 lingers twice 
on a harddisk. Not sure whether he's my hero ;-)

(7)
Why sort? What's wrong with just two lines:
! for size, file_list in self.compfiles.iteritems():
! self.comparefiles(size, file_list)
(7) I wanted the output to be sorted by file size, instead of being 
random. It's psychological, but if you're chasing dups, you'd want to 
start with the largest ones first. If you have more that a screen full 
of info, it's the last lines which are the most interesting. And it will 
produce the same info in the same order if you run it twice on the same 
folders.

(8) global
MIN_FILESIZE,MAX_ONEBUFFER,MAX_ALLBUFFERS,BLOCKSIZE,INODES
That doesn't sit very well with the 'everything must be in a class'
religion seemingly espoused by the following:
(8) Agreed. I'll think about that.
(9) Any good reason why the "executables" don't have ".py" extensions
on their names?
(9) Because I am lazy and Linux doesn't care. I suppose Windows does?
All in all, a very poor "out-of-the-box" experience. Bear in mind that
very few Windows users would have even heard of bzip2, let alone have a
bzip2.exe on their machine. They wouldn't even be able to *open* the
box.
As I said, I did not give Windows users much thought. I will improve this.
And what is "chown" -- any relation of Perl's "chomp"?
chown is a Unix command to change the owner or the group of a file. It 
has to do with controlling access to the file. It is not relevant on 
Windows. No relation to Perl's chomp.

Thank you very much for your feedback. Did you actually run it on your 
Windows box?

-pu
--
http://mail.python.org/mailman/listinfo/python-list


Re: fdups: calling for beta testers

2005-02-26 Thread Patrick Useldinger
John Machin wrote:
Yes. Moreover, "WinZip", the most popular archive-handler, doesn't grok
bzip2.
I've added a zip file. It was made in Linux with the zip command-line 
tool, the man pages say it's compatible with the Windows zip tools. I 
have also added .py extentions to the 2 programs. I did however not use 
distutils, because I'm not sure it is really adapted to module-less scripts.

You should consider a fall-back method to be used in this case and in
the case of too many files for your 1Mb (default) buffer pool. BTW 1Mb
seems tiny; desktop PCs come with 512MB standard these days, and Bill
does leave a bit more than 1MB available for applications.
I've added it to the TODO list.
The question was rhetorical. Your irony detector must be on the fritz.
:-)
I always find it hard to detect irony by mail with people I do not know. ..
Did you actually run it on your
Windows box?

Yes, with trepidation, after carefully reading the source. It detected
some highly plausible duplicates, which I haven't verified yet.
I would have been reluctant too. But I've tested it intensively, and 
there's strictly no statement that actually alters the file system.

Thanks for your feedback!
-pu
--
http://mail.python.org/mailman/listinfo/python-list


Re: fdups: calling for beta testers

2005-02-26 Thread Patrick Useldinger
Serge Orlov wrote:
Or use exemaker, which IMHO is the best way to handle this
problem.
Looks good, but I do not use Windows.
-pu
--
http://mail.python.org/mailman/listinfo/python-list


Re: fdups: calling for beta testers

2005-02-27 Thread Patrick Useldinger
John Machin wrote:
I've tested it intensively
"Famous Last Words" :-)
;-)
(1) Manic s/w producing lots of files all the same size: the Borland
C[++] compiler produces a debug symbol file (.tds) that's always
384KB; I have 144 of these on my HD, rarely more than 1 in the same
directory.
Not sure what you want me to do about it. I've decreased the minimum 
block size once more, to accomodate for more files of the same length 
without increasing the total amount of memory used.

(2) There appears to be a flaw in your logic such that it will find
duplicates only if they are in the *SAME* directory and only when
there are no other directories with two or more files of the same
size. 
Ooops...
A really stupid mistake on my side. Corrected.
(3) Your fdups-check gadget doesn't work on Windows; the commands
module works only on Unix but is supplied with Python on all
platforms. The results might just confuse a newbie:
Why not use the Python filecmp module?
Done. It's also faster AND it works better. Thanks for the suggestion.
Please fetch the new version from http://www.homepages.lu/pu/fdups.html.
-pu
--
http://mail.python.org/mailman/listinfo/python-list


os.stat('')[stat.ST_INO] on Windows

2005-02-27 Thread Patrick Useldinger
What does the above yield on Windows? Are inodes supported on Windows 
NTFS, FAT, FAT32?
--
http://mail.python.org/mailman/listinfo/python-list


Re: Wishful thinking : unix to windows script?

2005-03-04 Thread Patrick Useldinger
John Leslie wrote:
Or does anyone have a python script which takes a standard unix
command as an argument and runs the pyton/windows equivalent on
windows?
There's not always an equivalent command.
-pu
--
http://mail.python.org/mailman/listinfo/python-list


Re: Wishful thinking : unix to windows script?

2005-03-04 Thread Patrick Useldinger
Grant Edwards wrote:
If you install cygwin there almost always is.
If you install cygwin there's no need for what the OP describes.
-pu
--
http://mail.python.org/mailman/listinfo/python-list


Re: Indexing strings

2005-03-04 Thread Patrick Useldinger
Fred wrote:
I am searching for a possibility, to find out, what the index for a
certain lettyer in a string is.
My example:
for x in text:
   if x == ' ':
  list = text[:  # There I need the index of the space the
program found during the loop...
Is there and possibility to find the index of the space???
Thanks for any help!
Fred
Use the index method, e.g.: text.index(' ').
What exactly do you want to do?
-pu
--
http://mail.python.org/mailman/listinfo/python-list


Re: Indexing strings

2005-03-05 Thread Patrick Useldinger
Fred wrote:
That was exactely what I was searching for. I needed a program, that
chopped up a string into its words and then saves them into a list. I
think I got this done...
There's a function for that: text.split().
You should really have a look at the Python docs. Also, 
http://diveintopython.org/ and http://www.gnosis.cx/TPiP/ are great 
tutorials.

-pu
--
http://mail.python.org/mailman/listinfo/python-list


Re: enum question

2005-03-05 Thread Patrick Useldinger
M.N.A.Smadi wrote:
does python support a C-like enum statement where one can define a 
variable with prespesified range of values?

thanks
m.smadi
>>> BLUE, RED, GREEN = 1,5,8
>>> BLUE
1
>>> RED
5
>>> GREEN
8
--
http://mail.python.org/mailman/listinfo/python-list


Re: function with a state

2005-03-06 Thread Patrick Useldinger
Xah Lee wrote:
globe=0;
def myFun():
  globe=globe+1
  return globe
The short answer is to use the global statement:
globe=0
def myFun():
  global globe
  globe=globe+1
  return globe
more elegant is:
globe=0
globe=myfun(globe)
def myFun(var):
  return var+1
and still more elegant is using classes and class attributes instead of 
global variables.

-pu
--
http://mail.python.org/mailman/listinfo/python-list


Re: function with a state

2005-03-06 Thread Patrick Useldinger
Kent Johnson wrote:
globe=0
globe=myfun(globe)
def myFun(var):
  return var+1

This mystifies me. What is myfun()? What is var intended to be?
myfun is an error ;-) should be myFun, of course.
var is parameter of function myFun. If you call myFun with variable 
globe, all references to var will be replaced by globe inside function 
myFun.

-pu
--
http://mail.python.org/mailman/listinfo/python-list


Re: Python docs [was: function with a state]

2005-03-10 Thread Patrick Useldinger
You don't understand the "global" statement in Python, but you do 
understand Software industry in general? Smart...
--
http://mail.python.org/mailman/listinfo/python-list


Re: [perl-python] a program to delete duplicate files

2005-03-10 Thread Patrick Useldinger
I wrote something similar, have a look at 
http://www.homepages.lu/pu/fdups.html.
--
http://mail.python.org/mailman/listinfo/python-list


Re: [perl-python] a program to delete duplicate files

2005-03-10 Thread Patrick Useldinger
Christos TZOTZIOY Georgiou wrote:
On POSIX filesystems, one has also to avoid comparing files having same (st_dev,
st_inum), because you know that they are the same file.
I then have a bug here - I consider all files with the same inode equal, 
 but according to what you say I need to consider the tuple 
(st_dev,ST_ium). I'll have to fix that for 0.13.

Thanks ;-)
-pu
--
http://mail.python.org/mailman/listinfo/python-list


Re: [perl-python] a program to delete duplicate files

2005-03-10 Thread Patrick Useldinger
Christos TZOTZIOY Georgiou wrote:
That's fast and good.
Nice to hear.
A minor nit-pick: `fdups.py -r .` does nothing (at least on Linux).
I'll look into that.
Have you found any way to test if two files on NTFS are hard linked without
opening them first to get a file handle?
No. And even then, I wouldn't know how to find out.
-pu
--
http://mail.python.org/mailman/listinfo/python-list


Re: [perl-python] a program to delete duplicate files

2005-03-11 Thread Patrick Useldinger
Christos TZOTZIOY Georgiou wrote:
The relevant parts from this last page:
st_dev <-> dwVolumeSerialNumber
st_ino <-> (nFileIndexHigh, nFileIndexLow)
I see. But if I am not mistaken, that would mean that I
(1) had to detect NTFS volumes
(2) use non-standard libraries to find these information (like the 
Python Win extentions).

I am not seriously motivated to do so, but if somebody is interested to 
help, I am open to it.

-pu
--
http://mail.python.org/mailman/listinfo/python-list


Re: [perl-python] a program to delete duplicate files

2005-03-11 Thread Patrick Useldinger
David Eppstein wrote:
You need do no comparisons between files.  Just use a sufficiently 
strong hash algorithm (SHA-256 maybe?) and compare the hashes.
That's not very efficient. IMO, it only makes sense in network-based 
operations such as rsync.

-pu
--
http://mail.python.org/mailman/listinfo/python-list


Re: [perl-python] a program to delete duplicate files

2005-03-11 Thread Patrick Useldinger
Christos TZOTZIOY Georgiou wrote:
A minor nit-pick: `fdups.py -r .` does nothing (at least on Linux).
Changed.
--
http://mail.python.org/mailman/listinfo/python-list


Re: [perl-python] a program to delete duplicate files

2005-03-11 Thread Patrick Useldinger
David Eppstein wrote:
Well, but the spec didn't say efficiency was the primary criterion, it 
said minimizing the number of comparisons was.
That's exactly what my program does.
More seriously, the best I can think of that doesn't use a strong slow 
hash would be to group files by (file size, cheap hash) then compare 
each file in a group with a representative of each distinct file found 
among earlier files in the same group -- that leads to an average of 
about three reads per duplicated file copy: one to hash it, and two for 
the comparison between it and its representative (almost all of the 
comparisons will turn out equal but you still need to check unless you 
My point is : forget hashes. If you work with hashes, you do have to 
read each file completely, cheap hash or not. My program normally reads 
*at most* 100% of the files to analyse, but usually much less. Also, I 
do plain comparisons which are much cheaper than hash calculations.

I'm assuming of course that there are too many files and/or they're too 
large just to keep them all in core.
I assume that file handles are sufficient to keep one open per file of 
the same size. This lead to trouble on Windows installations, but I 
guess that's a parameter to change. On Linux, I never had the problem.

Regarding buffer size, I use a maxumim which is then split up between 
all open files.

Anyone have any data on whether reading files and SHA-256'ing them (or 
whatever other cryptographic hash you think is strong enough) is 
I/O-bound or CPU-bound?  That is, is three reads with very little CPU 
overhead really cheaper than one read with a strong hash?
It also depends on the OS. I found that my program runs much slower on 
Windows, probably due to the way Linux anticipates reads and tries to 
reduce head movement.

I guess it also depends on the number of files you expect to have 
duplicates of.  If most of the files exist in only one copy, it's clear 
that the cheap hash will find them more cheaply than the expensive hash.  
In that case you could combine the (file size, cheap hash) filtering 
with the expensive hash and get only two reads per copy rather than 
three.
Sorry, but I can still not see a point tu use hashes. Maybe you'll have 
a look at my program and tell me where a hash could be useful?

It's available at http://www.homepages.lu/pu/fdups.html.
Regards,
-pu
--
http://mail.python.org/mailman/listinfo/python-list


Re: a program to delete duplicate files

2005-03-12 Thread Patrick Useldinger
John Machin wrote:
Just look at the efficiency of processing N files of the same size S,
where they differ after d bytes: [If they don't differ, d = S]
PU: O(Nd) reading time, O(Nd) data comparison time [Actually (N-1)d
which is important for small N and large d].
Hashing method: O(NS) reading time, O(NS) hash calc time

Shouldn't you add the additional comparison time that has to be done 
after hash calculation? Hashes do not give 100% guarantee. If there's a 
large number of identical hashes, you'd still need to read all of these 
files to make sure.

Just to explain why I appear to be a lawer: everybody I spoke to about 
this program told me to use hashes, but nobody has been able to explain 
why. I found myself 2 possible reasons:

1) it's easier to program: you don't compare several files in parallel, 
but process one by one. But it's not perfect and you still need to 
compare afterwards. In the worst case, you end up with 3 files with 
identical hashes, of which 2 are identical and 1 is not. In order to 
find this, you'd still have to program the algorithm I use, unless you 
say "oh well, there's a problem with the hash, go and look yourself."

2) it's probably useful if you compare files over a network and you want 
to reduce bandwidth. A hash lets you do that at the cost of local CPU 
and disk usage, which may be OK. That was not my case.

-pu
--
http://mail.python.org/mailman/listinfo/python-list


Re: Adapting code to multiple platforms

2005-03-12 Thread Patrick Useldinger
Jeffrey Barish wrote:
I have a small program that I would like to run on multiple platforms
(at least linux and windows).  My program calls helper programs that
are different depending on the platform.  I think I figured out a way
to structure my program, but I'm wondering whether my solution is good
Python programming practice.
I use something like this in the setup code:
if os.name == 'posix':
  statfunction = os.lstat
else:
  statfunction = os.stat
and then further in the code:
x = statfunction(filename)
So the idea is to have your "own" function names and assign the 
os-specific functions one and for all in the beginning. Afterwards, your 
code only uses your own function names and, as long as they behave in 
the same way, there's no more if - else stuff.

-pu
--
http://mail.python.org/mailman/listinfo/python-list


Re: a program to delete duplicate files

2005-03-12 Thread Patrick Useldinger
Scott David Daniels wrote:
   comparisons.  Using hashes, three file reads and three comparisons
   of hash values.  Without hashes, six file reads; you must read both
   files to do a file comparison, so three comparisons is six files.
That's provided you compare always 2 files at a time. I compare n files 
at a time, n being the number of files of the same size. That's quicker 
than hashes because I have a fair chance of finding a difference before 
the end of files. Otherwise, it's like hashes without computation and 
without having to have a second go to *really* compare them.

-pu
--
http://mail.python.org/mailman/listinfo/python-list


Re: a program to delete duplicate files

2005-03-12 Thread Patrick Useldinger
François Pinard wrote:
Identical hashes for different files?  The probability of this happening
should be extremely small, or else, your hash function is not a good one.
We're talking about md5, sha1 or similar. They are all known not to be 
100% perfect. I agree it's a rare case, but still, why settle on 
something "about right" when you can have "right"?

I once was over-cautious about relying on hashes only, without actually
comparing files.  A friend convinced me, doing maths, that with a good
hash function, the probability of a false match was much, much smaller
than the probability of my computer returning the wrong answer, despite
thorough comparisons, due to some electronic glitch or cosmic ray.  So,
my cautious attitude was by far, for all practical means, a waste.
It was not my only argument for not using hashed. My algorithm also does 
less reads, for example.

-pu
--
http://mail.python.org/mailman/listinfo/python-list


Re: Can't seem to insert rows into a MySQL table

2005-03-12 Thread Patrick Useldinger
grumfish wrote:
connection = MySQLdb.connect(host="localhost", user="root", passwd="pw", 
db="japanese")
cursor = connection.cursor()
cursor.execute("INSERT INTO edict (kanji, kana, meaning) VALUES (%s, %s, 
%s)", ("a", "b", "c") )
connection.close()
Just a guess "in the dark" (I don't use MySQL): is "commit" implicit, or 
do you have to add it yourself?

-pu
--
http://mail.python.org/mailman/listinfo/python-list


Re: a program to delete duplicate files

2005-03-12 Thread Patrick Useldinger
John Machin wrote:
Maybe I was wrong: lawyers are noted for irritating precision. You
meant to say in your own defence: "If there are *any* number (n >= 2)
of identical hashes, you'd still need to *RE*-read and *compare* ...".
Right, that is what I meant.
2. As others have explained, with a decent hash function, the
probability of a false positive is vanishingly small. Further, nobody
in their right mind [1] would contemplate automatically deleting n-1
out of a bunch of n reportedly duplicate files without further
investigation. Duplicate files are usually (in the same directory with
different names or in different-but-related directories with the same
names) and/or (have a plausible explanation for how they were
duplicated) -- the one-in-zillion-chance false-positive should stand
out as implausible.
Still, if you can get it 100% right automatically, why would you bother 
checking manually? Why get back to argments like "impossible", 
"implausible", "can't be" if you can have a simple and correct answer - 
yes or no?

Anyway, fdups does not do anything else than report duplicates. 
Deleting, hardlinking or anything else might be an option depending on 
the context in which you use fdups, but then we'd have to discuss the 
context. I never assumed any context, in order to keep it as universal 
as possible.

Different subject: maximum number of files that can be open at once. I
raised this issue with you because I had painful memories of having to
work around max=20 years ago on MS-DOS and was aware that this magic
number was copied blindly from early Unix. I did tell you that
empirically I could get 509 successful opens on Win 2000 [add 3 for
stdin/out/err to get a plausible number] -- this seems high enough to
me compared to the likely number of files with the same size -- but you
might like to consider a fall-back detection method instead of just
quitting immediately if you ran out of handles.
For the time being, the additional files will be ignored, and a warning 
is issued. fdups does not quit, why are you saying this?

A fallback solution would be to open the file before every _block_ read, 
and close it afterwards. In my mind, it would be a command-line option, 
because it's difficult to determine the number of available file handles 
in a multitasking environment.

Not difficult to implement, but I first wanted to refactor the code so 
that it's a proper class that can be used in other Python programs, as 
you also asked. That is what I have sent you tonight. It's not that I 
don't care about the file handle problem, it's just that I do changes by 
(my own) priority.

You wrote at some stage in this thread that (a) this caused problems on
Windows and (b) you hadn't had any such problems on Linux.
Re (a): what evidence do you have?
I've had the case myself on my girlfriend's XP box. It was certainly 
less than 500 files of the same length.

Re (b): famous last words! How long would it take you to do a test and
announce the margin of safety that you have?
Sorry, I do not understand what you mean by this.
-pu
--
http://mail.python.org/mailman/listinfo/python-list


Re: a program to delete duplicate files

2005-03-12 Thread Patrick Useldinger
John Machin wrote:
Oh yeah, "the computer said so, it must be correct". Even with your
algorithm, I would be investigating cases where files were duplicates
but there was nothing in the names or paths that suggested how that
might have come about.
Of course, but it's good to know that the computer is right, isn't it? 
That leaves the human to take decisions instead of double-checking.

I beg your pardon, I was wrong. Bad memory. It's the case of running
out of the minuscule buffer pool that you allocate by default where it
panics and pulls the sys.exit(1) rip-cord.
Bufferpool is a parameter, and the default values allow for 4096 files 
of the same size. It's more likely to run out of file handles than out 
of bufferspace, don't you think?

The pythonic way is to press ahead optimistically and recover if you
get bad news.
You're right, that's what I thought about afterwards. Current idea is to 
design a second class that opens/closes/reads the files and handles the 
situation independantly of the main class.

I didn't "ask"; I suggested. I would never suggest a
class-for-classes-sake. You had already a singleton class; why
another". What I did suggest was that you provide a callable interface
that returned clusters of duplicates [so that people could do their own
thing instead of having to parse your file output which contains a
mixture of warning & info messages and data].
That is what I have submitted to you. Are you sure that *I* am the 
lawyer here?

Re (a): what evidence do you have?
See ;-)
Interesting. Less on XP than on 2000? Maybe there's a machine-wide
limit, not a per-process limit, like the old DOS max=20. What else was
running at the time?
Nothing I started manually, but the usual bunch of local firewall, virus 
scanner (not doing a complete machine check at that time).

Test:
!for k in range(1000):
!open('foo' + str(k), 'w')
I'll try that.
Announce:
"I can open A files at once on box B running os C. The most files of
the same length that I have seen is D. The ratio A/D is small enough
not to worry."
I wouldn't count on that on a multi-tasking environment, as I said. The 
class I described earlier seems a cleaner approach.

Regards,
-pu
--
http://mail.python.org/mailman/listinfo/python-list


Re: a program to delete duplicate files

2005-03-13 Thread Patrick Useldinger
John Machin wrote:
Test:
!for k in range(1000):
!open('foo' + str(k), 'w')
I ran that and watched it open 2 million files and going strong ... 
until I figured that files are closed by Python immediately because 
there's no reference to them ;-)

Here's my code:
#!/usr/bin/env python
import os
print 'max number of file handles today is',
n = 0
h = []
try:
while True:
filename = 'mfh' + str(n)
h.append((file(filename,'w'),filename))
n = n + 1
except:
print n
for handle, filename in h:
handle.close()
os.remove(filename)
On Slackware 10.1, this yields 1021.
On WinXPSP2, this yields 509.
-pu
--
http://mail.python.org/mailman/listinfo/python-list


Re: a program to delete duplicate files

2005-03-14 Thread Patrick Useldinger
David Eppstein wrote:
When I've been talking about hashes, I've been assuming very strong 
cryptographic hashes, good enough that you can trust equal results to 
really be equal without having to verify by a comparison.
I am not an expert in this field. All I know is that MD5 and SHA1 can 
create collisions. Are there stronger algorithms that do not? And, more 
importantly, has it been *proved* that they do not?

-pu
--
http://mail.python.org/mailman/listinfo/python-list


Re: a program to delete duplicate files

2005-03-14 Thread Patrick Useldinger
David Eppstein wrote:
The hard part is verifying that the files that look like duplicates 
really are duplicates.  To do so, for a group of m files that appear to 
be the same, requires 2(m-1) reads through the whole files if you use a 
comparison based method, or m reads if you use a strong hashing method.  
You can't hope to cut the reads off early when using comparisons, 
because the files won't be different.
If you read them in parallel, it's _at most_ m (m is the worst case 
here), not 2(m-1). In my tests, it has always significantly less than m.

-pu
--
http://mail.python.org/mailman/listinfo/python-list


Re: How to create an object instance from a string??

2005-03-19 Thread Patrick Useldinger
Tian wrote:
I have a string:
classname = "Dog"
It's easier without strings:
>>> classname = Dog
>>> classname().bark()
Arf!!!
>>>
--
http://mail.python.org/mailman/listinfo/python-list


[ann] fdups 0.15

2005-03-20 Thread Patrick Useldinger
I am happy to announce version 0.15 of fdups.
Changes in this version:

- ability to limit the number of file handles used
Download
=
To download, go to: http://www.homepages.lu/pu/fdups.html
What is fdups?
==
fdups is a Python program to detect duplicate files on locally mounted 
filesystems. Files are considered equal if their content is identical, 
regardless of their filename. Also, fdups is able to detect and ignore 
symbolic links and hard links, where available.

In contrast to similar programs, fdups does not rely on md5 sums or 
other hash functions to detect potentially identical files. Instead, it 
does a direct blockwise comparison and stops reading as soon as 
possible, thus reducing the file reads to a maximum.

fdups results can either be processed by a unix-type filter, or directly 
 by another python program.

Warning
===
fdups is BETA software. It is known not to produce false positives if 
the filesystem is static.
I am looking for additional beta-testers, as well as for somebody who 
would be able to implement hard-link detection on NTFS file systems.

All feedback is appreciated.
--
http://mail.python.org/mailman/listinfo/python-list


Re: how to add a string to the beginning of a large binary file?

2005-03-27 Thread Patrick Useldinger
could ildg wrote:
I want to add a string such as "I love you" to the beginning of a binary file,
How to? and how to delete the string if I want to get the original file?
You shouldn't use Python to write a virus :-)
-pu
--
http://mail.python.org/mailman/listinfo/python-list


Re: numbering variables

2005-03-28 Thread Patrick Useldinger
remi wrote:
Hello,
I have got a list like : mylist = ['item 1', 'item 2','item n'] and 
I would like to store the string 'item1' in a variable called s_1, 
'item2' in s_2,...,'item i' in 's_i',... The lenght of mylist is finite ;-)
Any ideas ?
Thanks a lot.
Rémi.
Use a dictionary: variable['s_1']= mylist.pop(), variable['s_2'] = 
mylist.pop() ...
--
http://mail.python.org/mailman/listinfo/python-list


Re: Which is easier? Translating from C++ or from Java...

2005-03-28 Thread Patrick Useldinger
cjl wrote:
Implementations of what I'm trying to accomplish are available (open
source) in C++ and in Java.
Which would be easier for me to use as a reference?
I'm not looking for automated tools, just trying to gather opinions on
which language is easier to understand / rewrite as python.
Depends on what language you know best. But Java is certainly easier to 
read than C++.

-pu
--
http://mail.python.org/mailman/listinfo/python-list


Re: Which is easier? Translating from C++ or from Java...

2005-03-29 Thread Patrick Useldinger
[EMAIL PROTECTED] wrote:
Patrick Useldinger wrote:
Depends on what language you know best. But Java is certainly easier
to
read than C++.

There's certainly some irony in those last two sentences. However, I
agree with the former. It depends on which you know better, the style
of those who developed each and so forth. Personally, I'd prefer C++.
Not really.
If you know none of the languages perfectly, you are less likely to miss 
something in Java than in C++ (i.e. no &, * and stuff in Java).

However, if you are much more familiar with one of the two, you're less 
likely to miss things there.

-pu
--
http://mail.python.org/mailman/listinfo/python-list


Re: Which is easier? Translating from C++ or from Java...

2005-03-29 Thread Patrick Useldinger
cjl wrote:
I've found a third open source implementation in pascal (delphi), and
was wondering how well that would translate to python?
Being old enough to have programmed in UCSD Pascal on an Apple ][ (with 
a language card, of course), I'd say: go for Pascal!

;-)
--
http://mail.python.org/mailman/listinfo/python-list


filtering DNS proxy

2006-01-14 Thread Patrick Useldinger
Hi all,
I am looking to write a filtering DNS proxy which should
- receive DNS queries
- validate them again an ACL which looks as follows:
   { 'ip1':['name1','name2',...],
 'ip2':['name1','name3'],
 ...
   }
- if the request is valid (ie. if the sending IP address is allowed to 
ask for the name resulution of 'name', pass it on to the relevant DNS server
- if not send the requestor some kind of error message.
The expected workload is not enormous. The proxy must run on Linux.
What would be the best way to approach this problem:
- implementing it in stock Python with asyncore
- implementing it in stock Python with threads
- using Twisted
- anything else?
My first impression is that I would be most comfortable with stock 
Python and threads because I am not very familiar with event-driven 
programming and combining the server and client part might be more 
complicated to do. Twisted seems daunting to me because of the 
documentation.
Any suggesting would be appreciated.
Regards,
-pu

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: [OT] why cd ripping on Linux is so slow

2006-08-12 Thread Patrick Useldinger
alf wrote:
> Hi,
> 
> I try to ip some music CD and later convert it into mp3 for my mp3 
> player, but can not get around one problem. ripping from Linux is 
> extremely slow  like 0.5x of CD speed.
> 
> In contrary, on M$ Windows it takes like a few minutes to have CD ripped 
> and compresses into wmf yet I do not knowhow to get pure wavs...
> 
> Hope I find someone here helping me with that 

This is really OT, and you might be better off looking in Linux forums 
like http://www.linuxquestions.org/. That said, it's likely that your 
DMA is not switched on. Ask your question in the aforementioned forums, 
and make sure to state which distribution you are using.

-pu
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: setuid root

2006-08-24 Thread Patrick Useldinger
Tiago Simões Batista wrote:
> The sysadmin already set the setuid bit on the script, but it
> still fails when it tries to write to any file that only root has
> write access to.

use sudo.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Python 2.5 incompatible with Fedora Core 6 - packaging problems again

2007-03-04 Thread Patrick Useldinger
http://www.serpentine.com/blog/2006/12/22/how-to-build-safe-clean-python-25-rpms-for-fedora-core-6/
-- 
http://mail.python.org/mailman/listinfo/python-list