Re: [Tutor] Unexpected result of division by negative integer, was: (no subject)

2012-06-02 Thread Peter Otten
Jason Barrett wrote:

[Jason, please use "reply all" instead of sending a private mail. Your 
answer will then appear on the mailing list and other readers get a chance 
to offer an alternative explanation.]

>> In python, why does 17/-10= -2? Shouldn't it be -1?
> 
>> http://docs.python.org/faq/programming.html#why-does-22-10-return-3

> I just started programming so all this is new to me. I don't really
> understand the explanation. I'm a novice at programming.



___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


[Tutor] Octal confusion, please explain why. Python 3.2

2012-06-02 Thread Jordan
Hello, first off I am using Python 3.2 on Linux Mint 12 64-bit.
I am confused as to why I can not successfully compare a variable that
was created as an octal to a variable that is converted to an octal in a
if statement yet print yields that they are the same octal value. I
think it is because they are transported around within python as a
integer value, is this correct and if so why?
Code example 1: This code works because both are explicitly converted to
Octals.

PATH = '/home/wolfrage/Documents'
DIR_PERMISSIONS = 0o40740

for dirname, dirnames, filenames in os.walk(PATH):
for subdirname in dirnames:
#list each subdirectory in this directory.
full_path = os.path.join(dirname, subdirname)
print(full_path)
stat_info = os.stat(full_path)
uid = stat_info.st_uid
gid = stat_info.st_gid
#TODO: Group and User ID Check
#TODO: Directories need the right Masks, and then read, write, and
# execute permissions for User(Owner), Group and Others
#TODO: UMask is seperate from permissions, so handle it seperately.
if oct(stat_info.st_mode) != oct(DIR_PERMISSIONS):
#TODO: Fix User Read Permissions
print(str(oct(stat_info.st_mode)) + ' Vs ' +
str(oct(DIR_PERMISSIONS)))
os.chmod(full_path, DIR_PERMISSIONS)
print(subdirname + ' has bad user permissions.')


Code Example 2: This code does not work because we do not use an
explicit conversion.

PATH = '/home/wolfrage/Documents'
DIR_PERMISSIONS = 0o40740

for dirname, dirnames, filenames in os.walk(PATH):
for subdirname in dirnames:
#list each subdirectory in this directory.
full_path = os.path.join(dirname, subdirname)
print(full_path)
stat_info = os.stat(full_path)
uid = stat_info.st_uid
gid = stat_info.st_gid
#TODO: Group and User ID Check
#TODO: Directories need the right Masks, and then read, write, and
# execute permissions for User(Owner), Group and Others
#TODO: UMask is seperate from permissions, so handle it seperately.
if oct(stat_info.st_mode) != DIR_PERMISSIONS:
#TODO: Fix User Read Permissions
print(str(oct(stat_info.st_mode)) + ' Vs ' + str(DIR_PERMISSIONS))
# The Above print statement shows that DIR_PERMISSIONS is
printed as an
# Integer. But Why does Python convert a explicitly created
Octal to an
# Integer?
os.chmod(full_path, DIR_PERMISSIONS)
print(subdirname + ' has bad user permissions.')


___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Joining all strings in stringList into one string

2012-06-02 Thread Jordan


On 05/30/2012 06:21 PM, Akeria Timothy wrote:
> Hello all,
>
> I am working on learning Python(on my own) and ran into an exercise
> that I figured out but I wanted to know if there was a different way
> to write the code? I know he wanted a different answer for the body
> because we haven't gotten to the ' '.join() command yet.
>
> This is what I have:
>
> def joinStrings(stringList):
>  string = []
> for string in stringList:
# Here you never used the string variable so why have a for statement?
> print ''.join(stringList)
#Another version might look like this:

def join_strings2(string_list):
final_string = ''
for string in string_list:
final_string += string
print(final_string)
return final_string
# Tested in Python 3.2

join_strings2(['1', '2', '3', '4', '5'])
>
>
> def main():
> print joinStrings(['very', 'hot', 'day'])
> print joinStrings(['this', 'is', 'it'])
> print joinStrings(['1', '2', '3', '4', '5'])
>
> main()
>
>
> thanks all
>
>
> ___
> Tutor maillist  -  Tutor@python.org
> To unsubscribe or change subscription options:
> http://mail.python.org/mailman/listinfo/tutor
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Unexpected result of division by negative integer, was: (no subject)

2012-06-02 Thread Alan Gauld

On 02/06/12 09:52, Peter Otten wrote:


I just started programming so all this is new to me. I don't really
understand the explanation. I'm a novice at programming.


OK, Here is an attempt to explain the explanation!

Original question:
Why does -17 / 10  => -2 instead of -1

Python FAQ response:
> It’s primarily driven by the desire that i % j have the same
> sign as j.

The % operator returns the "remainder" in integer division.
It turns out that in programming we can often use the
remainder to perform programming "tricks", especially when
dealing with cyclic data - such as the days of the week.

For example if you want to find out which day of the
week it is 53 days  from now it is 53 % 7 plus whatever
todays number is... Now for those tricks to work we want
to have the % operator work as described in the FAQ above


> If you want that, and also want:
>
> i == (i // j) * j + (i % j)
>
> then integer division has to return the floor.

This is just saying that the integer division(//) result times the 
original divisor plus the remainder should equal the starting number.

Thus:

17/10 => 1,7
So (17 // 10) => 1 and (17 % 10) => 7
So 1 x 10 + 7 => 17.

And to achieve those two aims(remainder with the same sign as the 
divisor and the sum returning the original value) we must define integer 
division to return the "floor" of the division result.


The floor of a real number is the integer value *lower* than its real 
value. Thus:


floor(5.5) -> 5
floor(-5.5) -> -6

17/10 -> 1.7
floor(1.7) -> 1
-17/10 -> -1.7
floor(-1.7) -> -2

So X // Y == floor(X / Y)

HTH.

--
Alan G
Author of the Learn to Program web site
http://www.alan-g.me.uk/

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Octal confusion, please explain why. Python 3.2

2012-06-02 Thread Steven D'Aprano

Jordan wrote:

Hello, first off I am using Python 3.2 on Linux Mint 12 64-bit.
I am confused as to why I can not successfully compare a variable that
was created as an octal to a variable that is converted to an octal in a
if statement yet print yields that they are the same octal value.


Because the oct() function returns a string, not a number.

py> oct(63)
'0o77'

In Python, strings are never equal to numbers:

py> 42 == '42'
False


Octal literals are just regular integers that you type differently:

py> 0o77
63

"Octals" aren't a different type or object; the difference between the numbers 
0o77 and 63 is cosmetic only. 0o77 is just the base 8 version of the base 10 
number 63 or the binary number 0b11.



Because octal notation is only used for input, there's no need to covert 
numbers to octal strings to compare them:


py> 0o77 == 63  # no different to "63 == 63"
True


The same applies for hex and bin syntax too:

py> 0x1ff == 511
True



--
Steven
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Octal confusion, please explain why. Python 3.2

2012-06-02 Thread Jordan
Thank you for the detailed answer, now I understand and I understand
that each number format is an integer just with a different base and
cosmetic appearance.

On 06/02/2012 01:51 PM, Steven D'Aprano wrote:
> Jordan wrote:
>> Hello, first off I am using Python 3.2 on Linux Mint 12 64-bit.
>> I am confused as to why I can not successfully compare a variable that
>> was created as an octal to a variable that is converted to an octal in a
>> if statement yet print yields that they are the same octal value.
>
> Because the oct() function returns a string, not a number.
Ahh, I see. That makes sense next time I will read the documentation on
the functions that I am using.
> py> oct(63)
> '0o77'
>
> In Python, strings are never equal to numbers:
>
> py> 42 == '42'
> False
>
>
> Octal literals are just regular integers that you type differently:
>
> py> 0o77
> 63
>
> "Octals" aren't a different type or object; the difference between the
> numbers 0o77 and 63 is cosmetic only. 0o77 is just the base 8 version
> of the base 10 number 63 or the binary number 0b11.
>
>
> Because octal notation is only used for input, there's no need to
> covert numbers to octal strings to compare them:
>
> py> 0o77 == 63  # no different to "63 == 63"
> True
>
>
> The same applies for hex and bin syntax too:
>
> py> 0x1ff == 511
> True
>
>
>
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Octal confusion, please explain why. Python 3.2

2012-06-02 Thread Alan Gauld

On 02/06/12 12:29, Jordan wrote:


think it is because they are transported around within python as a
integer value, is this correct and if so why?


Steven has already answered your issue but just to emphasise
this point.

Python stores all data as binary values in memory. Octal,
decimal, hex are all representations of that raw data.

Python interprets some binary data as strings, some as integers, some as 
booleans etc and then tries to display these objects in a suitable 
representation. (We can use the struct module to define the way python 
interprets binary data input.) For integers the default is base 10 but 
we can exercise some control over the format by using modifiers such as 
oct() hex() etc.


But these functions only change the representation they do not change 
the underlying data in any way.


HTH,

Alan G.



___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Division by negative numbers [was: no subject]

2012-06-02 Thread Steven D'Aprano

Jason Barrett wrote:

In python, why does 17/-10= -2? Shouldn't it be -1?


Integer division with negative numbers can be done in two different ways.

(1) 17/-10 gives -1, with remainder 7.

Proof: -1 times -10 is 10, plus 7 gives 17.


(2) 17/-10 gives -2, with remainder -3.

Proof: -2 times -10 is 20, plus -3 gives 17.

[Aside: technically, integer division with positive numbers also has the same 
choice: 17 = 1*10+7 or 2*10-3.]


Every language has to choose whether to support the first or the second 
version. Python happens to choose the second. Other languages might make the 
same choice, or the other.


For example, C leaves that decision up to the compiler. Different C compiles 
may give different results, which is not very useful in my opinion.


Ruby behaves like Python:


steve@orac:/home/steve$ irb
irb(main):001:0> 17/10
=> 1
irb(main):002:0> 17/-10
=> -2



Other languages may be different.

P.S. in future, please try to include a meaningful subject line to your posts.


--
Steven
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Joining all strings in stringList into one string

2012-06-02 Thread Steven D'Aprano

Jordan wrote:


#Another version might look like this:

def join_strings2(string_list):
final_string = ''
for string in string_list:
final_string += string
print(final_string)
return final_string


Please don't do that. This risks becoming slow. REALLY slow. Painfully slow. 
Like, 10 minutes versus 3 seconds slow. Seriously.


The reason for this is quite technical, and the reason why you might not 
notice is even more complicated, but the short version is this:


Never build up a long string by repeated concatenation of short strings.
Always build up a list of substrings first, then use the join method to
assemble them into one long string.


Why repeated concatenation is slow:

Suppose you want to concatenation two strings, "hello" and "world", to make a 
new string "helloworld". What happens?


Firstly, Python has to count the number of characters needed, which 
fortunately is fast in Python, so we can ignore it. In this case, we need 5+5 
= 10 characters.


Secondly, Python sets aside enough memory for those 10 characters, plus a 
little bit of overhead: --


Then it copies the characters from "hello" into the new area: hello-

followed by the characters of "world": helloworld

and now it is done. Simple, right? Concatenating two strings is pretty fast. 
You can't get much faster.


Ah, but what happens if you do it *repeatedly*? Suppose we have SIX strings we 
want to concatenate, in a loop:


words = ['hello', 'world', 'foo', 'bar', 'spam', 'ham']
result = ''
for word in words:
result = result + word

How much work does Python have to do?

Step one: add '' + 'hello', giving result 'hello'
Python needs to copy 0+5 = 5 characters.

Step two: add 'hello' + 'world', giving result 'helloworld'
Python needs to copy 5+5 = 10 characters, as shown above.

Step three: add 'helloworld' + 'foo', giving 'helloworldfoo'
Python needs to copy 10+3 = 13 characters.

Step four: add 'helloworldfoo' + 'bar', giving 'helloworldfoobar'
Python needs to copy 13+3 = 16 characters.

Step five: add 'helloworldfoobar' + 'spam', giving 'helloworldfoobarspam'
Python needs to copy 16+4 = 20 characters.

Step six: add 'helloworldfoobarspam' + 'ham', giving 'helloworldfoobarspamham'
Python needs to copy 20+3 = 23 characters.

So in total, Python has to copy 5+10+13+16+20+23 = 87 characters, just to 
build up a 23 character string. And as the number of loops increases, the 
amount of extra work needed just keeps expanding. Even though a single string 
concatenation is fast, repeated concatenation is painfully SLOW.


In comparison, ''.join(words) one copies each substring once: it counts out 
that it needs 23 characters, allocates space for 23 characters, then copies 
each substring into the right place instead of making a whole lot of temporary 
strings and redundant copying.


So, join() is much faster than repeated concatenation. But you may never have 
noticed. Why not?


Well, for starters, for small enough pieces of data, everything is fast. The 
difference between copying 87 characters (the slow way) and 23 characters (the 
fast way) is trivial.


But more importantly, some years ago (Python 2.4, about 8 years ago?) the 
Python developers found a really neat trick that they can do to optimize 
string concatenation so it doesn't need to repeatedly copy characters over and 
over and over again. I won't go into details, but the thing is, this trick 
works well enough that repeated concatenation is about as fast as the join 
method MOST of the time.


Except when it fails. Because it is a trick, it doesn't always work. And when 
it does fail, your repeated string concatenation code will suddenly drop from 
running in 0.1 milliseconds to a full second or two; or worse, from 20 seconds 
to over an hour. (Potentially; the actual slow-down depends on the speed of 
your computer, your operating system, how much memory you have, etc.)


Because this is a cunning trick, it doesn't always work, and when it doesn't 
work, and you have slow code and no hint as to why.


What can cause it to fail?

- Old versions of Python, before 2.4, will be slow.

- Other implementations of Python, such as Jython and IronPython, will not 
have the trick, and so will be slow.


- The trick is highly-dependent on internal details of the memory management 
of Python and the way it interacts with the operating system. So what's fast 
under Linux may be slow under Windows, or the other way around.


- The trick is highly-dependent on specific circumstances to do with the 
substrings being added. Without going into details, if those circumstances are 
violated, you will have slow code.


- The trick only works when you are adding strings to the end of the new 
string, not if you are building it up from the beginning.



So even though your function works, you can't rely on it being fast.




--
Steven

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscriptio

[Tutor] How to identify clusters of similar files

2012-06-02 Thread Albert-Jan Roskam
Hi,

I want to use difflib to compare a lot (tens of thousands) of text files. I 
know that many files are quite similar as they are subsequent versions of the 
same document (a primitive kind of version control). What would be a good 
approach to cluster the files based on their likeness? I want to be able to say 
something like: the number of files could be reduced by a factor of ten when 
the number of (near-)duplicates is taken into account.

So let's say I have ten versions of a txt file: 'file0.txt', 'file1.txt', 
'file2.txt', 'file3.txt', 'file4.txt', 'file5.txt', 'file6.txt', 'file7.txt', 
'file8.txt', 'file9.txt'. How could I to some degree of certainty say they are 
related (I can't rely on the file names I'm affraid). file0 may be very similar 
to file1, but no longer to file10. But their likeness is "chained". The 
situation is easier with perfectly identical files.


The crude code below illustrates what I'd like to do, but it's too simplistic. 
I'd appreciate some thoughts or references to theoretical approaches to this 
kind of stuff.


import difflib, glob, os

path = "/home/aj/Destkop/someDir"
extension = ".txt"
cut_off = 0.95

allTheFiles = sorted(glob.glob(os.path.join(path, "*" + extension)))

for f_a in allTheFiles:
  for f_b in allTheFiles:
    file_a = open(f_a).readlines()
    file_b = open(f_b).readlines()
    if f_a != f_b:

   likeness = difflib.SequenceMatcher(lambda x: x == " ", file_a, 
file_b).ratio()
   if likeness >= cut_off:
 try:
   clusters[f_a].append(f_b)
 except KeyError:
   clusters[f_a] = [f_b]

 
Thank you in advance!


Regards,
Albert-Jan


~~
All right, but apart from the sanitation, the medicine, education, wine, public 
order, irrigation, roads, a 
fresh water system, and public health, what have the Romans ever done for us?
~~ ___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Joining all strings in stringList into one string

2012-06-02 Thread Jordan
Thanks for the excellent feedback and very informative. I guess I just
did not consider the memory side of things and did not think about just
how much extra addition was having to occur.
Additionally I also found this supporting link:
http://wiki.python.org/moin/PythonSpeed/PerformanceTips#String_Concatenation


On 06/02/2012 04:29 PM, Steven D'Aprano wrote:
> Jordan wrote:
>
>> #Another version might look like this:
>>
>> def join_strings2(string_list):
>> final_string = ''
>> for string in string_list:
>> final_string += string
>> print(final_string)
>> return final_string
>
> Please don't do that. This risks becoming slow. REALLY slow. Painfully
> slow. Like, 10 minutes versus 3 seconds slow. Seriously.
>
> The reason for this is quite technical, and the reason why you might
> not notice is even more complicated, but the short version is this:
>
> Never build up a long string by repeated concatenation of short strings.
> Always build up a list of substrings first, then use the join method to
> assemble them into one long string.
>
>
> Why repeated concatenation is slow:
>
> Suppose you want to concatenation two strings, "hello" and "world", to
> make a new string "helloworld". What happens?
>
> Firstly, Python has to count the number of characters needed, which
> fortunately is fast in Python, so we can ignore it. In this case, we
> need 5+5 = 10 characters.
>
> Secondly, Python sets aside enough memory for those 10 characters,
> plus a little bit of overhead: --
>
> Then it copies the characters from "hello" into the new area: hello-
>
> followed by the characters of "world": helloworld
>
> and now it is done. Simple, right? Concatenating two strings is pretty
> fast. You can't get much faster.
>
> Ah, but what happens if you do it *repeatedly*? Suppose we have SIX
> strings we want to concatenate, in a loop:
>
> words = ['hello', 'world', 'foo', 'bar', 'spam', 'ham']
> result = ''
> for word in words:
> result = result + word
>
> How much work does Python have to do?
>
> Step one: add '' + 'hello', giving result 'hello'
> Python needs to copy 0+5 = 5 characters.
>
> Step two: add 'hello' + 'world', giving result 'helloworld'
> Python needs to copy 5+5 = 10 characters, as shown above.
>
> Step three: add 'helloworld' + 'foo', giving 'helloworldfoo'
> Python needs to copy 10+3 = 13 characters.
>
> Step four: add 'helloworldfoo' + 'bar', giving 'helloworldfoobar'
> Python needs to copy 13+3 = 16 characters.
>
> Step five: add 'helloworldfoobar' + 'spam', giving 'helloworldfoobarspam'
> Python needs to copy 16+4 = 20 characters.
>
> Step six: add 'helloworldfoobarspam' + 'ham', giving
> 'helloworldfoobarspamham'
> Python needs to copy 20+3 = 23 characters.
>
> So in total, Python has to copy 5+10+13+16+20+23 = 87 characters, just
> to build up a 23 character string. And as the number of loops
> increases, the amount of extra work needed just keeps expanding. Even
> though a single string concatenation is fast, repeated concatenation
> is painfully SLOW.
>
> In comparison, ''.join(words) one copies each substring once: it
> counts out that it needs 23 characters, allocates space for 23
> characters, then copies each substring into the right place instead of
> making a whole lot of temporary strings and redundant copying.
>
> So, join() is much faster than repeated concatenation. But you may
> never have noticed. Why not?
>
> Well, for starters, for small enough pieces of data, everything is
> fast. The difference between copying 87 characters (the slow way) and
> 23 characters (the fast way) is trivial.
>
> But more importantly, some years ago (Python 2.4, about 8 years ago?)
> the Python developers found a really neat trick that they can do to
> optimize string concatenation so it doesn't need to repeatedly copy
> characters over and over and over again. I won't go into details, but
> the thing is, this trick works well enough that repeated concatenation
> is about as fast as the join method MOST of the time.
>
> Except when it fails. Because it is a trick, it doesn't always work.
> And when it does fail, your repeated string concatenation code will
> suddenly drop from running in 0.1 milliseconds to a full second or
> two; or worse, from 20 seconds to over an hour. (Potentially; the
> actual slow-down depends on the speed of your computer, your operating
> system, how much memory you have, etc.)
>
> Because this is a cunning trick, it doesn't always work, and when it
> doesn't work, and you have slow code and no hint as to why.
>
> What can cause it to fail?
>
> - Old versions of Python, before 2.4, will be slow.
>
> - Other implementations of Python, such as Jython and IronPython, will
> not have the trick, and so will be slow.
>
> - The trick is highly-dependent on internal details of the memory
> management of Python and the way it interacts with the operating
> system. So what's fast under Linux may be slow under Windows, or the
> other way around.
>
> - The t

Re: [Tutor] How to identify clusters of similar files

2012-06-02 Thread Steven D'Aprano

Albert-Jan Roskam wrote:

Hi,

I want to use difflib to compare a lot (tens of thousands) of text files. I
know that many files are quite similar as they are subsequent versions of
the same document (a primitive kind of version control). What would be a
good approach to cluster the files based on their likeness?


You have already identified the basic tool: difflib. But your question is not 
really about Python, it is more about the algorithm used for clustering data 
according to goodness of fit. That's a hard problem, and you should consider 
asking it on the main Python mailing list or newsgroup too.


Some search terms to get you started:

biopython
nltk  (the Natural Language Tool Kit)
unrooted phylogram


Good luck!


--
Steven
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor