find_longest_match in SequenceMatcher

2006-07-23 Thread koara
Hello, it might be too late or too hot, but i cannot work out this
behaviour of find_longest_match() in difflib.SequenceMatcher:

string1:
releasenotesforwildmagicversion01thiscdromcontainstheinitialreleaseofthesourcecodethataccompaniesthebook"3dgameenginedesign:apracticalapproachtorealtimecomputergraphics"thereareanumberofknownissuesaboutthecodeastheseissuesareaddressedtheupdatedcodewillbeavailableatthewebsitehttp://wwwmagicsoftwarecom/[EMAIL
 PROTECTED]

string2:
releasenotesforwildmagicversion02updatefromversion01toversion02ifyourcopyofthebookhasversion01andifyoudownloadedversion02fromthewebsitethenapplythefollowingdirectionsforinstallingtheupdateforalinuxinstallationseethesectionattheendofthisdocumentupdatedirectionsassumingthatthetopleveldirectoryiscalledmagicreplacebyyourtoplevelnameyoushouldhavetheversion01contentsinthislocation1deletethecontentsofmagic\include2deletethesubdirectorymagic\source\mgcapplication3deletetheobsoletefiles:amagic\source\mgc

find_longest_match(0,500,0,500)=(24,43,10)="version01t"

What? O_o Clearly there is a longer match, right at the beginning!
And then, after removal of the last character from each string (i found
the limit of 500 by trial and error -- and it looks suspiciously
rounded):

find_longest_match(0,499,0,499)=(0,0,32)="releasenotesforwildmagicversion0"


Is this the expected behaviour? What's going on?
Thank you for any ideas

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: find_longest_match in SequenceMatcher

2006-07-24 Thread koara
John Machin wrote:
> --test results snip---
> Looks to me like the problem has nothing at all to do with the length
> of the searched strings, but a bug appeared in 2.3.  What version(s)
> were you using? Can you reproduce your results (500 & 499 giving
> different answers) with the same version?

Hello John, thank you for investigating and responding!

Yes, I can reproduce the behaviour with different results within the
same version -- which is 2.4.3 (#69, Mar 29 2006, 17:35:34) [MSC v.1310
32 bit (Intel)]

The catch is to remove the last character, as i described in my
original post, as opposed to passing reduced length parameters to
find_longest_match, which is what you did.

It is morning now, but i still fail to see the mistake i am making --
if it is indeed a bug, where do i report it? 

Cheers!

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: find_longest_match in SequenceMatcher

2006-07-25 Thread koara
Hello again John -- your hack/fix seems to work. Thanks a lot, now
let's hope timbot will indeed be here shortly with a proper fix =)

-- 
http://mail.python.org/mailman/listinfo/python-list


memory efficient set/dictionary

2007-06-10 Thread koara
What is the best to go about using a large set (or dictionary) that
doesn't fit into main memory? What is Python's (2.5 let's say)
overhead for storing int in the set, and how much for storing int ->
int mapping in the dict?

Please recommend a module that allows persistent set/dict storage +
fast query that best fits my problem, and as lightweight as possible.
For queries, the hit ratio is about 10%. Fast updates would be nice,
but i can rewrite the algo so that the data is static, so update speed
is not critical.

Or am i better off not using Python here? Cheers.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: memory efficient set/dictionary

2007-06-10 Thread koara
Hello Steven,

On Jun 10, 5:29 pm, Steven D'Aprano
<[EMAIL PROTECTED]> wrote:
> > ...
> How do you know it won't fit in main memory if you don't know the
> overhead? A guess? You've tried it and your computer crashed?

exactly

> > Please recommend a module that allows persistent set/dict storage +
> > fast query that best fits my problem,
>
> Usually I love guessing what people's problems are before making a
> recommendation, but I'm feeling whimsical so I think I'll ask first.
>
> What is the problem you are trying to solve? How many keys do you have?

Corpus processing. There are in the order of billions to tens of
billions keys (64bit integers).

> Can you group them in some way, e.g. alphabetically? Do you need to search
> on random keys, or can you queue them and do them in the order of your
> choice?


Yes, keys in sets and dictionaries can be grouped in any way, order
doesn't matter. Not sure what you mean.
Yes, i need fast random access (at least i do without having to
rethink and rewrite everything, which is what i'd like to avoid with
the help of this thread :-)

Thanks for the reply!


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: memory efficient set/dictionary

2007-06-11 Thread koara
> > I would recommend you to use a database since it meets your
> > requirements (off-memory, fast, persistent). The bsdddb module
> > (berkeley db) even gives you a dictionary like interface.
> >http://www.python.org/doc/lib/module-bsddb.html
>
> Standard SQL databases can work for this, but generally your
> recommendation of using bsddb works very well for int -> int mappings.
> In particular, I would suggest using a btree, if only because I have had
> troubles in the past with colliding keys in the bsddb.hash (and recno is
> just a flat file, and will attempt to create a file i*(record size) to
> write to record number i .
>
> As an alternative, there are many search-engine known methods for
> mapping int -> [int, int, ...], which can be implemented as int -> int,
> where the second int is a pointer to an address on disk.  Looking into a
> few of the open source search implementations may be worthwhile.

Thanks guys! I will look into bsddb, hopefully this doesn't keep all
keys in memory, i couldn't find answer to that during my (very brief)
look into the documentation.

And how about the extra memory used for set/dict'ing of integers? Is
there a simple answer?

-- 
http://mail.python.org/mailman/listinfo/python-list


unicode categories -- regex

2007-09-22 Thread koara
Hello all -- my question regards special meta characters for the re
module. I saw in the re module documentation about the possibility to
abstract to any alphanumeric unicode character with '\w'. However,
there was no info on constructing patterns for other unicode
categories, such as purely alphabetical characters, or punctuation
symbols etc.

I found that this category information actually IS available in python
-- in the standard module unicodedata. For example,
unicodedata.category(u'.') gives 'Po' for 'Punctuation, other' etc.

So how do i include this information in regular pattern search? Any
ideas? Thanks.


I'm talking about python2.5 here.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: unicode categories -- regex

2007-09-22 Thread koara
> At the moment, you have to generate a character class for this yourself,
> e.g.
> ...


Thank you Martin, this is exactly what i wanted to know.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: enumerate overflow

2007-10-03 Thread koara
On Oct 3, 7:22 pm, Raymond Hettinger <[EMAIL PROTECTED]> wrote:
> In Py2.6, I will mostly likely put in an automatic promotion to long
> for both enumerate() and count().  It took a while to figure-out how
> to do this without killing the performance for normal cases (ones used
> in real programs, not examples contrived to say, "omg, see what
> *could* happen").
>
> Raymond


Thanks everybody for the reply and suggestions, I'm glad to see the
issues's already been discovered/discussed/almostresolved.

By the way, I do not consider my programs in any way 'unreal'.

-- 
http://mail.python.org/mailman/listinfo/python-list


urllib.unquote + unicode

2007-11-13 Thread koara
Hello all,

i am using urllib.unquote_plus to unquote a string. Sometimes i get a
strange string like for example "spolu%u017E%E1ci.cz" to unquote. Here
the problem is that some application decided to quote a non-ascii
character as %u directly, instead of using an encoding and quoting
byte per byte.

Python (2.4.1) simply returns "'spolu%u017E\xe1ci.cz", which is likely
not what the application meant.

My question is, is this %u quoting a standard (i.e., urllib is in the
wrong), is it not (i.e., the application is in the wrong and urllib
silently ignores the '%u0' - why?), and most importantly, is there a
simple workaround to get it working as expected?

Cheers!

-- 
http://mail.python.org/mailman/listinfo/python-list


mmap disk performance

2007-11-20 Thread koara
Hello all,

i am using the mmap module (python2.4) to access contents of a file.

My question regards the relative performance of mmap.seek() vs
mmap.tell(). I have a generator that returns stuff from the file,
piece by piece. Since other things may happen to the mmap object in
between consecutive next() calls (such as another iterator's next()),
i have to store the file position before yield and restore it
afterwards by means of tell() and seek(). Is this correct?

When restoring, is there a penalty for mmap.seek(pos) where the file
position is already at pos (i.e., nothing happened to the file
position in between, a common scenario)? If there is, is it worth
doing

if mmap.tell() != pos:
mmap.seek(pos)

or such?

Cheers!
-- 
http://mail.python.org/mailman/listinfo/python-list


hidden built-in module

2008-03-07 Thread koara
Hello, is there a way to access a module that is hidden because
another module (of the same name) is found first?

More specifically, i have my own logging.py module, and inside this
module, depending on how initialization goes,  i may want to do 'from
logging import *' from the built-in logging.

I hope my description was clear, cheers.

I am using python2.4.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: hidden built-in module

2008-03-07 Thread koara
On Mar 5, 1:39 pm, gigs <[EMAIL PROTECTED]> wrote:
> koara wrote:
> > Hello, is there a way to access a module that is hidden because
> > another module (of the same name) is found first?
>
> > More specifically, i have my own logging.py module, and inside this
> > module, depending on how initialization goes,  i may want to do 'from
> > logging import *' from the built-in logging.
>
> > I hope my description was clear, cheers.
>
> > I am using python2.4.
>
> you can add your own logging module in extra directory that have __init__.py 
> and
> import it like: from extradirectory.logging import *
>
> and builtin: from logging import *


Thank you for your reply gigs. However, the point of this namespace
harakiri is that existing code which uses 'import logging' ...
'logging.info()'... etc. continues working without any change.
Renaming my logging.py file is not an option -- if it were, i wouldn't
bother naming my module same as a built-in :-)

Cheers.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: hidden built-in module

2008-03-07 Thread koara
> You can only try and search the sys-path for the logging-module, using
>
> sys.prefix
>
> and then look for logging.py. Using
>
> __import__(path)
>
> you get a reference to that module.
>
> Diez


Thank you Diez, that's the info i'd been looking for :-)

So the answer is sys module + __import__

Cheers!
-- 
http://mail.python.org/mailman/listinfo/python-list