Re: handeling very large dictionaries

2009-06-28 Thread Joseph Turian
You could also try using a key-value store. I am using pytc, a Python API for Tokyo Cabinet. It seems to figure out quite nicely when to go to disk, and when to use memory. But I have not done extensive tests. Here is some example code for using pytc: http://github.com/turian/pytc-example/tree/m

Re: handeling very large dictionaries

2009-06-28 Thread alex23
On Jun 29, 9:13 am, mclovin wrote: > Is there something like it that is more flexible? Have you seen the stdlib module 'shelve'? http://docs.python.org/library/shelve.html It creates a persistent file-based dictionary, which can hold any type of object as long as it can be pickled. I really li

Re: handeling very large dictionaries

2009-06-28 Thread Aahz
In article <9efff087-bd6e-49fb-ad30-a955a64b8...@j32g2000yqh.googlegroups.com>, mclovin wrote: > >I need to have a dictionary of about 8 gigs (well the data it is >processing is around 4gb). so naturally i am running into memory >errors. > >So i looked around and found bsddb which acts like a dic

handeling very large dictionaries

2009-06-28 Thread mclovin
Hello all, I need to have a dictionary of about 8 gigs (well the data it is processing is around 4gb). so naturally i am running into memory errors. So i looked around and found bsddb which acts like a dictionary object only offloads the data from the RAM to the HDD, however that only supports st

Re: writing large dictionaries to file using cPickle

2009-04-26 Thread repi
ble.com/writing-large-dictionaries-to-file-using-cPickle-tp21708938p23246693.html Sent from the Python - python-list mailing list archive at Nabble.com. -- http://mail.python.org/mailman/listinfo/python-list

Re: writing large dictionaries to file using cPickle

2009-01-30 Thread Aaron Brady
On Jan 30, 2:44 pm, [email protected] wrote: > On Jan 28, 6:08 pm, Aaron Brady wrote: > > > > > On Jan 28, 4:43 pm, [email protected] wrote: > > > > On Jan 28, 5:14 pm, John Machin wrote: > > > > > On Jan 29, 3:13 am, [email protected] wrote: > > > > > > hello all, > > > > > > i have a large d

Re: writing large dictionaries to file using cPickle

2009-01-30 Thread perfreem
On Jan 28, 6:08 pm, Aaron Brady wrote: > On Jan 28, 4:43 pm, [email protected] wrote: > > > On Jan 28, 5:14 pm, John Machin wrote: > > > > On Jan 29, 3:13 am, [email protected] wrote: > > > > > hello all, > > > > > i have a large dictionary which contains about 10 keys, each key has a > > > > v

Re: writing large dictionaries to file using cPickle

2009-01-28 Thread Gabriel Genellina
En Wed, 28 Jan 2009 14:13:10 -0200, escribió: i have a large dictionary which contains about 10 keys, each key has a value which is a list containing about 1 to 5 million (small) dictionaries. for example, mydict = {key1: [{'a': 1, 'b': 2, 'c': 'hello'}, {'d', 3, 'e': 4, 'f': 'world'}, ...],

Re: writing large dictionaries to file using cPickle

2009-01-28 Thread John Machin
ower and slower. it takes almost an hour if not more to > > > write this pickle object to file. > > > > is there any way to speed this up? i dont mind the large file... after > > > all the text file with the data used to make the dictionary was larger > > > (

Re: writing large dictionaries to file using cPickle

2009-01-28 Thread Aaron Brady
On Jan 28, 4:43 pm, [email protected] wrote: > On Jan 28, 5:14 pm, John Machin wrote: > > > > > On Jan 29, 3:13 am, [email protected] wrote: > > > > hello all, > > > > i have a large dictionary which contains about 10 keys, each key has a > > > value which is a list containing about 1 to 5 milli

Re: writing large dictionaries to file using cPickle

2009-01-28 Thread perfreem
large file... after > > all the text file with the data used to make the dictionary was larger > > (~ 800 MB) than the file it eventually creates, which is 300 MB.  but > > i do care about speed... > > > i have tried optimizing this by using this: > > > s = pickle

Re: writing large dictionaries to file using cPickle

2009-01-28 Thread John Machin
do care about speed... > > i have tried optimizing this by using this: > > s = pickle.dumps(mydict, 2) > pfile.write(s) > > but this takes just as long... any ideas ? is there a different module > i could use that's more suitable for large dictionaries ? > thank you

Re: writing large dictionaries to file using cPickle

2009-01-28 Thread Aaron Brady
there a different module > i could use that's more suitable for large dictionaries ? > thank you very much. There is the 'shelve' module. You could create a shelf that tells you the filename of the 5 other ones. A million keys should be no problem, I guess. (It's

Re: writing large dictionaries to file using cPickle

2009-01-28 Thread Christian Heimes
[email protected] schrieb: > but this takes just as long... any ideas ? is there a different module > i could use that's more suitable for large dictionaries ? > thank you very much. Have a look at ZODB. -- http://mail.python.org/mailman/listinfo/python-list

Re: writing large dictionaries to file using cPickle

2009-01-28 Thread perfreem
On Jan 28, 11:32 am, [email protected] wrote: > Hi, > > Change: > > pickle.dump(mydict, pfile) > > to: > > pickle.dump(mydict, pfile, -1 ) > > I think you will see a big difference in performance and also a much > smaller file on disk. > > BTW: What type of application are you developing that crea

Re: writing large dictionaries to file using cPickle

2009-01-28 Thread python
Hi, Change: pickle.dump(mydict, pfile) to: pickle.dump(mydict, pfile, -1 ) I think you will see a big difference in performance and also a much smaller file on disk. BTW: What type of application are you developing that creates so many dictionaries? Sounds interesting. Malcolm -- http://mail

writing large dictionaries to file using cPickle

2009-01-28 Thread perfreem
he dictionary was larger (~ 800 MB) than the file it eventually creates, which is 300 MB. but i do care about speed... i have tried optimizing this by using this: s = pickle.dumps(mydict, 2) pfile.write(s) but this takes just as long... any ideas ? is there a different module i could use that&#

Re: optimizing large dictionaries

2009-01-16 Thread Luis M . González
On Jan 15, 6:39 pm, Per Freem wrote: > hello > > i have an optimization questions about python. i am iterating through > a file and counting the number of repeated elements. the file has on > the order > of tens of millions elements... > > i create a dictionary that maps elements of the file that

Re: optimizing large dictionaries

2009-01-16 Thread Matthias Julius
Per Freem writes: > the only 'twist' is that my elt is an instance of a class (MyClass) > with 3 fields, all numeric. the class is hashable, and so > my_dict[elt] works well. the __repr__ and __hash__ methods of my > class simply return str() representation of self, which just calls __str__().

Re: optimizing large dictionaries

2009-01-16 Thread pruebauno
On Jan 15, 4:39 pm, Per Freem wrote: > hello > > i have an optimization questions about python. i am iterating through > a file and counting the number of repeated elements. the file has on > the order > of tens of millions elements... > > i create a dictionary that maps elements of the file that

Re: optimizing large dictionaries

2009-01-15 Thread Scott David Daniels
Paul Rubin wrote: Per Freem writes: 2] is there an easy way to have nested defaultdicts? ie i want to say that my_dict = defaultdict(defaultdict(int)) -- to reflect the fact that my_dict is a dictionary, whose values are dictionary that map to ints. but that syntax is not valid. my_dict = def

Re: optimizing large dictionaries

2009-01-15 Thread Paul McGuire
On Jan 15, 5:31 pm, Per Freem wrote: > ...the aKeys are very small (less than 100) where > as the bKeys are the ones that are in the millions.  so in that case, > doing a Try-Except on aKey should be very efficient, since often it > will not fail, ... Do you know the aKeys in advance? If so, the

Re: optimizing large dictionaries

2009-01-15 Thread Christian Heimes
Per Freem schrieb: > 1] is Try-Except really slower? my dict actually has two layers, so > my_dict[aKey][bKeys]. the aKeys are very small (less than 100) where > as the bKeys are the ones that are in the millions. so in that case, > doing a Try-Except on aKey should be very efficient, since often

Re: optimizing large dictionaries

2009-01-15 Thread Steven D'Aprano
On Thu, 15 Jan 2009 14:49:29 -0800, bearophileHUGS wrote: > Matimus, your suggestions are all good. > > Try-except is slower than: > if x in adict: ... else: ... Not according to my tests. >>> def tryexcept(D, key): ... try: ... return D[key] ... except KeyError: ...

Re: optimizing large dictionaries

2009-01-15 Thread Paul Rubin
Per Freem writes: > 2] is there an easy way to have nested defaultdicts? ie i want to say > that my_dict = defaultdict(defaultdict(int)) -- to reflect the fact > that my_dict is a dictionary, whose values are dictionary that map to > ints. but that syntax is not valid. my_dict = defaultdict(lambd

Re: optimizing large dictionaries

2009-01-15 Thread Per Freem
thanks to everyone for the excellent suggestions. a few follow up q's: 1] is Try-Except really slower? my dict actually has two layers, so my_dict[aKey][bKeys]. the aKeys are very small (less than 100) where as the bKeys are the ones that are in the millions. so in that case, doing a Try-Except o

Re: optimizing large dictionaries

2009-01-15 Thread Steven D'Aprano
On Thu, 15 Jan 2009 23:22:48 +0100, Christian Heimes wrote: >> is there anything that can be done to speed up this simply code? right >> now it is taking well over 15 minutes to process, on a 3 Ghz machine >> with lots of RAM (though this is all taking CPU power, not RAM at this >> point.) > > cl

Re: optimizing large dictionaries

2009-01-15 Thread bearophileHUGS
Matimus, your suggestions are all good. Try-except is slower than: if x in adict: ... else: ... A defaultdict is generally faster (there are some conditions when it's not faster, but they aren't much common. I think it's when the ratio of duplicates is really low), creating just a tuple instead of

Re: optimizing large dictionaries

2009-01-15 Thread Christian Heimes
> class MyClass > > def __str__(self): > return "%s-%s-%s" %(self.field1, self.field2, self.field3) > > def __repr__(self): > return str(self) > > def __hash__(self): > return hash(str(self)) > > > is there anything that can be done to speed up this simply code? right > now i

Re: optimizing large dictionaries

2009-01-15 Thread Jervis Whitley
On Fri, Jan 16, 2009 at 8:39 AM, Per Freem wrote: > hello > > i have an optimization questions about python. i am iterating through > a file and counting the number of repeated elements. the file has on > the order > of tens of millions elements... > > > for line in file: > try: >elt = MyCla

Re: optimizing large dictionaries

2009-01-15 Thread Matimus
On Jan 15, 1:39 pm, Per Freem wrote: > hello > > i have an optimization questions about python. i am iterating through > a file and counting the number of repeated elements. the file has on > the order > of tens of millions elements... > > i create a dictionary that maps elements of the file that

optimizing large dictionaries

2009-01-15 Thread Per Freem
hello i have an optimization questions about python. i am iterating through a file and counting the number of repeated elements. the file has on the order of tens of millions elements... i create a dictionary that maps elements of the file that i want to count to their number of occurs. so i iter

Re: Strategy for determing difference between 2 very large dictionaries

2008-12-24 Thread John Machin
On Dec 24, 7:04 pm, "Malcolm Greene" wrote: > Hi Roger, > > By very large dictionary, I mean about 25M items per dictionary. Each > item is a simple integer whose value will never exceed 2^15. In Python-speak about dictionaries, an "item" is a tuple (key, value). I presume from the gross differen

Re: Strategy for determing difference between 2 very large dictionaries

2008-12-24 Thread Terry Reedy
Marc 'BlackJack' Rintsch wrote: On Wed, 24 Dec 2008 03:23:00 -0500, python wrote: collection, I don't see the advantage of using an iterator or a list. I'm sure I'm missing a subtle point here :) `keys()` creates a list in memory, `iterkeys()` does not. With ``set(dict.keys())`` there is a

Re: Strategy for determing difference between 2 very large dictionaries

2008-12-24 Thread Scott David Daniels
Gabriel Genellina wrote: En Wed, 24 Dec 2008 06:23:00 -0200, escribió: ... k1 = set(dict1.iterkeys()) You've got an excelent explanation from Marc Rintsch. (Note that in Python 3.0 keys() behaves as iterkeys() in previous versions, so the above code is supposed to be written in Python 2.x)

Re: Strategy for determing difference between 2 very large dictionaries

2008-12-24 Thread bearophileHUGS
Peter Otten: > >>> a = dict(a=1, b=2, c=3) > >>> b = dict(b=2, c=30, d=4) > >>> dict(set(a.iteritems()) ^ set(b.iteritems())) For larger sets this may be better, may avoid the creation of the second set: dict(set(a.iteritems()).symmetric_difference(b.iteritems())) Bye, bearophile -- http://mail.

Re: Most efficient way to build very large dictionaries

2008-12-24 Thread Roger Binns
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Martin wrote: > I'd think he's talking about in memory SQLite Databases, this way you > should be quite fast _and_ could dump all that to a persistent > storage... I was just talking about regular on disk SQLite databases. In terms of priming the pum

Re: Most efficient way to build very large dictionaries

2008-12-24 Thread Martin
Hi, 2008/12/24 : > Hi Roger, > >> you may want to consider using SQLite > > Thank you for your suggestion about looking at SQLite. I haven't > compared the performance of SQLite to Python dictionaries, but I'm > skeptical that SQLite would be faster than in-memory Python dictionaries > for the ty

Re: Strategy for determing difference between 2 very large dictionaries

2008-12-24 Thread Malcolm Greene
08 07:10:16 -0200 Subject: Re: Strategy for determing difference between 2 very large dictionaries En Wed, 24 Dec 2008 06:23:00 -0200, escribió: > Hi Gabriel, > > Thank you very much for your feedback! > >> k1 = set(dict1.iterkeys()) > > I noticed you suggested .iterkeys

Re: Strategy for determing difference between 2 very large dictionaries

2008-12-24 Thread python
x27;: 3, 'd': 4} That's very cool! Thanks for sharing that technique. Regards, Malcolm - Original message - From: "Peter Otten" <[email protected]> To: [email protected] Date: Wed, 24 Dec 2008 09:46:51 +0100 Subject: Re: Strategy for determing differ

Re: Most efficient way to build very large dictionaries

2008-12-24 Thread python
r your help Roger! Regards, Malcolm - Original message - From: "Roger Binns" To: [email protected] Date: Wed, 24 Dec 2008 00:50:49 -0800 Subject: Re: Most efficient way to build very large dictionaries -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 [email protected] wrote: &g

Re: Strategy for determing difference between 2 very large dictionaries

2008-12-24 Thread Gabriel Genellina
En Wed, 24 Dec 2008 06:23:00 -0200, escribió: Hi Gabriel, Thank you very much for your feedback! k1 = set(dict1.iterkeys()) I noticed you suggested .iterkeys() vs. .keys(). Is there any advantage to using an iterator vs. a list as the basis for creating a set? I You've got an excelent ex

Re: Strategy for determing difference between 2 very large dictionaries

2008-12-24 Thread python
Hi James, > For the purpose of perpetuating the annoying pedantry that has made > usenet great: > > http://docs.python.org/dev/3.0/whatsnew/3.0.html#views-and-iterators-instead-of-lists Great tip! Thank you! Malcolm -- http://mail.python.org/mailman/listinfo/python-list

Re: Strategy for determing difference between 2 very large dictionaries

2008-12-24 Thread James Stroud
Marc 'BlackJack' Rintsch wrote: On Wed, 24 Dec 2008 03:23:00 -0500, python wrote: Hi Gabriel, Thank you very much for your feedback! k1 = set(dict1.iterkeys()) I noticed you suggested .iterkeys() vs. .keys(). Is there any advantage to using an iterator vs. a list as the basis for creating a

Re: Most efficient way to build very large dictionaries

2008-12-24 Thread Roger Binns
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 [email protected] wrote: > Thank you for your suggestion about looking at SQLite. I haven't > compared the performance of SQLite to Python dictionaries, but I'm > skeptical that SQLite would be faster than in-memory Python dictionaries > for the type

Re: Strategy for determing difference between 2 very large dictionaries

2008-12-24 Thread Peter Otten
Gabriel Genellina wrote: > En Wed, 24 Dec 2008 05:16:36 -0200, escribió: [I didn't see the original post] >> I'm looking for suggestions on the best ('Pythonic') way to >> determine the difference between 2 very large dictionaries >> containing simple

Re: Strategy for determing difference between 2 very large dictionaries

2008-12-24 Thread python
explanation. Thank you! Malcolm - Original message - From: "Marc 'BlackJack' Rintsch" To: [email protected] Date: 24 Dec 2008 08:30:41 GMT Subject: Re: Strategy for determing difference between 2 very large dictionaries On Wed, 24 Dec 2008 03:23:00 -0

Re: Most efficient way to build very large dictionaries

2008-12-24 Thread python
t need SQLite's ability to work with data sets larger than my physical memory. Regards, Malcolm - Original message - From: "Roger Binns" To: [email protected] Date: Wed, 24 Dec 2008 00:19:56 -0800 Subject: Re: Most efficient way to build very large dictionaries -BE

Re: Strategy for determing difference between 2 very large dictionaries

2008-12-24 Thread Marc 'BlackJack' Rintsch
On Wed, 24 Dec 2008 03:23:00 -0500, python wrote: > Hi Gabriel, > > Thank you very much for your feedback! > >> k1 = set(dict1.iterkeys()) > > I noticed you suggested .iterkeys() vs. .keys(). Is there any advantage > to using an iterator vs. a list as the basis for creating a set? I > understan

Re: Most efficient way to build very large dictionaries

2008-12-24 Thread Roger Binns
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 [email protected] wrote: > Can I take advantage of this knowledge to optimize You do the optimization last :-) The first thing you need to do is make sure you have a way of validating you got the correct results. With 25M entries it would be very e

Re: Strategy for determing difference between 2 very large dictionaries

2008-12-24 Thread python
ubject: Re: Strategy for determing difference between 2 very large dictionaries En Wed, 24 Dec 2008 05:16:36 -0200, escribió: > I'm looking for suggestions on the best ('Pythonic') way to > determine the difference between 2 very large dictionaries > containing simp

Re: Most efficient way to build very large dictionaries

2008-12-24 Thread Roger Binns
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 [email protected] wrote: > I would appreciate your thoughts on whether there are advantages to > working with a pre-built dictionary and if so, what are the best ways to > create a pre-loaded dictionary. Based on this and your other thread, you may w

Re: Strategy for determing difference between 2 very large dictionaries

2008-12-24 Thread Malcolm Greene
ime. Thank you for sharing your thoughts with me. Regards, Malcolm - Original message - From: "Roger Binns" To: [email protected] Date: Tue, 23 Dec 2008 23:26:49 -0800 Subject: Re: Strategy for determing difference between 2 very large dictionaries -BEGIN PGP SIGNE

Re: Strategy for determing difference between 2 very large dictionaries

2008-12-23 Thread Chris Rebert
On Tue, Dec 23, 2008 at 11:46 PM, Gabriel Genellina wrote: > > Yes; but isn't a dict comprehension more adequate? > > [key: (dict1[key], dict2[key]) for key in common_keys if > dict1[key]!=dict2[key]} That initial [ should be a { of course. Cheers, Chris -- Follow the path of the Iguana...

Re: Strategy for determing difference between 2 very large dictionaries

2008-12-23 Thread Gabriel Genellina
En Wed, 24 Dec 2008 05:16:36 -0200, escribió: I'm looking for suggestions on the best ('Pythonic') way to determine the difference between 2 very large dictionaries containing simple key/value pairs. By difference, I mean a list of keys that are present in the first dictiona

Most efficient way to build very large dictionaries

2008-12-23 Thread python
I'm working with some very large dictionaries (about 25M items per dictionary) that get filled based on data parsed from text based log files. I'm using Python dictionaries to track the frequency of specific combinations of events, eg, I build a key and then add various timing inf

Re: Strategy for determing difference between 2 very large dictionaries

2008-12-23 Thread Roger Binns
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 [email protected] wrote: > Feedback on my proposed strategies (or better strategies) would be > greatly appreciated. Both strategies will work but I'd recommend the second approach since it uses already tested code written by other people - the chanc

Strategy for determing difference between 2 very large dictionaries

2008-12-23 Thread python
I'm looking for suggestions on the best ('Pythonic') way to determine the difference between 2 very large dictionaries containing simple key/value pairs. By difference, I mean a list of keys that are present in the first dictionary, but not the second. And vice versa. And a list of

Re: Optimizing size of very large dictionaries

2008-07-31 Thread Terry Reedy
Raymond Hettinger wrote: Background: I'm trying to identify duplicate records in very large text based transaction logs. I'm detecting duplicate records by creating a SHA1 checksum of each record and using this checksum as a dictionary key. This works great except for several files whose size

Re: Optimizing size of very large dictionaries

2008-07-31 Thread M.-A. Lemburg
On 2008-07-31 02:29, [EMAIL PROTECTED] wrote: Are there any techniques I can use to strip a dictionary data structure down to the smallest memory overhead possible? I'm working on a project where my available RAM is limited to 2G and I would like to use very large dictionaries vs. a tradit

Re: Optimizing size of very large dictionaries

2008-07-31 Thread Raymond Hettinger
> > Are there any techniques I can use to strip a dictionary data > > structure down to the smallest memory overhead possible? Sure. You can build your own version of a dict using UserDict.DictMixin. The underlying structure can be as space efficient as you want. FWIW, dictionaries automaticall

Re: Optimizing size of very large dictionaries

2008-07-30 Thread Miles
On Wed, Jul 30, 2008 at 8:29 PM, <[EMAIL PROTECTED]> wrote: > Background: I'm trying to identify duplicate records in very large text > based transaction logs. I'm detecting duplicate records by creating a SHA1 > checksum of each record and using this checksum as a dictionary key. This > works gre

Re: Optimizing size of very large dictionaries

2008-07-30 Thread Gabriel Genellina
En Wed, 30 Jul 2008 21:29:39 -0300, <[EMAIL PROTECTED]> escribi�: Are there any techniques I can use to strip a dictionary data structure down to the smallest memory overhead possible? I'm working on a project where my available RAM is limited to 2G and I would like to use

Optimizing size of very large dictionaries

2008-07-30 Thread python
Are there any techniques I can use to strip a dictionary data structure down to the smallest memory overhead possible? I'm working on a project where my available RAM is limited to 2G and I would like to use very large dictionaries vs. a traditional database. Background: I'm trying t

Re: Large Dictionaries

2006-06-12 Thread Klaas
Thomas Ganss wrote: > Klaas schrieb: > > > 4. Insert your keys in sorted order. > This advice is questionable - > it depends on the at least on the db vendor and probably > sometimes on the sort method, if inserting pre-sorted > values is better. The article I wrote that you quoted named a specifi

Re: Large Dictionaries

2006-06-09 Thread Duncan Booth
Marcin Ciura wrote: > Duncan Booth wrote: >> Marcin Ciura wrote: >>>See Figure 8 in >>>http://sun.iinf.polsl.gliwice.pl/~mciura/shellsort.pdf >> That isn't what the reference says. It only covers N up to a few >> thousand. Practical values of N need to at least go up into the >> millions. > > Pl

Re: Large Dictionaries

2006-06-09 Thread Marcin Ciura
Duncan Booth wrote: > Marcin Ciura wrote: >>See Figure 8 in >>http://sun.iinf.polsl.gliwice.pl/~mciura/shellsort.pdf > That isn't what the reference says. It only covers N up to a few thousand. > Practical values of N need to at least go up into the millions. Please look at the performance graph

Re: Large Dictionaries

2006-06-09 Thread Duncan Booth
Marcin Ciura wrote: > Tim Peters wrote: >> shellsort is much more a refinement of insertion-sort than of >> bubblesort, but is not an O(N log N) algorithm regardlesss. > > With a judiciously chosen increment sequence, > the number of comparisons made by shellsort > *is* of the order N log N for p

Re: Large Dictionaries

2006-06-09 Thread Marcin Ciura
Tim Peters wrote: > shellsort is much more a refinement of insertion-sort than of > bubblesort, but is not an O(N log N) algorithm regardlesss. With a judiciously chosen increment sequence, the number of comparisons made by shellsort *is* of the order N log N for practical values of N. See Figure

Re: Large Dictionaries

2006-06-08 Thread Tim Peters
[Tim Peters] >> ... >> O(N log N) sorting algorithms helped by pre-existing order are >> uncommon, unless they do extra work to detect and exploit >> pre-existing order. [Lawrence D'Oliveiro] > Shellsort works well with nearly-sorted data. It's basically a smarter > version of bubblesort with much

Re: Large Dictionaries

2006-06-08 Thread Lawrence D'Oliveiro
In article <[EMAIL PROTECTED]>, "Iain King" <[EMAIL PROTECTED]> wrote: >Lawrence D'Oliveiro wrote: >> In article <[EMAIL PROTECTED]>, >> Scott David Daniels <[EMAIL PROTECTED]> wrote: >> >> >For example, time timsort (Python's internal sort) on pre-sorted >> >data; you'll find it is handled fast

Re: Large Dictionaries

2006-06-08 Thread Lawrence D'Oliveiro
In article <[EMAIL PROTECTED]>, "Tim Peters" <[EMAIL PROTECTED]> wrote: >For example, the O(N log N) heapsort is unreasonable compared to the >O(N**2) bubblesort by that reasoning (pre-sorted is actually a bad >case for heapsort, but a good case for bubblesort)? O(N log N) >sorting algorithms he

Re: Large Dictionaries

2006-06-05 Thread Jim Segrave
In article <[EMAIL PROTECTED]>, Tim Peters <[EMAIL PROTECTED]> wrote: >[Jim Segrave] >> Actually, presorted lists are not a bad case for heapsort - it's quite >> immune to any existing order or lack thereof, > >Write a heapsort and time it. It's not a difference in O() behavior, >but more memory m

Re: Large Dictionaries

2006-06-05 Thread Tim Peters
[Jim Segrave] > Actually, presorted lists are not a bad case for heapsort - it's quite > immune to any existing order or lack thereof, Write a heapsort and time it. It's not a difference in O() behavior, but more memory movement is required for a sorted list because transforming the list into a m

Re: Large Dictionaries

2006-06-05 Thread Jim Segrave
In article <[EMAIL PROTECTED]>, Tim Peters <[EMAIL PROTECTED]> wrote: >[Scott David Daniels] >>> For example, time timsort (Python's internal sort) on pre-sorted >>> data; you'll find it is handled faster than random data. > >O(N) vs O(N log N), in fact. > >[Lawrence D'Oliveiro] >> But isn't that h

Re: Large Dictionaries

2006-06-05 Thread Tim Peters
[Scott David Daniels] >> For example, time timsort (Python's internal sort) on pre-sorted >> data; you'll find it is handled faster than random data. O(N) vs O(N log N), in fact. [Lawrence D'Oliveiro] > But isn't that how a reasonable sorting algorithm should behave? Less > work to do if the data

Re: Large Dictionaries

2006-06-05 Thread Iain King
Lawrence D'Oliveiro wrote: > In article <[EMAIL PROTECTED]>, > Scott David Daniels <[EMAIL PROTECTED]> wrote: > > >For example, time timsort (Python's internal sort) on pre-sorted > >data; you'll find it is handled faster than random data. > > But isn't that how a reasonable sorting algorithm sho

Re: Large Dictionaries

2006-06-05 Thread Aahz
In article <[EMAIL PROTECTED]>, Lawrence D'Oliveiro <[EMAIL PROTECTED]> wrote: >In article <[EMAIL PROTECTED]>, > Scott David Daniels <[EMAIL PROTECTED]> wrote: >> >>For example, time timsort (Python's internal sort) on pre-sorted >>data; you'll find it is handled faster than random data. > >But i

Re: Large Dictionaries

2006-06-05 Thread Steve Holden
Lawrence D'Oliveiro wrote: > In article <[EMAIL PROTECTED]>, > Scott David Daniels <[EMAIL PROTECTED]> wrote: > > >>For example, time timsort (Python's internal sort) on pre-sorted >>data; you'll find it is handled faster than random data. > > > But isn't that how a reasonable sorting algorith

Re: Large Dictionaries

2006-06-05 Thread Lawrence D'Oliveiro
In article <[EMAIL PROTECTED]>, Scott David Daniels <[EMAIL PROTECTED]> wrote: >For example, time timsort (Python's internal sort) on pre-sorted >data; you'll find it is handled faster than random data. But isn't that how a reasonable sorting algorithm should behave? Less work to do if the data

Re: Large Dictionaries

2006-05-29 Thread Martin v. Löwis
Roy Smith wrote: > My guess would be that each resize grows the table by a constant > factor, which IIRC, works out to amortized O(n). Right. For <5 entries, the size is multiplied by 4 each time, and doubled each time for large dictionaries. > It's not as good as > c

Re: Large Dictionaries

2006-05-29 Thread Scott David Daniels
Thomas Ganss wrote: > Klaas schrieb: >> 4. Insert your keys in sorted order. > This advice is questionable - > > My gut feeling on this matter is: > IF the insert times of pre-sorted values is far better > than the times of unsorted values, there is a chance > that the resulting tree is unbala

Re: Large Dictionaries

2006-05-28 Thread Thomas Ganss
Klaas schrieb: > 4. Insert your keys in sorted order. This advice is questionable - it depends on the at least on the db vendor and probably sometimes on the sort method, if inserting pre-sorted values is better. My gut feeling on this matter is: IF the insert times of pre-sorted values is far be

Re: Large Dictionaries

2006-05-24 Thread Klaas
Chris: > class StorageBerkeleyDB(StorageTest): >def runtest(self, number_hash): >db = bsddb.hashopen(None, flag='c', cachesize=8192) >for (num, wildcard_digits) in number_hash.keys(): >key = '%d:%d' % (num, wildcard_digits) >db[key] = None >db.clo

Re: Large Dictionaries

2006-05-24 Thread Klaas
Chris: > Berkeley DB is great for accessing data by key for things already > stored on disk (i.e. read access), but write performance for each > key-value pair is slow due to it being careful about flushing > writes to disk by default. This is absolutely false. -Mike -- http://mail.python.org/

Re: Large Dictionaries

2006-05-18 Thread Claudio Grondi
Chris Foote wrote: > Richie Hindle wrote: >> [Chris] >>> Has anyone written a fast hash module which is more optimal for >>> large datasets ? >> >> PyJudy might be what you're looking for, though I've never used it: >> >> http://www.dalkescientific.com/Python/PyJudy.html >> >> "Judy's key benefit

Re: Large Dictionaries

2006-05-18 Thread Claudio Grondi
Chris Foote wrote: > Claudio Grondi wrote: >> Chris Foote wrote: >>> Klaas wrote: >>> > 22.2s 20m25s[3] 20m to insert 1m keys? You are doing something wrong. >>> >>> I've put together some simplified test code, but the bsddb >>> module gives 11m for 1M keys: >>> >> I have run your c

Re: Large Dictionaries

2006-05-17 Thread Chris Foote
Claudio Grondi wrote: > Chris Foote wrote: >> Klaas wrote: >> 22.2s 20m25s[3] >>> >>> 20m to insert 1m keys? You are doing something wrong. >> >> I've put together some simplified test code, but the bsddb >> module gives 11m for 1M keys: >> > I have run your code for the bsddb on my P4 2.8 G

Re: Large Dictionaries

2006-05-17 Thread Chris Foote
Richie Hindle wrote: > [Chris] >> Has anyone written a fast hash module which is more optimal for >> large datasets ? > > PyJudy might be what you're looking for, though I've never used it: > > http://www.dalkescientific.com/Python/PyJudy.html > > "Judy's key benefits are scalability, high per

Re: Large Dictionaries

2006-05-17 Thread Claudio Grondi
Chris Foote wrote: > Klaas wrote: > >>> 22.2s 20m25s[3] >> >> >> 20m to insert 1m keys? You are doing something wrong. > > > Hi Mike. > > I've put together some simplified test code, but the bsddb > module gives 11m for 1M keys: > I have run your code for the bsddb on my P4 2.8 GHz and have

Re: Large Dictionaries

2006-05-17 Thread Chris Foote
Klaas wrote: 22.2s 20m25s[3] 20m to insert 1m keys? You are doing something wrong. Hi Mike. I've put together some simplified test code, but the bsddb module gives 11m for 1M keys: Number generator test for 100 number ranges with a maximum of 3 wildcard digits. Wed May 17 22:1

Re: For Large Dictionaries Could One Use Separate Dictionaries Where Each Dictionary Covers an Interval of the Input Range?

2006-05-16 Thread skip
bob> If you have the same number of entries as buckets, and you have a bob> good hash function, then if you have n buckets your longest chain bob> should have length around ln(n). The average length of a nonempty bob> bucket would be somewhere around 1 1/2. Yes, and it achieves t

Re: For Large Dictionaries Could One Use Separate Dictionaries Where Each Dictionary Covers an Interval of the Input Range?

2006-05-16 Thread skip
Roy> If you're getting long hash chains, you're either using a bad hash Roy> function, or your table isn't big enough. Sure. The original poster said something about 10 million keys I think. Unless the key distribution is amazingly fortuitous and the number of unique values is small, the

Re: Large Dictionaries

2006-05-16 Thread Klaas
>22.2s 20m25s[3] 20m to insert 1m keys? You are doing something wrong. With bdb's it is crucial to insert keys in bytestring-sorted order. Also, be sure to give it a decent amount of cache. -Mike -- http://mail.python.org/mailman/listinfo/python-list

Re: For Large Dictionaries Could One Use Separate Dictionaries Where Each Dictionary Covers an Interval of the Input Range?

2006-05-16 Thread bob_jenkins
If you have the same number of entries as buckets, and you have a good hash function, then if you have n buckets your longest chain should have length around ln(n). The average length of a nonempty bucket would be somewhere around 1 1/2. -- http://mail.python.org/mailman/listinfo/python-list

Re: For Large Dictionaries Could One Use Separate Dictionaries Where Each Dictionary Covers an Interval of the Input Range?

2006-05-16 Thread Roy Smith
ame value? For small >dictionaries keeping the max keys that hash to the same value small isn't a >huge problem. For large dictionaries (millions of keys) might you have some >long chains? Or in an effort to reduce the chain length, wind up using so >much virtual memory that you

Re: For Large Dictionaries Could One Use Separate Dictionaries Where Each Dictionary Covers an Interval of the Input Range?

2006-05-16 Thread Graham Fawcett
On 5/16/06, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote: > > Graham> Looking up a key in a dictionary is done in constant-time, > Graham> i.e. it doesn't matter how large the dictionary is. > > Doesn't that depend on how many keys hash to the same value? For small > dictionaries keeping th

Re: For Large Dictionaries Could One Use Separate Dictionaries Where Each Dictionary Covers an Interval of the Input Range?

2006-05-16 Thread skip
#x27;t a huge problem. For large dictionaries (millions of keys) might you have some long chains? Or in an effort to reduce the chain length, wind up using so much virtual memory that you wind up wrecking performance by swapping? Skip -- http://mail.python.org/mailman/listinfo/python-list

Re: For Large Dictionaries Could One Use Separate Dictionaries Where Each Dictionary Covers an Interval of the Input Range?

2006-05-16 Thread Graham Fawcett
Casey Hawthorne wrote: > For Large Dictionaries Could One Use Separate Dictionaries Where Each > Dictionary Covers an Interval of the Input Range? One Could, But Why? :-) You wouldn't see any performance improvements. Looking up a key in a dictionary is done in constant-time, i.e.

For Large Dictionaries Could One Use Separate Dictionaries Where Each Dictionary Covers an Interval of the Input Range?

2006-05-16 Thread Casey Hawthorne
For Large Dictionaries Could One Use Separate Dictionaries Where Each Dictionary Covers an Interval of the Input Range? You would need to know something of the input range and its uniformity! Self-balancing between dictionaries for separate dictionaries? -- Regards, Casey -- http

  1   2   >