You could also try using a key-value store.
I am using pytc, a Python API for Tokyo Cabinet. It seems to figure
out quite nicely when to go to disk, and when to use memory. But I
have not done extensive tests.
Here is some example code for using pytc:
http://github.com/turian/pytc-example/tree/m
On Jun 29, 9:13 am, mclovin wrote:
> Is there something like it that is more flexible?
Have you seen the stdlib module 'shelve'?
http://docs.python.org/library/shelve.html
It creates a persistent file-based dictionary, which can hold any type
of object as long as it can be pickled.
I really li
In article <9efff087-bd6e-49fb-ad30-a955a64b8...@j32g2000yqh.googlegroups.com>,
mclovin wrote:
>
>I need to have a dictionary of about 8 gigs (well the data it is
>processing is around 4gb). so naturally i am running into memory
>errors.
>
>So i looked around and found bsddb which acts like a dic
Hello all,
I need to have a dictionary of about 8 gigs (well the data it is
processing is around 4gb). so naturally i am running into memory
errors.
So i looked around and found bsddb which acts like a dictionary object
only offloads the data from the RAM to the HDD, however that only
supports st
ble.com/writing-large-dictionaries-to-file-using-cPickle-tp21708938p23246693.html
Sent from the Python - python-list mailing list archive at Nabble.com.
--
http://mail.python.org/mailman/listinfo/python-list
On Jan 30, 2:44 pm, [email protected] wrote:
> On Jan 28, 6:08 pm, Aaron Brady wrote:
>
>
>
> > On Jan 28, 4:43 pm, [email protected] wrote:
>
> > > On Jan 28, 5:14 pm, John Machin wrote:
>
> > > > On Jan 29, 3:13 am, [email protected] wrote:
>
> > > > > hello all,
>
> > > > > i have a large d
On Jan 28, 6:08 pm, Aaron Brady wrote:
> On Jan 28, 4:43 pm, [email protected] wrote:
>
> > On Jan 28, 5:14 pm, John Machin wrote:
>
> > > On Jan 29, 3:13 am, [email protected] wrote:
>
> > > > hello all,
>
> > > > i have a large dictionary which contains about 10 keys, each key has a
> > > > v
En Wed, 28 Jan 2009 14:13:10 -0200, escribió:
i have a large dictionary which contains about 10 keys, each key has a
value which is a list containing about 1 to 5 million (small)
dictionaries. for example,
mydict = {key1: [{'a': 1, 'b': 2, 'c': 'hello'}, {'d', 3, 'e': 4, 'f':
'world'}, ...],
ower and slower. it takes almost an hour if not more to
> > > write this pickle object to file.
>
> > > is there any way to speed this up? i dont mind the large file... after
> > > all the text file with the data used to make the dictionary was larger
> > > (
On Jan 28, 4:43 pm, [email protected] wrote:
> On Jan 28, 5:14 pm, John Machin wrote:
>
>
>
> > On Jan 29, 3:13 am, [email protected] wrote:
>
> > > hello all,
>
> > > i have a large dictionary which contains about 10 keys, each key has a
> > > value which is a list containing about 1 to 5 milli
large file... after
> > all the text file with the data used to make the dictionary was larger
> > (~ 800 MB) than the file it eventually creates, which is 300 MB. but
> > i do care about speed...
>
> > i have tried optimizing this by using this:
>
> > s = pickle
do care about speed...
>
> i have tried optimizing this by using this:
>
> s = pickle.dumps(mydict, 2)
> pfile.write(s)
>
> but this takes just as long... any ideas ? is there a different module
> i could use that's more suitable for large dictionaries ?
> thank you
there a different module
> i could use that's more suitable for large dictionaries ?
> thank you very much.
There is the 'shelve' module. You could create a shelf that tells you
the filename of the 5 other ones. A million keys should be no
problem, I guess. (It's
[email protected] schrieb:
> but this takes just as long... any ideas ? is there a different module
> i could use that's more suitable for large dictionaries ?
> thank you very much.
Have a look at ZODB.
--
http://mail.python.org/mailman/listinfo/python-list
On Jan 28, 11:32 am, [email protected] wrote:
> Hi,
>
> Change:
>
> pickle.dump(mydict, pfile)
>
> to:
>
> pickle.dump(mydict, pfile, -1 )
>
> I think you will see a big difference in performance and also a much
> smaller file on disk.
>
> BTW: What type of application are you developing that crea
Hi,
Change:
pickle.dump(mydict, pfile)
to:
pickle.dump(mydict, pfile, -1 )
I think you will see a big difference in performance and also a much
smaller file on disk.
BTW: What type of application are you developing that creates so many
dictionaries? Sounds interesting.
Malcolm
--
http://mail
he dictionary was larger
(~ 800 MB) than the file it eventually creates, which is 300 MB. but
i do care about speed...
i have tried optimizing this by using this:
s = pickle.dumps(mydict, 2)
pfile.write(s)
but this takes just as long... any ideas ? is there a different module
i could use that
On Jan 15, 6:39 pm, Per Freem wrote:
> hello
>
> i have an optimization questions about python. i am iterating through
> a file and counting the number of repeated elements. the file has on
> the order
> of tens of millions elements...
>
> i create a dictionary that maps elements of the file that
Per Freem writes:
> the only 'twist' is that my elt is an instance of a class (MyClass)
> with 3 fields, all numeric. the class is hashable, and so
> my_dict[elt] works well. the __repr__ and __hash__ methods of my
> class simply return str() representation of self,
which just calls __str__().
On Jan 15, 4:39 pm, Per Freem wrote:
> hello
>
> i have an optimization questions about python. i am iterating through
> a file and counting the number of repeated elements. the file has on
> the order
> of tens of millions elements...
>
> i create a dictionary that maps elements of the file that
Paul Rubin wrote:
Per Freem writes:
2] is there an easy way to have nested defaultdicts? ie i want to say
that my_dict = defaultdict(defaultdict(int)) -- to reflect the fact
that my_dict is a dictionary, whose values are dictionary that map to
ints. but that syntax is not valid.
my_dict = def
On Jan 15, 5:31 pm, Per Freem wrote:
> ...the aKeys are very small (less than 100) where
> as the bKeys are the ones that are in the millions. so in that case,
> doing a Try-Except on aKey should be very efficient, since often it
> will not fail, ...
Do you know the aKeys in advance? If so, the
Per Freem schrieb:
> 1] is Try-Except really slower? my dict actually has two layers, so
> my_dict[aKey][bKeys]. the aKeys are very small (less than 100) where
> as the bKeys are the ones that are in the millions. so in that case,
> doing a Try-Except on aKey should be very efficient, since often
On Thu, 15 Jan 2009 14:49:29 -0800, bearophileHUGS wrote:
> Matimus, your suggestions are all good.
>
> Try-except is slower than:
> if x in adict: ... else: ...
Not according to my tests.
>>> def tryexcept(D, key):
... try:
... return D[key]
... except KeyError:
...
Per Freem writes:
> 2] is there an easy way to have nested defaultdicts? ie i want to say
> that my_dict = defaultdict(defaultdict(int)) -- to reflect the fact
> that my_dict is a dictionary, whose values are dictionary that map to
> ints. but that syntax is not valid.
my_dict = defaultdict(lambd
thanks to everyone for the excellent suggestions. a few follow up q's:
1] is Try-Except really slower? my dict actually has two layers, so
my_dict[aKey][bKeys]. the aKeys are very small (less than 100) where
as the bKeys are the ones that are in the millions. so in that case,
doing a Try-Except o
On Thu, 15 Jan 2009 23:22:48 +0100, Christian Heimes wrote:
>> is there anything that can be done to speed up this simply code? right
>> now it is taking well over 15 minutes to process, on a 3 Ghz machine
>> with lots of RAM (though this is all taking CPU power, not RAM at this
>> point.)
>
> cl
Matimus, your suggestions are all good.
Try-except is slower than:
if x in adict: ... else: ...
A defaultdict is generally faster (there are some conditions when it's
not faster, but they aren't much common. I think it's when the ratio
of duplicates is really low), creating just a tuple instead of
> class MyClass
>
> def __str__(self):
> return "%s-%s-%s" %(self.field1, self.field2, self.field3)
>
> def __repr__(self):
> return str(self)
>
> def __hash__(self):
> return hash(str(self))
>
>
> is there anything that can be done to speed up this simply code? right
> now i
On Fri, Jan 16, 2009 at 8:39 AM, Per Freem wrote:
> hello
>
> i have an optimization questions about python. i am iterating through
> a file and counting the number of repeated elements. the file has on
> the order
> of tens of millions elements...
>
>
> for line in file:
> try:
>elt = MyCla
On Jan 15, 1:39 pm, Per Freem wrote:
> hello
>
> i have an optimization questions about python. i am iterating through
> a file and counting the number of repeated elements. the file has on
> the order
> of tens of millions elements...
>
> i create a dictionary that maps elements of the file that
hello
i have an optimization questions about python. i am iterating through
a file and counting the number of repeated elements. the file has on
the order
of tens of millions elements...
i create a dictionary that maps elements of the file that i want to
count
to their number of occurs. so i iter
On Dec 24, 7:04 pm, "Malcolm Greene" wrote:
> Hi Roger,
>
> By very large dictionary, I mean about 25M items per dictionary. Each
> item is a simple integer whose value will never exceed 2^15.
In Python-speak about dictionaries, an "item" is a tuple (key, value).
I presume from the gross differen
Marc 'BlackJack' Rintsch wrote:
On Wed, 24 Dec 2008 03:23:00 -0500, python wrote:
collection, I don't see the advantage of using an iterator or a list.
I'm sure I'm missing a subtle point here :)
`keys()` creates a list in memory, `iterkeys()` does not. With
``set(dict.keys())`` there is a
Gabriel Genellina wrote:
En Wed, 24 Dec 2008 06:23:00 -0200, escribió:
... k1 = set(dict1.iterkeys())
You've got an excelent explanation from Marc Rintsch. (Note that in
Python 3.0 keys() behaves as iterkeys() in previous versions, so the
above code is supposed to be written in Python 2.x)
Peter Otten:
> >>> a = dict(a=1, b=2, c=3)
> >>> b = dict(b=2, c=30, d=4)
> >>> dict(set(a.iteritems()) ^ set(b.iteritems()))
For larger sets this may be better, may avoid the creation of the
second set:
dict(set(a.iteritems()).symmetric_difference(b.iteritems()))
Bye,
bearophile
--
http://mail.
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
Martin wrote:
> I'd think he's talking about in memory SQLite Databases, this way you
> should be quite fast _and_ could dump all that to a persistent
> storage...
I was just talking about regular on disk SQLite databases. In terms of
priming the pum
Hi,
2008/12/24 :
> Hi Roger,
>
>> you may want to consider using SQLite
>
> Thank you for your suggestion about looking at SQLite. I haven't
> compared the performance of SQLite to Python dictionaries, but I'm
> skeptical that SQLite would be faster than in-memory Python dictionaries
> for the ty
08 07:10:16 -0200
Subject: Re: Strategy for determing difference between 2 very large
dictionaries
En Wed, 24 Dec 2008 06:23:00 -0200, escribió:
> Hi Gabriel,
>
> Thank you very much for your feedback!
>
>> k1 = set(dict1.iterkeys())
>
> I noticed you suggested .iterkeys
x27;: 3, 'd': 4}
That's very cool! Thanks for sharing that technique.
Regards,
Malcolm
- Original message -
From: "Peter Otten" <[email protected]>
To: [email protected]
Date: Wed, 24 Dec 2008 09:46:51 +0100
Subject: Re: Strategy for determing differ
r your help Roger!
Regards,
Malcolm
- Original message -
From: "Roger Binns"
To: [email protected]
Date: Wed, 24 Dec 2008 00:50:49 -0800
Subject: Re: Most efficient way to build very large dictionaries
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
[email protected] wrote:
&g
En Wed, 24 Dec 2008 06:23:00 -0200, escribió:
Hi Gabriel,
Thank you very much for your feedback!
k1 = set(dict1.iterkeys())
I noticed you suggested .iterkeys() vs. .keys(). Is there any advantage
to using an iterator vs. a list as the basis for creating a set? I
You've got an excelent ex
Hi James,
> For the purpose of perpetuating the annoying pedantry that has made
> usenet great:
>
> http://docs.python.org/dev/3.0/whatsnew/3.0.html#views-and-iterators-instead-of-lists
Great tip! Thank you!
Malcolm
--
http://mail.python.org/mailman/listinfo/python-list
Marc 'BlackJack' Rintsch wrote:
On Wed, 24 Dec 2008 03:23:00 -0500, python wrote:
Hi Gabriel,
Thank you very much for your feedback!
k1 = set(dict1.iterkeys())
I noticed you suggested .iterkeys() vs. .keys(). Is there any advantage
to using an iterator vs. a list as the basis for creating a
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
[email protected] wrote:
> Thank you for your suggestion about looking at SQLite. I haven't
> compared the performance of SQLite to Python dictionaries, but I'm
> skeptical that SQLite would be faster than in-memory Python dictionaries
> for the type
Gabriel Genellina wrote:
> En Wed, 24 Dec 2008 05:16:36 -0200, escribió:
[I didn't see the original post]
>> I'm looking for suggestions on the best ('Pythonic') way to
>> determine the difference between 2 very large dictionaries
>> containing simple
explanation.
Thank you!
Malcolm
- Original message -
From: "Marc 'BlackJack' Rintsch"
To: [email protected]
Date: 24 Dec 2008 08:30:41 GMT
Subject: Re: Strategy for determing difference between 2 very large
dictionaries
On Wed, 24 Dec 2008 03:23:00 -0
t need SQLite's
ability to work with data sets larger than my physical memory.
Regards,
Malcolm
- Original message -
From: "Roger Binns"
To: [email protected]
Date: Wed, 24 Dec 2008 00:19:56 -0800
Subject: Re: Most efficient way to build very large dictionaries
-BE
On Wed, 24 Dec 2008 03:23:00 -0500, python wrote:
> Hi Gabriel,
>
> Thank you very much for your feedback!
>
>> k1 = set(dict1.iterkeys())
>
> I noticed you suggested .iterkeys() vs. .keys(). Is there any advantage
> to using an iterator vs. a list as the basis for creating a set? I
> understan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
[email protected] wrote:
> Can I take advantage of this knowledge to optimize
You do the optimization last :-) The first thing you need to do is make
sure you have a way of validating you got the correct results. With 25M
entries it would be very e
ubject: Re: Strategy for determing difference between 2 very large
dictionaries
En Wed, 24 Dec 2008 05:16:36 -0200, escribió:
> I'm looking for suggestions on the best ('Pythonic') way to
> determine the difference between 2 very large dictionaries
> containing simp
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
[email protected] wrote:
> I would appreciate your thoughts on whether there are advantages to
> working with a pre-built dictionary and if so, what are the best ways to
> create a pre-loaded dictionary.
Based on this and your other thread, you may w
ime.
Thank you for sharing your thoughts with me.
Regards,
Malcolm
- Original message -
From: "Roger Binns"
To: [email protected]
Date: Tue, 23 Dec 2008 23:26:49 -0800
Subject: Re: Strategy for determing difference between 2 very large
dictionaries
-BEGIN PGP SIGNE
On Tue, Dec 23, 2008 at 11:46 PM, Gabriel Genellina
wrote:
>
> Yes; but isn't a dict comprehension more adequate?
>
> [key: (dict1[key], dict2[key]) for key in common_keys if
> dict1[key]!=dict2[key]}
That initial [ should be a { of course.
Cheers,
Chris
--
Follow the path of the Iguana...
En Wed, 24 Dec 2008 05:16:36 -0200, escribió:
I'm looking for suggestions on the best ('Pythonic') way to
determine the difference between 2 very large dictionaries
containing simple key/value pairs.
By difference, I mean a list of keys that are present in the
first dictiona
I'm working with some very large dictionaries (about 25M items
per dictionary) that get filled based on data parsed from text
based log files. I'm using Python dictionaries to track the
frequency of specific combinations of events, eg, I build a key
and then add various timing inf
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
[email protected] wrote:
> Feedback on my proposed strategies (or better strategies) would be
> greatly appreciated.
Both strategies will work but I'd recommend the second approach since it
uses already tested code written by other people - the chanc
I'm looking for suggestions on the best ('Pythonic') way to
determine the difference between 2 very large dictionaries
containing simple key/value pairs.
By difference, I mean a list of keys that are present in the
first dictionary, but not the second. And vice versa. And a list
of
Raymond Hettinger wrote:
Background: I'm trying to identify duplicate records in very
large text based transaction logs. I'm detecting duplicate
records by creating a SHA1 checksum of each record and using this
checksum as a dictionary key. This works great except for several
files whose size
On 2008-07-31 02:29, [EMAIL PROTECTED] wrote:
Are there any techniques I can use to strip a dictionary data
structure down to the smallest memory overhead possible?
I'm working on a project where my available RAM is limited to 2G
and I would like to use very large dictionaries vs. a tradit
> > Are there any techniques I can use to strip a dictionary data
> > structure down to the smallest memory overhead possible?
Sure. You can build your own version of a dict using
UserDict.DictMixin. The underlying structure can be as space
efficient as you want.
FWIW, dictionaries automaticall
On Wed, Jul 30, 2008 at 8:29 PM, <[EMAIL PROTECTED]> wrote:
> Background: I'm trying to identify duplicate records in very large text
> based transaction logs. I'm detecting duplicate records by creating a SHA1
> checksum of each record and using this checksum as a dictionary key. This
> works gre
En Wed, 30 Jul 2008 21:29:39 -0300, <[EMAIL PROTECTED]> escribi�:
Are there any techniques I can use to strip a dictionary data
structure down to the smallest memory overhead possible?
I'm working on a project where my available RAM is limited to 2G
and I would like to use
Are there any techniques I can use to strip a dictionary data
structure down to the smallest memory overhead possible?
I'm working on a project where my available RAM is limited to 2G
and I would like to use very large dictionaries vs. a traditional
database.
Background: I'm trying t
Thomas Ganss wrote:
> Klaas schrieb:
>
> > 4. Insert your keys in sorted order.
> This advice is questionable -
> it depends on the at least on the db vendor and probably
> sometimes on the sort method, if inserting pre-sorted
> values is better.
The article I wrote that you quoted named a specifi
Marcin Ciura wrote:
> Duncan Booth wrote:
>> Marcin Ciura wrote:
>>>See Figure 8 in
>>>http://sun.iinf.polsl.gliwice.pl/~mciura/shellsort.pdf
>> That isn't what the reference says. It only covers N up to a few
>> thousand. Practical values of N need to at least go up into the
>> millions.
>
> Pl
Duncan Booth wrote:
> Marcin Ciura wrote:
>>See Figure 8 in
>>http://sun.iinf.polsl.gliwice.pl/~mciura/shellsort.pdf
> That isn't what the reference says. It only covers N up to a few thousand.
> Practical values of N need to at least go up into the millions.
Please look at the performance graph
Marcin Ciura wrote:
> Tim Peters wrote:
>> shellsort is much more a refinement of insertion-sort than of
>> bubblesort, but is not an O(N log N) algorithm regardlesss.
>
> With a judiciously chosen increment sequence,
> the number of comparisons made by shellsort
> *is* of the order N log N for p
Tim Peters wrote:
> shellsort is much more a refinement of insertion-sort than of
> bubblesort, but is not an O(N log N) algorithm regardlesss.
With a judiciously chosen increment sequence,
the number of comparisons made by shellsort
*is* of the order N log N for practical values of N.
See Figure
[Tim Peters]
>> ...
>> O(N log N) sorting algorithms helped by pre-existing order are
>> uncommon, unless they do extra work to detect and exploit
>> pre-existing order.
[Lawrence D'Oliveiro]
> Shellsort works well with nearly-sorted data. It's basically a smarter
> version of bubblesort with much
In article <[EMAIL PROTECTED]>,
"Iain King" <[EMAIL PROTECTED]> wrote:
>Lawrence D'Oliveiro wrote:
>> In article <[EMAIL PROTECTED]>,
>> Scott David Daniels <[EMAIL PROTECTED]> wrote:
>>
>> >For example, time timsort (Python's internal sort) on pre-sorted
>> >data; you'll find it is handled fast
In article <[EMAIL PROTECTED]>,
"Tim Peters" <[EMAIL PROTECTED]> wrote:
>For example, the O(N log N) heapsort is unreasonable compared to the
>O(N**2) bubblesort by that reasoning (pre-sorted is actually a bad
>case for heapsort, but a good case for bubblesort)? O(N log N)
>sorting algorithms he
In article <[EMAIL PROTECTED]>,
Tim Peters <[EMAIL PROTECTED]> wrote:
>[Jim Segrave]
>> Actually, presorted lists are not a bad case for heapsort - it's quite
>> immune to any existing order or lack thereof,
>
>Write a heapsort and time it. It's not a difference in O() behavior,
>but more memory m
[Jim Segrave]
> Actually, presorted lists are not a bad case for heapsort - it's quite
> immune to any existing order or lack thereof,
Write a heapsort and time it. It's not a difference in O() behavior,
but more memory movement is required for a sorted list because
transforming the list into a m
In article <[EMAIL PROTECTED]>,
Tim Peters <[EMAIL PROTECTED]> wrote:
>[Scott David Daniels]
>>> For example, time timsort (Python's internal sort) on pre-sorted
>>> data; you'll find it is handled faster than random data.
>
>O(N) vs O(N log N), in fact.
>
>[Lawrence D'Oliveiro]
>> But isn't that h
[Scott David Daniels]
>> For example, time timsort (Python's internal sort) on pre-sorted
>> data; you'll find it is handled faster than random data.
O(N) vs O(N log N), in fact.
[Lawrence D'Oliveiro]
> But isn't that how a reasonable sorting algorithm should behave? Less
> work to do if the data
Lawrence D'Oliveiro wrote:
> In article <[EMAIL PROTECTED]>,
> Scott David Daniels <[EMAIL PROTECTED]> wrote:
>
> >For example, time timsort (Python's internal sort) on pre-sorted
> >data; you'll find it is handled faster than random data.
>
> But isn't that how a reasonable sorting algorithm sho
In article <[EMAIL PROTECTED]>,
Lawrence D'Oliveiro <[EMAIL PROTECTED]> wrote:
>In article <[EMAIL PROTECTED]>,
> Scott David Daniels <[EMAIL PROTECTED]> wrote:
>>
>>For example, time timsort (Python's internal sort) on pre-sorted
>>data; you'll find it is handled faster than random data.
>
>But i
Lawrence D'Oliveiro wrote:
> In article <[EMAIL PROTECTED]>,
> Scott David Daniels <[EMAIL PROTECTED]> wrote:
>
>
>>For example, time timsort (Python's internal sort) on pre-sorted
>>data; you'll find it is handled faster than random data.
>
>
> But isn't that how a reasonable sorting algorith
In article <[EMAIL PROTECTED]>,
Scott David Daniels <[EMAIL PROTECTED]> wrote:
>For example, time timsort (Python's internal sort) on pre-sorted
>data; you'll find it is handled faster than random data.
But isn't that how a reasonable sorting algorithm should behave? Less
work to do if the data
Roy Smith wrote:
> My guess would be that each resize grows the table by a constant
> factor, which IIRC, works out to amortized O(n).
Right. For <5 entries, the size is multiplied by 4 each time,
and doubled each time for large dictionaries.
> It's not as good as
> c
Thomas Ganss wrote:
> Klaas schrieb:
>> 4. Insert your keys in sorted order.
> This advice is questionable -
>
> My gut feeling on this matter is:
> IF the insert times of pre-sorted values is far better
> than the times of unsorted values, there is a chance
> that the resulting tree is unbala
Klaas schrieb:
> 4. Insert your keys in sorted order.
This advice is questionable -
it depends on the at least on the db vendor and probably
sometimes on the sort method, if inserting pre-sorted
values is better.
My gut feeling on this matter is:
IF the insert times of pre-sorted values is far be
Chris:
> class StorageBerkeleyDB(StorageTest):
>def runtest(self, number_hash):
>db = bsddb.hashopen(None, flag='c', cachesize=8192)
>for (num, wildcard_digits) in number_hash.keys():
>key = '%d:%d' % (num, wildcard_digits)
>db[key] = None
>db.clo
Chris:
> Berkeley DB is great for accessing data by key for things already
> stored on disk (i.e. read access), but write performance for each
> key-value pair is slow due to it being careful about flushing
> writes to disk by default.
This is absolutely false.
-Mike
--
http://mail.python.org/
Chris Foote wrote:
> Richie Hindle wrote:
>> [Chris]
>>> Has anyone written a fast hash module which is more optimal for
>>> large datasets ?
>>
>> PyJudy might be what you're looking for, though I've never used it:
>>
>> http://www.dalkescientific.com/Python/PyJudy.html
>>
>> "Judy's key benefit
Chris Foote wrote:
> Claudio Grondi wrote:
>> Chris Foote wrote:
>>> Klaas wrote:
>>>
> 22.2s 20m25s[3]
20m to insert 1m keys? You are doing something wrong.
>>>
>>> I've put together some simplified test code, but the bsddb
>>> module gives 11m for 1M keys:
>>>
>> I have run your c
Claudio Grondi wrote:
> Chris Foote wrote:
>> Klaas wrote:
>>
22.2s 20m25s[3]
>>>
>>> 20m to insert 1m keys? You are doing something wrong.
>>
>> I've put together some simplified test code, but the bsddb
>> module gives 11m for 1M keys:
>>
> I have run your code for the bsddb on my P4 2.8 G
Richie Hindle wrote:
> [Chris]
>> Has anyone written a fast hash module which is more optimal for
>> large datasets ?
>
> PyJudy might be what you're looking for, though I've never used it:
>
> http://www.dalkescientific.com/Python/PyJudy.html
>
> "Judy's key benefits are scalability, high per
Chris Foote wrote:
> Klaas wrote:
>
>>> 22.2s 20m25s[3]
>>
>>
>> 20m to insert 1m keys? You are doing something wrong.
>
>
> Hi Mike.
>
> I've put together some simplified test code, but the bsddb
> module gives 11m for 1M keys:
>
I have run your code for the bsddb on my P4 2.8 GHz and have
Klaas wrote:
22.2s 20m25s[3]
20m to insert 1m keys? You are doing something wrong.
Hi Mike.
I've put together some simplified test code, but the bsddb
module gives 11m for 1M keys:
Number generator test for 100 number ranges
with a maximum of 3 wildcard digits.
Wed May 17 22:1
bob> If you have the same number of entries as buckets, and you have a
bob> good hash function, then if you have n buckets your longest chain
bob> should have length around ln(n). The average length of a nonempty
bob> bucket would be somewhere around 1 1/2.
Yes, and it achieves t
Roy> If you're getting long hash chains, you're either using a bad hash
Roy> function, or your table isn't big enough.
Sure. The original poster said something about 10 million keys I think.
Unless the key distribution is amazingly fortuitous and the number of unique
values is small, the
>22.2s 20m25s[3]
20m to insert 1m keys? You are doing something wrong.
With bdb's it is crucial to insert keys in bytestring-sorted order.
Also, be sure to give it a decent amount of cache.
-Mike
--
http://mail.python.org/mailman/listinfo/python-list
If you have the same number of entries as buckets, and you have a good
hash function, then if you have n buckets your longest chain should
have length around ln(n). The average length of a nonempty bucket
would be somewhere around 1 1/2.
--
http://mail.python.org/mailman/listinfo/python-list
ame value? For small
>dictionaries keeping the max keys that hash to the same value small isn't a
>huge problem. For large dictionaries (millions of keys) might you have some
>long chains? Or in an effort to reduce the chain length, wind up using so
>much virtual memory that you
On 5/16/06, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:
>
> Graham> Looking up a key in a dictionary is done in constant-time,
> Graham> i.e. it doesn't matter how large the dictionary is.
>
> Doesn't that depend on how many keys hash to the same value? For small
> dictionaries keeping th
#x27;t a
huge problem. For large dictionaries (millions of keys) might you have some
long chains? Or in an effort to reduce the chain length, wind up using so
much virtual memory that you wind up wrecking performance by swapping?
Skip
--
http://mail.python.org/mailman/listinfo/python-list
Casey Hawthorne wrote:
> For Large Dictionaries Could One Use Separate Dictionaries Where Each
> Dictionary Covers an Interval of the Input Range?
One Could, But Why? :-) You wouldn't see any performance improvements.
Looking up a key in a dictionary is done in constant-time, i.e.
For Large Dictionaries Could One Use Separate Dictionaries Where Each
Dictionary Covers an Interval of the Input Range?
You would need to know something of the input range and its
uniformity!
Self-balancing between dictionaries for separate dictionaries?
--
Regards,
Casey
--
http
1 - 100 of 129 matches
Mail list logo