date:20191223

[Python-Dev] Re: Should set objects maintain insertion order too?

2019-12-23 Thread Kyle Stanley

Larry Hastings wrote:
> a hypothetical collections.OrderedSet would probably work just fine.

I'm also in agreement with adding a collections.OrderedSet. There certainly
seems to be a practical use case for a Set that preserves insertion order,
but with significant push back to modifying the existing Set implementation
(primarily due to performance concerns).

If later down the road an improvement to OrderedSet is made that gives it
equal or better average performance across the board compared to Set, we
can consider changing the default implementation *if* there's significant
demand for it.

Larry Hastings wrote:
> And he'd probably use it too, as that makes the code clearer / easier to
read.  If you read code using an OrderedSet, you know what the intent was
and what the semantics are.  If you see code using a dict but setting the
values to None, you think "that's crazy" and now you have a puzzle to
solve.  Let's hope those comments are accurate!

This is an excellent point. My first thought if someone were to be using a
Dict with all values set to None would be: why not use a Set here? As Larry
said, the need for preservation of insertion order would have to be
explicitly described in the code comments. Realistically speaking, code
comments are not something that can be consistently relied upon, especially
if it's part of some boilerplate format that gets replicated or if the
container is used in multiple locations. I'd much rather have a dedicated
data structure that clearly describes what it does based on the name alone,
IMO that's a million times better for readability purposes.

Also, this is mostly speculation since I haven't ran any benchmarks for an
OrderedSet implementation, but I strongly suspect that OrderedSet will end
up having better average performance for add, pop, and update than trying
to use a dictionary with None values (assuming it's implemented in C or at
least with a C accelerator).

Not to mention allowing the ability to update (for both addition and
removal) based on intersections and unions across ordered sets, which
currently isn't possible to do with dictionaries (at least not directly
with |=, &=, -=. and ^=). I'm not sure how much of a practical use case
this would have for ordered sets specifically, but it's a nice bonus.

On Sat, Dec 21, 2019 at 7:59 AM Larry Hastings  wrote:

>
> (mixing together messages in the thread, sorry threaded-view readers)
>
> On 12/19/19 3:15 PM, Tim Peters wrote:
>
> [Nick]
>
> I took Larry's request a slightly different way:
>
> [...]
>
> Only Larry can answer whether that would meet his perceived need.  My
> _guess_ is that he wouldn't know OrderedSet existed, and, even if he
> did, he'd use a dict with None values anyway (because it's less hassle
> and does everything he wanted).
>
>
> At last, something I can speak knowledgeably about: Larry's use case!
> Regarding Larry, I'd say
>
>1. his use case was small enough that almost anything maintaining
>order would have worked, even a list,
>2. he often *does have* a pretty good idea what goodies are salted
>away in the Python standard library, and
>3. a hypothetical collections.OrderedSet would probably work just
>fine.  And he'd probably use it too, as that makes the code clearer /
>easier to read.  If you read code using an OrderedSet, you know what the
>intent was and what the semantics are.  If you see code using a dict but
>setting the values to None, you think "that's crazy" and now you have a
>puzzle to solve.  Let's hope those comments are accurate!
>
>
> Also, speaking personally, at least once (maybe twice?) in this thread
> folks have asked "would YOU, Mr Larry, really want ordered sets if it meant
> sets were slightly slower?"
>
> The first thing I'd say is, I'm not sure why people should care about
> what's best for me.  That's sweet of you!  But you really shouldn't.
>
> The second thing is, my needs are modest, so the speed hit we're talking
> about would likely be statistically insignificant, for me.
>
> And the third thing is, I don't really put the set() API through much of a
> workout anyway.  I build sets, I add and remove items, I iterate over them,
> I do the occasional union, and only very rarely do I use anything else.
> Even then, when I write code where I reach for a fancy operation like
> intersection or symmetric_difference--what tends to happen is, I have a
> rethink and a refactor, and after I simplify my code I don't need the fancy
> operations anymore.  I can't remember the last time I used one of these
> fancy operations where the code really stuck around for a long time.
>
> So, personally?  Sure, I'd take that tradeoff.  As already established, I
> like that dict objects maintain their order, and I think it'd be swell if
> sets maintained their order too.  I suspect the performance hit wouldn't
> affect *me* in any meaningful way, and I could benefit from the
> order-maintaining semantics.  I bet it'd have other benefits

[Python-Dev] Re: Should set objects maintain insertion order too?

2019-12-23 Thread Guido van Rossum

To begin with, please compare performance of dict-with-None-values to that
of a set, for operations that can be expressed by both (e.g. both have
update()).

On Mon, Dec 23, 2019 at 18:08 Kyle Stanley  wrote:

> Larry Hastings wrote:
> > a hypothetical collections.OrderedSet would probably work just fine.
>
> I'm also in agreement with adding a collections.OrderedSet. There
> certainly seems to be a practical use case for a Set that preserves
> insertion order, but with significant push back to modifying the existing
> Set implementation (primarily due to performance concerns).
>
> If later down the road an improvement to OrderedSet is made that gives it
> equal or better average performance across the board compared to Set, we
> can consider changing the default implementation *if* there's significant
> demand for it.
>
> Larry Hastings wrote:
> > And he'd probably use it too, as that makes the code clearer / easier to
> read.  If you read code using an OrderedSet, you know what the intent was
> and what the semantics are.  If you see code using a dict but setting the
> values to None, you think "that's crazy" and now you have a puzzle to
> solve.  Let's hope those comments are accurate!
>
> This is an excellent point. My first thought if someone were to be using a
> Dict with all values set to None would be: why not use a Set here? As Larry
> said, the need for preservation of insertion order would have to be
> explicitly described in the code comments. Realistically speaking, code
> comments are not something that can be consistently relied upon, especially
> if it's part of some boilerplate format that gets replicated or if the
> container is used in multiple locations. I'd much rather have a dedicated
> data structure that clearly describes what it does based on the name alone,
> IMO that's a million times better for readability purposes.
>
> Also, this is mostly speculation since I haven't ran any benchmarks for an
> OrderedSet implementation, but I strongly suspect that OrderedSet will end
> up having better average performance for add, pop, and update than trying
> to use a dictionary with None values (assuming it's implemented in C or at
> least with a C accelerator).
>
> Not to mention allowing the ability to update (for both addition and
> removal) based on intersections and unions across ordered sets, which
> currently isn't possible to do with dictionaries (at least not directly
> with |=, &=, -=. and ^=). I'm not sure how much of a practical use case
> this would have for ordered sets specifically, but it's a nice bonus.
>
> On Sat, Dec 21, 2019 at 7:59 AM Larry Hastings  wrote:
>
>>
>> (mixing together messages in the thread, sorry threaded-view readers)
>>
>> On 12/19/19 3:15 PM, Tim Peters wrote:
>>
>> [Nick]
>>
>> I took Larry's request a slightly different way:
>>
>> [...]
>>
>> Only Larry can answer whether that would meet his perceived need.  My
>> _guess_ is that he wouldn't know OrderedSet existed, and, even if he
>> did, he'd use a dict with None values anyway (because it's less hassle
>> and does everything he wanted).
>>
>>
>> At last, something I can speak knowledgeably about: Larry's use case!
>> Regarding Larry, I'd say
>>
>>1. his use case was small enough that almost anything maintaining
>>order would have worked, even a list,
>>2. he often *does have* a pretty good idea what goodies are salted
>>away in the Python standard library, and
>>3. a hypothetical collections.OrderedSet would probably work just
>>fine.  And he'd probably use it too, as that makes the code clearer /
>>easier to read.  If you read code using an OrderedSet, you know what the
>>intent was and what the semantics are.  If you see code using a dict but
>>setting the values to None, you think "that's crazy" and now you have a
>>puzzle to solve.  Let's hope those comments are accurate!
>>
>>
>> Also, speaking personally, at least once (maybe twice?) in this thread
>> folks have asked "would YOU, Mr Larry, really want ordered sets if it meant
>> sets were slightly slower?"
>>
>> The first thing I'd say is, I'm not sure why people should care about
>> what's best for me.  That's sweet of you!  But you really shouldn't.
>>
>> The second thing is, my needs are modest, so the speed hit we're talking
>> about would likely be statistically insignificant, for me.
>>
>> And the third thing is, I don't really put the set() API through much of
>> a workout anyway.  I build sets, I add and remove items, I iterate over
>> them, I do the occasional union, and only very rarely do I use anything
>> else.  Even then, when I write code where I reach for a fancy operation
>> like intersection or symmetric_difference--what tends to happen is, I have
>> a rethink and a refactor, and after I simplify my code I don't need the
>> fancy operations anymore.  I can't remember the last time I used one of
>> these fancy operations where the code really stuck around for a long time.
>>
>> So, person

[Python-Dev] Re: Should set objects maintain insertion order too?

2019-12-23 Thread David Mertz

Even though I was the first person in this thread to suggest
collections.OrderedSet, I'm "meh" about it now. As I read more and played
with the sortedcollections package, it seemed to me that while I might want
a set that iterated in a determinate and meaningful order moderately often,
insertion order would make up a small share of those use cases.

On Mon, Dec 23, 2019, 8:05 PM Kyle Stanley  wrote:

> Larry Hastings wrote:
> > a hypothetical collections.OrderedSet would probably work just fine.
>
> I'm also in agreement with adding a collections.OrderedSet. There
> certainly seems to be a practical use case for a Set that preserves
> insertion order, but with significant push back to modifying the existing
> Set implementation (primarily due to performance concerns).
>
> If later down the road an improvement to OrderedSet is made that gives it
> equal or better average performance across the board compared to Set, we
> can consider changing the default implementation *if* there's significant
> demand for it.
>
> Larry Hastings wrote:
> > And he'd probably use it too, as that makes the code clearer / easier to
> read.  If you read code using an OrderedSet, you know what the intent was
> and what the semantics are.  If you see code using a dict but setting the
> values to None, you think "that's crazy" and now you have a puzzle to
> solve.  Let's hope those comments are accurate!
>
> This is an excellent point. My first thought if someone were to be using a
> Dict with all values set to None would be: why not use a Set here? As Larry
> said, the need for preservation of insertion order would have to be
> explicitly described in the code comments. Realistically speaking, code
> comments are not something that can be consistently relied upon, especially
> if it's part of some boilerplate format that gets replicated or if the
> container is used in multiple locations. I'd much rather have a dedicated
> data structure that clearly describes what it does based on the name alone,
> IMO that's a million times better for readability purposes.
>
> Also, this is mostly speculation since I haven't ran any benchmarks for an
> OrderedSet implementation, but I strongly suspect that OrderedSet will end
> up having better average performance for add, pop, and update than trying
> to use a dictionary with None values (assuming it's implemented in C or at
> least with a C accelerator).
>
> Not to mention allowing the ability to update (for both addition and
> removal) based on intersections and unions across ordered sets, which
> currently isn't possible to do with dictionaries (at least not directly
> with |=, &=, -=. and ^=). I'm not sure how much of a practical use case
> this would have for ordered sets specifically, but it's a nice bonus.
>
> On Sat, Dec 21, 2019 at 7:59 AM Larry Hastings  wrote:
>
>>
>> (mixing together messages in the thread, sorry threaded-view readers)
>>
>> On 12/19/19 3:15 PM, Tim Peters wrote:
>>
>> [Nick]
>>
>> I took Larry's request a slightly different way:
>>
>> [...]
>>
>> Only Larry can answer whether that would meet his perceived need.  My
>> _guess_ is that he wouldn't know OrderedSet existed, and, even if he
>> did, he'd use a dict with None values anyway (because it's less hassle
>> and does everything he wanted).
>>
>>
>> At last, something I can speak knowledgeably about: Larry's use case!
>> Regarding Larry, I'd say
>>
>>1. his use case was small enough that almost anything maintaining
>>order would have worked, even a list,
>>2. he often *does have* a pretty good idea what goodies are salted
>>away in the Python standard library, and
>>3. a hypothetical collections.OrderedSet would probably work just
>>fine.  And he'd probably use it too, as that makes the code clearer /
>>easier to read.  If you read code using an OrderedSet, you know what the
>>intent was and what the semantics are.  If you see code using a dict but
>>setting the values to None, you think "that's crazy" and now you have a
>>puzzle to solve.  Let's hope those comments are accurate!
>>
>>
>> Also, speaking personally, at least once (maybe twice?) in this thread
>> folks have asked "would YOU, Mr Larry, really want ordered sets if it meant
>> sets were slightly slower?"
>>
>> The first thing I'd say is, I'm not sure why people should care about
>> what's best for me.  That's sweet of you!  But you really shouldn't.
>>
>> The second thing is, my needs are modest, so the speed hit we're talking
>> about would likely be statistically insignificant, for me.
>>
>> And the third thing is, I don't really put the set() API through much of
>> a workout anyway.  I build sets, I add and remove items, I iterate over
>> them, I do the occasional union, and only very rarely do I use anything
>> else.  Even then, when I write code where I reach for a fancy operation
>> like intersection or symmetric_difference--what tends to happen is, I have
>> a rethink and a refactor, and after I simplify

[Python-Dev] Re: Should set objects maintain insertion order too?

2019-12-23 Thread Kyle Stanley

> To begin with, please compare performance of dict-with-None-values to
that of a set, for operations that can be expressed by both (e.g. both have
update()).

Good idea. It would be useful to have a baseline comparison. I emboldened
the comparisons with a notable difference.

Tested on:
Version: Python 3.8.0
OS: Linux kernel 5.4.5
CPU: Intel i5-4460

Initialization (about the same):
>>> timeit.timeit("set(i for i in range(1000))", number=100_000)
4.878508095000143
>>> timeit.timeit("{i: None for i in range(1000)}", number=100_000)
4.9192055170024105

Membership (about the same):
>>> timeit.timeit("random.randint(0, 999) in s", setup="import random; s =
set(i for i in range(1000))", number=10_000_000)
7.992674231001729
>>> timeit.timeit("random.randint(0, 999) in d", setup="import random; d =
{i: None for i in range(1000)}", number=10_000_000)
8.032404395999038

*Add* (much faster for dicts):
>>> timeit.timeit("s = set(); s.add(0)", number=100_000_000)
13.330938750001224
>>> timeit.timeit("d = {}; d[0] = None", number=100_000_000)
5.788865385999088

*Update* (much faster for updating sets):
>>> timeit.timeit("s.update(other_s)", setup="s = set(); other_s = set(i
for i in range(1000))", number=1_000_000)
6.2428832090008655
>>> timeit.timeit("d.update(other_d)", setup="d = {}; other_d = {i: None
for i in range(1000)}", number=1_000_000)
13.031371458000649

*Create list from keys* (faster for sets):
>>> timeit.timeit("list(s)", setup="s = set(i for i in range(1000))",
number=10_00_000)
5.24364303600305
>>> timeit.timeit("list(d)", setup="d = {i: None for i in range(1000)}",
number=10_00_000)
6.303017043999716

Removal isn't easily comparable, since s.pop(), s.discard(), d.pop(), and
d.popitem() all behave quite differently. Also, I'm sure these benchmarks
could be improved significantly, particularly with the "Add" ones. This
should provide a decent general idea though.

On Mon, Dec 23, 2019 at 8:12 PM Guido van Rossum  wrote:

> To begin with, please compare performance of dict-with-None-values to that
> of a set, for operations that can be expressed by both (e.g. both have
> update()).
>
> On Mon, Dec 23, 2019 at 18:08 Kyle Stanley  wrote:
>
>> Larry Hastings wrote:
>> > a hypothetical collections.OrderedSet would probably work just fine.
>>
>> I'm also in agreement with adding a collections.OrderedSet. There
>> certainly seems to be a practical use case for a Set that preserves
>> insertion order, but with significant push back to modifying the existing
>> Set implementation (primarily due to performance concerns).
>>
>> If later down the road an improvement to OrderedSet is made that gives it
>> equal or better average performance across the board compared to Set, we
>> can consider changing the default implementation *if* there's
>> significant demand for it.
>>
>> Larry Hastings wrote:
>> > And he'd probably use it too, as that makes the code clearer / easier
>> to read.  If you read code using an OrderedSet, you know what the intent
>> was and what the semantics are.  If you see code using a dict but setting
>> the values to None, you think "that's crazy" and now you have a puzzle to
>> solve.  Let's hope those comments are accurate!
>>
>> This is an excellent point. My first thought if someone were to be using
>> a Dict with all values set to None would be: why not use a Set here? As
>> Larry said, the need for preservation of insertion order would have to be
>> explicitly described in the code comments. Realistically speaking, code
>> comments are not something that can be consistently relied upon, especially
>> if it's part of some boilerplate format that gets replicated or if the
>> container is used in multiple locations. I'd much rather have a dedicated
>> data structure that clearly describes what it does based on the name alone,
>> IMO that's a million times better for readability purposes.
>>
>> Also, this is mostly speculation since I haven't ran any benchmarks for
>> an OrderedSet implementation, but I strongly suspect that OrderedSet will
>> end up having better average performance for add, pop, and update than
>> trying to use a dictionary with None values (assuming it's implemented in C
>> or at least with a C accelerator).
>>
>> Not to mention allowing the ability to update (for both addition and
>> removal) based on intersections and unions across ordered sets, which
>> currently isn't possible to do with dictionaries (at least not directly
>> with |=, &=, -=. and ^=). I'm not sure how much of a practical use case
>> this would have for ordered sets specifically, but it's a nice bonus.
>>
>> On Sat, Dec 21, 2019 at 7:59 AM Larry Hastings 
>> wrote:
>>
>>>
>>> (mixing together messages in the thread, sorry threaded-view readers)
>>>
>>> On 12/19/19 3:15 PM, Tim Peters wrote:
>>>
>>> [Nick]
>>>
>>> I took Larry's request a slightly different way:
>>>
>>> [...]
>>>
>>> Only Larry can answer whether that would meet his perceived need.  My
>>> _guess_ is that he wouldn't know OrderedSet exis

[Python-Dev] Re: Should set objects maintain insertion order too?

2019-12-23 Thread Chris Angelico

On Tue, Dec 24, 2019 at 1:57 PM Kyle Stanley  wrote:
> Add (much faster for dicts):
> >>> timeit.timeit("s = set(); s.add(0)", number=100_000_000)
> 13.330938750001224
> >>> timeit.timeit("d = {}; d[0] = None", number=100_000_000)
> 5.788865385999088

I think this is an artifact of Python not having an empty set literal.

>>> timeit.timeit("s = set(); s.add(0)", number=100_000_000)
13.275540543720126
>>> timeit.timeit("d = dict(); d[0] = None", number=100_000_000)
13.044076398015022
>>> timeit.timeit("d = {}; d[0] = None", number=100_000_000)
6.088695731014013
>>> timeit.timeit("s = {1}; s.add(0)", number=100_000_000)
9.260965215042233
>>> timeit.timeit("d = {1:2}; d[0] = None", number=100_000_000)
8.75433829985559

When both use the constructor call or both use a literal, the
difference is far smaller. I'd call this one a wash.

ChrisA
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/ODZYHNI57MFZD3I7TGP3B3HJTRX36KGB/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-Dev] Re: Should set objects maintain insertion order too?

2019-12-23 Thread Stephen J. Turnbull

David Mertz writes:

 > Even though I was the first person in this thread to suggest
 > collections.OrderedSet, I'm "meh" about it now. As I read more and played
 > with the sortedcollections package, it seemed to me that while I might want
 > a set that iterated in a determinate and meaningful order moderately often,
 > insertion order would make up a small share of those use cases.

On the other hand, insertion order is one of the most prominent of the
determinate meaningful orders where you would have to do ugly things
to use "sorted" to get that order.  Any application where you have an
unreliable message bus feeding a queue (so that you might get
duplicate objects but it's bad to process the same object twice) would
be a potential application of insertion-ordered sets.

Steve

___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/ZGSW6MJ4HGHUG65TLWUDHQMZXCW2GR7Q/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-Dev] Re: Should set objects maintain insertion order too?

2019-12-23 Thread Tim Peters

On Mon, Dec 23, 2019 at 8:56 PM Kyle Stanley  wrote:
>
> > To begin with, please compare performance of dict-with-None-values to that 
> > of a set, for operations that can be expressed by both (e.g. both have 
> > update()).
>
> Good idea. It would be useful to have a baseline comparison. I emboldened the 
> comparisons with a notable difference.
>
> Tested on:
> Version: Python 3.8.0
> OS: Linux kernel 5.4.5
> CPU: Intel i5-4460
>
> Initialization (about the same):
> >>> timeit.timeit("set(i for i in range(1000))", number=100_000)
> 4.878508095000143
> >>> timeit.timeit("{i: None for i in range(1000)}", number=100_000)
> 4.9192055170024105
>
> Membership (about the same):
> >>> timeit.timeit("random.randint(0, 999) in s", setup="import random; s = 
> >>> set(i for i in range(1000))", number=10_000_000)
> 7.992674231001729
> >>> timeit.timeit("random.randint(0, 999) in d", setup="import random; d = 
> >>> {i: None for i in range(1000)}", number=10_000_000)
> 8.032404395999038
>
> Add (much faster for dicts):
> >>> timeit.timeit("s = set(); s.add(0)", number=100_000_000)
> 13.330938750001224
> >>> timeit.timeit("d = {}; d[0] = None", number=100_000_000)
> 5.788865385999088
>
> Update (much faster for updating sets):
> >>> timeit.timeit("s.update(other_s)", setup="s = set(); other_s = set(i for 
> >>> i in range(1000))", number=1_000_000)
> 6.2428832090008655
> >>> timeit.timeit("d.update(other_d)", setup="d = {}; other_d = {i: None for 
> >>> i in range(1000)}", number=1_000_000)
> 13.031371458000649
>
> Create list from keys (faster for sets):
> >>> timeit.timeit("list(s)", setup="s = set(i for i in range(1000))", 
> >>> number=10_00_000)
> 5.24364303600305
> >>> timeit.timeit("list(d)", setup="d = {i: None for i in range(1000)}", 
> >>> number=10_00_000)
> 6.303017043999716
>
> Removal isn't easily comparable, since s.pop(), s.discard(), d.pop(), and 
> d.popitem() all behave quite differently. Also, I'm sure these benchmarks 
> could be improved significantly, particularly with the "Add" ones. This 
> should provide a decent general idea though.
>
> On Mon, Dec 23, 2019 at 8:12 PM Guido van Rossum  wrote:
>>
>> To begin with, please compare performance of dict-with-None-values to that 
>> of a set, for operations that can be expressed by both (e.g. both have 
>> update()).
>>
>> On Mon, Dec 23, 2019 at 18:08 Kyle Stanley  wrote:
>>>
>>> Larry Hastings wrote:
>>> > a hypothetical collections.OrderedSet would probably work just fine.
>>>
>>> I'm also in agreement with adding a collections.OrderedSet. There certainly 
>>> seems to be a practical use case for a Set that preserves insertion order, 
>>> but with significant push back to modifying the existing Set implementation 
>>> (primarily due to performance concerns).
>>>
>>> If later down the road an improvement to OrderedSet is made that gives it 
>>> equal or better average performance across the board compared to Set, we 
>>> can consider changing the default implementation if there's significant 
>>> demand for it.
>>>
>>> Larry Hastings wrote:
>>> > And he'd probably use it too, as that makes the code clearer / easier to 
>>> > read.  If you read code using an OrderedSet, you know what the intent was 
>>> > and what the semantics are.  If you see code using a dict but setting the 
>>> > values to None, you think "that's crazy" and now you have a puzzle to 
>>> > solve.  Let's hope those comments are accurate!
>>>
>>> This is an excellent point. My first thought if someone were to be using a 
>>> Dict with all values set to None would be: why not use a Set here? As Larry 
>>> said, the need for preservation of insertion order would have to be 
>>> explicitly described in the code comments. Realistically speaking, code 
>>> comments are not something that can be consistently relied upon, especially 
>>> if it's part of some boilerplate format that gets replicated or if the 
>>> container is used in multiple locations. I'd much rather have a dedicated 
>>> data structure that clearly describes what it does based on the name alone, 
>>> IMO that's a million times better for readability purposes.
>>>
>>> Also, this is mostly speculation since I haven't ran any benchmarks for an 
>>> OrderedSet implementation, but I strongly suspect that OrderedSet will end 
>>> up having better average performance for add, pop, and update than trying 
>>> to use a dictionary with None values (assuming it's implemented in C or at 
>>> least with a C accelerator).
>>>
>>> Not to mention allowing the ability to update (for both addition and 
>>> removal) based on intersections and unions across ordered sets, which 
>>> currently isn't possible to do with dictionaries (at least not directly 
>>> with |=, &=, -=. and ^=). I'm not sure how much of a practical use case 
>>> this would have for ordered sets specifically, but it's a nice bonus.
>>>
>>> On Sat, Dec 21, 2019 at 7:59 AM Larry Hastings  wrote:


 (mixing together messages in the thread, sorry threaded-vi

[Python-Dev] Re: Should set objects maintain insertion order too?

2019-12-23 Thread Tim Peters

Sorry!  A previous attempt to reply got sent before I typed anything :-(

Very briefly:

> >>> timeit.timeit("set(i for i in range(1000))", number=100_000)

[and other examples using a range of integers]

The collision resolution strategy for sets evolved to be fancier than
for dicts, to reduce cache misses.  This is important for sets because
the _only_ interesting thing about an element wrt a set is whether or
not the set contains it.   Lookup speed is everything for sets.

If you use a contiguous range of "reasonable" integers for keys, the
integer hash function is perfect:  there's never a collision.  So any
such test misses all the work Raymond did to speed set lookups.
String keys have sufficiently "random" hashes to reliably create
collisions, though.  Cache misses can, of course, have massive effects
on timing.

> Add (much faster for dicts):
> >>> timeit.timeit("s = set(); s.add(0)", number=100_000_000)
> 13.330938750001224
> >>> timeit.timeit("d = {}; d[0] = None", number=100_000_000)
> 5.788865385999088

In the former case you're primarily measuring the time to look up the
"add" method of sets (that's more expensive than adding 0 to the set).
A better comparison would, e.g., move `add = s.add` to a setup line,
and use plain "add(0)" in the loop.

That's it!
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/5UKYDB36WDRRIYOLRD52QS4I43SCAAJ6/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-Dev] Re: Should set objects maintain insertion order too?

2019-12-23 Thread Kyle Stanley

Chris Angelico wrote:
> I think this is an artifact of Python not having an empty set literal.
> [snip]
> When both use the constructor call or both use a literal, the
> difference is far smaller. I'd call this one a wash.

Ah, good catch. I hadn't considered that it would make a substantial
difference, but that makes sense. Here's an additional comparison between
"{}"  and "dict()" to confirm it:

>>> timeit.timeit("{}", number=100_000_000)
2.1038335599987477
>>> timeit.timeit("dict()", number=100_000_000)
10.22558353268

Some more in-depth comparisons might be required for addition of single
items to the containers. I might experiment further with this at some point
in the next week or so, likely with implementing proper tests that omit the
time to initialize the container.








On Mon, Dec 23, 2019 at 10:09 PM Chris Angelico  wrote:

> On Tue, Dec 24, 2019 at 1:57 PM Kyle Stanley  wrote:
> > Add (much faster for dicts):
> > >>> timeit.timeit("s = set(); s.add(0)", number=100_000_000)
> > 13.330938750001224
> > >>> timeit.timeit("d = {}; d[0] = None", number=100_000_000)
> > 5.788865385999088
>
> I think this is an artifact of Python not having an empty set literal.
>
> >>> timeit.timeit("s = set(); s.add(0)", number=100_000_000)
> 13.275540543720126
> >>> timeit.timeit("d = dict(); d[0] = None", number=100_000_000)
> 13.044076398015022
> >>> timeit.timeit("d = {}; d[0] = None", number=100_000_000)
> 6.088695731014013
> >>> timeit.timeit("s = {1}; s.add(0)", number=100_000_000)
> 9.260965215042233
> >>> timeit.timeit("d = {1:2}; d[0] = None", number=100_000_000)
> 8.75433829985559
>
> When both use the constructor call or both use a literal, the
> difference is far smaller. I'd call this one a wash.
>
> ChrisA
> ___
> Python-Dev mailing list -- python-dev@python.org
> To unsubscribe send an email to python-dev-le...@python.org
> https://mail.python.org/mailman3/lists/python-dev.python.org/
> Message archived at
> https://mail.python.org/archives/list/python-dev@python.org/message/ODZYHNI57MFZD3I7TGP3B3HJTRX36KGB/
> Code of Conduct: http://python.org/psf/codeofconduct/
>
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/COUO6PFHUADHYP5KSHMPHPIUCOAXS56L/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-Dev] Re: Should set objects maintain insertion order too?

2019-12-23 Thread Kyle Stanley

Tim Peters wrote:
> Sorry!  A previous attempt to reply got sent before I typed anything :-(

No worries, I only saw the link in the footer to the PSF CoC, and I was
mildly concerned for a moment. My first thought was "Oh no, what did I
do?". Thanks for clearing that up (:

> The collision resolution strategy for sets evolved to be fancier than
> for dicts, to reduce cache misses.  This is important for sets because
> the _only_ interesting thing about an element wrt a set is whether or
> not the set contains it.   Lookup speed is everything for sets.

Huh, that's quite interesting. For some reason, I had assumed in the back
of my head (without giving it much thought) that the average collision rate
would be the same for set items and dict keys. Thanks for the useful
information.

> If you use a contiguous range of "reasonable" integers for keys, the
> integer hash function is perfect:  there's never a collision.  So any
> such test misses all the work Raymond did to speed set lookups.
> String keys have sufficiently "random" hashes to reliably create
> collisions, though.  Cache misses can, of course, have massive effects
> on timing.

Ah, I forgot to consider how the hash function actually works for
continuous integers. A better comparison to demonstrate the collision
differences would likely use random strings.

Also, I believe that max "reasonable" integer range of no collision is
(-2305843009213693951, 2305843009213693951), as defined in Include/pyhash.h
(
https://github.com/python/cpython/blob/e42b705188271da108de42b55d9344642170aa2b/Include/pyhash.h#L28
).

> In the former case you're primarily measuring the time to look up the
> "add" method of sets (that's more expensive than adding 0 to the set).
> A better comparison would, e.g., move `add = s.add` to a setup line,
> and use plain "add(0)" in the loop.

Good point, there's also what Chris Angelico pointed out as well: dicts
have a significant advantage in this case for having a literal for
initialization. From my understanding, this ends up having a minimal
performance impact in most realistic code (similar to method lookup time),
but it makes a very substantial difference in this case since
initialization of the containers takes up a significant portion of each
iteration.

I suspect my initialization comparison might also be flawed for similar
reasons, but I'm glad that at least 3/5 of my comparisons were mostly
reasonable. So far, the performance difference between set.update() and
dict.update() were the most notable.

I'll likely spend some more time experimenting in the next couple of weeks
or so, and report back if I find any interesting results. In the meantime,
anyone can certainly feel free to improve upon my existing comparisons.

On Mon, Dec 23, 2019 at 11:03 PM Tim Peters  wrote:

> Sorry!  A previous attempt to reply got sent before I typed anything :-(
>
> Very briefly:
>
> > >>> timeit.timeit("set(i for i in range(1000))", number=100_000)
>
> [and other examples using a range of integers]
>
> The collision resolution strategy for sets evolved to be fancier than
> for dicts, to reduce cache misses.  This is important for sets because
> the _only_ interesting thing about an element wrt a set is whether or
> not the set contains it.   Lookup speed is everything for sets.
>
> If you use a contiguous range of "reasonable" integers for keys, the
> integer hash function is perfect:  there's never a collision.  So any
> such test misses all the work Raymond did to speed set lookups.
> String keys have sufficiently "random" hashes to reliably create
> collisions, though.  Cache misses can, of course, have massive effects
> on timing.
>
> > Add (much faster for dicts):
> > >>> timeit.timeit("s = set(); s.add(0)", number=100_000_000)
> > 13.330938750001224
> > >>> timeit.timeit("d = {}; d[0] = None", number=100_000_000)
> > 5.788865385999088
>
> In the former case you're primarily measuring the time to look up the
> "add" method of sets (that's more expensive than adding 0 to the set).
> A better comparison would, e.g., move `add = s.add` to a setup line,
> and use plain "add(0)" in the loop.
>
> That's it!
>
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/LDAC4526CZZWFMXORSOKPW2PM6HZE252/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-Dev] Re: Should set objects maintain insertion order too?

2019-12-23 Thread Tim Peters

[Kyle]
> ...
> For some reason, I had assumed in the back of my head (without
> giving it much thought) that the average collision rate would be the
> same for set items and dict keys. Thanks for the useful information.

I know the theoretical number of probes for dicts, but not for sets
anymore.  The latter use a mix of probe strategies now, "randomish"
jumps (same as for dicts) but also purely linear ("up by 1") probing
to try to exploit L1 cache.

It's not _apparent_ to me that the mix actually helps ;-)  I just
trust that Raymond timed stuff until he was sure it does.

> ...
> Ah, I forgot to consider how the hash function actually works for continuous
> integers. A better comparison to demonstrate the collision differences would
> likely use random strings.

And, at an extreme, a class with a __hash__ that always returns the
same integer.

> Also, I believe that max "reasonable" integer range of no collision
> is (-2305843009213693951, 2305843009213693951), ...

Any range  that does _not_ contain both -2 and -1 (-1 is an annoying
special case, with hash(-1) == hash(-2) == -2), and spans no more than
sys.hash_info.modulus integers.  Apart from that, the sign and
magnitude of the start of the range don't matter; e.g.,

>>> len(set(hash(i) for i in range(10**5000, 10**5000 + 100)))
100
>>> len(set(hash(i) for i in range(-10**5000, -10**5000 + 100)))
100
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/FXTRZLOHSWYV5KEGJ4DYNS7HOMUOPVE5/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-Dev] Re: Should set objects maintain insertion order too?

2019-12-23 Thread Tim Peters

>> Also, I believe that max "reasonable" integer range of no collision
>> is (-2305843009213693951, 2305843009213693951), ...

> Any range  that does _not_ contain both -2 and -1 (-1 is an annoying
> special case, with hash(-1) == hash(-2) == -2), and spans no more than
> sys.hash_info.modulus integers.  Apart from that, the sign and
> magnitude of the start of the range don't matter; e.g.,
>
> >>> len(set(hash(i) for i in range(10**5000, 10**5000 + 100)))
> 100
> >>> len(set(hash(i) for i in range(-10**5000, -10**5000 + 100)))
> 100

Sorry again!  That only shows that the hash codes are unique.  Dicts
and sets use only the last N bits to determine the start of the probe
sequence, for some value of N >= 3 (depending on the table size).  So
for a table of size a million, the last 20 bits (100 ~= 2**20) are
interesting.  But same thing:

>>> MASK = (1 << 20) - 1
>>> len(set(hash(i) & MASK for i in range(-10**5000, -10**5000 + 100)))
100
>>> len(set(hash(i) & MASK for i in range(-10**5000, -10**5000 + 100)))
100
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/BXHOS73YMJDSVZUM74VYXXYN5WW63BZ4/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-Dev] Re: Should set objects maintain insertion order too?

[Python-Dev] Re: Should set objects maintain insertion order too?

[Python-Dev] Re: Should set objects maintain insertion order too?

[Python-Dev] Re: Should set objects maintain insertion order too?

[Python-Dev] Re: Should set objects maintain insertion order too?

[Python-Dev] Re: Should set objects maintain insertion order too?

[Python-Dev] Re: Should set objects maintain insertion order too?

[Python-Dev] Re: Should set objects maintain insertion order too?

[Python-Dev] Re: Should set objects maintain insertion order too?

[Python-Dev] Re: Should set objects maintain insertion order too?

[Python-Dev] Re: Should set objects maintain insertion order too?

[Python-Dev] Re: Should set objects maintain insertion order too?

12 matches

Site Navigation

Mail list logo

Footer information