[Python-Dev] Re: Should set objects maintain insertion order too?
Larry Hastings wrote: > a hypothetical collections.OrderedSet would probably work just fine. I'm also in agreement with adding a collections.OrderedSet. There certainly seems to be a practical use case for a Set that preserves insertion order, but with significant push back to modifying the existing Set implementation (primarily due to performance concerns). If later down the road an improvement to OrderedSet is made that gives it equal or better average performance across the board compared to Set, we can consider changing the default implementation *if* there's significant demand for it. Larry Hastings wrote: > And he'd probably use it too, as that makes the code clearer / easier to read. If you read code using an OrderedSet, you know what the intent was and what the semantics are. If you see code using a dict but setting the values to None, you think "that's crazy" and now you have a puzzle to solve. Let's hope those comments are accurate! This is an excellent point. My first thought if someone were to be using a Dict with all values set to None would be: why not use a Set here? As Larry said, the need for preservation of insertion order would have to be explicitly described in the code comments. Realistically speaking, code comments are not something that can be consistently relied upon, especially if it's part of some boilerplate format that gets replicated or if the container is used in multiple locations. I'd much rather have a dedicated data structure that clearly describes what it does based on the name alone, IMO that's a million times better for readability purposes. Also, this is mostly speculation since I haven't ran any benchmarks for an OrderedSet implementation, but I strongly suspect that OrderedSet will end up having better average performance for add, pop, and update than trying to use a dictionary with None values (assuming it's implemented in C or at least with a C accelerator). Not to mention allowing the ability to update (for both addition and removal) based on intersections and unions across ordered sets, which currently isn't possible to do with dictionaries (at least not directly with |=, &=, -=. and ^=). I'm not sure how much of a practical use case this would have for ordered sets specifically, but it's a nice bonus. On Sat, Dec 21, 2019 at 7:59 AM Larry Hastings wrote: > > (mixing together messages in the thread, sorry threaded-view readers) > > On 12/19/19 3:15 PM, Tim Peters wrote: > > [Nick] > > I took Larry's request a slightly different way: > > [...] > > Only Larry can answer whether that would meet his perceived need. My > _guess_ is that he wouldn't know OrderedSet existed, and, even if he > did, he'd use a dict with None values anyway (because it's less hassle > and does everything he wanted). > > > At last, something I can speak knowledgeably about: Larry's use case! > Regarding Larry, I'd say > >1. his use case was small enough that almost anything maintaining >order would have worked, even a list, >2. he often *does have* a pretty good idea what goodies are salted >away in the Python standard library, and >3. a hypothetical collections.OrderedSet would probably work just >fine. And he'd probably use it too, as that makes the code clearer / >easier to read. If you read code using an OrderedSet, you know what the >intent was and what the semantics are. If you see code using a dict but >setting the values to None, you think "that's crazy" and now you have a >puzzle to solve. Let's hope those comments are accurate! > > > Also, speaking personally, at least once (maybe twice?) in this thread > folks have asked "would YOU, Mr Larry, really want ordered sets if it meant > sets were slightly slower?" > > The first thing I'd say is, I'm not sure why people should care about > what's best for me. That's sweet of you! But you really shouldn't. > > The second thing is, my needs are modest, so the speed hit we're talking > about would likely be statistically insignificant, for me. > > And the third thing is, I don't really put the set() API through much of a > workout anyway. I build sets, I add and remove items, I iterate over them, > I do the occasional union, and only very rarely do I use anything else. > Even then, when I write code where I reach for a fancy operation like > intersection or symmetric_difference--what tends to happen is, I have a > rethink and a refactor, and after I simplify my code I don't need the fancy > operations anymore. I can't remember the last time I used one of these > fancy operations where the code really stuck around for a long time. > > So, personally? Sure, I'd take that tradeoff. As already established, I > like that dict objects maintain their order, and I think it'd be swell if > sets maintained their order too. I suspect the performance hit wouldn't > affect *me* in any meaningful way, and I could benefit from the > order-maintaining semantics. I bet it'd have other benefits
[Python-Dev] Re: Should set objects maintain insertion order too?
To begin with, please compare performance of dict-with-None-values to that of a set, for operations that can be expressed by both (e.g. both have update()). On Mon, Dec 23, 2019 at 18:08 Kyle Stanley wrote: > Larry Hastings wrote: > > a hypothetical collections.OrderedSet would probably work just fine. > > I'm also in agreement with adding a collections.OrderedSet. There > certainly seems to be a practical use case for a Set that preserves > insertion order, but with significant push back to modifying the existing > Set implementation (primarily due to performance concerns). > > If later down the road an improvement to OrderedSet is made that gives it > equal or better average performance across the board compared to Set, we > can consider changing the default implementation *if* there's significant > demand for it. > > Larry Hastings wrote: > > And he'd probably use it too, as that makes the code clearer / easier to > read. If you read code using an OrderedSet, you know what the intent was > and what the semantics are. If you see code using a dict but setting the > values to None, you think "that's crazy" and now you have a puzzle to > solve. Let's hope those comments are accurate! > > This is an excellent point. My first thought if someone were to be using a > Dict with all values set to None would be: why not use a Set here? As Larry > said, the need for preservation of insertion order would have to be > explicitly described in the code comments. Realistically speaking, code > comments are not something that can be consistently relied upon, especially > if it's part of some boilerplate format that gets replicated or if the > container is used in multiple locations. I'd much rather have a dedicated > data structure that clearly describes what it does based on the name alone, > IMO that's a million times better for readability purposes. > > Also, this is mostly speculation since I haven't ran any benchmarks for an > OrderedSet implementation, but I strongly suspect that OrderedSet will end > up having better average performance for add, pop, and update than trying > to use a dictionary with None values (assuming it's implemented in C or at > least with a C accelerator). > > Not to mention allowing the ability to update (for both addition and > removal) based on intersections and unions across ordered sets, which > currently isn't possible to do with dictionaries (at least not directly > with |=, &=, -=. and ^=). I'm not sure how much of a practical use case > this would have for ordered sets specifically, but it's a nice bonus. > > On Sat, Dec 21, 2019 at 7:59 AM Larry Hastings wrote: > >> >> (mixing together messages in the thread, sorry threaded-view readers) >> >> On 12/19/19 3:15 PM, Tim Peters wrote: >> >> [Nick] >> >> I took Larry's request a slightly different way: >> >> [...] >> >> Only Larry can answer whether that would meet his perceived need. My >> _guess_ is that he wouldn't know OrderedSet existed, and, even if he >> did, he'd use a dict with None values anyway (because it's less hassle >> and does everything he wanted). >> >> >> At last, something I can speak knowledgeably about: Larry's use case! >> Regarding Larry, I'd say >> >>1. his use case was small enough that almost anything maintaining >>order would have worked, even a list, >>2. he often *does have* a pretty good idea what goodies are salted >>away in the Python standard library, and >>3. a hypothetical collections.OrderedSet would probably work just >>fine. And he'd probably use it too, as that makes the code clearer / >>easier to read. If you read code using an OrderedSet, you know what the >>intent was and what the semantics are. If you see code using a dict but >>setting the values to None, you think "that's crazy" and now you have a >>puzzle to solve. Let's hope those comments are accurate! >> >> >> Also, speaking personally, at least once (maybe twice?) in this thread >> folks have asked "would YOU, Mr Larry, really want ordered sets if it meant >> sets were slightly slower?" >> >> The first thing I'd say is, I'm not sure why people should care about >> what's best for me. That's sweet of you! But you really shouldn't. >> >> The second thing is, my needs are modest, so the speed hit we're talking >> about would likely be statistically insignificant, for me. >> >> And the third thing is, I don't really put the set() API through much of >> a workout anyway. I build sets, I add and remove items, I iterate over >> them, I do the occasional union, and only very rarely do I use anything >> else. Even then, when I write code where I reach for a fancy operation >> like intersection or symmetric_difference--what tends to happen is, I have >> a rethink and a refactor, and after I simplify my code I don't need the >> fancy operations anymore. I can't remember the last time I used one of >> these fancy operations where the code really stuck around for a long time. >> >> So, person
[Python-Dev] Re: Should set objects maintain insertion order too?
Even though I was the first person in this thread to suggest collections.OrderedSet, I'm "meh" about it now. As I read more and played with the sortedcollections package, it seemed to me that while I might want a set that iterated in a determinate and meaningful order moderately often, insertion order would make up a small share of those use cases. On Mon, Dec 23, 2019, 8:05 PM Kyle Stanley wrote: > Larry Hastings wrote: > > a hypothetical collections.OrderedSet would probably work just fine. > > I'm also in agreement with adding a collections.OrderedSet. There > certainly seems to be a practical use case for a Set that preserves > insertion order, but with significant push back to modifying the existing > Set implementation (primarily due to performance concerns). > > If later down the road an improvement to OrderedSet is made that gives it > equal or better average performance across the board compared to Set, we > can consider changing the default implementation *if* there's significant > demand for it. > > Larry Hastings wrote: > > And he'd probably use it too, as that makes the code clearer / easier to > read. If you read code using an OrderedSet, you know what the intent was > and what the semantics are. If you see code using a dict but setting the > values to None, you think "that's crazy" and now you have a puzzle to > solve. Let's hope those comments are accurate! > > This is an excellent point. My first thought if someone were to be using a > Dict with all values set to None would be: why not use a Set here? As Larry > said, the need for preservation of insertion order would have to be > explicitly described in the code comments. Realistically speaking, code > comments are not something that can be consistently relied upon, especially > if it's part of some boilerplate format that gets replicated or if the > container is used in multiple locations. I'd much rather have a dedicated > data structure that clearly describes what it does based on the name alone, > IMO that's a million times better for readability purposes. > > Also, this is mostly speculation since I haven't ran any benchmarks for an > OrderedSet implementation, but I strongly suspect that OrderedSet will end > up having better average performance for add, pop, and update than trying > to use a dictionary with None values (assuming it's implemented in C or at > least with a C accelerator). > > Not to mention allowing the ability to update (for both addition and > removal) based on intersections and unions across ordered sets, which > currently isn't possible to do with dictionaries (at least not directly > with |=, &=, -=. and ^=). I'm not sure how much of a practical use case > this would have for ordered sets specifically, but it's a nice bonus. > > On Sat, Dec 21, 2019 at 7:59 AM Larry Hastings wrote: > >> >> (mixing together messages in the thread, sorry threaded-view readers) >> >> On 12/19/19 3:15 PM, Tim Peters wrote: >> >> [Nick] >> >> I took Larry's request a slightly different way: >> >> [...] >> >> Only Larry can answer whether that would meet his perceived need. My >> _guess_ is that he wouldn't know OrderedSet existed, and, even if he >> did, he'd use a dict with None values anyway (because it's less hassle >> and does everything he wanted). >> >> >> At last, something I can speak knowledgeably about: Larry's use case! >> Regarding Larry, I'd say >> >>1. his use case was small enough that almost anything maintaining >>order would have worked, even a list, >>2. he often *does have* a pretty good idea what goodies are salted >>away in the Python standard library, and >>3. a hypothetical collections.OrderedSet would probably work just >>fine. And he'd probably use it too, as that makes the code clearer / >>easier to read. If you read code using an OrderedSet, you know what the >>intent was and what the semantics are. If you see code using a dict but >>setting the values to None, you think "that's crazy" and now you have a >>puzzle to solve. Let's hope those comments are accurate! >> >> >> Also, speaking personally, at least once (maybe twice?) in this thread >> folks have asked "would YOU, Mr Larry, really want ordered sets if it meant >> sets were slightly slower?" >> >> The first thing I'd say is, I'm not sure why people should care about >> what's best for me. That's sweet of you! But you really shouldn't. >> >> The second thing is, my needs are modest, so the speed hit we're talking >> about would likely be statistically insignificant, for me. >> >> And the third thing is, I don't really put the set() API through much of >> a workout anyway. I build sets, I add and remove items, I iterate over >> them, I do the occasional union, and only very rarely do I use anything >> else. Even then, when I write code where I reach for a fancy operation >> like intersection or symmetric_difference--what tends to happen is, I have >> a rethink and a refactor, and after I simplify
[Python-Dev] Re: Should set objects maintain insertion order too?
> To begin with, please compare performance of dict-with-None-values to that of a set, for operations that can be expressed by both (e.g. both have update()). Good idea. It would be useful to have a baseline comparison. I emboldened the comparisons with a notable difference. Tested on: Version: Python 3.8.0 OS: Linux kernel 5.4.5 CPU: Intel i5-4460 Initialization (about the same): >>> timeit.timeit("set(i for i in range(1000))", number=100_000) 4.878508095000143 >>> timeit.timeit("{i: None for i in range(1000)}", number=100_000) 4.9192055170024105 Membership (about the same): >>> timeit.timeit("random.randint(0, 999) in s", setup="import random; s = set(i for i in range(1000))", number=10_000_000) 7.992674231001729 >>> timeit.timeit("random.randint(0, 999) in d", setup="import random; d = {i: None for i in range(1000)}", number=10_000_000) 8.032404395999038 *Add* (much faster for dicts): >>> timeit.timeit("s = set(); s.add(0)", number=100_000_000) 13.330938750001224 >>> timeit.timeit("d = {}; d[0] = None", number=100_000_000) 5.788865385999088 *Update* (much faster for updating sets): >>> timeit.timeit("s.update(other_s)", setup="s = set(); other_s = set(i for i in range(1000))", number=1_000_000) 6.2428832090008655 >>> timeit.timeit("d.update(other_d)", setup="d = {}; other_d = {i: None for i in range(1000)}", number=1_000_000) 13.031371458000649 *Create list from keys* (faster for sets): >>> timeit.timeit("list(s)", setup="s = set(i for i in range(1000))", number=10_00_000) 5.24364303600305 >>> timeit.timeit("list(d)", setup="d = {i: None for i in range(1000)}", number=10_00_000) 6.303017043999716 Removal isn't easily comparable, since s.pop(), s.discard(), d.pop(), and d.popitem() all behave quite differently. Also, I'm sure these benchmarks could be improved significantly, particularly with the "Add" ones. This should provide a decent general idea though. On Mon, Dec 23, 2019 at 8:12 PM Guido van Rossum wrote: > To begin with, please compare performance of dict-with-None-values to that > of a set, for operations that can be expressed by both (e.g. both have > update()). > > On Mon, Dec 23, 2019 at 18:08 Kyle Stanley wrote: > >> Larry Hastings wrote: >> > a hypothetical collections.OrderedSet would probably work just fine. >> >> I'm also in agreement with adding a collections.OrderedSet. There >> certainly seems to be a practical use case for a Set that preserves >> insertion order, but with significant push back to modifying the existing >> Set implementation (primarily due to performance concerns). >> >> If later down the road an improvement to OrderedSet is made that gives it >> equal or better average performance across the board compared to Set, we >> can consider changing the default implementation *if* there's >> significant demand for it. >> >> Larry Hastings wrote: >> > And he'd probably use it too, as that makes the code clearer / easier >> to read. If you read code using an OrderedSet, you know what the intent >> was and what the semantics are. If you see code using a dict but setting >> the values to None, you think "that's crazy" and now you have a puzzle to >> solve. Let's hope those comments are accurate! >> >> This is an excellent point. My first thought if someone were to be using >> a Dict with all values set to None would be: why not use a Set here? As >> Larry said, the need for preservation of insertion order would have to be >> explicitly described in the code comments. Realistically speaking, code >> comments are not something that can be consistently relied upon, especially >> if it's part of some boilerplate format that gets replicated or if the >> container is used in multiple locations. I'd much rather have a dedicated >> data structure that clearly describes what it does based on the name alone, >> IMO that's a million times better for readability purposes. >> >> Also, this is mostly speculation since I haven't ran any benchmarks for >> an OrderedSet implementation, but I strongly suspect that OrderedSet will >> end up having better average performance for add, pop, and update than >> trying to use a dictionary with None values (assuming it's implemented in C >> or at least with a C accelerator). >> >> Not to mention allowing the ability to update (for both addition and >> removal) based on intersections and unions across ordered sets, which >> currently isn't possible to do with dictionaries (at least not directly >> with |=, &=, -=. and ^=). I'm not sure how much of a practical use case >> this would have for ordered sets specifically, but it's a nice bonus. >> >> On Sat, Dec 21, 2019 at 7:59 AM Larry Hastings >> wrote: >> >>> >>> (mixing together messages in the thread, sorry threaded-view readers) >>> >>> On 12/19/19 3:15 PM, Tim Peters wrote: >>> >>> [Nick] >>> >>> I took Larry's request a slightly different way: >>> >>> [...] >>> >>> Only Larry can answer whether that would meet his perceived need. My >>> _guess_ is that he wouldn't know OrderedSet exis
[Python-Dev] Re: Should set objects maintain insertion order too?
On Tue, Dec 24, 2019 at 1:57 PM Kyle Stanley wrote: > Add (much faster for dicts): > >>> timeit.timeit("s = set(); s.add(0)", number=100_000_000) > 13.330938750001224 > >>> timeit.timeit("d = {}; d[0] = None", number=100_000_000) > 5.788865385999088 I think this is an artifact of Python not having an empty set literal. >>> timeit.timeit("s = set(); s.add(0)", number=100_000_000) 13.275540543720126 >>> timeit.timeit("d = dict(); d[0] = None", number=100_000_000) 13.044076398015022 >>> timeit.timeit("d = {}; d[0] = None", number=100_000_000) 6.088695731014013 >>> timeit.timeit("s = {1}; s.add(0)", number=100_000_000) 9.260965215042233 >>> timeit.timeit("d = {1:2}; d[0] = None", number=100_000_000) 8.75433829985559 When both use the constructor call or both use a literal, the difference is far smaller. I'd call this one a wash. ChrisA ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/ODZYHNI57MFZD3I7TGP3B3HJTRX36KGB/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: Should set objects maintain insertion order too?
David Mertz writes: > Even though I was the first person in this thread to suggest > collections.OrderedSet, I'm "meh" about it now. As I read more and played > with the sortedcollections package, it seemed to me that while I might want > a set that iterated in a determinate and meaningful order moderately often, > insertion order would make up a small share of those use cases. On the other hand, insertion order is one of the most prominent of the determinate meaningful orders where you would have to do ugly things to use "sorted" to get that order. Any application where you have an unreliable message bus feeding a queue (so that you might get duplicate objects but it's bad to process the same object twice) would be a potential application of insertion-ordered sets. Steve ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/ZGSW6MJ4HGHUG65TLWUDHQMZXCW2GR7Q/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: Should set objects maintain insertion order too?
On Mon, Dec 23, 2019 at 8:56 PM Kyle Stanley wrote: > > > To begin with, please compare performance of dict-with-None-values to that > > of a set, for operations that can be expressed by both (e.g. both have > > update()). > > Good idea. It would be useful to have a baseline comparison. I emboldened the > comparisons with a notable difference. > > Tested on: > Version: Python 3.8.0 > OS: Linux kernel 5.4.5 > CPU: Intel i5-4460 > > Initialization (about the same): > >>> timeit.timeit("set(i for i in range(1000))", number=100_000) > 4.878508095000143 > >>> timeit.timeit("{i: None for i in range(1000)}", number=100_000) > 4.9192055170024105 > > Membership (about the same): > >>> timeit.timeit("random.randint(0, 999) in s", setup="import random; s = > >>> set(i for i in range(1000))", number=10_000_000) > 7.992674231001729 > >>> timeit.timeit("random.randint(0, 999) in d", setup="import random; d = > >>> {i: None for i in range(1000)}", number=10_000_000) > 8.032404395999038 > > Add (much faster for dicts): > >>> timeit.timeit("s = set(); s.add(0)", number=100_000_000) > 13.330938750001224 > >>> timeit.timeit("d = {}; d[0] = None", number=100_000_000) > 5.788865385999088 > > Update (much faster for updating sets): > >>> timeit.timeit("s.update(other_s)", setup="s = set(); other_s = set(i for > >>> i in range(1000))", number=1_000_000) > 6.2428832090008655 > >>> timeit.timeit("d.update(other_d)", setup="d = {}; other_d = {i: None for > >>> i in range(1000)}", number=1_000_000) > 13.031371458000649 > > Create list from keys (faster for sets): > >>> timeit.timeit("list(s)", setup="s = set(i for i in range(1000))", > >>> number=10_00_000) > 5.24364303600305 > >>> timeit.timeit("list(d)", setup="d = {i: None for i in range(1000)}", > >>> number=10_00_000) > 6.303017043999716 > > Removal isn't easily comparable, since s.pop(), s.discard(), d.pop(), and > d.popitem() all behave quite differently. Also, I'm sure these benchmarks > could be improved significantly, particularly with the "Add" ones. This > should provide a decent general idea though. > > On Mon, Dec 23, 2019 at 8:12 PM Guido van Rossum wrote: >> >> To begin with, please compare performance of dict-with-None-values to that >> of a set, for operations that can be expressed by both (e.g. both have >> update()). >> >> On Mon, Dec 23, 2019 at 18:08 Kyle Stanley wrote: >>> >>> Larry Hastings wrote: >>> > a hypothetical collections.OrderedSet would probably work just fine. >>> >>> I'm also in agreement with adding a collections.OrderedSet. There certainly >>> seems to be a practical use case for a Set that preserves insertion order, >>> but with significant push back to modifying the existing Set implementation >>> (primarily due to performance concerns). >>> >>> If later down the road an improvement to OrderedSet is made that gives it >>> equal or better average performance across the board compared to Set, we >>> can consider changing the default implementation if there's significant >>> demand for it. >>> >>> Larry Hastings wrote: >>> > And he'd probably use it too, as that makes the code clearer / easier to >>> > read. If you read code using an OrderedSet, you know what the intent was >>> > and what the semantics are. If you see code using a dict but setting the >>> > values to None, you think "that's crazy" and now you have a puzzle to >>> > solve. Let's hope those comments are accurate! >>> >>> This is an excellent point. My first thought if someone were to be using a >>> Dict with all values set to None would be: why not use a Set here? As Larry >>> said, the need for preservation of insertion order would have to be >>> explicitly described in the code comments. Realistically speaking, code >>> comments are not something that can be consistently relied upon, especially >>> if it's part of some boilerplate format that gets replicated or if the >>> container is used in multiple locations. I'd much rather have a dedicated >>> data structure that clearly describes what it does based on the name alone, >>> IMO that's a million times better for readability purposes. >>> >>> Also, this is mostly speculation since I haven't ran any benchmarks for an >>> OrderedSet implementation, but I strongly suspect that OrderedSet will end >>> up having better average performance for add, pop, and update than trying >>> to use a dictionary with None values (assuming it's implemented in C or at >>> least with a C accelerator). >>> >>> Not to mention allowing the ability to update (for both addition and >>> removal) based on intersections and unions across ordered sets, which >>> currently isn't possible to do with dictionaries (at least not directly >>> with |=, &=, -=. and ^=). I'm not sure how much of a practical use case >>> this would have for ordered sets specifically, but it's a nice bonus. >>> >>> On Sat, Dec 21, 2019 at 7:59 AM Larry Hastings wrote: (mixing together messages in the thread, sorry threaded-vi
[Python-Dev] Re: Should set objects maintain insertion order too?
Sorry! A previous attempt to reply got sent before I typed anything :-( Very briefly: > >>> timeit.timeit("set(i for i in range(1000))", number=100_000) [and other examples using a range of integers] The collision resolution strategy for sets evolved to be fancier than for dicts, to reduce cache misses. This is important for sets because the _only_ interesting thing about an element wrt a set is whether or not the set contains it. Lookup speed is everything for sets. If you use a contiguous range of "reasonable" integers for keys, the integer hash function is perfect: there's never a collision. So any such test misses all the work Raymond did to speed set lookups. String keys have sufficiently "random" hashes to reliably create collisions, though. Cache misses can, of course, have massive effects on timing. > Add (much faster for dicts): > >>> timeit.timeit("s = set(); s.add(0)", number=100_000_000) > 13.330938750001224 > >>> timeit.timeit("d = {}; d[0] = None", number=100_000_000) > 5.788865385999088 In the former case you're primarily measuring the time to look up the "add" method of sets (that's more expensive than adding 0 to the set). A better comparison would, e.g., move `add = s.add` to a setup line, and use plain "add(0)" in the loop. That's it! ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/5UKYDB36WDRRIYOLRD52QS4I43SCAAJ6/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: Should set objects maintain insertion order too?
Chris Angelico wrote: > I think this is an artifact of Python not having an empty set literal. > [snip] > When both use the constructor call or both use a literal, the > difference is far smaller. I'd call this one a wash. Ah, good catch. I hadn't considered that it would make a substantial difference, but that makes sense. Here's an additional comparison between "{}" and "dict()" to confirm it: >>> timeit.timeit("{}", number=100_000_000) 2.1038335599987477 >>> timeit.timeit("dict()", number=100_000_000) 10.22558353268 Some more in-depth comparisons might be required for addition of single items to the containers. I might experiment further with this at some point in the next week or so, likely with implementing proper tests that omit the time to initialize the container. On Mon, Dec 23, 2019 at 10:09 PM Chris Angelico wrote: > On Tue, Dec 24, 2019 at 1:57 PM Kyle Stanley wrote: > > Add (much faster for dicts): > > >>> timeit.timeit("s = set(); s.add(0)", number=100_000_000) > > 13.330938750001224 > > >>> timeit.timeit("d = {}; d[0] = None", number=100_000_000) > > 5.788865385999088 > > I think this is an artifact of Python not having an empty set literal. > > >>> timeit.timeit("s = set(); s.add(0)", number=100_000_000) > 13.275540543720126 > >>> timeit.timeit("d = dict(); d[0] = None", number=100_000_000) > 13.044076398015022 > >>> timeit.timeit("d = {}; d[0] = None", number=100_000_000) > 6.088695731014013 > >>> timeit.timeit("s = {1}; s.add(0)", number=100_000_000) > 9.260965215042233 > >>> timeit.timeit("d = {1:2}; d[0] = None", number=100_000_000) > 8.75433829985559 > > When both use the constructor call or both use a literal, the > difference is far smaller. I'd call this one a wash. > > ChrisA > ___ > Python-Dev mailing list -- python-dev@python.org > To unsubscribe send an email to python-dev-le...@python.org > https://mail.python.org/mailman3/lists/python-dev.python.org/ > Message archived at > https://mail.python.org/archives/list/python-dev@python.org/message/ODZYHNI57MFZD3I7TGP3B3HJTRX36KGB/ > Code of Conduct: http://python.org/psf/codeofconduct/ > ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/COUO6PFHUADHYP5KSHMPHPIUCOAXS56L/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: Should set objects maintain insertion order too?
Tim Peters wrote: > Sorry! A previous attempt to reply got sent before I typed anything :-( No worries, I only saw the link in the footer to the PSF CoC, and I was mildly concerned for a moment. My first thought was "Oh no, what did I do?". Thanks for clearing that up (: > The collision resolution strategy for sets evolved to be fancier than > for dicts, to reduce cache misses. This is important for sets because > the _only_ interesting thing about an element wrt a set is whether or > not the set contains it. Lookup speed is everything for sets. Huh, that's quite interesting. For some reason, I had assumed in the back of my head (without giving it much thought) that the average collision rate would be the same for set items and dict keys. Thanks for the useful information. > If you use a contiguous range of "reasonable" integers for keys, the > integer hash function is perfect: there's never a collision. So any > such test misses all the work Raymond did to speed set lookups. > String keys have sufficiently "random" hashes to reliably create > collisions, though. Cache misses can, of course, have massive effects > on timing. Ah, I forgot to consider how the hash function actually works for continuous integers. A better comparison to demonstrate the collision differences would likely use random strings. Also, I believe that max "reasonable" integer range of no collision is (-2305843009213693951, 2305843009213693951), as defined in Include/pyhash.h ( https://github.com/python/cpython/blob/e42b705188271da108de42b55d9344642170aa2b/Include/pyhash.h#L28 ). > In the former case you're primarily measuring the time to look up the > "add" method of sets (that's more expensive than adding 0 to the set). > A better comparison would, e.g., move `add = s.add` to a setup line, > and use plain "add(0)" in the loop. Good point, there's also what Chris Angelico pointed out as well: dicts have a significant advantage in this case for having a literal for initialization. From my understanding, this ends up having a minimal performance impact in most realistic code (similar to method lookup time), but it makes a very substantial difference in this case since initialization of the containers takes up a significant portion of each iteration. I suspect my initialization comparison might also be flawed for similar reasons, but I'm glad that at least 3/5 of my comparisons were mostly reasonable. So far, the performance difference between set.update() and dict.update() were the most notable. I'll likely spend some more time experimenting in the next couple of weeks or so, and report back if I find any interesting results. In the meantime, anyone can certainly feel free to improve upon my existing comparisons. On Mon, Dec 23, 2019 at 11:03 PM Tim Peters wrote: > Sorry! A previous attempt to reply got sent before I typed anything :-( > > Very briefly: > > > >>> timeit.timeit("set(i for i in range(1000))", number=100_000) > > [and other examples using a range of integers] > > The collision resolution strategy for sets evolved to be fancier than > for dicts, to reduce cache misses. This is important for sets because > the _only_ interesting thing about an element wrt a set is whether or > not the set contains it. Lookup speed is everything for sets. > > If you use a contiguous range of "reasonable" integers for keys, the > integer hash function is perfect: there's never a collision. So any > such test misses all the work Raymond did to speed set lookups. > String keys have sufficiently "random" hashes to reliably create > collisions, though. Cache misses can, of course, have massive effects > on timing. > > > Add (much faster for dicts): > > >>> timeit.timeit("s = set(); s.add(0)", number=100_000_000) > > 13.330938750001224 > > >>> timeit.timeit("d = {}; d[0] = None", number=100_000_000) > > 5.788865385999088 > > In the former case you're primarily measuring the time to look up the > "add" method of sets (that's more expensive than adding 0 to the set). > A better comparison would, e.g., move `add = s.add` to a setup line, > and use plain "add(0)" in the loop. > > That's it! > ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/LDAC4526CZZWFMXORSOKPW2PM6HZE252/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: Should set objects maintain insertion order too?
[Kyle] > ... > For some reason, I had assumed in the back of my head (without > giving it much thought) that the average collision rate would be the > same for set items and dict keys. Thanks for the useful information. I know the theoretical number of probes for dicts, but not for sets anymore. The latter use a mix of probe strategies now, "randomish" jumps (same as for dicts) but also purely linear ("up by 1") probing to try to exploit L1 cache. It's not _apparent_ to me that the mix actually helps ;-) I just trust that Raymond timed stuff until he was sure it does. > ... > Ah, I forgot to consider how the hash function actually works for continuous > integers. A better comparison to demonstrate the collision differences would > likely use random strings. And, at an extreme, a class with a __hash__ that always returns the same integer. > Also, I believe that max "reasonable" integer range of no collision > is (-2305843009213693951, 2305843009213693951), ... Any range that does _not_ contain both -2 and -1 (-1 is an annoying special case, with hash(-1) == hash(-2) == -2), and spans no more than sys.hash_info.modulus integers. Apart from that, the sign and magnitude of the start of the range don't matter; e.g., >>> len(set(hash(i) for i in range(10**5000, 10**5000 + 100))) 100 >>> len(set(hash(i) for i in range(-10**5000, -10**5000 + 100))) 100 ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/FXTRZLOHSWYV5KEGJ4DYNS7HOMUOPVE5/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: Should set objects maintain insertion order too?
>> Also, I believe that max "reasonable" integer range of no collision >> is (-2305843009213693951, 2305843009213693951), ... > Any range that does _not_ contain both -2 and -1 (-1 is an annoying > special case, with hash(-1) == hash(-2) == -2), and spans no more than > sys.hash_info.modulus integers. Apart from that, the sign and > magnitude of the start of the range don't matter; e.g., > > >>> len(set(hash(i) for i in range(10**5000, 10**5000 + 100))) > 100 > >>> len(set(hash(i) for i in range(-10**5000, -10**5000 + 100))) > 100 Sorry again! That only shows that the hash codes are unique. Dicts and sets use only the last N bits to determine the start of the probe sequence, for some value of N >= 3 (depending on the table size). So for a table of size a million, the last 20 bits (100 ~= 2**20) are interesting. But same thing: >>> MASK = (1 << 20) - 1 >>> len(set(hash(i) & MASK for i in range(-10**5000, -10**5000 + 100))) 100 >>> len(set(hash(i) & MASK for i in range(-10**5000, -10**5000 + 100))) 100 ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/BXHOS73YMJDSVZUM74VYXXYN5WW63BZ4/ Code of Conduct: http://python.org/psf/codeofconduct/