[issue47114] random.choice and random.choices have different distributions
New submission from Mark Bell : The docstring for `random.choices` indicates that ``` import random random.choices(population, k=1) ``` should produce a list containing one item, where each item of `population` has equal likelihood of being selected. However `random.choices` draws elements for its sample by doing `population[floor(random() * len(population)]` and so relies on floating point numbers. Therefore not each item is equally likely to be chosen since floats are not uniformly dense in [0, 1] and this problem becomes worse as `population` becomes larger. Note that this issue does not apply to `random.choice(population)` since this uses `random.randint` to choose a random element of `population` and performs exact integer arithmetic. Compare https://github.com/python/cpython/blob/main/Lib/random.py#L371 and https://github.com/python/cpython/blob/main/Lib/random.py#L490 Could `random.choices` fall back to doing `return [choice(population) for _ in _repeat(None, k)]` if no weights are given? Similarly, is it also plausible to only rely on `random.randint` and integer arithmetic if all of the (cumulative) weights given to `random.choices` are integers? -- components: Library (Lib) messages: 415981 nosy: Mark.Bell priority: normal severity: normal status: open title: random.choice and random.choices have different distributions versions: Python 3.11 ___ Python tracker <https://bugs.python.org/issue47114> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue47114] random.choice and random.choices have different distributions
Mark Bell added the comment: To give two more consequences of `random.choices` using floating point arithmetic: 1) When doing `random.choices([A, B, C], weights=[2**55, 1, 1])` the cumulative weight to bisect for is selected using `floor(random() * (2**55 + 1 + 1 + 0.0))`. Since this is always even, it is impossible for `B` to be chosen. 2) When doing `random.choices([A, B], weights=[2**54, 1])` although the `cum_weights` is [18014398509481984, 18014398509481985] the `total` used by `random.choices` is `cum_weights[-1] + 0.0`. Due to floating point rounding this is 18014398509481984.0 and so is actually smaller than `cum_weights[-1]`. Therefore it is impossible for `B` to be chosen. -- ___ Python tracker <https://bugs.python.org/issue47114> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue28937] str.split(): allow removing empty strings (when sep is not None)
Mark Bell added the comment: So I have taken a look at the original patch that was provided and I have been able to update it so that it is compatible with the current release. I have also flipped the logic in the wrapping functions so that they take a `keepempty` flag (which is the opposite of the `prune` flag). I had to make a few extra changes since there are now some extra catches in things like PyUnicode_Split which spot that if len(self) > len(sep) then they can just return [self]. However that now needs an extra test since that shortcut can only be used if len(self) > 0. You can find the code here: https://github.com/markcbell/cpython/tree/split-keepempty However in exploring this, I'm not sure that this patch interacts correctly with maxsplit. For example, ' x y z'.split(maxsplit=1, keepempty=True) results in ['', '', 'x', 'y z'] since the first two empty strings items are "free" and don't count towards the maxsplit. I think the length of the result returned must be <= maxsplit + 1, is this right? I'm about to rework the logic to avoid this, but before I go too far could someone double check my test cases to make sure that I have the correct idea about how this is supposed to work please. Only the 8 lines marked "New case" show new behaviour, all the other come from how string.split works currently. Of course the same patterns should apply to bytestrings and bytearrays. ''.split() == [] ''.split(' ') == [''] ''.split(' ', keepempty=False) == []# New case ' '.split(' ') == ['', '', ''] ' '.split(' ', maxsplit=1) == ['', ' '] ' '.split(' ', maxsplit=1, keepempty=False) == [' ']# New case ' a b c '.split() == ['a', 'b', 'c'] ' a b c '.split(maxsplit=0) == ['a b c '] ' a b c '.split(maxsplit=1) == ['a', 'b c '] ' a b c '.split(' ') == ['', '', 'a', 'b', 'c', '', ''] ' a b c '.split(' ', maxsplit=0) == [' a b c '] ' a b c '.split(' ', maxsplit=1) == ['', ' a b c '] ' a b c '.split(' ', maxsplit=2) == ['', '', 'a b c '] ' a b c '.split(' ', maxsplit=3) == ['', '', 'a', 'b c '] ' a b c '.split(' ', maxsplit=4) == ['', '', 'a', 'b', 'c '] ' a b c '.split(' ', maxsplit=5) == ['', '', 'a', 'b', 'c', ' '] ' a b c '.split(' ', maxsplit=6) == ['', '', 'a', 'b', 'c', '', ''] ' a b c '.split(' ', keepempty=False) == ['a', 'b', 'c']# New case ' a b c '.split(' ', maxsplit=0, keepempty=False) == [' a b c ']# New case ' a b c '.split(' ', maxsplit=1, keepempty=False) == ['a', 'b c ']# New case ' a b c '.split(' ', maxsplit=2, keepempty=False) == ['a', 'b', 'c '] # New case ' a b c '.split(' ', maxsplit=3, keepempty=False) == ['a', 'b', 'c', ' ']# New case ' a b c '.split(' ', maxsplit=4, keepempty=False) == ['a', 'b', 'c'] # New case -- nosy: +Mark.Bell ___ Python tracker <https://bugs.python.org/issue28937> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue28937] str.split(): allow removing empty strings (when sep is not None)
Mark Bell added the comment: > suggests that empty strings don't count towards maxsplit Thank you for the confirmation. Although just to clarify I guess you really mean "empty strings *that are dropped from the output* don't count towards maxsplit". Just to double check this, what do we expect the output of ' x y z'.split(maxsplit=1, keepempty=True) to be? I think it should be ['', ' x y z'] since in this case we are retaining empty strings and they should count towards the maxsplit. (In the current patch this crashes with a core dump since it tries to write to unallocated memory) -- ___ Python tracker <https://bugs.python.org/issue28937> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue28937] str.split(): allow removing empty strings (when sep is not None)
Mark Bell added the comment: So I think I agree with you about the difference between .split() and .split(' '). However wouldn't that mean that ' x y z'.split(maxsplit=1, keepempty=False) == ['x', 'y z'] since it should do one split. -- ___ Python tracker <https://bugs.python.org/issue28937> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue28937] str.split(): allow removing empty strings (when sep is not None)
Change by Mark Bell : -- pull_requests: +24839 pull_request: https://github.com/python/cpython/pull/26222 ___ Python tracker <https://bugs.python.org/issue28937> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue28937] str.split(): allow removing empty strings (when sep is not None)
Mark Bell added the comment: Thank you very much for confirming these test cases. Using these I believe that I have now been able to complete a patch that would implement this feature. The PR is available at https://github.com/python/cpython/pull/26222. As I am a first-time contributor, please could a maintainer approve running the CI workflows so that I can confirm that all the (new) tests pass. -- ___ Python tracker <https://bugs.python.org/issue28937> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue28937] str.split(): allow removing empty strings (when sep is not None)
Mark Bell added the comment: Andrei: That is a very interesting observation, thank you for pointing it out. I guess your example / argument also currently applies to whitespace separation too. For example, if we have a whitespace separated string with contents: col1 col2 col3 a b c x y z then using [row.split() for row in contents.splitlines()] results in [['col1', 'col2', 'col3'], ['a', 'b', 'c'], [], ['x', 'y', 'z']] However if later a user appends the row: p q aiming to have p, and empty cell and then q then they will actually get [['col1', 'col2', 'col3'], ['a', 'b', 'c'], [], ['x', 'y', 'z'], ['p', 'q']] So at least this patch results in behaviour that is consistent with how split currently works. Are you suggesting that this is something that could be addressed by clearer documentation or using a different flag name? -- ___ Python tracker <https://bugs.python.org/issue28937> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue28937] str.split(): allow removing empty strings (when sep is not None)
Mark Bell added the comment: > Instead, the discussion was focused on removing *all* empty strings from the > result. I imagine that the discussion focussed on this since this is precisely what happens when sep=None. For example, 'a b c '.split() == ['a', 'b', 'c']. I guess that the point was to provide users with explicit, manual control over whether the behaviour of split should drop all empty strings or retain all empty strings instead of this decision just being made on whether sep is None or not. So I wonder whether the "expected" solution for parsing CSV like strings is for you to actually filter out the empty strings yourself and never pass them to split at all. For example by doing something like: [line.split(sep=',') for line in content.splitlines() if line] but if this is the case then this is the kind of thing that would require careful thought about what is the right name for this parameter / right way to express this in the documentation to make sure that users don't fall into the trap that you mentioned. > Sorry that I bring this up only now when the discussion was finished and the > work on PR completed; I wish I had seen the issue sooner. Of course, but the main thing is that you spotted this before the PR was merged :) -- ___ Python tracker <https://bugs.python.org/issue28937> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue39755] Change example of itertools.product
New submission from Mark Bell : The documentation for itertools.product at: https://docs.python.org/3/library/itertools.html#itertools.product currently says that: For example, product(A, B) returns the same as ((x,y) for x in A for y in B) While this is broadly correct, since product first converts its arguments to tuples, this is not true if A or B are infinite iterables. For example, when A = itertools.count() and B = range(2) then the former runs forever using infinite memory, whereas the latter returns the lazy generator immediately for use. Would it be clearer / more correct to instead say: For example, product(A, B) returns the same as ((x,y) for x in tuple(A) for y in tuple(B)) -- assignee: docs@python components: Documentation messages: 362672 nosy: Mark.Bell, docs@python priority: normal severity: normal status: open title: Change example of itertools.product versions: Python 3.5 ___ Python tracker <https://bugs.python.org/issue39755> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue31141] Start should be a keyword argument of the built-in sum
New submission from Mark Bell: The built-in function sum takes an optional argument "start" to specify what value to start adding from (defaults to 0). This argument should be a keyword argument in order to match the other built-in functions such as: enumerate(range(10), start=5) This patch allows users to write: sum(range(10), start=5) which previously raised "TypeError: sum() takes no keyword arguments". Since the only change is making an optional positional argument into a keyword argument, this has no effect on any existing code using the current convention of: sum(range(10), 5) -- components: ctypes messages: 299908 nosy: Mark.Bell priority: normal severity: normal status: open title: Start should be a keyword argument of the built-in sum versions: Python 2.7, Python 3.7 ___ Python tracker <http://bugs.python.org/issue31141> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue31141] Start should be a keyword argument of the built-in sum
Changes by Mark Bell : -- type: -> behavior ___ Python tracker <http://bugs.python.org/issue31141> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue31141] Start should be a keyword argument of the built-in sum
Mark Bell added the comment: I ran some timing tests of the patch I submitted to compare it to the current build of Python. Using timit on the current master branch I got: python.exe -m timeit "sum(())" 1.12 usec per loop python.exe -m timeit "sum((), 0)" 1.22 usec per loop And for the patched version: python.exe -m timeit "sum(())" 1.46 usec per loop python.exe -m timeit "sum((), 0)" 1.57 usec per loop However my patch wasn't just the simple argument clinic change suggested by serhiy.storchaka, so maybe that would be more efficient and easier to understand. -- ___ Python tracker <http://bugs.python.org/issue31141> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue21757] Can't reenable menus in Tkinter on Mac
New submission from Mark Bell: The following example is a Tkinter app with a button that toggles the menu being enabled based off of https://mail.python.org/pipermail/tkinter-discuss/2004-September/000204.html #--- from Tkinter import * root=Tk() def hello(): print('hello!') def toggle(): print('I think the menu bar is %s' % menubar.entrycget(0, 'state')) if menubar.entrycget(0, 'state')=='normal': print('disabling') menubar.entryconfig(0,state=DISABLED) print('disbled') else: print('enabling') menubar.entryconfig(0,state=NORMAL) print('enabled') menubar = Menu(root) submenu = Menu(menubar, tearoff=0) submenu.add_command(label='Hello', command=hello) menubar.add_cascade(label='test', menu=submenu) # this cascade will have index 0 in submenu b = Button(root, text='Toggle', command=toggle) b.pack() # display the menu root.config(menu=menubar) root.mainloop() #--- On Windows 7 and Ubuntu 14.04 (using Python 2.7.6 and Tkinter Revision 81008) clicking the button toggles the menu being enabled. However, on Mac 10.9 (again under Python 2.7.6 and Tkinter Revision 81008) the menu becomes stuck disabled. Additionally, the example also prints out what state it thinks the menu has (using entrycget) and it would print out that it thought that the menu was in fact alternating between enabled and disabled. Others have verified this being the case. See: http://stackoverflow.com/questions/24207870/cant-reenable-menus-in-python-tkinter-on-mac -- components: Tkinter files: tkinter_test.py messages: 220553 nosy: Mark.Bell priority: normal severity: normal status: open title: Can't reenable menus in Tkinter on Mac type: behavior versions: Python 2.7 Added file: http://bugs.python.org/file35628/tkinter_test.py ___ Python tracker <http://bugs.python.org/issue21757> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue21810] SIGSEGV in PyObject_Malloc when ARENAS_USE_MMAP
New submission from John-Mark Bell: In low-memory scenarios, the Python 2.7 interpreter may crash as a result of failing to correctly check the return value from mmap in new_arena(). This changeset appears to be the point at which this issue was introduced: http://hg.python.org/cpython/rev/4e43e5b3f7fc Looking at the head of the 2.7 branch in Mercurial, we see the issue is still present: http://hg.python.org/cpython/file/cf70f030a744/Objects/obmalloc.c#l595 On failure, mmap will return MAP_FAILED ((void *) -1), whereas malloc will return NULL (0). Thus, the check for allocation failure on line 601 will erroneously decide that the allocation succeeded in the mmap case. The interpreter will subsequently crash once the invalid address is accessed. I've attached a potential fix for this issue. -- components: Interpreter Core files: obmalloc.diff keywords: patch messages: 221013 nosy: John-Mark.Bell priority: normal severity: normal status: open title: SIGSEGV in PyObject_Malloc when ARENAS_USE_MMAP type: crash versions: Python 2.7 Added file: http://bugs.python.org/file35694/obmalloc.diff ___ Python tracker <http://bugs.python.org/issue21810> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com