[issue47114] random.choice and random.choices have different distributions

2022-03-24 Thread Mark Bell


New submission from Mark Bell :

The docstring for `random.choices` indicates that
```
import random
random.choices(population, k=1)
```
should produce a list containing one item, where each item of `population` has 
equal likelihood of being selected. However `random.choices` draws elements for 
its sample by doing `population[floor(random() * len(population)]` and so 
relies on floating point numbers. Therefore not each item is equally likely to 
be chosen since floats are not uniformly dense in [0, 1] and this problem 
becomes worse as `population` becomes larger. 

Note that this issue does not apply to `random.choice(population)` since this 
uses `random.randint` to choose a random element of `population` and performs 
exact integer arithmetic. Compare 
https://github.com/python/cpython/blob/main/Lib/random.py#L371 and 
https://github.com/python/cpython/blob/main/Lib/random.py#L490

Could `random.choices` fall back to doing `return [choice(population) for _ in 
_repeat(None, k)]` if no weights are given? Similarly, is it also plausible to 
only rely on `random.randint` and integer arithmetic if all of the (cumulative) 
weights given to `random.choices` are integers?

--
components: Library (Lib)
messages: 415981
nosy: Mark.Bell
priority: normal
severity: normal
status: open
title: random.choice and random.choices have different distributions
versions: Python 3.11

___
Python tracker 
<https://bugs.python.org/issue47114>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue47114] random.choice and random.choices have different distributions

2022-03-24 Thread Mark Bell


Mark Bell  added the comment:

To give two more consequences of `random.choices` using floating point 
arithmetic:

1) When doing `random.choices([A, B, C], weights=[2**55, 1, 1])` the cumulative 
weight to bisect for is selected using `floor(random() * (2**55 + 1 + 1 + 
0.0))`. Since this is always even, it is impossible for `B` to be chosen.

2) When doing `random.choices([A, B], weights=[2**54, 1])` although the 
`cum_weights` is [18014398509481984, 18014398509481985] the `total` used by 
`random.choices` is `cum_weights[-1] + 0.0`. Due to floating point rounding 
this is 18014398509481984.0 and so is actually smaller than `cum_weights[-1]`. 
Therefore it is impossible for `B` to be chosen.

--

___
Python tracker 
<https://bugs.python.org/issue47114>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28937] str.split(): allow removing empty strings (when sep is not None)

2021-05-18 Thread Mark Bell

Mark Bell  added the comment:

So I have taken a look at the original patch that was provided and I have been 
able to update it so that it is compatible with the current release. I have 
also flipped the logic in the wrapping functions so that they take a 
`keepempty` flag (which is the opposite of the `prune` flag). 

I had to make a few extra changes since there are now some extra catches in 
things like PyUnicode_Split which spot that if len(self) > len(sep) then they 
can just return [self]. However that now needs an extra test since that 
shortcut can only be used if len(self) > 0. You can find the code here: 
https://github.com/markcbell/cpython/tree/split-keepempty

However in exploring this, I'm not sure that this patch interacts correctly 
with maxsplit. For example, 
'   x y z'.split(maxsplit=1, keepempty=True)
results in
['', '', 'x', 'y z']
since the first two empty strings items are "free" and don't count towards the 
maxsplit. I think the length of the result returned must be <= maxsplit + 1, is 
this right?

I'm about to rework the logic to avoid this, but before I go too far could 
someone double check my test cases to make sure that I have the correct idea 
about how this is supposed to work please. Only the 8 lines marked "New case" 
show new behaviour, all the other come from how string.split works currently. 
Of course the same patterns should apply to bytestrings and bytearrays.

''.split() == []
''.split(' ') == ['']
''.split(' ', keepempty=False) == []# New case

'  '.split(' ') == ['', '', '']
'  '.split(' ', maxsplit=1) == ['', ' ']
'  '.split(' ', maxsplit=1, keepempty=False) == [' ']# New case

'  a b c  '.split() == ['a', 'b', 'c']
​'  a b c  '.split(maxsplit=0) == ['a b c  ']
​'  a b c  '.split(maxsplit=1) == ['a', 'b c  ']

'  a b c  '.split(' ') == ['', '', 'a', 'b', 'c', '', '']
​'  a b c  '.split(' ', maxsplit=0) == ['  a b c  ']
​'  a b c  '.split(' ', maxsplit=1) == ['', ' a b c  ']
​'  a b c  '.split(' ', maxsplit=2) == ['', '', 'a b c  ']
​'  a b c  '.split(' ', maxsplit=3) == ['', '', 'a', 'b c  ']
​'  a b c  '.split(' ', maxsplit=4) == ['', '', 'a', 'b', 'c  ']
​'  a b c  '.split(' ', maxsplit=5) == ['', '', 'a', 'b', 'c', ' ']
​'  a b c  '.split(' ', maxsplit=6) == ['', '', 'a', 'b', 'c', '', '']

​'  a b c  '.split(' ', keepempty=False) == ['a', 'b', 'c']# New case
​'  a b c  '.split(' ', maxsplit=0, keepempty=False) == ['  a b c  ']# 
New case
​'  a b c  '.split(' ', maxsplit=1, keepempty=False) == ['a', 'b c  ']# 
New case
​'  a b c  '.split(' ', maxsplit=2, keepempty=False) == ['a', 'b', 'c  ']   
 # New case
​'  a b c  '.split(' ', maxsplit=3, keepempty=False) == ['a', 'b', 'c', ' 
']# New case
​'  a b c  '.split(' ', maxsplit=4, keepempty=False) == ['a', 'b', 'c']
# New case

--
nosy: +Mark.Bell

___
Python tracker 
<https://bugs.python.org/issue28937>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28937] str.split(): allow removing empty strings (when sep is not None)

2021-05-18 Thread Mark Bell


Mark Bell  added the comment:

> suggests that empty strings don't count towards maxsplit

Thank you for the confirmation. Although just to clarify I guess you really 
mean "empty strings *that are dropped from the output* don't count towards 
maxsplit". Just to double check this, what do we expect the output of

'   x y z'.split(maxsplit=1, keepempty=True)

to be?

I think it should be ['', '  x y z'] since in this case we are retaining empty 
strings and they should count towards the maxsplit.

(In the current patch this crashes with a core dump since it tries to write to 
unallocated memory)

--

___
Python tracker 
<https://bugs.python.org/issue28937>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28937] str.split(): allow removing empty strings (when sep is not None)

2021-05-18 Thread Mark Bell


Mark Bell  added the comment:

So I think I agree with you about the difference between .split() and .split(' 
'). However wouldn't that mean that
'   x y z'.split(maxsplit=1, keepempty=False) == ['x', 'y z']

since it should do one split.

--

___
Python tracker 
<https://bugs.python.org/issue28937>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28937] str.split(): allow removing empty strings (when sep is not None)

2021-05-18 Thread Mark Bell


Change by Mark Bell :


--
pull_requests: +24839
pull_request: https://github.com/python/cpython/pull/26222

___
Python tracker 
<https://bugs.python.org/issue28937>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28937] str.split(): allow removing empty strings (when sep is not None)

2021-05-20 Thread Mark Bell


Mark Bell  added the comment:

Thank you very much for confirming these test cases. Using these I believe that 
I have now been able to complete a patch that would implement this feature. The 
PR is available at https://github.com/python/cpython/pull/26222. As I am a 
first-time contributor, please could a maintainer approve running the CI 
workflows so that I can confirm that all the (new) tests pass.

--

___
Python tracker 
<https://bugs.python.org/issue28937>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28937] str.split(): allow removing empty strings (when sep is not None)

2021-06-05 Thread Mark Bell


Mark Bell  added the comment:

Andrei: That is a very interesting observation, thank you for pointing it out. 
I guess your example / argument also currently applies to whitespace separation 
too. For example, if we have a whitespace separated string with contents:

col1 col2 col3
a b c

x y z

then using [row.split() for row in contents.splitlines()] results in
[['col1', 'col2', 'col3'], ['a', 'b', 'c'], [], ['x', 'y', 'z']]

However if later a user appends the row:

p  q

aiming to have p, and empty cell and then q then they will actually get

[['col1', 'col2', 'col3'], ['a', 'b', 'c'], [], ['x', 'y', 'z'], ['p', 'q']]

So at least this patch results in behaviour that is consistent with how split 
currently works. 

Are you suggesting that this is something that could be addressed by clearer 
documentation or using a different flag name?

--

___
Python tracker 
<https://bugs.python.org/issue28937>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28937] str.split(): allow removing empty strings (when sep is not None)

2021-06-05 Thread Mark Bell

Mark Bell  added the comment:

> Instead, the discussion was focused on removing *all* empty strings from the 
> result.

I imagine that the discussion focussed on this since this is precisely what 
happens when sep=None. For example, 'a b   c ​'.split() == ['a', 'b', 
'c']. I guess that the point was to provide users with explicit, manual control 
over whether the behaviour of split should drop all empty strings or retain all 
empty strings instead of this decision just being made on whether sep is None 
or not.

So I wonder whether the "expected" solution for parsing CSV like strings is for 
you to actually filter out the empty strings yourself and never pass them to 
split at all. For example by doing something like:

[line.split(sep=',') for line in content.splitlines() if line]

but if this is the case then this is the kind of thing that would require 
careful thought about what is the right name for this parameter / right way to 
express this in the documentation to make sure that users don't fall into the 
trap that you mentioned.

> Sorry that I bring this up only now when the discussion was finished and the 
> work on PR completed; I wish I had seen the issue sooner.

Of course, but the main thing is that you spotted this before the PR was merged 
:)

--

___
Python tracker 
<https://bugs.python.org/issue28937>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue39755] Change example of itertools.product

2020-02-25 Thread Mark Bell


New submission from Mark Bell :

The documentation for itertools.product at: 
https://docs.python.org/3/library/itertools.html#itertools.product
currently says that:

For example, product(A, B) returns the same as ((x,y) for x in A for y in B)

While this is broadly correct, since product first converts its arguments to 
tuples, this is not true if A or B are infinite iterables. For example, when A 
= itertools.count() and B = range(2) then the former runs forever using 
infinite memory, whereas the latter returns the lazy generator immediately for 
use.

Would it be clearer / more correct to instead say:

For example, product(A, B) returns the same as ((x,y) for x in tuple(A) for 
y in tuple(B))

--
assignee: docs@python
components: Documentation
messages: 362672
nosy: Mark.Bell, docs@python
priority: normal
severity: normal
status: open
title: Change example of itertools.product
versions: Python 3.5

___
Python tracker 
<https://bugs.python.org/issue39755>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue31141] Start should be a keyword argument of the built-in sum

2017-08-08 Thread Mark Bell

New submission from Mark Bell:

The built-in function sum takes an optional argument "start" to specify what 
value to start adding from (defaults to 0). This argument should be a keyword 
argument in order to match the other built-in functions such as:

enumerate(range(10), start=5)

This patch allows users to write:

sum(range(10), start=5)

which previously raised "TypeError: sum() takes no keyword arguments". Since 
the only change is making an optional positional argument into a keyword 
argument, this has no effect on any existing code using the current convention 
of:

sum(range(10), 5)

--
components: ctypes
messages: 299908
nosy: Mark.Bell
priority: normal
severity: normal
status: open
title: Start should be a keyword argument of the built-in sum
versions: Python 2.7, Python 3.7

___
Python tracker 
<http://bugs.python.org/issue31141>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue31141] Start should be a keyword argument of the built-in sum

2017-08-08 Thread Mark Bell

Changes by Mark Bell :


--
type:  -> behavior

___
Python tracker 
<http://bugs.python.org/issue31141>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue31141] Start should be a keyword argument of the built-in sum

2017-08-16 Thread Mark Bell

Mark Bell added the comment:

I ran some timing tests of the patch I submitted to compare it to the current 
build of Python. Using timit on the current master branch I got:

python.exe -m timeit "sum(())" 1.12 usec per loop
python.exe -m timeit "sum((), 0)"  1.22 usec per loop

And for the patched version:

python.exe -m timeit "sum(())" 1.46 usec per loop
python.exe -m timeit "sum((), 0)"  1.57 usec per loop

However my patch wasn't just the simple argument clinic change suggested by 
serhiy.storchaka, so maybe that would be more efficient and easier to 
understand.

--

___
Python tracker 
<http://bugs.python.org/issue31141>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue21757] Can't reenable menus in Tkinter on Mac

2014-06-14 Thread Mark Bell

New submission from Mark Bell:

The following example is a Tkinter app with a button that toggles the menu 
being enabled based off of 
https://mail.python.org/pipermail/tkinter-discuss/2004-September/000204.html

#---

from Tkinter import *

root=Tk()

def hello():
print('hello!')

def toggle():
print('I think the menu bar is %s' % menubar.entrycget(0, 'state'))
if menubar.entrycget(0, 'state')=='normal':
print('disabling')
menubar.entryconfig(0,state=DISABLED)
print('disbled')
else:
print('enabling')
menubar.entryconfig(0,state=NORMAL)
print('enabled')

menubar = Menu(root)
submenu = Menu(menubar, tearoff=0)
submenu.add_command(label='Hello', command=hello)
menubar.add_cascade(label='test', menu=submenu)

# this cascade will have index 0 in submenu
b = Button(root, text='Toggle', command=toggle)
b.pack()

# display the menu
root.config(menu=menubar)
root.mainloop()

#---

On Windows 7 and Ubuntu 14.04 (using Python 2.7.6 and Tkinter Revision 81008) 
clicking the button toggles the menu being enabled. However, on Mac 10.9 (again 
under Python 2.7.6 and Tkinter Revision 81008) the menu becomes stuck disabled. 

Additionally, the example also prints out what state it thinks the menu has 
(using entrycget) and it would print out that it thought that the menu was in 
fact alternating between enabled and disabled.

Others have verified this being the case. See: 
http://stackoverflow.com/questions/24207870/cant-reenable-menus-in-python-tkinter-on-mac

--
components: Tkinter
files: tkinter_test.py
messages: 220553
nosy: Mark.Bell
priority: normal
severity: normal
status: open
title: Can't reenable menus in Tkinter on Mac
type: behavior
versions: Python 2.7
Added file: http://bugs.python.org/file35628/tkinter_test.py

___
Python tracker 
<http://bugs.python.org/issue21757>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue21810] SIGSEGV in PyObject_Malloc when ARENAS_USE_MMAP

2014-06-19 Thread John-Mark Bell

New submission from John-Mark Bell:

In low-memory scenarios, the Python 2.7 interpreter may crash as a result of 
failing to correctly check the return value from mmap in new_arena().

This changeset appears to be the point at which this issue was introduced: 
http://hg.python.org/cpython/rev/4e43e5b3f7fc

Looking at the head of the 2.7 branch in Mercurial, we see the issue is still 
present: http://hg.python.org/cpython/file/cf70f030a744/Objects/obmalloc.c#l595

On failure, mmap will return MAP_FAILED ((void *) -1), whereas malloc will 
return NULL (0). Thus, the check for allocation failure on line 601 will 
erroneously decide that the allocation succeeded in the mmap case.

The interpreter will subsequently crash once the invalid address is accessed. 
I've attached a potential fix for this issue.

--
components: Interpreter Core
files: obmalloc.diff
keywords: patch
messages: 221013
nosy: John-Mark.Bell
priority: normal
severity: normal
status: open
title: SIGSEGV in PyObject_Malloc when ARENAS_USE_MMAP
type: crash
versions: Python 2.7
Added file: http://bugs.python.org/file35694/obmalloc.diff

___
Python tracker 
<http://bugs.python.org/issue21810>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com