[Tutor] binning data and calculating means of classes

Wolfgang Maier Wed, 23 Jul 2014 02:29:16 -0700

On 07/23/2014 03:36 AM, LN A-go-go wrote:
>
> with your help.  I have been working the last few days, I am sorry to
> say, unsuccessfully, to calculate the mean (that's easy), split the data
> into sub-groups or secondary means - which are the break values between
> 4 classes.  Create data-sets with incursive values.  I can do it with
> brute force (copy and paste) but need to rise to the pythonic way and
> use a while loop and a nested if-else structure.  My attempts have been
> lame enough that I don't even want to put them here.

A while loop with an if inside is indeed a very plausible solution, soit would be interesting to see your attempts.


> int_list
> [36, 39, 39, 45, 61, 54, 61, 93, 62, 51, 47, 72, 54, 36, 62, 50, 41, 41,
> 40, 62, 62, 58, 57, 54, 49, 43, 47, 50, 45, 41, 54, 57, 57, 55, 62, 51,
> 34, 57, 55, 63, 45, 45, 42, 44, 34, 53, 67, 58, 56, 43, 33]
>>>> int_list.sort()
>>>> int_list
> [33, 34, 34, 36, 36, 39, 39, 40, 41, 41, 41, 42, 43, 43, 44, 45, 45, 45,
> 45, 47, 47, 49, 50, 50, 51, 51, 53, 54, 54, 54, 54, 55, 55, 56, 57, 57,
> 57, 57, 58, 58, 61, 61, 62, 62, 62, 62, 62, 63, 67, 72, 93]
>>>> flo_list = [float(integral) for integral in int_list]

While this last line shows that you've started using listcomprehensions, which is a good thing, converting your data to floatingpoint is not a good idea. It is completely unnecessary and (thoughprobably not relevant here) can compromise the accuracy of calculationsdue to inherent rounding errors.I guess you are doing this to prevent subsequent rounding of the resultof sum(int_list)/len(int_list).This is a Python2-specific issue and, personally, I think that as abeginner you should use Python3, where (among other things) this is nota problem.

If you want to stick to Python2 for whatever reason then do:

from __future__ import division

after which integer divisions return a float if required just as in Python3.

>>> sum(int_list)/len(int_list)
51.31372549019608

>>>> flo_list
> [33.0, 34.0, 34.0, 36.0, 36.0, 39.0, 39.0, 40.0, 41.0, 41.0, 41.0, 42.0,
> 43.0, 43.0, 44.0, 45.0, 45.0, 45.0, 45.0, 47.0, 47.0, 49.0, 50.0, 50.0,
> 51.0, 51.0, 53.0, 54.0, 54.0, 54.0, 54.0, 55.0, 55.0, 56.0, 57.0, 57.0,
> 57.0, 57.0, 58.0, 58.0, 61.0, 61.0, 62.0, 62.0, 62.0, 62.0, 62.0, 63.0,
> 67.0, 72.0, 93.0]
>>>> sum(flo_list)
> 2617.0
>>>>  totalnum = sum(flo_list)

stop generating references if you're not going to use them later!
Confuses you and others.

>>>> len(flo_list)
> 51
>>>> mean = sum(flo_list)/len(flo_list)
>>>> mean
> 51.31372549019608

So, you know how to calculate the total mean. For the means ofsubsamples what you have to do is to apply that same logic to subsamplesof the data, which you have to generate.Without going through the lists of values several times, however, Icannot think of any simple implementation of this, which does notinvolve plenty of novel concepts.One fairly simple approach would be through a while loop as yousuggested, but as said before, for loops are often more elegant inPython. I guess the following code is roughly what you had in mind ?


breakpoints = [your_list_of breakpoints]
large_value_buffer = []
int_list_iter = iter(int_list) # see comment below
for breakpoint in breakpoints:
        sublist = large_value_buffer
        for value in int_list_iter:
                if value < breakpoint:
                        sublist.append(value)
                        if large_value_buffer:
                                large_value_buffer = []
                else:
                        if sublist:
                                print(sum(sublist)/len(sublist))
                                large_value_buffer.append(value)
                        break

Essentially, you should know all elements of this small program exceptthe iter(int_list). Essentially, this gives you a one-time iterator,which cannot be reused or reset, to use in the inner for loop. Thisprevents starting from the beginning of the list every time.

Since this is probably too complicated for you to work it out byyourself at this stage, I decided to give you the complete code, butmake sure you understand what it does, especially think about what thelarge_value_buffer is doing.

One problem with this code is that it silently skips empty bins. Maybethat's something for you to work on ?


Best,
Wolfgang

_______________________________________________
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

[Tutor] binning data and calculating means of classes

Reply via email to