On 04-Feb-10 09:58:36, Petr PIKAL wrote: > Hi > so do you think I shall fire a bug announcement? I think I rather > wait to see if there is some reaction from others. Maybe, there > is some reason behind such behaviour. Those simple statistics tend > to behave differently when operating on data.frames so median is > not such a huge surprise. > > see > > sd(df1), var(df1), mean(df1), max(df1), min(df1), range(df1) > > Produced results are usually clearly documented,
Yes, in the case of sd() and mean() it is clearly stated what happens if the argument is a dataframe: it is the value of the function as applied to each column separately. For var() it is also clearly stated that when applied to a matrix it returns the covariances between columns (with, presumably, dataframes inplicitly converted to matrix). For max() and min() it is clearly stated "the maximum or minimum of _all_ the values present in their arguments"; for range() it is not so clear but is similar: "a vector containing the minimum and maximum of all the given arguments", and you have to experiment to verify that it is apparently intended to be the same as c(min(...),max(...)): range(c(1,4,7),c(2,5,8),c(3,6,9)) # [1] 1 9 However, for median there is no such statement,compared with what is stated for mean(): '?mean': "For a data frame, a named vector with the appropriate method being applied column by column." '?median': "The default method returns a length-one object of the same type as 'x'" (which is a bit cryptic). It is possible that the behaviour of mean() with dataframes is in tended as an "add-on": If mean() is applied to a matrix, you get the mean of all the values in the matrix. For dataframes, there seems to be a special "mean" method which causes the standard mean() to be applied spearately to each column. This is not the case with any of the other functions above. Quite why mean() was specially designed in this way in the first place is another question (presumably to match up with the behavious of sd(), so that you can represent each column of a dataframe by its (mean,sd) pair??); but it was, and there it is, and it is useful. > however for novice it is rather mysterious why using those functions > on vector produce easily understandable results but using them on > data.frame (which is most common structure of data) is far from > consistent and intuitive. > > But I agree with you that mean and median in best case shall give > similar results regarding results structure. Absolutely! Mean and median are, from the interpretative point of view, essentially the same: a "measure of central tendency", albeit computed in different ways and with somewhat different properties. But any user will expect that whenever a mean (or a set of means) can be computed using mean(), a similar median (or set of medians) would result from using median(). Of course, one way round this gross anomaly between mean() and median() would be to ignore the special behaviour of mean() when applied to dataframes, and simply use an appropriate "apply", just as one would for sd(), var() (if interested in the variance of each column), max(), min() and range(). And this would then work for median(). But, despite all that, the fact that median() produces so meaningless a result for a dataframe is undoubtedly a bug, in my opinion. Either median() whould produce the median of all the values present (like max(), min()), or it should behave like mean() and sd(). I would prefer the latter. However, like you, I prefer to wait for comments from others before a bug report is filed -- it is just possible that there is an important reason why median() should behave as it does, though I cannot imagine what it might be! Ted. > Regards > Petr > > r-help-boun...@r-project.org napsal dne 04.02.2010 10:28:16: > >> Well, I get the same as Petr with R version 2.10.0 (2009-10-26) >> on Linux. >> >> To me, this suggests that median is broken! Any user would, >> a priori, expect that median() should operate in exactly >> the same way as mean(). To extend Petr's example: >> >> mat <- matrix(1:32, 4,8) >> df1 <- data.frame(mat) >> mean(df1) >> # X1 X2 X3 X4 X5 X6 X7 X8 >> # 2.5 6.5 10.5 14.5 18.5 22.5 26.5 30.5 >> median(df1) >> # [1] 14.5 18.5 >> >> so (as in Petr's original example, but more clearly) median() >> returns the medians of the two "central" columns X4 and X5 of df1. >> >> But that is with an even number of columns. Now look at what >> happens with an odd number: >> >> mat <- matrix(1:28, 4,7) >> df1 <- data.frame(mat) >> mean(df1) >> # X1 X2 X3 X4 X5 X6 X7 >> # 2.5 6.5 10.5 14.5 18.5 22.5 26.5 >> median(df1) >> # structure(c("13", "14", "15", "16"), class = "AsIs") >> # 1 13 >> # 2 14 >> # 3 15 >> # 4 16 >> >> Wow!!!!!!!!!! >> >> This does suggest a tie-in with Petr's observation about "As.Is", >> and there is no doubt at all that the above result is rubbish. >> It is certainly not what a user would expect, and in the context >> of Petr's intention to present R lessons to a class, I could >> foresee students turning their backs on R if they came up with >> such a result in their early encounters! >> >> Ted. >> >> On 04-Feb-10 08:59:59, Mario Valle wrote: >> > Linux 2.9.0 gives: >> > >> >> median(df1) >> > [1] 34 >> > >> > Ever stranger... >> > mario >> > >> > Petr PIKAL wrote: >> >> During some experimentation in preparing R lessons I encountered >> >> this > >> >> behaviour which I can not explain fully >> >> >> >> mat <- matrix(1:16, 4,4) >> >> df1 <- data.frame(mat) >> >> >> >>> mean(df1) >> >> X1 X2 X3 X4 >> >> 2.5 6.5 10.5 14.5 >> >> >> >> Expected, documented >> >> >> >>> median(df1) >> >> [1] 6.5 10.5 >> >> >> >> Rather weird, AFAIK there shall not be an issue with data frame at >> >> least I >> >> did not find any in help page. I tracked it down probably to an >> >> As.Is > >> >> operation with object and subsequent sorting in median.default. >> >> >> >> I know other (*apply) ways how to compute median for data frames so >> >> I >> >> just >> >> would like to hear an opinion about this behaviour from more >> >> experienced >> >> people. >> >> >> >> Thank you >> >> Best regards >> >> >> >> Petr >> >> >> >> ______________________________________________ >> >> R-help@r-project.org mailing list >> >> https://stat.ethz.ch/mailman/listinfo/r-help >> >> PLEASE do read the posting guide >> >> http://www.R-project.org/posting-guide.html >> >> and provide commented, minimal, self-contained, reproducible code. >> > >> > -- >> > Ing. Mario Valle >> > Data Analysis and Visualization Group | >> > http://www.cscs.ch/~mvalle >> > Swiss National Supercomputing Centre (CSCS) | Tel: +41 (91) >> > 610.82.60 >> > v. Cantonale Galleria 2, 6928 Manno, Switzerland | Fax: +41 (91) >> > 610.82.82 >> > >> > ______________________________________________ >> > R-help@r-project.org mailing list >> > https://stat.ethz.ch/mailman/listinfo/r-help >> > PLEASE do read the posting guide >> > http://www.R-project.org/posting-guide.html >> > and provide commented, minimal, self-contained, reproducible code. >> >> -------------------------------------------------------------------- >> E-Mail: (Ted Harding) <ted.hard...@manchester.ac.uk> >> Fax-to-email: +44 (0)870 094 0861 >> Date: 04-Feb-10 Time: 09:28:13 >> ------------------------------ XFMail ------------------------------ >> >> ______________________________________________ >> R-help@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. -------------------------------------------------------------------- E-Mail: (Ted Harding) <ted.hard...@manchester.ac.uk> Fax-to-email: +44 (0)870 094 0861 Date: 04-Feb-10 Time: 10:53:19 ------------------------------ XFMail ------------------------------ ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.