[Rd] stringsAsFactors and type.convert()
If read.table() is defaulting to "character" instead of "factor" data type, shouldn't type.convert() also default to "character" in R 4.0.0? This would seem like a good time to change the default to type.convert(as.is=TRUE), to align it with the new default in read.table and data.frame. I think many R >=4.0.0 users would be happy with as.is=TRUE as the default in type.convert. I'm happy to work on the patch and run tests if that is helpful. Cheers, Arni __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] stringsAsFactors and type.convert()
Dear Martin, Thank you for the well-reasoned response. I realized I was rather late to make this suggestion for 4.0.0, changing a somewhat low-level function that can indeed affect packages. I was just reviewing some R user scripts that were using type.convert(), mainly on data frames. In all cases, people were passing as.is=TRUE, so I was reminded that I would not be the only user who would appreciate if as.is=TRUE becomes the default at some point. So I am happy to hear that the help page now mentions that the as.is=TRUE is planned to be the default at some point in the future. Looking forward to the 4.0.0 official release - all positive changes! All the best, Arni From: Martin Maechler Sent: Monday, April 20, 2020 6:23:31 PM To: Arni Magnusson Cc: r-devel@r-project.org Subject: Re: [Rd] stringsAsFactors and type.convert() >>>>> Arni Magnusson >>>>> on Mon, 13 Apr 2020 22:20:19 + writes: > If read.table() is defaulting to "character" instead of "factor" data type, shouldn't type.convert() also default to "character" in R 4.0.0? > This would seem like a good time to change the default to type.convert(as.is=TRUE), to align it with the new default in read.table and data.frame. I think many R >=4.0.0 users would be happy with as.is=TRUE as the default in type.convert. > I'm happy to work on the patch and run tests if that is helpful. > Cheers, > Arni Dear Arni, thank you for the notice, which unfortunately wasn't noticed (Easter break etc) and was too late in any case to fulfill the criterion of a small trivial bug fix for R 4.0.0 beta (very close to becoming RC (= "Release Candidate"). Even when type.convert() may not be used much directly (but rather indirectly via read.table() where there's no problem), we found it too risky to destabilize the R 4.0.0 prereleases. As you all know there were ( / are?) still package changes needed and a few other important "todo"s, so we had to decide to postpone this (even for R-devel) to after releasing R 4.0.0 coming Friday. I've committed a change to the help page which does mention that the default for 'as.is' is planned to be changed. Also, the help page's "Details" section, for a long time has ended with Since this is a helper function, the caller should always pass an appropriate value of 'as.is'. If useRs and package authors have followed this advice, they won't be bitten at all. Best regards, Martin __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] Drop single-dimensional array
Hi Simon, thank you for the concise reply. Do you mean the reported behavior of drop() is not a bug? It looks like a borderline bug to me (see below), but I'm not the judge of that. If this is the intended behavior and serves an actual purpose, then that could be explicitly documented in a \note{} on the help page. Such a note would slightly reduce the surprise of users running into this behavior. This is related to the oddity that one-dimensional arrays are: array(month.abb, dim=c(1,1,12)) # array array(month.abb, dim=c(1,12))# matrix array(month.abb, dim=12) # array Firstly, one would expect the pattern to be array-matrix-vector. Secondly, it's easy to drop() the three-dimensional and two-dimensional objects, but drop() does nothing to the one-dimensional array. Instead, it takes an unintuitive combination of methods to convert a single-dimensional to a vector, while retaining its names. Or I may well be missing something obvious. Best regards, Arni On Wed, 8 Sep 2010, Simon Urbanek wrote: wrong address - did you mean R-devel? Simon On Sep 6, 2010, at 8:35 AM, Arni Magnusson wrote: Bug or not, I was surprised by this behavior: x <- tapply(chickwts$weight, chickwts$feed, median) x # array ... I'd like to convert to vector with named elements drop(x) # what, still an array? drop(as.matrix(x)) # this works drop(t(x)) # this works I was expecting drop(x) to return a vector, and I suspect many R users would too. The title in help(drop), "Drop Redundant Extent Information", suggests that such a simple array would be converted to a vector. Reading through the help page, I note that this is perhaps not a clear bug, but somewhat unclear behavior. The most compact way to break the vector out of its eggshell seems to be t(x)[,] but drop(x) would be much easier to read and write. There's nothing particularly matrix about x, so it's not obvious that the conversion should involve as.matrix(x). Thanks, Arni __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] Statistical mode
One descriptive statistic that is conspicuously missing from core R is the statistical mode - the most frequent value in a discrete distribution. I would like to propose adding the attached 'statmode' (or a similar function) to the 'stats' package. Currently, it can be quite cumbersome to calculate the mode of a distribution in R, both for experts and beginners. The lack of a function to do this is felt, both when teaching introductory R courses, and when using sapply() or the like. Looking forward to your feedback, Arnistatmode <- function(x, all=FALSE, ...) { if(is.list(x)) { output <- sapply(x, statmode, all=all, ...) } else { freq <- table(x, ...) if(all) output <- names(freq)[freq==max(freq)] else output <- names(freq)[which.max(freq)] ## Coerce to original data type, using any() to handle mts, xtabs, etc. if(any(class(x) %in% c("integer","numeric","ts","complex","matrix","table"))) output <- as(output, storage.mode(x)) } return(output) } \name{statmode} \alias{statmode} \title{Statistical Mode} \description{ Compute the statistical mode, the most frequent value in a discrete distribution. } \usage{ statmode(x, all = FALSE, \dots) } \arguments{ \item{x}{an \R object, usually vector, matrix, or data frame.} \item{all}{whether all statistical modes should be returned.} \item{\dots}{further arguments passed to the \code{\link{table}} function.} } \details{The default is to return only the first statistical mode.} \value{ The most frequent value in \code{x}, possibly a vector or list, depending on the class of \code{x} and whether \code{all=TRUE}. } \seealso{ \code{\link{mean}}, \code{\link{median}}, \code{\link{table}}. \code{\link{density}} can be used to compute the statistical mode of a continuous distribution. } \examples{ ## Different location statistics fw <- faithful$waiting hist(fw) barplot(table(fw)) mean(fw) median(fw) statmode(fw) plot(density(fw)) with(density(fw), x[which.max(y)]) ## Different classes statmode(chickwts$feed) # factor statmode(volcano)# matrix statmode(discoveries)# ts statmode(mtcars) # data frame ## Multiple modes table(mtcars$carb) statmode(mtcars$carb) statmode(mtcars$carb, TRUE) statmode(mtcars, TRUE) } \keyword{univar} __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Statistical mode
Thank you, Kevin, for the feedback. 1. The mode is not so interesting for continuous data. I would much rather use something like density(). Absolutely. The help page for statmode() says it is for discrete data, and points to density() for continuous data. 2. Both the iris and barley data sets are balanced (each factor level appears equally often), and the current output from the statmode function is misleading by only showing one level. Try statmode(iris,TRUE). It points out that petal lengths 1.4 and 1.5 are equally common in the data. I decided to make all=FALSE the default behavior, but I'd be equally happy with all=TRUE as the default. As for the barley data, statmode(barley,TRUE) is just the honest answer. The yield is continuous, so the discrete mode is not of interest, and the factors levels are all equally common as you point out. 3. I think the describe() function in the Hmisc package is much more useful and informative, even for introductory stat classes. I always use describe() after importing data into R. The describe() function is a verbose summary, usually of a data frame. The statmode() function is the discrete mode, usually of a vector. Importantly, describe(faithful$waiting) points out the mean, median and range, but not the mode. --- Allow me to include two more valid comments, from Sarah Goslee and David Winsemius, respectively: 4. The 'modeest' package does this and more, see for example mfv(). I think core R should come with a basic function to get the mode of a discrete vector. One option would be to lift mfv() into the 'stats' package, but something like statmode() could also cover factors and strings. Might as well provide all=TRUE/FALSE functionality, too, and retain integers as integers. It's common to find rudimentary basic functionality in the 'stats' package, and dedicated packages for more details; time series models and robust statistics come to mind. The 'modeest' package is impressive indeed. 5. Isn't this just table(Vec)[which.max(table(Vec))]? Yes it is, only less cumbersome. Much like sd(Vec) is less cumbersome than sqrt(var(Vec)). Moreover, I find it confusing to see the count as well, table(volcano)[which.max(table(volcano))] # 110 # 177 although this can be debated. Finally, I think the examples statmode(mtcars) statmode(mtcars, TRUE) demonstrate practical functionality beyond table(Vec)[which.max(table(Vec))]. The mean, median, and mode are often mentioned together as fundamental descriptive statistics, and I just find it odd that statmode() is not already in core R. Sure, we could get by without the sd() function in core R, but why should we? All the best, Arni __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] Proposed diff.character() method
Dear R developers, Recently, I was busy comparing different versions of several packages. Tired of going back and forth between R and diff, I created a simple file comparison function in R that I found quite useful. For an efficient and familiar interface I called it diff.character() and ran things like: diff("old/R/foo.R", "new/R/foo.R") Before long, I found the need for a directory-wide comparison and added support for: diff("old/R", "new/R") I have now revisited and fine-polished this function to a point where I'd like to humbly suggest that diff.character() could be incorporated into the base package. See attached files and patch based on the current SVN trunk. It can be tested quickly by sourcing diff.R, or by building R. The examples in diff.character.html are somewhat contrived, in the absence of good example files to compare. You will probably have better example files to compare from your own work. Clearly, the functionality differs considerably from the default diff() method that operates on a single x vector, but in the broad sense, they're both about showing differences. For most programmers, calling diff() on two files or directories is already a part of muscle memory, both intuitive and efficient. There are a couple of CRAN packages (diffobj, diffR) that can compare files but not directories. They have package dependencies and return objects that are more complex (S4, HTML) than the plain list returned by diff.character(). This basic utility does by no means compete with Meld, Kompare, Emacs ediff, or other feature-rich diff applications, and using setdiff() as a basis for file comparison can be a somewhat simplistic approach. Nevertheless, I think many users may find this a handy tool to quickly compare scripts and data files. The method could be implemented differently, with fewer or more features, and I'm happy to amend according to the R Core Team decision. In the past, I have proposed additions to core R, some rejected and others accepted. This proposal fits a useful tool in a currently vacant diff.character() method at a low cost, using relatively few lines of base function calls and no compiled code. Its acceptance will probably depend on whether members of the R Core Team and/or CRAN Team might see it as a useful addition to their toolkit for interactive and scripted workflows, including R and CRAN maintenance. All the best, Arni Compare Files Description: Show differences between files or directories. Usage: ## S3 method for class 'character' diff(x, y, file = NULL, ignore = NULL, lines = FALSE, short = TRUE, similar = FALSE, simple = TRUE, trimws = FALSE, ...) Arguments: x: a file or directory name. y: another file or directory name. file: if ‘x’ and ‘y’ are directories, then ‘file’ can be used to select a specific file that exists in both directories. ignore: patterns (regular expressions) to exclude from the output. lines: if ‘x’ and ‘y’ are directories, then ‘lines = TRUE’ compares the contents (lines) of files that exist in both directories, instead of listing filenames that are different between the directories. short: whether to produce short file paths for the output. similar: whether to show similarities instead of differences. simple: whether to replace ‘character(0)’ with ‘NULL’ in output, for compact display. trimws: whether to trim whitespace and exclude empty strings. ...: passed to ‘readLines’. Details: When comparing directories, two kinds of differences can occur: (1) filenames existing in one directory and not the other, and (2) files containing different lines of text. The purpose of the ‘lines’ argument is to select which of those two kinds of differences to show. If ‘x’ and ‘y’ are files (and not directories), the ‘file’ and ‘lines’ arguments are not applicable and will be ignored. Value: List showing differences as strings, or similarities if ‘similar = TRUE’. Note: This function uses ‘setdiff’ for the comparison, so line order, line numbers, and repeated lines are ignored. Subdirectories are excluded when comparing directories. This function has very basic features compared to full GUI applications such as WinMerge (Windows), Meld (Linux, Windows), Kompare (Linux), Ediff (Emacs), or the ‘diff’ shell command. The use of full GUI applications is recommended, but what this function offers in addition is: • a quick diff tool that is handy during an interactive R session, • a programmatic interface to analyze file differences as native R objects, and • a tool that works on all platforms, regardless of what software may be installed. The ‘short’ and ‘simple’ defaults are designed for interactive (human-readable) use, while ‘short = FALSE’ and ‘simple = FALSE’ produc
[Rd] ASCIIfy() - a proposal for package:tools
Hi all, I would like to propose the attached function ASCIIfy() to be added to the 'tools' package. Non-ASCII characters in character vectors can be problematic for R packages, but sometimes they cannot be avoided. To make packages portable and build without 'R CMD check' warnings, my solution has been to convert problematic characters in functions and datasets to escaped ASCII, so plot(1,main="São Paulo") becomes plot(1,main="S\u00e3o Paulo"). The showNonASCII() function in package:tools is helpful to identify R source files where characters should be converted to ASCII one way or another, but I could not find a function to actually perform the conversion to ASCII. I have written the function ASCIIfy() to convert character vectors to ASCII. I imagine other R package developers might be looking for a similar tool, and it seems to me that package:tools is the first place they would look, where the R Core Team has provided a variety of tools for handling non-ASCII characters in package development. I hope the R Core Team will adopt ASCIIfy() into the 'tools' package, to make life easier for package developers outside the English-speaking world. I have of course no problem with them renaming or rewriting the function in any way. See the attached examples - all in flat ASCII that was prepared using the function itself! The main objective, though, is to ASCIIfy functions and datasets, not help pages. ArniASCIIfy <- function(string, bytes=2, fallback="?") { bytes <- match.arg(as.character(bytes), 1:2) convert <- function(char) # convert to ASCII, e.g. "z", "\xfe", or "\u00fe" { raw <- charToRaw(char) if(length(raw)==1 && raw<=127) # 7-bit ascii <- char else if(length(raw)==1 && bytes==1) # 8-bit to \x00 ascii <- paste0("\\x", raw) else if(length(raw)==1 && bytes==2) # 8-bit to \u ascii <- paste0("\\u", chartr(" ","0",formatC(as.character(raw),width=4))) else if(length(raw)==2 && bytes==1) # 16-bit to \x00, if possible if(utf8ToInt(char) <= 255) ascii <- paste0("\\x", format.hexmode(utf8ToInt(char))) else { ascii <- fallback; warning(char, " could not be converted to 1 byte")} else if(length(raw)==2 && bytes==2) # UTF-8 to \u ascii <- paste0("\\u", format.hexmode(utf8ToInt(char),width=4)) else { ascii <- fallback warning(char, " could not be converted to ", bytes, " byte")} return(ascii) } if(length(string) > 1) { sapply(string, ASCIIfy, bytes=bytes, fallback=fallback, USE.NAMES=FALSE) } else { input <- unlist(strsplit(string,"")) # "c" "a" "f" "<\'e>" output <- character(length(input))# "" "" "" "" for(i in seq_along(input)) output[i] <- convert(input[i]) # "c" "a" "f" "\\u00e9" output <- paste(output, collapse="") # "caf\\u00e9" return(output) } } \name{ASCIIfy} \alias{ASCIIfy} \title{Convert Characters to ASCII} \description{ Convert character vector to ASCII, replacing non-ASCII characters with single-byte (\samp{\x00}) or two-byte (\samp{\u}) codes. } \usage{ ASCIIfy(x, bytes = 2, fallback = "?") } \arguments{ \item{x}{a character vector, possibly containing non-ASCII characters.} \item{bytes}{either \code{1} or \code{2}, for single-byte (\samp{\x00}) or two-byte (\samp{\u}) codes.} \item{fallback}{an output character to use, when input characters cannot be converted.} } \value{ A character vector like \code{x}, except non-ASCII characters have been replaced with \samp{\x00} or \samp{\u} codes. } \author{Arni Magnusson.} \note{ To render single backslashes, use these or similar techniques: \verb{ write(ASCIIfy(x), "file.txt") cat(paste(ASCIIfy(x), collapse="\n"), "\n", sep="")} The resulting strings are plain ASCII and can be used in R functions and datasets to improve package portability. } \seealso{ \code{\link[tools]{showNonASCII}} identifies non-ASCII characters in a character vector. } \examples{ cities <- c("S\u00e3o Paulo", "Reykjav\u00edk") print(cities) ASCIIfy(cities, 1) ASCIIfy(cities, 2) athens <- "\u0391\u03b8\u03ae\u03bd\u03b1" print(athens) ASCIIfy(athens) } \keyword{} __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] ASCIIfy() - a proposal for package:tools
Thanks Duncan, for considering ASCIIfy. I understand your reasoning. This is a recurring pattern - I propose functions for core R, and Greg catches them from freefall :) I'm delighted with ASCIIfy being hosted in gtools. The R and Rd should be ready as is. Cheers, Arni __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Model frame when LHS is cbind (PR#14189)
Thank you Prof. Ripley, for examining this issue. I have two more questions on this topic, if I may. (1) Truncated column names With your explanations I can see that the problem of missing column names originates in cbind() and the 'deparse.level' bug we have just discovered. I had tried different 'deparse.level' values, only to see that it didn't solve my problem of missing column names. attach(mtcars) cbind(qsec,log(hp),sqrt(disp), deparse.level=0) # no column names cbind(qsec,log(hp),sqrt(disp), deparse.level=1) # qsec only cbind(qsec,log(hp),sqrt(disp), deparse.level=2) # no column names cbind(qsec=qsec,log(hp),sqrt(disp), deparse.level=2) # works! cbind(qsec=qsec,log(hp),sqrt(abs(disp)), deparse.level=2) # hmm... Now a new question arises. The last line generates these truncated column names "qsec" "log(hp)" "sqrt(abs(d..." where the dots are not mine, but something that R decided to do, presumably to keep the column names no longer than 13 characters. I would prefer to retain the full column names, like this, as.matrix(data.frame(qsec,log(hp),sqrt(abs(disp)), check.names=FALSE)) where the column names are "qsec" "log(hp)" "sqrt(abs(disp))" Is there some reason why cbind() should truncate column names? Matrices have no problems with very long column names. (2) Changing the default 'deparse.level' to 2 Furthermore, since many users appreciate the compact model formula syntax in R, it would be great if the formula cbind(qsec, log(hp), sqrt(disp)) ~ wt would result in a model frame with full column names, without sacrificing legibility by adding deparse.level=2 in between the variable names. The simplest way to achieve this would be by changing the default value of 'deparse.level' to 2 in cbind() and probably rbind(). Am I missing some important cases where functions/users rely on some of the column names to be missing, as generated by deparse.level=1? And if so, do these cases outweigh the benefits of clean and compact formula syntax when modelling? Many thanks, Arni On Thu, 21 Jan 2010, Prof Brian Ripley wrote: A few points. 0) This seems a Wishlist item, but it does not say so (see the section on BUGS in the FAQ). 1) A formula does not need to have an lhs, and it is an assumption that the response is the first element of 'variables' (an assumption not made a couple of lines later when 'resp' is used). 2) I don't think this is the best way to get names. If I do fm <- lm(cbind(a=qsec,b=log(hp),sqrt(disp))~wt, data=mtcars) I want a and b as names, but that is not what your code gives. And if I do X <- with(mtcars, cbind(a = qsec, b = log(hp), c=sqrt(disp))) fm <- lm(X ~ wt, data=mtcars) model.frame(fm)[[1]] [,1] [,2] [,3] You've lost the names that the current code gives. The logic is that if you use a lhs which is a matrix with column names, then those names are used. If (as you did), you use one with empty column names, that is what you get in the model frame. This seems much more in the spirit of R than second-guessing that the author actually meant to give column names and create them, let alone renaming the columns to be different than the names supplied. 3) It looks to me as if you wanted cbind(qsec, log(hp), sqrt(disp), deparse.level=2) but that does not give names (despite the description). And that is I think a bug that can easily be changed. That way we can fulfil yoour wish without breaking other people's code. -- Brian D. Ripley, rip...@stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595 __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] Rounding multinomial proportions
eqn{n}--0.4. Favors big parties slightly. } \item{\code{"SL"}}{ Sainte-Laguë. Involves the series 1, 3, 5, \ldots, 2\eqn{n}--1. Does not favor big or small parties. A reasonable default method for rounding multinomial proportions. } } } \value{ Vector of same length as \code{x}, with rounded numbers whose sum is one (or integers whose sum is \code{seats}). } \note{ d'Hondt is used to allocate parliamentary seats in most of Europe and South America, East Timor, Israel, Japan, and Turkey. Modified Sainte-Laguë is used in Norway and Sweden. Sainte-Laguë is used in Bosnia and Herzegovina, Kosovo, Latvia, and New Zealand. } \author{Arni Magnusson.} \references{ Balinski, M. and V. RamÃrez. 1999. Parametric methods of apportionment, rounding and production. \emph{Mathematical Social Sciences} \bold{37}:107--122. Diaconis, P. and D. Freedman. 1979. On rounding percentages. \emph{Journal of the American Statistical Association} \bold{74}:359--364. Mosteller, F., C. Youtz, and D. Zahn. 1967. The distribution of sums of rounded percentages. \emph{Demography} \bold{4}:850--858. \url{http://en.wikipedia.org/wiki/Highest_averages_method} (accessed 10 Feb 2010). } \seealso{ \code{\link{round}} is the standard rounding function in \R. \code{\link{rmultinom}} generates random multinomial vectors. } \examples{ ## Guarantee that rounded multinomial proportions sum to 1: a <- c(67630, 116558, 207536, 251555, 356721) p <- a / sum(a) p1 <- round(p, 3) # 0.068 0.117 0.208 0.252 0.357 sum(p1) # 1.002, no good p2 <- round_multinom(p, 3) # 0.068 0.117 0.207 0.251 0.357 sum(p2) # 1 ## The multinomial "proportions" can also be raw counts (a, not p): p3 <- round_multinom(a, 3) # 0.068 0.117 0.207 0.251 0.357 sum(p3) # 1 ## Allocate 9 seats from 178 votes using different methods: votes <- c(red=66, green=80, blue=32) round_multinom(votes, seats=9, method="DH") # 4 4 1 round_multinom(votes, seats=9, method="MSL") # 3 4 2 round_multinom(votes, seats=9, method="SL") # 3 4 2 } \keyword{arith} \keyword{distribution} round_multinom <- function(x, digits=NULL, seats=NULL, method="SL", labels=names(x)) { method <- match.arg(toupper(method), c("DH","MSL","SL")) if(is.null(digits) && is.null(seats) || !is.null(digits) && !is.null(seats)) stop("Please pass a value as 'digits' or 'seats', not both") if(!is.null(digits)) { if(digits<0 || digits>6) stop("Please pass a positive value (0-6) as 'digits'") n <- as.integer(10^digits) } else n <- seats party <- seq_along(x) series <- switch(method, DH = 1 + 1 *(seq_len(n)-1), # 1, 2, 3, ..., n MSL = 1 + 1.4*(seq_len(n)-1), # 1, 2.4, 3.8, ..., 1.4n-0.4 SL = 1 + 2 *(seq_len(n)-1)) # 1, 3, 5, ..., 2n-1 output <- data.frame(party=rep(party,each=n), score=as.numeric(sapply(x, function(votes) votes/series))) output <- factor(output$party)[order(-output$score)][seq_len(n)] output <- as.integer(table(output)) if(!is.null(digits)) output <- output / n names(output) <- labels return(output) } __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Rounding multinomial proportions
Ugh, I made a typo at the very heart of my message: "when I preprocess each line in R as p<-a/sum(a), occasionally a line will sum to 0.999, 1.002, or the like" should be "when I preprocess each line in R as p<-round(a/sum(a),3) occasionally a line will sum to 0.999, 1.002, or the like" Also, the first paragraph should end with "where the other multinomial functions reside." Revision 2, Arni On Thu, 11 Feb 2010, Arni Magnusson wrote: I present you with a function that solves a problem that has bugged me for many years. I think the problem may be general enough to at least consider adding this function, or a revamped version of it, to the 'stats' package, with the other multinomial functions reside. I'm using R to export data to text files, which are input data for an external model written in C++. Parts of the data are age distributions, in the form of relative frequency in each year: Year Age1 Age2 ... Age10 1980 0.123 0.234 ... 0.001 ... ......... ... Each row should sum to exactly 1. The problem is that when I preprocess each line in R as p<-a/sum(a), occasionally a line will sum to 0.999, 1.002, or the like. This could either crash the external model or lead to wrong conclusions. I believe similar partitioning is commonly used in a wide variety of models, making this a general problem for many modellers. In the past, I have checked every line manually, and then arbitrarily tweaked one or two values up or down to make the row sum to exactly one, but two people would tweak differently. Another semi-solution is to write the values to the text file in a very long format, but this would (1) make it harder to visually check the numbers and (2) the numbers in the article or report would no longer match the data files exactly, so other scientists could not repeat the analysis and get the same results. Once I implemented a quick and dirty solution, simply setting the last proportion (Age10 above) as 1 minus the sum of ages 1-9. I quickly stopped using that approach when I started seeing negative values. After this introduction, the attached round_multinom.html should make sense. The algorithm I ended up choosing comes from allocating seats in elections, so I was tempted to provide that application as well, although it makes the interface and documentation slightly more confusing. The working title of this function was a short and catchy vote(), but I changed it to round_multinom(), even though it's not matrix-oriented like the other *multinom functions. That would probably be straightforward to do, but I'll keep it as a vector function during the initial discussion. I'm curious to hear your impressions and ideas. In the worst case, this is a not-so-great solution to a marginal problem. In the best case, this might be worth a short note in the Journal of Statistical Software. Thanks for your time, Arni P.S. In case the mailing list doesn't handle attachments, I've placed the same files on http://www.hafro.is/~arnima/ for your convenience. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] update.packages(1)
I'm relaying a question from my institute's sysadmin: Would it be possible to modify update.packages() and related functions so that 'lib.loc' accepts integer values to specify a library from the .libPaths() vector? Many Linux users want to update all user packages (inside the R_LIBS_USER directory, e.g. ~/r/library) and none of the system packages (inside the /usr directory, e.g. /usr/lib64/R/library), because they don't have write privileges to update the system packages. Currently, this can be done by pressing 'y RET' for all the user packages and 'RET' for all the system packages. This hard work and careful reading when there dozens of packages. Another way is to run update.packages(Sys.getenv("R_LIBS_USER")) or: update.packages(.libPaths()[1]) But it would be nicer for the user to type update.packages(1) using a 'pos' like notation to indicate the first element of the .libPaths() vector. --- A separate but related issue is that it would be nice if the R_LIBS_USER library would be the first library by default. Currently, my sysadmin must use Rprofile.site to shuffle the .libPaths() to make R_LIBS_USER first, which seems like a sensible default when it comes to install.packages() and remove.packages(). This is R version 2.10.1 (2009-12-14) on x86_64-redhat-linux-gnu (Fedora 11). Thanks, Arni __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] areaplot
a plot, or add polygons to an existing plot. } \usage{ areaplot(x, \dots) \method{areaplot}{default}(x, y = NULL, prop = FALSE, add = FALSE, xlab = NULL, ylab = NULL, col = NULL, \dots) \method{areaplot}{formula}(formula, data, subset, na.action = NULL, \dots) } \arguments{ \item{x}{numeric vector of x values, or if \code{y=NULL} a numeric vector of y values. Can also be a 1-dimensional table (x values in names, y values in array), matrix or 2-dimensional table (x values in row names and y values in columns), a data frame (x values in first column and y values in subsequent columns), or a time-series object of class \code{ts/mts}.} \item{y}{numeric vector of y values, or a matrix containing y values in columns.} \item{prop}{whether data should be plotted as proportions, so stacked areas equal 1.} \item{add}{whether polygons should be added to an existing plot.} \item{xlab}{label for x axis.} \item{ylab}{label for y axis.} \item{col}{fill color of polygon(s). The default is a vector of gray colors.} \item{formula}{a \code{\link{formula}}, such as \code{y ~ x} or \code{cbind(y1, y2) ~ x}, specifying x and y values. A dot on the left-hand side, \code{formula = . ~ x}, means all variables except the one specified on the right-hand side.} \item{data}{a data frame (or list) from which the variables in \code{formula} should be taken.} \item{subset}{an optional vector specifying a subset of observations to be used.} \item{na.action}{a function which indicates what should happen when the data contain \code{NA} values. The default is to ignore missing values in the given variables.} \item{\dots}{further arguments passed to \code{matplot} and \code{polygon}.} } \value{ Matrix of cumulative sums that was used for plotting. } \author{ Arni Magnusson. } \seealso{ \code{\link{barplot}}, \code{\link{polygon}}. } \examples{ areaplot(rpois(10,40)) areaplot(rnorm(10)) # formula areaplot(Armed.Forces~Year, data=longley) areaplot(cbind(Armed.Forces,Unemployed)~Year, data=longley) # add=TRUE plot(1940:1970, 500*runif(31), ylim=c(0,500)) areaplot(Armed.Forces~Year, data=longley, add=TRUE) # matrix areaplot(WorldPhones) areaplot(WorldPhones, prop=TRUE) # table require(MASS) areaplot(table(Aids2$age)) areaplot(table(Aids2$age, Aids2$sex)) # ts/mts areaplot(austres) areaplot(Seatbelts[,c("drivers","front","rear")], ylab="Killed or seriously injured") abline(v=1983+1/12, lty=3) } \keyword{kwd1} \keyword{kwd2} __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel