[Rd] stringsAsFactors and type.convert()

2020-04-13 Thread Arni Magnusson
If read.table() is defaulting to "character" instead of "factor" data type, 
shouldn't type.convert() also default to "character" in R 4.0.0?

This would seem like a good time to change the default to 
type.convert(as.is=TRUE), to align it with the new default in read.table and 
data.frame. I think many R >=4.0.0 users would be happy with as.is=TRUE as the 
default in type.convert.

I'm happy to work on the patch and run tests if that is helpful.

Cheers,
Arni

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] stringsAsFactors and type.convert()

2020-04-20 Thread Arni Magnusson
Dear Martin,

Thank you for the well-reasoned response. I realized I was rather late to make 
this suggestion for 4.0.0, changing a somewhat low-level function that can 
indeed affect packages.

I was just reviewing some R user scripts that were using type.convert(), mainly 
on data frames. In all cases, people were passing as.is=TRUE, so I was reminded 
that I would not be the only user who would appreciate if as.is=TRUE becomes 
the default at some point.

So I am happy to hear that the help page now mentions that the as.is=TRUE is 
planned to be the default at some point in the future. Looking forward to the 
4.0.0 official release - all positive changes!

All the best,
Arni


From: Martin Maechler 
Sent: Monday, April 20, 2020 6:23:31 PM
To: Arni Magnusson
Cc: r-devel@r-project.org
Subject: Re: [Rd] stringsAsFactors and type.convert()

>>>>> Arni Magnusson
>>>>> on Mon, 13 Apr 2020 22:20:19 + writes:

> If read.table() is defaulting to "character" instead of "factor" data 
type, shouldn't type.convert() also default to "character" in R 4.0.0?
> This would seem like a good time to change the default to 
type.convert(as.is=TRUE), to align it with the new default in read.table and 
data.frame. I think many R >=4.0.0 users would be happy with as.is=TRUE as the 
default in type.convert.

> I'm happy to work on the patch and run tests if that is helpful.

> Cheers,
> Arni

Dear Arni,
thank you for the notice, which unfortunately wasn't noticed
(Easter break etc) and was too late in any case to fulfill the
criterion of a small trivial bug fix  for  R 4.0.0 beta (very close
to becoming RC (= "Release Candidate").

Even when type.convert() may not be used much directly (but
rather indirectly via read.table() where there's no problem), we
found it too risky to destabilize the R 4.0.0 prereleases.
As you all know there were ( / are?) still package changes
needed and a few other important "todo"s, so we had to decide to
postpone this (even for R-devel) to after releasing R 4.0.0
coming Friday.

I've committed a change to the help page which does mention that
the default for 'as.is' is planned to be changed.

Also, the help page's  "Details" section, for a long time has
ended with

 Since this is a helper function, the caller should always pass an
 appropriate value of 'as.is'.

If useRs and package authors have followed this advice, they
won't be bitten at all.

Best regards,
Martin

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] Drop single-dimensional array

2010-09-08 Thread Arni Magnusson

Hi Simon, thank you for the concise reply.

Do you mean the reported behavior of drop() is not a bug?

It looks like a borderline bug to me (see below), but I'm not the judge of 
that. If this is the intended behavior and serves an actual purpose, then 
that could be explicitly documented in a \note{} on the help page.


Such a note would slightly reduce the surprise of users running into this 
behavior.


This is related to the oddity that one-dimensional arrays are:

  array(month.abb, dim=c(1,1,12))  # array
  array(month.abb, dim=c(1,12))# matrix
  array(month.abb, dim=12) # array

Firstly, one would expect the pattern to be array-matrix-vector. Secondly, 
it's easy to drop() the three-dimensional and two-dimensional objects, but 
drop() does nothing to the one-dimensional array. Instead, it takes an 
unintuitive combination of methods to convert a single-dimensional to a 
vector, while retaining its names. Or I may well be missing something 
obvious.


Best regards,

Arni



On Wed, 8 Sep 2010, Simon Urbanek wrote:


wrong address - did you mean R-devel?
Simon



On Sep 6, 2010, at 8:35 AM, Arni Magnusson wrote:


Bug or not, I was surprised by this behavior:

 x <- tapply(chickwts$weight, chickwts$feed, median)
 x  # array ... I'd like to convert to vector with named elements
 drop(x) # what, still an array?
 drop(as.matrix(x))  # this works
 drop(t(x))  # this works

I was expecting drop(x) to return a vector, and I suspect many R users 
would too. The title in help(drop), "Drop Redundant Extent 
Information", suggests that such a simple array would be converted to a 
vector.


Reading through the help page, I note that this is perhaps not a clear 
bug, but somewhat unclear behavior.


The most compact way to break the vector out of its eggshell seems to 
be


 t(x)[,]

but drop(x) would be much easier to read and write. There's nothing 
particularly matrix about x, so it's not obvious that the conversion 
should involve as.matrix(x).


Thanks,

Arni




__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] Statistical mode

2011-05-26 Thread Arni Magnusson
One descriptive statistic that is conspicuously missing from core R is the 
statistical mode - the most frequent value in a discrete distribution.


I would like to propose adding the attached 'statmode' (or a similar 
function) to the 'stats' package.


Currently, it can be quite cumbersome to calculate the mode of a 
distribution in R, both for experts and beginners. The lack of a function 
to do this is felt, both when teaching introductory R courses, and when 
using sapply() or the like.


Looking forward to your feedback,

Arnistatmode <- function(x, all=FALSE, ...)
{
  if(is.list(x))
  {
output <- sapply(x, statmode, all=all, ...)
  }
  else
  {
freq <- table(x, ...)
if(all)
  output <- names(freq)[freq==max(freq)]
else
  output <- names(freq)[which.max(freq)]
## Coerce to original data type, using any() to handle mts, xtabs, etc.
if(any(class(x) %in% 
c("integer","numeric","ts","complex","matrix","table")))
  output <- as(output, storage.mode(x))
  }
  return(output)
}
\name{statmode}
\alias{statmode}
\title{Statistical Mode}
\description{
  Compute the statistical mode, the most frequent value in a discrete
  distribution.
}
\usage{
statmode(x, all = FALSE, \dots)
}
\arguments{
  \item{x}{an \R object, usually vector, matrix, or data frame.}
  \item{all}{whether all statistical modes should be returned.}
  \item{\dots}{further arguments passed to the \code{\link{table}}
function.}
}
\details{The default is to return only the first statistical mode.}
\value{
  The most frequent value in \code{x}, possibly a vector or list,
  depending on the class of \code{x} and whether \code{all=TRUE}.
}
\seealso{
  \code{\link{mean}}, \code{\link{median}}, \code{\link{table}}.

  \code{\link{density}} can be used to compute the statistical mode of a
  continuous distribution.
}
\examples{
## Different location statistics
fw <- faithful$waiting
hist(fw)
barplot(table(fw))
mean(fw)
median(fw)
statmode(fw)
plot(density(fw))
with(density(fw), x[which.max(y)])

## Different classes
statmode(chickwts$feed)  # factor
statmode(volcano)# matrix
statmode(discoveries)# ts
statmode(mtcars) # data frame

## Multiple modes
table(mtcars$carb)
statmode(mtcars$carb)
statmode(mtcars$carb, TRUE)
statmode(mtcars, TRUE)
}
\keyword{univar}
__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Statistical mode

2011-05-27 Thread Arni Magnusson

Thank you, Kevin, for the feedback.

1. The mode is not so interesting for continuous data. I would much 
rather use something like density().


Absolutely. The help page for statmode() says it is for discrete data, and 
points to density() for continuous data.



2. Both the iris and barley data sets are balanced (each factor level 
appears equally often), and the current output from the statmode 
function is misleading by only showing one level.


Try statmode(iris,TRUE). It points out that petal lengths 1.4 and 1.5 are 
equally common in the data. I decided to make all=FALSE the default 
behavior, but I'd be equally happy with all=TRUE as the default.


As for the barley data, statmode(barley,TRUE) is just the honest answer. 
The yield is continuous, so the discrete mode is not of interest, and the 
factors levels are all equally common as you point out.



3. I think the describe() function in the Hmisc package is much more 
useful and informative, even for introductory stat classes.  I always 
use describe() after importing data into R.


The describe() function is a verbose summary, usually of a data frame. The 
statmode() function is the discrete mode, usually of a vector. 
Importantly, describe(faithful$waiting) points out the mean, median and 
range, but not the mode.


---

Allow me to include two more valid comments, from Sarah Goslee and David 
Winsemius, respectively:




4. The 'modeest' package does this and more, see for example mfv().


I think core R should come with a basic function to get the mode of a 
discrete vector. One option would be to lift mfv() into the 'stats' 
package, but something like statmode() could also cover factors and 
strings. Might as well provide all=TRUE/FALSE functionality, too, and 
retain integers as integers.


It's common to find rudimentary basic functionality in the 'stats' 
package, and dedicated packages for more details; time series models and 
robust statistics come to mind. The 'modeest' package is impressive 
indeed.




5. Isn't this just table(Vec)[which.max(table(Vec))]?


Yes it is, only less cumbersome. Much like sd(Vec) is less cumbersome than 
sqrt(var(Vec)). Moreover, I find it confusing to see the count as well,


  table(volcano)[which.max(table(volcano))]
  # 110
  # 177

although this can be debated. Finally, I think the examples

  statmode(mtcars)
  statmode(mtcars, TRUE)

demonstrate practical functionality beyond 
table(Vec)[which.max(table(Vec))].


The mean, median, and mode are often mentioned together as fundamental 
descriptive statistics, and I just find it odd that statmode() is not 
already in core R. Sure, we could get by without the sd() function in core 
R, but why should we?


All the best,

Arni

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] Proposed diff.character() method

2022-03-08 Thread Arni Magnusson
Dear R developers,

Recently, I was busy comparing different versions of several packages.
Tired of going back and forth between R and diff, I created a simple
file comparison function in R that I found quite useful. For an
efficient and familiar interface I called it diff.character() and ran
things like:

  diff("old/R/foo.R", "new/R/foo.R")

Before long, I found the need for a directory-wide comparison and
added support for:

  diff("old/R", "new/R")

I have now revisited and fine-polished this function to a point where
I'd like to humbly suggest that diff.character() could be incorporated
into the base package. See attached files and patch based on the
current SVN trunk. It can be tested quickly by sourcing diff.R, or by
building R.

The examples in diff.character.html are somewhat contrived, in the
absence of good example files to compare. You will probably have
better example files to compare from your own work.

Clearly, the functionality differs considerably from the default
diff() method that operates on a single x vector, but in the broad
sense, they're both about showing differences. For most programmers,
calling diff() on two files or directories is already a part of muscle
memory, both intuitive and efficient.

There are a couple of CRAN packages (diffobj, diffR) that can compare
files but not directories. They have package dependencies and return
objects that are more complex (S4, HTML) than the plain list returned
by diff.character().

This basic utility does by no means compete with Meld, Kompare, Emacs
ediff, or other feature-rich diff applications, and using setdiff() as
a basis for file comparison can be a somewhat simplistic approach.
Nevertheless, I think many users may find this a handy tool to quickly
compare scripts and data files. The method could be implemented
differently, with fewer or more features, and I'm happy to amend
according to the R Core Team decision.

In the past, I have proposed additions to core R, some rejected and
others accepted. This proposal fits a useful tool in a currently
vacant diff.character() method at a low cost, using relatively few
lines of base function calls and no compiled code. Its acceptance will
probably depend on whether members of the R Core Team and/or CRAN Team
might see it as a useful addition to their toolkit for interactive and
scripted workflows, including R and CRAN maintenance.

All the best,
Arni
Compare Files

Description:

 Show differences between files or directories.

Usage:

 ## S3 method for class 'character'
 diff(x, y, file = NULL, ignore = NULL, lines = FALSE, short = TRUE,
  similar = FALSE, simple = TRUE, trimws = FALSE, ...)

Arguments:

   x: a file or directory name.

   y: another file or directory name.

file: if ‘x’ and ‘y’ are directories, then ‘file’ can be used to
  select a specific file that exists in both directories.

  ignore: patterns (regular expressions) to exclude from the output.

   lines: if ‘x’ and ‘y’ are directories, then ‘lines = TRUE’ compares
  the contents (lines) of files that exist in both directories,
  instead of listing filenames that are different between the
  directories.

   short: whether to produce short file paths for the output.

 similar: whether to show similarities instead of differences.

  simple: whether to replace ‘character(0)’ with ‘NULL’ in output, for
  compact display.

  trimws: whether to trim whitespace and exclude empty strings.

 ...: passed to ‘readLines’.

Details:

 When comparing directories, two kinds of differences can occur:
 (1) filenames existing in one directory and not the other, and
 (2) files containing different lines of text. The purpose of the
 ‘lines’ argument is to select which of those two kinds of
 differences to show.

 If ‘x’ and ‘y’ are files (and not directories), the ‘file’ and
 ‘lines’ arguments are not applicable and will be ignored.

Value:

 List showing differences as strings, or similarities if
 ‘similar = TRUE’.

Note:

 This function uses ‘setdiff’ for the comparison, so line order,
 line numbers, and repeated lines are ignored. Subdirectories are
 excluded when comparing directories.

 This function has very basic features compared to full GUI
 applications such as WinMerge (Windows), Meld (Linux,
 Windows), Kompare (Linux), Ediff (Emacs), or the ‘diff’ shell
 command. The use of full GUI applications is recommended, but what
 this function offers in addition is:

• a quick diff tool that is handy during an interactive R
  session,

• a programmatic interface to analyze file differences as
  native R objects, and

• a tool that works on all platforms, regardless of what
  software may be installed.

 The ‘short’ and ‘simple’ defaults are designed for interactive
 (human-readable) use, while ‘short = FALSE’ and ‘simple = FALSE’
 produc

[Rd] ASCIIfy() - a proposal for package:tools

2014-04-15 Thread Arni Magnusson

Hi all,

I would like to propose the attached function ASCIIfy() to be added to the 
'tools' package.


Non-ASCII characters in character vectors can be problematic for R 
packages, but sometimes they cannot be avoided. To make packages portable 
and build without 'R CMD check' warnings, my solution has been to convert 
problematic characters in functions and datasets to escaped ASCII, so 
plot(1,main="São Paulo") becomes plot(1,main="S\u00e3o Paulo").


The showNonASCII() function in package:tools is helpful to identify R 
source files where characters should be converted to ASCII one way or 
another, but I could not find a function to actually perform the 
conversion to ASCII.


I have written the function ASCIIfy() to convert character vectors to 
ASCII. I imagine other R package developers might be looking for a similar 
tool, and it seems to me that package:tools is the first place they would 
look, where the R Core Team has provided a variety of tools for handling 
non-ASCII characters in package development.


I hope the R Core Team will adopt ASCIIfy() into the 'tools' package, to 
make life easier for package developers outside the English-speaking 
world. I have of course no problem with them renaming or rewriting the 
function in any way.


See the attached examples - all in flat ASCII that was prepared using the 
function itself! The main objective, though, is to ASCIIfy functions and 
datasets, not help pages.


ArniASCIIfy <- function(string, bytes=2, fallback="?")
{
  bytes <- match.arg(as.character(bytes), 1:2)
  convert <- function(char)  # convert to ASCII, e.g. "z", "\xfe", or "\u00fe"
  {
raw <- charToRaw(char)
if(length(raw)==1 && raw<=127)  # 7-bit
  ascii <- char
else if(length(raw)==1 && bytes==1)  # 8-bit to \x00
  ascii <- paste0("\\x", raw)
else if(length(raw)==1 && bytes==2)  # 8-bit to \u
  ascii <- paste0("\\u", chartr(" ","0",formatC(as.character(raw),width=4)))
else if(length(raw)==2 && bytes==1)  # 16-bit to \x00, if possible
  if(utf8ToInt(char) <= 255)
ascii <- paste0("\\x", format.hexmode(utf8ToInt(char)))
  else {
ascii <- fallback; warning(char, " could not be converted to 1 byte")}
else if(length(raw)==2 && bytes==2)  # UTF-8 to \u
  ascii <- paste0("\\u", format.hexmode(utf8ToInt(char),width=4))
else {
  ascii <- fallback
  warning(char, " could not be converted to ", bytes, " byte")}
return(ascii)
  }

  if(length(string) > 1)
  {
sapply(string, ASCIIfy, bytes=bytes, fallback=fallback, USE.NAMES=FALSE)
  }
  else
  {
input <- unlist(strsplit(string,""))  # "c"  "a"  "f"  "<\'e>"
output <- character(length(input))# ""   ""   ""   ""
for(i in seq_along(input))
  output[i] <- convert(input[i])  # "c"  "a"  "f"  "\\u00e9"
output <- paste(output, collapse="")  # "caf\\u00e9"
return(output)
  }
}
\name{ASCIIfy}
\alias{ASCIIfy}
\title{Convert Characters to ASCII}
\description{
  Convert character vector to ASCII, replacing non-ASCII characters with
  single-byte (\samp{\x00}) or two-byte (\samp{\u}) codes.
}
\usage{
ASCIIfy(x, bytes = 2, fallback = "?")
}
\arguments{
  \item{x}{a character vector, possibly containing non-ASCII
characters.}
  \item{bytes}{either \code{1} or \code{2}, for single-byte
(\samp{\x00}) or two-byte (\samp{\u}) codes.}
  \item{fallback}{an output character to use, when input characters
cannot be converted.}
}
\value{
  A character vector like \code{x}, except non-ASCII characters have
  been replaced with \samp{\x00} or \samp{\u} codes.
}
\author{Arni Magnusson.}
\note{
  To render single backslashes, use these or similar techniques:
  \verb{
write(ASCIIfy(x), "file.txt")
cat(paste(ASCIIfy(x), collapse="\n"), "\n", sep="")}

  The resulting strings are plain ASCII and can be used in R functions
  and datasets to improve package portability.
}
\seealso{
  \code{\link[tools]{showNonASCII}} identifies non-ASCII characters in
  a character vector.
}
\examples{
cities <- c("S\u00e3o Paulo", "Reykjav\u00edk")
print(cities)
ASCIIfy(cities, 1)
ASCIIfy(cities, 2)

athens <- "\u0391\u03b8\u03ae\u03bd\u03b1"
print(athens)
ASCIIfy(athens)
}
\keyword{}
__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] ASCIIfy() - a proposal for package:tools

2014-04-17 Thread Arni Magnusson

Thanks Duncan, for considering ASCIIfy. I understand your reasoning.

This is a recurring pattern - I propose functions for core R, and Greg 
catches them from freefall :)


I'm delighted with ASCIIfy being hosted in gtools. The R and Rd should be 
ready as is.


Cheers,

Arni

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Model frame when LHS is cbind (PR#14189)

2010-01-21 Thread Arni Magnusson
Thank you Prof. Ripley, for examining this issue. I have two more 
questions on this topic, if I may.


(1) Truncated column names

With your explanations I can see that the problem of missing column names 
originates in cbind() and the 'deparse.level' bug we have just discovered. 
I had tried different 'deparse.level' values, only to see that it didn't 
solve my problem of missing column names.


  attach(mtcars)
  cbind(qsec,log(hp),sqrt(disp), deparse.level=0)  # no column names
  cbind(qsec,log(hp),sqrt(disp), deparse.level=1)  # qsec only
  cbind(qsec,log(hp),sqrt(disp), deparse.level=2)  # no column names
  cbind(qsec=qsec,log(hp),sqrt(disp), deparse.level=2)  # works!
  cbind(qsec=qsec,log(hp),sqrt(abs(disp)), deparse.level=2)  # hmm...

Now a new question arises. The last line generates these truncated column 
names


  "qsec"  "log(hp)"  "sqrt(abs(d..."

where the dots are not mine, but something that R decided to do, 
presumably to keep the column names no longer than 13 characters. I would 
prefer to retain the full column names, like this,


  as.matrix(data.frame(qsec,log(hp),sqrt(abs(disp)), check.names=FALSE))

where the column names are

  "qsec"  "log(hp)"  "sqrt(abs(disp))"

Is there some reason why cbind() should truncate column names? Matrices 
have no problems with very long column names.



(2) Changing the default 'deparse.level' to 2

Furthermore, since many users appreciate the compact model formula syntax 
in R, it would be great if the formula


  cbind(qsec, log(hp), sqrt(disp)) ~ wt

would result in a model frame with full column names, without sacrificing 
legibility by adding deparse.level=2 in between the variable names. The 
simplest way to achieve this would be by changing the default value of 
'deparse.level' to 2 in cbind() and probably rbind().


Am I missing some important cases where functions/users rely on some of 
the column names to be missing, as generated by deparse.level=1? And if 
so, do these cases outweigh the benefits of clean and compact formula 
syntax when modelling?



Many thanks,

Arni



On Thu, 21 Jan 2010, Prof Brian Ripley wrote:


A few points.

0) This seems a Wishlist item, but it does not say so (see the section 
on BUGS in the FAQ).


1) A formula does not need to have an lhs, and it is an assumption that 
the response is the first element of 'variables' (an assumption not made 
a couple of lines later when 'resp' is used).


2) I don't think this is the best way to get names.  If I do

fm <- lm(cbind(a=qsec,b=log(hp),sqrt(disp))~wt, data=mtcars)

I want a and b as names, but that is not what your code gives. And if I 
do



X <- with(mtcars, cbind(a = qsec, b = log(hp), c=sqrt(disp)))
fm <- lm(X ~ wt, data=mtcars)
model.frame(fm)[[1]]

  [,1] [,2]  [,3]

You've lost the names that the current code gives.

The logic is that if you use a lhs which is a matrix with column names, 
then those names are used.  If (as you did), you use one with empty 
column names, that is what you get in the model frame.  This seems much 
more in the spirit of R than second-guessing that the author actually 
meant to give column names and create them, let alone renaming the 
columns to be different than the names supplied.


3) It looks to me as if you wanted

cbind(qsec, log(hp), sqrt(disp), deparse.level=2)

but that does not give names (despite the description).  And that is I 
think a bug that can easily be changed.  That way we can fulfil yoour 
wish without breaking other people's code.


--
Brian D. Ripley,  rip...@stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel:  +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UKFax:  +44 1865 272595



__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] Rounding multinomial proportions

2010-02-11 Thread Arni Magnusson
eqn{n}--0.4. Favors big parties slightly.
}
\item{\code{"SL"}}{
  Sainte-Laguë. Involves the series 1, 3, 5, \ldots, 2\eqn{n}--1.
  Does not favor big or small parties. A reasonable default method
  for rounding multinomial proportions.
}
  }
}
\value{
  Vector of same length as \code{x}, with rounded numbers whose sum is
  one (or integers whose sum is \code{seats}).
}
\note{
  d'Hondt is used to allocate parliamentary seats in most of Europe and
  South America, East Timor, Israel, Japan, and Turkey.

  Modified Sainte-Laguë is used in Norway and Sweden.

  Sainte-Laguë is used in Bosnia and Herzegovina, Kosovo, Latvia, and
  New Zealand.
}
\author{Arni Magnusson.}
\references{
  Balinski, M. and V. Ramírez. 1999. Parametric methods of
  apportionment, rounding and production. \emph{Mathematical Social
Sciences} \bold{37}:107--122.

  Diaconis, P. and D. Freedman. 1979. On rounding percentages.
  \emph{Journal of the American Statistical Association}
  \bold{74}:359--364.

  Mosteller, F., C. Youtz, and D. Zahn. 1967. The distribution of sums
  of rounded percentages. \emph{Demography} \bold{4}:850--858.

  \url{http://en.wikipedia.org/wiki/Highest_averages_method} (accessed
  10 Feb 2010).
}
\seealso{
  \code{\link{round}} is the standard rounding function in \R.

  \code{\link{rmultinom}} generates random multinomial vectors.
}
\examples{
## Guarantee that rounded multinomial proportions sum to 1:

a <- c(67630, 116558, 207536, 251555, 356721)
p <- a / sum(a)
p1 <- round(p, 3)   # 0.068  0.117  0.208  0.252  0.357
sum(p1) # 1.002, no good
p2 <- round_multinom(p, 3)  # 0.068  0.117  0.207  0.251  0.357
sum(p2) # 1


## The multinomial "proportions" can also be raw counts (a, not p):

p3 <- round_multinom(a, 3)  # 0.068  0.117  0.207  0.251  0.357
sum(p3) # 1


## Allocate 9 seats from 178 votes using different methods:

votes <- c(red=66, green=80, blue=32)
round_multinom(votes, seats=9, method="DH")   # 4 4 1
round_multinom(votes, seats=9, method="MSL")  # 3 4 2
round_multinom(votes, seats=9, method="SL")   # 3 4 2
}
\keyword{arith}
\keyword{distribution}
round_multinom <- function(x, digits=NULL, seats=NULL, method="SL", 
labels=names(x))
{
  method <- match.arg(toupper(method), c("DH","MSL","SL"))

  if(is.null(digits) && is.null(seats) || !is.null(digits) && !is.null(seats))
stop("Please pass a value as 'digits' or 'seats', not both")
  if(!is.null(digits))
  {
if(digits<0 || digits>6)
  stop("Please pass a positive value (0-6) as 'digits'")
n <- as.integer(10^digits)
  }
  else
n <- seats

  party <- seq_along(x)
  series <- switch(method,
   DH  = 1 + 1  *(seq_len(n)-1),  # 1, 2,   3,   ..., n
   MSL = 1 + 1.4*(seq_len(n)-1),  # 1, 2.4, 3.8, ..., 1.4n-0.4
   SL  = 1 + 2  *(seq_len(n)-1))  # 1, 3,   5,   ..., 2n-1

  output <- data.frame(party=rep(party,each=n), score=as.numeric(sapply(x, 
function(votes) votes/series)))
  output <- factor(output$party)[order(-output$score)][seq_len(n)]
  output <- as.integer(table(output))

  if(!is.null(digits))
output <- output / n
  names(output) <- labels

  return(output)
}
__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Rounding multinomial proportions

2010-02-11 Thread Arni Magnusson

Ugh, I made a typo at the very heart of my message:

"when I preprocess each line in R as p<-a/sum(a), occasionally a line will 
sum to 0.999, 1.002, or the like"


should be

"when I preprocess each line in R as p<-round(a/sum(a),3) occasionally a 
line will sum to 0.999, 1.002, or the like"


Also, the first paragraph should end with "where the other multinomial 
functions reside."


Revision 2,

Arni



On Thu, 11 Feb 2010, Arni Magnusson wrote:

I present you with a function that solves a problem that has bugged me 
for many years. I think the problem may be general enough to at least 
consider adding this function, or a revamped version of it, to the 
'stats' package, with the other multinomial functions reside.


I'm using R to export data to text files, which are input data for an 
external model written in C++. Parts of the data are age distributions, 
in the form of relative frequency in each year:


 Year  Age1   Age2   ...  Age10
 1980  0.123  0.234  ...  0.001
 ...   .........  ...

Each row should sum to exactly 1. The problem is that when I preprocess 
each line in R as p<-a/sum(a), occasionally a line will sum to 0.999, 
1.002, or the like. This could either crash the external model or lead 
to wrong conclusions.


I believe similar partitioning is commonly used in a wide variety of 
models, making this a general problem for many modellers.


In the past, I have checked every line manually, and then arbitrarily 
tweaked one or two values up or down to make the row sum to exactly one, 
but two people would tweak differently. Another semi-solution is to 
write the values to the text file in a very long format, but this would 
(1) make it harder to visually check the numbers and (2) the numbers in 
the article or report would no longer match the data files exactly, so 
other scientists could not repeat the analysis and get the same results.


Once I implemented a quick and dirty solution, simply setting the last 
proportion (Age10 above) as 1 minus the sum of ages 1-9. I quickly 
stopped using that approach when I started seeing negative values.


After this introduction, the attached round_multinom.html should make 
sense. The algorithm I ended up choosing comes from allocating seats in 
elections, so I was tempted to provide that application as well, 
although it makes the interface and documentation slightly more 
confusing.


The working title of this function was a short and catchy vote(), but I 
changed it to round_multinom(), even though it's not matrix-oriented 
like the other *multinom functions. That would probably be 
straightforward to do, but I'll keep it as a vector function during the 
initial discussion.


I'm curious to hear your impressions and ideas. In the worst case, this 
is a not-so-great solution to a marginal problem. In the best case, this 
might be worth a short note in the Journal of Statistical Software.


Thanks for your time,

Arni

P.S. In case the mailing list doesn't handle attachments, I've placed 
the same files on http://www.hafro.is/~arnima/ for your convenience.




__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] update.packages(1)

2010-03-25 Thread Arni Magnusson

I'm relaying a question from my institute's sysadmin:

Would it be possible to modify update.packages() and related functions so 
that 'lib.loc' accepts integer values to specify a library from the 
.libPaths() vector?


Many Linux users want to update all user packages (inside the R_LIBS_USER 
directory, e.g. ~/r/library) and none of the system packages (inside the 
/usr directory, e.g. /usr/lib64/R/library), because they don't have write 
privileges to update the system packages.


Currently, this can be done by pressing 'y RET' for all the user packages 
and 'RET' for all the system packages. This hard work and careful reading 
when there dozens of packages. Another way is to run


  update.packages(Sys.getenv("R_LIBS_USER"))

or:

  update.packages(.libPaths()[1])

But it would be nicer for the user to type

  update.packages(1)

using a 'pos' like notation to indicate the first element of the 
.libPaths() vector.


---

A separate but related issue is that it would be nice if the R_LIBS_USER 
library would be the first library by default. Currently, my sysadmin must 
use Rprofile.site to shuffle the .libPaths() to make R_LIBS_USER first, 
which seems like a sensible default when it comes to install.packages() 
and remove.packages().


This is R version 2.10.1 (2009-12-14) on x86_64-redhat-linux-gnu (Fedora 
11).


Thanks,

Arni

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] areaplot

2010-06-14 Thread Arni Magnusson
a plot, or add polygons to an existing plot.

}

\usage{

areaplot(x, \dots)



\method{areaplot}{default}(x, y = NULL, prop = FALSE, add = FALSE, xlab = NULL,

 ylab = NULL, col = NULL, \dots)



\method{areaplot}{formula}(formula, data, subset, na.action = NULL, \dots)

}

\arguments{

  \item{x}{numeric vector of x values, or if \code{y=NULL} a numeric

vector of y values. Can also be a 1-dimensional table (x values in

names, y values in array), matrix or 2-dimensional table (x values

in row names and y values in columns), a data frame (x values in

first column and y values in subsequent columns), or a time-series

object of class \code{ts/mts}.}

  \item{y}{numeric vector of y values, or a matrix containing y values

in columns.}

  \item{prop}{whether data should be plotted as proportions, so stacked

areas equal 1.}

  \item{add}{whether polygons should be added to an existing plot.}

  \item{xlab}{label for x axis.}

  \item{ylab}{label for y axis.}

  \item{col}{fill color of polygon(s). The default is a vector of gray

colors.}

  \item{formula}{a \code{\link{formula}}, such as \code{y ~ x} or

\code{cbind(y1, y2) ~ x}, specifying x and y values. A dot on the

left-hand side, \code{formula = . ~ x}, means all variables except

the one specified on the right-hand side.}

  \item{data}{a data frame (or list) from which the variables in

\code{formula} should be taken.}

  \item{subset}{an optional vector specifying a subset of observations

to be used.}

  \item{na.action}{a function which indicates what should happen when

the data contain \code{NA} values. The default is to ignore missing

values in the given variables.}

  \item{\dots}{further arguments passed to \code{matplot} and

\code{polygon}.}

}

\value{

  Matrix of cumulative sums that was used for plotting.

}

\author{

  Arni Magnusson.

}

\seealso{

  \code{\link{barplot}}, \code{\link{polygon}}.

}

\examples{

areaplot(rpois(10,40))

areaplot(rnorm(10))



# formula

areaplot(Armed.Forces~Year, data=longley)

areaplot(cbind(Armed.Forces,Unemployed)~Year, data=longley)



# add=TRUE

plot(1940:1970, 500*runif(31), ylim=c(0,500))

areaplot(Armed.Forces~Year, data=longley, add=TRUE)



# matrix

areaplot(WorldPhones)

areaplot(WorldPhones, prop=TRUE)



# table

require(MASS)

areaplot(table(Aids2$age))

areaplot(table(Aids2$age, Aids2$sex))



# ts/mts

areaplot(austres)

areaplot(Seatbelts[,c("drivers","front","rear")],

 ylab="Killed or seriously injured")

abline(v=1983+1/12, lty=3)

}

\keyword{kwd1}

\keyword{kwd2}

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel