[Rd] possible internal (un)tar bug

2018-05-01 Thread Gábor Csárdi
This is a not too old R-devel on Linux, it already fails in R 3.4.4, and on
macOS as well.

The tar file seems valid, external tar can untar it, so maybe an untar()
bug.

setwd(tempdir())
dir.create("pkg")
cat("foobar\n",  file = file.path("pkg", "NAMESPACE"))
cat("this: that\n", file = file.path("pkg", "DESCRIPTION"))

tar("pkg_1.0.tar.gz", "pkg", compression = "gzip", tar = "internal")
unlink("pkg", recursive = TRUE)

con <- file("pkg_1.0.tar.gz", open = "rb")
ex <- tempfile()
untar(con, files = "pkg/DESCRIPTION", exdir = ex)

#> Error in untar2(tarfile, files, list, exdir, restore_times) :
#>   incomplete block on file

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] issue with model.frame()

2018-05-01 Thread Therneau, Terry M., Ph.D. via R-devel
A user sent me an example where coxph fails, and the root of the failure is a case where 
names(mf) is not equal to the term.labels attribute of the formula -- the latter has an 
extraneous newline. Here is an example that does not use the survival library.


# first create a data set with many long names
n <- 30  # number of rows for the dummy data set
vname <- vector("character", 26)
for (i in 1:26) vname[i] <- paste(rep(letters[1:i],2), collapse='')  # long 
variable names

tdata <- data.frame(y=1:n, matrix(runif(n*26), nrow=n))
names(tdata) <- c('y', vname)

# Use it in a formula
myform <- paste("y ~ cbind(", paste(vname, collapse=", "), ")")
mf <- model.frame(formula(myform), data=tdata)

match(attr(terms(mf), "term.labels"), names(mf))   # gives NA



In the user's case the function is ridge(x1, x2, ) rather than cbind, but the effect 
is the same.

Any ideas for a work around?

Aside: the ridge() function is very simple, it was added as an example to show how a user 
can add their own penalization to coxph.  I never expected serious use of it.  For this 
particular user the best answer is to use glmnet instead.   He/she is trying to apply an 
L2 penalty to a large number of SNP * covariate interactions.


Terry T.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] possible internal (un)tar bug

2018-05-01 Thread Martin Maechler
> Gábor Csárdi 
> on Tue, 1 May 2018 12:05:32 + writes:

> This is a not too old R-devel on Linux, it already fails
> in R 3.4.4, and on macOS as well.

and fails in considerably older R versions, too.

Basically  untar() seems to fail on a connection, but works fine
on a plain file name.

This is a bug --> Thank you for the report, Gábor !

I'm investigating.
Martin

--- my version of your reprex :

setwd(tempdir())
dir.create("pkg")
cat("this: that\n", file = file.path("pkg", "DESCRIPTION"))
tf <- "pkg_1.0.tar.gz"
tar(tf, "pkg", compression = "gzip", tar = "internal")
unlink("pkg", recursive = TRUE)

## MM: tar *file* is good
stopifnot(identical(untar(tf, list=TRUE), "pkg/DESCRIPTION"))
untar(tf, files = (f <- "pkg/DESCRIPTION")) # no problem
stopifnot(file.exists(f))
unlink("pkg", recursive = TRUE)

## Now with a connection -- "nothing works":
con <- file(tf, open = "rb"); try( untar(con, list = TRUE) ) ## -> Error
con <- file(tf, open = "rb"); try( untar(con, files = "pkg/DESCRIPTION") )
## The error message is the same in both cases:
'
Error in untar2(tarfile, files, list, exdir, restore_times) :
   incomplete block on file
'

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] possible internal (un)tar bug

2018-05-01 Thread Martin Maechler
> Martin Maechler 
> on Tue, 1 May 2018 16:14:43 +0200 writes:

> Gábor Csárdi 
> on Tue, 1 May 2018 12:05:32 + writes:

>> This is a not too old R-devel on Linux, it already fails
>> in R 3.4.4, and on macOS as well.

> and fails in considerably older R versions, too.

> Basically untar() seems to fail on a connection, but works
> fine on a plain file name.

Well, there's an easy workaround:   If you want to use a
connection (instead of a simple filename) with  untar() and want
to use compression (as in the example), you
can currently  do that easily when you ensure the connection is
a "gzcon" one :

##=>  Workaround for now:

## Create :
setwd(tempdir()) ; dir.create("pkg")
cat("this: that\n", file = file.path("pkg", "DESCRIPTION"))
tf <- "pkg_1.0.tar.gz"
tar(tf, "pkg", compression = "gzip", tar = "internal")
unlink("pkg", recursive = TRUE)

## As it is a compressed tar file, use it via a gzcon() connection,
## and both cases work fine:
con <- gzcon(file(tf, open = "rb")) ; (f <- untar(con, list = TRUE))
## ~
con <- gzcon(file(tf, open = "rb")) ; untar(con, files = f)
stopifnot(identical(f, "pkg/DESCRIPTION"),
  file.exists(f))
unlink(c(tf,"pkg"), recursive = TRUE) # clean after me




Of course, ideally untar() should do that for us and I'm testing
a simple patch to do that.

Martin

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] issue with model.frame()

2018-05-01 Thread Berry, Charles


> On May 1, 2018, at 6:11 AM, Therneau, Terry M., Ph.D. via R-devel 
>  wrote:
> 
> A user sent me an example where coxph fails, and the root of the failure is a 
> case where names(mf) is not equal to the term.labels attribute of the formula 
> -- the latter has an extraneous newline. Here is an example that does not use 
> the survival library.
> 
> # first create a data set with many long names
> n <- 30  # number of rows for the dummy data set
> vname <- vector("character", 26)
> for (i in 1:26) vname[i] <- paste(rep(letters[1:i],2), collapse='')  # long 
> variable names
> 
> tdata <- data.frame(y=1:n, matrix(runif(n*26), nrow=n))
> names(tdata) <- c('y', vname)
> 
> # Use it in a formula
> myform <- paste("y ~ cbind(", paste(vname, collapse=", "), ")")
> mf <- model.frame(formula(myform), data=tdata)
> 
> match(attr(terms(mf), "term.labels"), names(mf))   # gives NA
> 
> 
> 
> In the user's case the function is ridge(x1, x2, ) rather than cbind, but 
> the effect is the same.
> Any ideas for a work around?

Maybe add a `yourclass' class to mf and dispatch to a model.frame.yourclass 
method where the width cutoff arg here (around lines 57-58 of 
model.frame.default) is made larger:

varnames <- sapply(vars, function(x) paste(deparse(x, width.cutoff = 500), 
collapse = " "))[-1L]

??

> 
> Aside: the ridge() function is very simple, it was added as an example to 
> show how a user can add their own penalization to coxph.  I never expected 
> serious use of it.  For this particular user the best answer is to use glmnet 
> instead.   He/she is trying to apply an L2 penalty to a large number of SNP * 
> covariate interactions.
> 
> Terry T.
> 


HTH,

Chuck
__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] possible internal (un)tar bug

2018-05-01 Thread Martin Maechler
TLDR:  Use  gzfile(), not file()  .. and you have no problems.

> Martin Maechler 
> on Tue, 1 May 2018 16:39:57 +0200 writes:

> Martin Maechler 
> on Tue, 1 May 2018 16:14:43 +0200 writes:

> Gábor Csárdi 
> on Tue, 1 May 2018 12:05:32 + writes:

>>> This is a not too old R-devel on Linux, it already fails
>>> in R 3.4.4, and on macOS as well.

>> and fails in considerably older R versions, too.

>> Basically untar() seems to fail on a connection, but works
>> fine on a plain file name.

> Well, there's an easy workaround:   If you want to use a
> connection (instead of a simple filename) with  untar() and want
> to use compression (as in the example), you
> can currently  do that easily when you ensure the connection is
> a "gzcon" one :

> ##=>  Workaround for now:

> ## Create :
> setwd(tempdir()) ; dir.create("pkg")
> cat("this: that\n", file = file.path("pkg", "DESCRIPTION"))
> tf <- "pkg_1.0.tar.gz"
> tar(tf, "pkg", compression = "gzip", tar = "internal")
> unlink("pkg", recursive = TRUE)

> ## As it is a compressed tar file, use it via a gzcon() connection,
> ## and both cases work fine:
> con <- gzcon(file(tf, open = "rb")) ; (f <- untar(con, list = TRUE))
> ## ~
> con <- gzcon(file(tf, open = "rb")) ; untar(con, files = f)
> stopifnot(identical(f, "pkg/DESCRIPTION"),
> file.exists(f))
> unlink(c(tf,"pkg"), recursive = TRUE) # clean after me

Actually, much better than  gzcon(file())  is  gzfile()
The latter works for all compression types that are supported by
tar(), not just for  gzip compression.

In the end, I'd conclude for now that the bug is mostly in the
documentation and the unhelpful error message.

We could try to "fix" your use case by wrapping the connection
by  gzcon(.) and that is okay also for uncompressed tar
files. However it fails for the newer compression schemes which
are all supported via gzfile().

I propose to commit the following change :

1) change the documentation of untar() to say that a connection
   to a compressed tar file should be created by gzfile().
2) in the case of a connection which gave the "block error",
   the error would newly be more helpful, mentioning gzfile().

Currently:

> con <- file(tf, open = "rb"); try( untar(con, list = TRUE) ) ## -> Error
Error in untar2(tarfile, files, list, exdir, restore_times) : 
  incomplete block: rather use gzfile(.) created connection?
> 

Feedback (by anyone)  ??

Martin

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] debugonce() functions are not considered as debugged

2018-05-01 Thread Gabe Becker
Gabor,

Others can speak to the origins of this more directly, but from what I
recall this has been true at least since I was working in this space on the
debugcall stuff a couple years ago. I imagine the reasoning  is what you
would expect: a single bit of course can't tell R both that a function is
debugged AND that it should undebug after the first call.  I don't know of
any R-facing way to check for debugonce status, though its possible I
missed it

That said, it would be possible to alter how the two bits are used so that
debugonce sets both of them, and debug (not once) only sets one, rather
them being treated as mutually exclusive. This would alter the behavior so
that debugonce'ed functions that haven't been called yet are considered
debugged, e.g., by isdebugged.

This would not, strictly speaking, be backwards compatible, but by the very
nature of what debugging means, it would not break any existing script
code. It could, and likely would, effect code implementing GUIs, however.

R-core - is this a patch that you are interested in and would consider
incorporating? If so I can volunteer to work on it.

Best,
~G

On Sat, Apr 28, 2018 at 4:57 AM, Gábor Csárdi 
wrote:

> debugonce() sets a different flag (RSTEP), and this is not queried by
> isdebugged(), and it is also not unset by undebug().
>
> Is this expected? If yes, is there a way to query and unset the RSTEP flag
> from R code?
>
> ❯ f <- function() { }
> ❯ debugonce(f)
> ❯ isdebugged(f)
> [1] FALSE
>
> ❯ undebug(f)
> Warning message:
> In undebug(f) : argument is not being debugged
>
> ❯ f()
> debugging in: f()
> debug at #1: {
> }
> Browse[2]>
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>
>


-- 
Gabriel Becker, Ph.D
Scientist
Bioinformatics and Computational Biology
Genentech Research

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] [EXTERNAL] Re: issue with model.frame()

2018-05-01 Thread Therneau, Terry M., Ph.D. via R-devel
Great catch.  I'm very reluctant to use my own model.frame, since that locks me into 
tracking all the base R changes, potentially breaking survival in a bad way if I miss one.


But, this shows me clearly what the issue is and will allow me to think about 
it.

Another solution for the user is to use multiple ridge() calls to break it up; since 
he/she was using a fixed tuning parameter the result is the same.


Terry T.


On 05/01/2018 11:43 AM, Berry, Charles wrote:




On May 1, 2018, at 6:11 AM, Therneau, Terry M., Ph.D. via R-devel 
 wrote:

A user sent me an example where coxph fails, and the root of the failure is a 
case where names(mf) is not equal to the term.labels attribute of the formula 
-- the latter has an extraneous newline. Here is an example that does not use 
the survival library.

# first create a data set with many long names
n <- 30  # number of rows for the dummy data set
vname <- vector("character", 26)
for (i in 1:26) vname[i] <- paste(rep(letters[1:i],2), collapse='')  # long 
variable names

tdata <- data.frame(y=1:n, matrix(runif(n*26), nrow=n))
names(tdata) <- c('y', vname)

# Use it in a formula
myform <- paste("y ~ cbind(", paste(vname, collapse=", "), ")")
mf <- model.frame(formula(myform), data=tdata)

match(attr(terms(mf), "term.labels"), names(mf))   # gives NA



In the user's case the function is ridge(x1, x2, ) rather than cbind, but 
the effect is the same.
Any ideas for a work around?


Maybe add a `yourclass' class to mf and dispatch to a model.frame.yourclass 
method where the width cutoff arg here (around lines 57-58 of 
model.frame.default) is made larger:

varnames <- sapply(vars, function(x) paste(deparse(x, width.cutoff = 500),
 collapse = " "))[-1L]

??



Aside: the ridge() function is very simple, it was added as an example to show 
how a user can add their own penalization to coxph.  I never expected serious 
use of it.  For this particular user the best answer is to use glmnet instead.  
 He/she is trying to apply an L2 penalty to a large number of SNP * covariate 
interactions.

Terry T.




HTH,

Chuck



__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] issue with model.frame()

2018-05-01 Thread William Dunlap via R-devel
You run into the same problem when using 'non-syntactical' names:

> mfB <- model.frame(y ~ `Temp(C)` + `Pres(mb)`,
data=data.frame(check.names=FALSE, y=1:10, `Temp(C)`=21:30,
`Pres(mb)`=991:1000))
> match(attr(terms(mfB), "term.labels"), names(mfB))   # gives NA's
[1] NA NA
> attr(terms(mfB), "term.labels")
[1] "`Temp(C)`"  "`Pres(mb)`"
> names(mfB)
[1] "y""Temp(C)"  "Pres(mb)"

Note that names(mfB) does not give a hint as whether they represent R
expressions or not (in this case they do not).  When they do represent R
expressions then one could parse() them and compare them to
as.list(attr(mfB),"variables")[-1]).


Bill Dunlap
TIBCO Software
wdunlap tibco.com

On Tue, May 1, 2018 at 6:11 AM, Therneau, Terry M., Ph.D. via R-devel <
r-devel@r-project.org> wrote:

> A user sent me an example where coxph fails, and the root of the failure
> is a case where names(mf) is not equal to the term.labels attribute of the
> formula -- the latter has an extraneous newline. Here is an example that
> does not use the survival library.
>
> # first create a data set with many long names
> n <- 30  # number of rows for the dummy data set
> vname <- vector("character", 26)
> for (i in 1:26) vname[i] <- paste(rep(letters[1:i],2), collapse='')  #
> long variable names
>
> tdata <- data.frame(y=1:n, matrix(runif(n*26), nrow=n))
> names(tdata) <- c('y', vname)
>
> # Use it in a formula
> myform <- paste("y ~ cbind(", paste(vname, collapse=", "), ")")
> mf <- model.frame(formula(myform), data=tdata)
>
> match(attr(terms(mf), "term.labels"), names(mf))   # gives NA
>
> 
>
> In the user's case the function is ridge(x1, x2, ) rather than cbind,
> but the effect is the same.
> Any ideas for a work around?
>
> Aside: the ridge() function is very simple, it was added as an example to
> show how a user can add their own penalization to coxph.  I never expected
> serious use of it.  For this particular user the best answer is to use
> glmnet instead.   He/she is trying to apply an L2 penalty to a large number
> of SNP * covariate interactions.
>
> Terry T.
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] [EXTERNAL] Re: issue with model.frame()

2018-05-01 Thread Berry, Charles
Unfortunately, I spoke too soon.

model.frame calls formula <- terms(formula, data = data) if formula does not 
inherit from class "terms" as in your case.

And that is where the bad terms.labels attribute comes from.

So, the fix I suggested won't work.

But maybe you can just supply a terms object to model.frame that has correct 
term.labels.

Chuck


> On May 1, 2018, at 10:55 AM, Therneau, Terry M., Ph.D. via R-devel 
>  wrote:
> 
> Great catch.  I'm very reluctant to use my own model.frame, since that locks 
> me into tracking all the base R changes, potentially breaking survival in a 
> bad way if I miss one.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] source(echo = TRUE) with a iso-8859-1 encoded file gives an error

2018-05-01 Thread Scott Kostyshak
I have very little knowledge about file encodings and would like to
learn more.

I've read the following pages to learn more:

  
https://urldefense.proofpoint.com/v2/url?u=http-3A__stat.ethz.ch_R-2Dmanual_R-2Ddevel_library_base_html_Encoding.html&d=DwIDAw&c=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM&r=1fpq0SJ48L-zRWX2t0llEVIDZAHfU8S-4oINHlOA0rk&m=Hx2R8haOcpOy7nHCyZ63_tEVrmVn5txQk-yjGkgjKjw&s=HegPJMcZ_5R6vYtdQLgIsh-M6ElOlewHPBZxe8IPSlI&e=
  
https://urldefense.proofpoint.com/v2/url?u=https-3A__stackoverflow.com_questions_4806823_how-2Dto-2Ddetect-2Dthe-2Dright-2Dencoding-2Dfor-2Dread-2Dcsv&d=DwIDAw&c=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM&r=1fpq0SJ48L-zRWX2t0llEVIDZAHfU8S-4oINHlOA0rk&m=Hx2R8haOcpOy7nHCyZ63_tEVrmVn5txQk-yjGkgjKjw&s=KGDvHJrfkvqbwyKnIiY0V45HtN-W4Rpq4ZBXfIFaFMk&e=
  
https://urldefense.proofpoint.com/v2/url?u=https-3A__developer.r-2Dproject.org_Encodings-5Fand-5FR.html&d=DwIDAw&c=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM&r=1fpq0SJ48L-zRWX2t0llEVIDZAHfU8S-4oINHlOA0rk&m=Hx2R8haOcpOy7nHCyZ63_tEVrmVn5txQk-yjGkgjKjw&s=Ka1kGiCw3w22tOLfA50AyrKsMT-La14TQdutJJkdE04&e=

The last one, in particular, has been very helpful. I would be
interested in any further references that you suggest.

I attach a file that reproduces the issue I would like to learn more
about. I do not know if the file encoding will be correctly preserved
through email, so I also provide the file (temporarily) on Dropbox here:

  
https://urldefense.proofpoint.com/v2/url?u=https-3A__www.dropbox.com_s_3lbgebk7b5uaia7_encoding-5Fexport-5Fissue.R-3Fdl-3D0&d=DwIDAw&c=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM&r=1fpq0SJ48L-zRWX2t0llEVIDZAHfU8S-4oINHlOA0rk&m=Hx2R8haOcpOy7nHCyZ63_tEVrmVn5txQk-yjGkgjKjw&s=58a7qB9IHt3s2ZLDglGEHwWARuo8xvSlH_z8G5jDaUY&e=

The file gives an error when using "source()" with the
argument echo = TRUE:

  > source("encoding_export_issue.R", echo = TRUE)
  Error in nchar(dep, "c") : invalid multibyte string, element 1
  In addition: Warning message:
  In grepl("^[[:blank:]]*$", dep[1L]) :
input string 1 is invalid in this locale

The problem comes from the "á" character in the .R file. The file
appears to be encoded as "iso-8859-1":

  $ file --mime-encoding encoding_export_issue.R 
  encoding_export_issue.R: iso-8859-1

Note that for me:

  > getOption("encoding")
  [1] "native.enc"

so "native.enc" is used for the "encoding" argument of source().

The following two calls succeed:

  > source("encoding_export_issue.R", echo = TRUE, encoding = "unknown")
  > source("encoding_export_issue.R", echo = TRUE, encoding = "iso-8859-1")

Is this file a valid "iso-8859-1" encoded file?  Why does source() fail
in the case of encoding set to "native.enc"? Is it because of the
settings to UTF-8 in my locale (see info on my system at the bottom of
this email).

I'm guessing it would be a bad idea to put

  options(encoding = "unknown")

in my .Rprofile, because it is difficult to always correctly guess the
encoding of files? Is there a reason why setting it to "unknown" would
lead to more problems than leaving it set to "native.enc"?

I've reproduced the above behavior on R-devel (r74677) and 3.4.3. Below
is my session info and locale info for my system with the 3.4.3 version:

> sessionInfo()
R version 3.4.3 (2017-11-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.3 LTS

Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0

locale:
 [1] LC_CTYPE=en_US.UTF-8   LC_NUMERIC=C  
 [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8   LC_NAME=C 
 [9] LC_ADDRESS=C   LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C   

attached base packages:
[1] stats graphics  grDevices utils datasets  methods   base 

loaded via a namespace (and not attached):
[1] compiler_3.4.3

> Sys.getlocale()
[1] 
"LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=en_US.UTF-8;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C"

Thanks for your time,

Scott


-- 
Scott Kostyshak
Assistant Professor of Economics
University of Florida
https://people.clas.ufl.edu/skostyshak/

# Ch?vez
quantile_type <- 4

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] source(echo = TRUE) with a iso-8859-1 encoded file gives an error

2018-05-01 Thread Ista Zahn
Hi Scott,

This question is appropriate for the r-help mailing list, but probably
off-topic here on r-devel.

Best,
Ista

On Tue, May 1, 2018 at 2:57 PM, Scott Kostyshak  wrote:
> I have very little knowledge about file encodings and would like to
> learn more.
>
> I've read the following pages to learn more:
>
>   
> https://urldefense.proofpoint.com/v2/url?u=http-3A__stat.ethz.ch_R-2Dmanual_R-2Ddevel_library_base_html_Encoding.html&d=DwIDAw&c=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM&r=1fpq0SJ48L-zRWX2t0llEVIDZAHfU8S-4oINHlOA0rk&m=Hx2R8haOcpOy7nHCyZ63_tEVrmVn5txQk-yjGkgjKjw&s=HegPJMcZ_5R6vYtdQLgIsh-M6ElOlewHPBZxe8IPSlI&e=
>   
> https://urldefense.proofpoint.com/v2/url?u=https-3A__stackoverflow.com_questions_4806823_how-2Dto-2Ddetect-2Dthe-2Dright-2Dencoding-2Dfor-2Dread-2Dcsv&d=DwIDAw&c=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM&r=1fpq0SJ48L-zRWX2t0llEVIDZAHfU8S-4oINHlOA0rk&m=Hx2R8haOcpOy7nHCyZ63_tEVrmVn5txQk-yjGkgjKjw&s=KGDvHJrfkvqbwyKnIiY0V45HtN-W4Rpq4ZBXfIFaFMk&e=
>   
> https://urldefense.proofpoint.com/v2/url?u=https-3A__developer.r-2Dproject.org_Encodings-5Fand-5FR.html&d=DwIDAw&c=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM&r=1fpq0SJ48L-zRWX2t0llEVIDZAHfU8S-4oINHlOA0rk&m=Hx2R8haOcpOy7nHCyZ63_tEVrmVn5txQk-yjGkgjKjw&s=Ka1kGiCw3w22tOLfA50AyrKsMT-La14TQdutJJkdE04&e=
>
> The last one, in particular, has been very helpful. I would be
> interested in any further references that you suggest.
>
> I attach a file that reproduces the issue I would like to learn more
> about. I do not know if the file encoding will be correctly preserved
> through email, so I also provide the file (temporarily) on Dropbox here:
>
>   
> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.dropbox.com_s_3lbgebk7b5uaia7_encoding-5Fexport-5Fissue.R-3Fdl-3D0&d=DwIDAw&c=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM&r=1fpq0SJ48L-zRWX2t0llEVIDZAHfU8S-4oINHlOA0rk&m=Hx2R8haOcpOy7nHCyZ63_tEVrmVn5txQk-yjGkgjKjw&s=58a7qB9IHt3s2ZLDglGEHwWARuo8xvSlH_z8G5jDaUY&e=
>
> The file gives an error when using "source()" with the
> argument echo = TRUE:
>
>   > source("encoding_export_issue.R", echo = TRUE)
>   Error in nchar(dep, "c") : invalid multibyte string, element 1
>   In addition: Warning message:
>   In grepl("^[[:blank:]]*$", dep[1L]) :
> input string 1 is invalid in this locale
>
> The problem comes from the "á" character in the .R file. The file
> appears to be encoded as "iso-8859-1":
>
>   $ file --mime-encoding encoding_export_issue.R
>   encoding_export_issue.R: iso-8859-1
>
> Note that for me:
>
>   > getOption("encoding")
>   [1] "native.enc"
>
> so "native.enc" is used for the "encoding" argument of source().
>
> The following two calls succeed:
>
>   > source("encoding_export_issue.R", echo = TRUE, encoding = "unknown")
>   > source("encoding_export_issue.R", echo = TRUE, encoding = "iso-8859-1")
>
> Is this file a valid "iso-8859-1" encoded file?  Why does source() fail
> in the case of encoding set to "native.enc"? Is it because of the
> settings to UTF-8 in my locale (see info on my system at the bottom of
> this email).
>
> I'm guessing it would be a bad idea to put
>
>   options(encoding = "unknown")
>
> in my .Rprofile, because it is difficult to always correctly guess the
> encoding of files? Is there a reason why setting it to "unknown" would
> lead to more problems than leaving it set to "native.enc"?
>
> I've reproduced the above behavior on R-devel (r74677) and 3.4.3. Below
> is my session info and locale info for my system with the 3.4.3 version:
>
>> sessionInfo()
> R version 3.4.3 (2017-11-30)
> Platform: x86_64-pc-linux-gnu (64-bit)
> Running under: Ubuntu 16.04.3 LTS
>
> Matrix products: default
> BLAS: /usr/lib/libblas/libblas.so.3.6.0
> LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
>
> locale:
>  [1] LC_CTYPE=en_US.UTF-8   LC_NUMERIC=C
>  [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8
>  [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8
>  [7] LC_PAPER=en_US.UTF-8   LC_NAME=C
>  [9] LC_ADDRESS=C   LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats graphics  grDevices utils datasets  methods   base
>
> loaded via a namespace (and not attached):
> [1] compiler_3.4.3
>
>> Sys.getlocale()
> [1] 
> "LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=en_US.UTF-8;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C"
>
> Thanks for your time,
>
> Scott
>
>
> --
> Scott Kostyshak
> Assistant Professor of Economics
> University of Florida
> https://people.clas.ufl.edu/skostyshak/
>
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] issue with model.frame()

2018-05-01 Thread Martin Maechler
> Berry, Charles 
> on Tue, 1 May 2018 16:43:18 + writes:

>> On May 1, 2018, at 6:11 AM, Therneau, Terry M., Ph.D. via R-devel 
 wrote:
>> 
>> A user sent me an example where coxph fails, and the root of the failure 
is a case where names(mf) is not equal to the term.labels attribute of the 
formula -- the latter has an extraneous newline. Here is an example that does 
not use the survival library.
>> 
>> # first create a data set with many long names
>> n <- 30  # number of rows for the dummy data set
>> vname <- vector("character", 26)
>> for (i in 1:26) vname[i] <- paste(rep(letters[1:i],2), collapse='')  # 
long variable names
>> 
>> tdata <- data.frame(y=1:n, matrix(runif(n*26), nrow=n))
>> names(tdata) <- c('y', vname)
>> 
>> # Use it in a formula
>> myform <- paste("y ~ cbind(", paste(vname, collapse=", "), ")")
>> mf <- model.frame(formula(myform), data=tdata)
>> 
>> match(attr(terms(mf), "term.labels"), names(mf))   # gives NA
>> 
>> 
>> 
>> In the user's case the function is ridge(x1, x2, ) rather than 
cbind, but the effect is the same.
>> Any ideas for a work around?

> Maybe add a `yourclass' class to mf and dispatch to a 
model.frame.yourclass method where the width cutoff arg here (around lines 
57-58 of model.frame.default) is made larger:

> varnames <- sapply(vars, function(x) paste(deparse(x, width.cutoff = 
500), 
> collapse = " "))[-1L]

What version of R is that ?  In current versions it is

varnames <- vapply(vars, deparse2, " ")[-1L]

and deparse2() is a slightly enhanced version of the above
function, again with  'width.cutoff = 500'

*BUT* if you read  help(deparse)  you will learn that 500 is the
upper bound allowed currently.  (and yes, one could consider
increasing that as it has been unchanged in R since the very
beginning (I have checked R version 0.49 from 1997).

On the other hand, deparse2 (and your older code above) do paste
all the parts together  via  collapse = " "  so I don't see
quite yet ...

Martin


>> Aside: the ridge() function is very simple, it was added as an example 
to show how a user can add their own penalization to coxph.  I never expected 
serious use of it.  For this particular user the best answer is to use glmnet 
instead.   He/she is trying to apply an L2 penalty to a large number of SNP * 
covariate interactions.
>> 
>> Terry T.



> HTH,

> Chuck
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] issue with model.frame()

2018-05-01 Thread Berry, Charles


> On May 1, 2018, at 1:15 PM, Martin Maechler  
> wrote:
> 
> What version of R is that ?

Sorry. It was 3.4.2. But it doesn't matter, because my diagnosis was wrong even 
there.  I think (based on my reading of my outdated version) the problem is a 
bit upstream in terms() as I noted in a follow up to the Terry.

>  In current versions it is
> 
>varnames <- vapply(vars, deparse2, " ")[-1L]
> 
> and deparse2() is a slightly enhanced version of the above
> function, again with  'width.cutoff = 500'
> 
> *BUT* if you read  help(deparse)  you will learn that 500 is the
> upper bound allowed currently.  (and yes, one could consider
> increasing that as it has been unchanged in R since the very
> beginning (I have checked R version 0.49 from 1997).
> 
> On the other hand, deparse2 (and your older code above) do paste
> all the parts together  via  collapse = " "  so I don't see
> quite yet ...
> 

Again, due to my bad diagnosis, I guess.

Chuck
__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] issue with model.frame()

2018-05-01 Thread Therneau, Terry M., Ph.D. via R-devel
I want to add that the priority for this is rather low, since we have a couple of work 
arounds for the user/data set in question.  I have some ideas about changing the way in 
which ridge() works, which might make the problem moot.  The important short-term result 
was finding that it wasn't an error of mine in the survival package. :-)


Add it to your "think about it" list.

Terry


On 05/01/2018 03:15 PM, Martin Maechler wrote:

What version of R is that ?  In current versions it is

 varnames <- vapply(vars, deparse2, " ")[-1L]

and deparse2() is a slightly enhanced version of the above
function, again with  'width.cutoff = 500'

*BUT*  if you read  help(deparse)  you will learn that 500 is the
upper bound allowed currently.  (and yes, one could consider
increasing that as it has been unchanged in R since the very
beginning (I have checked R version 0.49 from 1997).

On the other hand, deparse2 (and your older code above) do paste
all the parts together  via  collapse = " "  so I don't see
quite yet ...

Martin


__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel