from:"toby"

[Rd] ?symbols man page (PR#8713)

2006-03-27 Thread toby

Full_Name: TOBY MARTHEWS
Version: 2.2.1
OS: Windows XP
Submission from: (NULL) (139.133.7.37)


Just a small one, but it did trip me up today. On the ?symbols man page the
following paragraph:

  inches: If 'inches' is 'FALSE', the units are taken to be those of
  the x axis...

should say:

  inches: If 'inches' is 'FALSE', the units are taken to be those of
  the y axis...

Thanks,
Toby

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] not a problem: a submission (PR#8714)

2006-03-27 Thread toby

Full_Name: TOBY MARTHEWS
Version: 2.1.1
OS: Windows XP
Submission from: (NULL) (139.133.7.38)


I think you should have better examples on the ?plot man page, if you don't mind
me saying. Can I suggest something like this, which would probably stop so many
emails to the R help about how to put on error bars

Cheers,
Toby M
**

xvalsT1=1:3
yvalsT1=c(10,20,30)
errplusT1=c(2,3,4)
errminusT1=c(9,8,7)
xvalsT2=1:3
yvalsT2=c(30,35,40)
errplusT2=c(12,13,14)
errminusT2=c(1,2,3)
treatment=c(rep("T1",times=length(xvalsT1)),rep("T2",times=length(xvalsT2)))

xvals=c(xvalsT1,xvalsT2)
yvals=c(yvalsT1,yvalsT2)
errplus=c(errplusT1,errplusT2)
errminus=c(errminusT1,errminusT2)
RGR=data.frame(treatment,xvals,yvals,errplus,errminus)

minx=min(RGR$xvals);maxx=max(RGR$xvals)
miny=min(RGR$yvals-RGR$errminus);maxy=max(RGR$yvals+RGR$errplus)
plot(x=0,y=0,type="n",xlim=c(minx,maxx),ylim=c(miny,maxy),lab=c(2,4,0),xlab="month",ylab="Relative
Growth Rate",axes=FALSE)
axis(1,at=1:3,month.abb[1:3])#axis(1,at=1:3,labels=c("Jan","Feb","Mar"))
has the
same effect
axis(2)

trts=c("T1","T2");syms=c(21,24)
for (i in 1:2) {
 A=subset(RGR,treatment==trts[i])
 points(x=A$xvals,y=A$yvals,pch=syms[i])
 segments(A$xvals,A$yvals-A$errminus,A$xvals,A$yvals+A$errplus)#similar
to
symbols(x=A$xvals,y=A$yvals,boxplots=cbind(0,0,A$errminus,A$errplus,0.5),inches=FALSE,add=TRUE)
 errwidth=0.015
 segments(A$xvals-errwidth,A$yvals+A$errplus,A$xvals+errwidth,A$yvals+A$errplus)
 
segments(A$xvals-errwidth,A$yvals-A$errminus,A$xvals+errwidth,A$yvals-A$errminus)
 lines(x=A$xvals,y=A$yvals,lty=syms[i])
}
#PS - this is a bit of an inelegant way to put on error bars, but to do better
you
#have to use commands like plotCI {gplots} or xYplot {Hmisc} - to learn more
look at
RSiteSearch("error bars")

legend(x=2.7,y=9,trts,pch=syms)#in same units as the axes

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] configure on solaris 2.9 with non GNU compilers (PR#8300)

2005-11-11 Thread toby . m

Full_Name: Toby Muhlhofer
Version: 2.2.0, 2.1.1
OS: Solaris 2.9
Submission from: (NULL) (128.83.62.46)


I'm trying to compile R on a Solaris machine. The default C compiler is cc
(although gcc is available) and the default Fortran compiler is f95 (although
g77 is available).

Without defining the F77 environment variable, configure defaults to f95 as a
Fortran compiler and eventually fails with the following output:

-
checking whether mixed C/Fortran code can be run... configure: WARNING: cannot
run mixed C/Fortan code
configure: error: Maybe check LDFLAGS for paths to Fortran libraries?
-

Setting LDFLAGS to the path where the Fortran libraries sit makes the C compiler
complain.

If I give the value g77 (or the full path to g77) to F77, there are two
interesting issues:

1)
-
defining F77 to be g77
checking whether we are using the GNU Fortran 77 compiler... no
checking whether g77 accepts -g... yes
-

Why does configure think we are not using the GNU Fortran 77 compiler?

But more importantly

2)
-
checking how to get verbose linking output from g77... configure: WARNING:
compilation failed

checking for Fortran libraries of g77...
checking how to get verbose linking output from cc... -###
checking for C libraries of cc...  -L/usr/local/lib -lthread
checking for dummy main to link with Fortran libraries... none
checking for Fortran name-mangling scheme... configure: error: cannot compile a
simple Fortran program
-

I tried to compile a simple "Hello World" program with either Fortran compiler
and both work without a problem.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] patch for gregexpr(perl=TRUE)

2019-02-19 Thread Toby Hocking

Hi all,

Several people have noticed that gregexpr is very slow for large subject
strings when perl=TRUE is specified.
-
https://stackoverflow.com/questions/31216299/r-faster-gregexpr-for-very-large-strings
-
http://r.789695.n4.nabble.com/strsplit-perl-TRUE-gregexpr-perl-TRUE-very-slow-for-long-strings-td4727902.html
- https://stat.ethz.ch/pipermail/r-help/2008-October/178451.html

I figured out the issue, which is fixed by changing 1 line of code in
src/main/grep.c -- there is a strlen function call which is currently
inside of the while loop over matches, and the patch moves it before the
loop.
https://github.com/tdhock/namedCapture-article/blob/master/linear-time-gregexpr-perl.patch

I made some figures that show the quadratic time complexity before applying
the patch, and the linear time complexity after applying the patch
https://github.com/tdhock/namedCapture-article#19-feb-2019

I would have posted a bug report on bugs.r-project.org but I do not have an
account. So can an R-devel person please either (1) accept this patch, or
(2) give me an account so I can post the patch on the bug tracker?

Finally I would like to mention that Bill Dunlap noticed a similar problem
(time complexity which is quadratic in subject size) for strsplit with
perl=TRUE. My patch does NOT fix that, but I suspect that a similar fix
could be accomplished (because I see that strlen is being called in a while
loop in do_strsplit as well).

Thanks
Toby Dylan Hocking

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] Bug: time complexity of substring is quadratic as string size and number of substrings increases

2019-02-20 Thread Toby Hocking

Hi all, (and especially hi to Tomas Kalibera who accepted my patch sent
yesterday)

I believe that I have found another bug, this time in the substring
function. The use case that I am concerned with is when there is a single
(character scalar) text/subject, and many substrings to extract. For example

substring("", 1:4, 1:4)

or more generally,

N=1000
substring(paste(rep("A", N), collapse=""), 1:N, 1:N)

The problem I observe is that the time complexity is quadratic in N, as
shown on this figure
https://github.com/tdhock/namedCapture-article/blob/master/figure-substring-bug.png
source:
https://github.com/tdhock/namedCapture-article/blob/master/figure-substring-bug.R

I expected the time complexity to be linear in N.

The example above may seem contrived/trivial, but it is indeed relevant to
a number of packages (rex, rematch2, namedCapture) which provide functions
that use gregexpr and then substring to extract the text in the captured
sub-patterns. The figure
https://github.com/tdhock/namedCapture-article/blob/master/figure-trackDb-pkgs.png
shows the issue: these packages have quadratic time complexity, whereas
other packages (and the gregexpr function with perl=TRUE after applying the
patch discussed yesterday) have linear time complexity. I believe the
problem is the substring function. Source for this figure:
https://github.com/tdhock/namedCapture-article/blob/master/figure-trackDb-pkgs.R

I suspect that a fix can be accomplished by optimizing the implementation
of substring, for the special case when the text/subject is a single
element (character scalar). Right now I notice that the substring R code
uses rep_len so that the text/subject which is passed to the C code is a
character vector with the same length as the number of substrings to
extract. Maybe the C code is calling strlen for each of these (identical)
text/subject elements?

Anyway, it would be useful to have some feedback to make sure this is
indeed a bug before I post on bugzilla. (btw thanks Martin for signing me
up for an account)

Toby

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] Bug: time complexity of substring is quadratic as string size and number of substrings increases

2019-02-20 Thread Toby Hocking

Update: I have observed that stringi::stri_sub is linear time complexity,
and it computes the same thing as base::substring. figure
https://github.com/tdhock/namedCapture-article/blob/master/figure-substring-bug.png
source:
https://github.com/tdhock/namedCapture-article/blob/master/figure-substring-bug.R

To me this is a clear indication of a bug in substring, but again it would
be nice to have some feedback/confirmation before posting on bugzilla.

Also this suggests a fix -- just need to copy whatever stringi::stri_sub is
doing.




On Wed, Feb 20, 2019 at 11:16 AM Toby Hocking  wrote:

> Hi all, (and especially hi to Tomas Kalibera who accepted my patch sent
> yesterday)
>
> I believe that I have found another bug, this time in the substring
> function. The use case that I am concerned with is when there is a single
> (character scalar) text/subject, and many substrings to extract. For example
>
> substring("", 1:4, 1:4)
>
> or more generally,
>
> N=1000
> substring(paste(rep("A", N), collapse=""), 1:N, 1:N)
>
> The problem I observe is that the time complexity is quadratic in N, as
> shown on this figure
> https://github.com/tdhock/namedCapture-article/blob/master/figure-substring-bug.png
> source:
> https://github.com/tdhock/namedCapture-article/blob/master/figure-substring-bug.R
>
> I expected the time complexity to be linear in N.
>
> The example above may seem contrived/trivial, but it is indeed relevant to
> a number of packages (rex, rematch2, namedCapture) which provide functions
> that use gregexpr and then substring to extract the text in the captured
> sub-patterns. The figure
> https://github.com/tdhock/namedCapture-article/blob/master/figure-trackDb-pkgs.png
> shows the issue: these packages have quadratic time complexity, whereas
> other packages (and the gregexpr function with perl=TRUE after applying the
> patch discussed yesterday) have linear time complexity. I believe the
> problem is the substring function. Source for this figure:
> https://github.com/tdhock/namedCapture-article/blob/master/figure-trackDb-pkgs.R
>
> I suspect that a fix can be accomplished by optimizing the implementation
> of substring, for the special case when the text/subject is a single
> element (character scalar). Right now I notice that the substring R code
> uses rep_len so that the text/subject which is passed to the C code is a
> character vector with the same length as the number of substrings to
> extract. Maybe the C code is calling strlen for each of these (identical)
> text/subject elements?
>
> Anyway, it would be useful to have some feedback to make sure this is
> indeed a bug before I post on bugzilla. (btw thanks Martin for signing me
> up for an account)
>
> Toby
>

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] R pkg install should fail for unsuccessful DLL copy on windows?

2019-05-29 Thread Toby Hocking

Hi all,

I am having an issue related to installing packages on windows with
R-3.6.0. When installing a package that is in use, I expected R to stop
with an error. However I am getting a warning that the DLL copy was not
successful, but the overall package installation IS successful. This is
quite dangerous because the old DLL and the new R code could be
incompatible.

I am definitely not the first person to have this issue.
* Matt Dowle reported
https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17478 which was never
addressed.
* Jim Hester reported
https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17453 which was
apparently addressed in R-3.5.1, via
https://github.com/wch/r-source/commit/828a04f9c428403e476620b1905a1d8ca41d0bcd

But I am now having the same issue in R-3.6.0 -- is this a regression in R?
or is there another fix that I can use?

Below is the minimal R code that I used to reproduce the issue. Essentially,
* I start R with --vanilla and set options repos=cloud and warn=2 (which I
expect should convert warnings to errors).
* I do library(penaltyLearning) and then install the package from source,
which results in the
  warnings. I expected there should be an error.

th798@cmp2986 MINGW64 ~/R
$ R --vanilla -e "options(repos='https://cloud.r-project.org',
warn=2);library(penaltyLearning);install.packages('penaltyLearning',
type='source');getOption('warn');sessionInfo()"

R version 3.6.0 (2019-04-26) -- "Planting of a Tree"
Copyright (C) 2019 The R Foundation for Statistical Computing
Platform: x86_64-w64-mingw32/x64 (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> options(repos='https://cloud.r-project.org',
warn=2);library(penaltyLearning);install.packages('penaltyLearning',
type='source');getOption('warn');sessionInfo()
Loading required package: data.table
Registered S3 methods overwritten by 'ggplot2':
  method from
  [.quosures rlang
  c.quosures rlang
  print.quosures rlang
trying URL '
https://cloud.r-project.org/src/contrib/penaltyLearning_2018.09.04.tar.gz'
Content type 'application/x-gzip' length 2837289 bytes (2.7 MB)
==
downloaded 2.7 MB

* installing *source* package 'penaltyLearning' ...
** package 'penaltyLearning' successfully unpacked and MD5 sums checked
** using staged installation
** libs
c:/Rtools/mingw_64/bin/g++  -std=gnu++11 -I"C:/PROGRA~1/R/R-36~1.0/include"
-DNDEBUG  -O2 -Wall  -mtune=generic -c interface.cpp -o interface.o
c:/Rtools/mingw_64/bin/g++  -std=gnu++11 -I"C:/PROGRA~1/R/R-36~1.0/include"
-DNDEBUG  -O2 -Wall  -mtune=generic -c largestContinuousMinimum.cpp
-o largestContinuousMinimum.o
largestContinuousMinimum.cpp: In function 'int
largestContinuousMinimum(int, double*, double*, int*)':
largestContinuousMinimum.cpp:38:27: warning: 'start' may be used
uninitialized in this function [-Wmaybe-uninitialized]
   index_vec[0] = start;
   ^
c:/Rtools/mingw_64/bin/g++  -std=gnu++11 -I"C:/PROGRA~1/R/R-36~1.0/include"
-DNDEBUG  -O2 -Wall  -mtune=generic -c modelSelection.cpp -o
modelSelection.o
/usr/bin/sed: -e expression #1, char 1: unknown command: `C'
c:/Rtools/mingw_64/bin/g++ -shared -s -static-libgcc -o penaltyLearning.dll
tmp.def interface.o largestContinuousMinimum.o modelSelection.o
-LC:/PROGRA~1/R/R-36~1.0/bin/x64 -lR
installing to C:/Program
Files/R/R-3.6.0/library/00LOCK-penaltyLearning/00new/penaltyLearning/libs/x64
** R
** data
** byte-compile and prepare package for lazy loading
** help
*** installing help indices
  converting help for package 'penaltyLearning'
finding HTML links ... done
GeomTallRecthtml
IntervalRegressionCVhtml
IntervalRegressionCVmargin  html
IntervalRegressionInternal  html
IntervalRegressionRegularized   html
IntervalRegressionUnregularized html
ROChangehtml
change.colors   html
change.labels   html
changeLabel html
check_features_targets  html
check_target_pred   html
coef.IntervalRegression html
demo8   html
featureMatrix   html
featureVector   html
geom_tallrect   html
labelError  html
largestContinuousMinimumC   html

Re: [Rd] R pkg install should fail for unsuccessful DLL copy on windows?

2019-05-30 Thread Toby Hocking

   modelSelectionR html
neuroblastomaProcessed  html
oneSkip html
plot.IntervalRegression html
predict.IntervalRegression  html
print.IntervalRegressionhtml
squared.hinge   html
targetIntervalROC   html
targetIntervalResidual  html
targetIntervals html
theme_no_space  html
** building package indices
** testing if installed package can be loaded
* DONE (penaltyLearning)

The downloaded source packages are in
'C:\Users\th798\AppData\Local\Temp\RtmpkVV0sH\downloaded_packages'
[1] 2
R version 3.6.0 (2019-04-26)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 17134)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages:
[1] stats graphics  grDevices utils datasets  methods   base

other attached packages:
[1] penaltyLearning_2018.09.04 data.table_1.12.2

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.1   assertthat_0.2.1 dplyr_0.8.1  crayon_1.3.4
 [5] R6_2.4.0 grid_3.6.0   plyr_1.8.4   magic_1.5-9
 [9] gtable_0.3.0 magrittr_1.5 scales_1.0.0 ggplot2_3.1.1
[13] pillar_1.4.0 rlang_0.3.4  lazyeval_0.2.2   geometry_0.4.1
[17] tools_3.6.0  glue_1.3.1   purrr_0.3.2  munsell_0.5.0
[21] abind_1.4-7  compiler_3.6.0   pkgconfig_2.0.2  colorspace_1.4-1
[25] tidyselect_0.2.5 tibble_2.1.1
>
>
]0;MINGW64:/c/Users/th798/projects/max-generalized-auc
th798@cmp2986 MINGW64 ~/projects/max-generalized-auc (master)
$



On Wed, May 29, 2019 at 8:15 PM Jan Gorecki  wrote:

> Hi Toby,
> AFAIK it has not been addressed in R. You can handle the problem on
> your package side, see
> https://github.com/Rdatatable/data.table/pull/3237
> Regards,
> Jan
>
>
> On Thu, May 30, 2019 at 4:46 AM Toby Hocking  wrote:
> >
> > Hi all,
> >
> > I am having an issue related to installing packages on windows with
> > R-3.6.0. When installing a package that is in use, I expected R to stop
> > with an error. However I am getting a warning that the DLL copy was not
> > successful, but the overall package installation IS successful. This is
> > quite dangerous because the old DLL and the new R code could be
> > incompatible.
> >
> > I am definitely not the first person to have this issue.
> > * Matt Dowle reported
> > https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17478 which was
> never
> > addressed.
> > * Jim Hester reported
> > https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17453 which was
> > apparently addressed in R-3.5.1, via
> >
> https://github.com/wch/r-source/commit/828a04f9c428403e476620b1905a1d8ca41d0bcd
> >
> > But I am now having the same issue in R-3.6.0 -- is this a regression in
> R?
> > or is there another fix that I can use?
> >
> > Below is the minimal R code that I used to reproduce the issue.
> Essentially,
> > * I start R with --vanilla and set options repos=cloud and warn=2 (which
> I
> > expect should convert warnings to errors).
> > * I do library(penaltyLearning) and then install the package from source,
> > which results in the
> >   warnings. I expected there should be an error.
> >
> > th798@cmp2986 MINGW64 ~/R
> > $ R --vanilla -e "options(repos='https://cloud.r-project.org',
> > warn=2);library(penaltyLearning);install.packages('penaltyLearning',
> > type='source');getOption('warn');sessionInfo()"
> >
> > R version 3.6.0 (2019-04-26) -- "Planting of a Tree"
> > Copyright (C) 2019 The R Foundation for Statistical Computing
> > Platform: x86_64-w64-mingw32/x64 (64-bit)
> >
> > R is free software and comes with ABSOLUTELY NO WARRANTY.
> > You are welcome to redistribute it under certain conditions.
> > Type 'license()' or 'licence()' for distribution details.
> >
> > R is a collaborative project with many contributors.
> > Type 'contributors()' for more information and
> > 'citation()' on how to cite R or R packages in publications.
> >
> > Type 'demo()' for some demos, 'help()' for on-line help, or
> > 'help.start()' for an HTML browser interface to help.
> > Type 'q()' to quit R.
> >
> > > options(repos='https://cloud.r-project.org',
> > warn=2);library(penaltyLearning);in

Re: [Rd] R pkg install should fail for unsuccessful DLL copy on windows?

2019-05-31 Thread Toby Hocking

thanks for your input Hervé. Glad to hear I'm not the only one still having
this issue.

In my opinion install.packages should stop with an error (instead of a
warning) if this happens.

However even if you want to keep the warning, at least make it so that
users can set options(warn=2) to get an error if they want one. I tried
setting options(warn=2) but for some reason I still get a warning.

I believe that is a bug in install.packages -- if I specify options(warn=2)
it should convert that warning to an error (but it currently does not).

Toby

On Thu, May 30, 2019 at 4:50 PM Pages, Herve  wrote:

> Also note that this can lead to people not being able to load the
> package if the set of .Call entry points has changed between the old
> and new versions of the package. We strongly suspect that this is what
> happened to this Bioconductor user:
>
>https://support.bioconductor.org/p/121228/
>
> Note that she's installing the binary and in this case no warning
> is issued. All we see is:
>
>package ‘S4Vectors’ successfully unpacked and MD5 sums checked
>
> but the old DLL apparently didn't get replaced with the new one.
> Hence the
>
>error: "make_RAW_from_NA_LLINT" not available for .Call() for package
> "S4Vectors"
>
> later on when trying to load the package.
>
> Cheers,
> H.
>
>
> On 5/30/19 16:31, Toby Hocking wrote:
> > thanks for the tip Jan.
> >
> > However it would be nice if I didn't have to handle this myself for all
> of
> > my packages. (and teach my students how to do that)
> >
> > BTW I tried to disable staged installation, and the issue still happens:
> >
> > th798@cmp2986 MINGW64 ~/projects/max-generalized-auc (master)
> > $ R_INSTALL_STAGED=FALSE R --vanilla -e
> > ".libPaths('~/R/library');.libPaths();options(repos='
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__cloud.r-2Dproject.org&d=DwICAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=zldJhdavBFtHDHr08_HFRAi9MY2WBkTiDn1ggbog4cA&s=7X00xNRObhT9O68YU8m-IBkt38N5p_GP-UV77XEnKZw&e=
> ',
> > warn=2);library(penaltyLearning);install.packages('penaltyLearning',
> > type='source');getOption('warn');sessionInfo()"
> >
> > R version 3.6.0 (2019-04-26) -- "Planting of a Tree"
> > Copyright (C) 2019 The R Foundation for Statistical Computing
> > Platform: x86_64-w64-mingw32/x64 (64-bit)
> >
> > R is free software and comes with ABSOLUTELY NO WARRANTY.
> > You are welcome to redistribute it under certain conditions.
> > Type 'license()' or 'licence()' for distribution details.
> >
> > R is a collaborative project with many contributors.
> > Type 'contributors()' for more information and
> > 'citation()' on how to cite R or R packages in publications.
> >
> > Type 'demo()' for some demos, 'help()' for on-line help, or
> > 'help.start()' for an HTML browser interface to help.
> > Type 'q()' to quit R.
> >
> >> .libPaths('~/R/library');.libPaths();options(repos='
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__cloud.r-2Dproject.org&d=DwICAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=zldJhdavBFtHDHr08_HFRAi9MY2WBkTiDn1ggbog4cA&s=7X00xNRObhT9O68YU8m-IBkt38N5p_GP-UV77XEnKZw&e=
> ',
> > warn=2);library(penaltyLearning);install.packages('penaltyLearning',
> > type='source');getOption('warn');sessionInfo()
> > [1] "C:/Users/th798/R/library"   "C:/Program
> > Files/R/R-3.6.0/library"
> > Loading required package: data.table
> > Registered S3 methods overwritten by 'ggplot2':
> >method from
> >[.quosures rlang
> >c.quosures rlang
> >print.quosures rlang
> > Installing package into 'C:/Users/th798/R/library'
> > (as 'lib' is unspecified)
> > trying URL '
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__cloud.r-2Dproject.org_src_contrib_penaltyLearning-5F2018.09.04.tar.gz&d=DwICAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=zldJhdavBFtHDHr08_HFRAi9MY2WBkTiDn1ggbog4cA&s=o34c6BnD4LvJv-00tYn5M2TqO_IjH5qtaKnnhI4ph50&e=
> '
> > Content type 'application/x-gzip' length 2837289 bytes (2.7 MB)
> > ==
> > downloaded 2.7 MB
> >
> > * installing *source* p

Re: [Rd] R pkg install should fail for unsuccessful DLL copy on windows?

2019-06-06 Thread Toby Hocking

If anybody else has this issue, please add a comment on
https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17478 so we are more
likely to get R-core to address this.

Thanks
Toby

On Tue, Jun 4, 2019 at 2:58 PM Pages, Herve  wrote:

> On 5/31/19 08:41, Toby Hocking wrote:...
> > In my opinion install.packages should stop with an error (instead of a
> > warning) if this happens.
>
> Totally agree with that.
>
> Best,
> H.
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpa...@fredhutch.org
> Phone:  (206) 667-5791
> Fax:(206) 667-1319
>

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] Feature request: non-dropping regmatches/strextract

2019-08-29 Thread Toby Hocking

if you want "to extract regex matches into a new column in a data.frame"
then there are some package functions which do exactly that. three examples
are namedCapture::df_match_variable, rematch2::bind_re_match, and
tidyr::extract. For a more detailed discussion see my R journal submission
(under review) about regular expression packages,
https://raw.githubusercontent.com/tdhock/namedCapture-article/master/RJwrapper.pdf
Comments/suggestions welcome.

On Thu, Aug 15, 2019 at 12:15 AM Cyclic Group Z_1 via R-devel <
r-devel@r-project.org> wrote:

> A very common use case for regmatches is to extract regex matches into a
> new column in a data.frame (or data.table, etc.) or otherwise use the
> extracted strings alongside the input. However, the default behavior is to
> drop empty matches, which results in mismatches in column length if
> reassignment is done without subsetting.
>
> For consistency with other R functions and compatibility with this use
> case, it would be nice if regmatches did not automatically drop empty
> matches and would instead insert an NA_character_ value (similar to
> stringr::str_extract). This alternative regmatches could be implemented
> through an optional drop argument, a new function, or mentioned in the
> documentation (a la resample in ?sample).
>
> Alternatively, at the moment, there is a non-exported function strextract
> in utils which is very similar to stringr::str_extract. It would be great
> if this function, once exported, were to include a drop argument to prevent
> dropping positions with no matches.
>
> An example solution (last option):
>
> strextract <- function(pattern, x, perl = FALSE, useBytes = FALSE, drop =
> T) {
>  m <- regexec(pattern, x, perl=perl, useBytes=useBytes)
>  result <- regmatches(x, m)
>
>  if(isTRUE(drop)){
>  unlist(result)
>  } else if(isFALSE(drop)) {
>  unlist({result[lengths(result)==0] <- NA_character_; result})
>  } else {
>  stop("Invalid argument for `drop`")
>  }
> }
>
> Based on Ricardo Saporta's response to How to prevent regmatches drop non
> matches?
>
> --CG
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] R CMD build should fail early for old package versions?

2019-09-30 Thread Toby Hocking

Hi all,

Today I had an R CMD build that failed while building a vignette because
the vignette needs tidyr (>= 1.0, declared in DESCRIPTION Suggests) but my
system had a previous version installed.

It did not take me too long to figure out the issue (solved by upgrading
tidyr) but it would have been even faster / easier if R CMD build failed
early, with an error message that says something like "Can not build
package XXX because it Suggests: tidyr (>= 1.0) but tidyr 0.8.3 is
installed"

Is that possible?

Toby

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] stats::reshape quadratic in number of input columns

2019-10-30 Thread Toby Hocking

Hi R-core,

I have been performance testing R packages for wide-to-tall data reshaping
and for the most part I see they differ by constant factors.

However in one test, which involves converting into multiple output
columns, I see that stats::reshape is in fact quadratic in the number of
input columns. For example take the iris data, which has 4 input columns to
reshape, and the desired output has columns named
Species,Sepal,Petal,dimension (where part is either Length or Width). Of
course there is no performance issue with N=4 input columns in the original
iris data, but I made larger versions of this reshaping problem by making
copies of the input columns. The results
https://github.com/tdhock/nc-article#28-oct-2019 show that the quadratic
time complexity results in significant slowdowns after about N=10,000 input
columns to reshape. (e.g. several minutes for stats::reshape versus several
seconds for data.table::melt)

For a fix, I would suggest looking into how they implemented the same
operation in the data.table package, which in my test shows computation
times that seem to be linear.

Toby

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] add jsslogo.jpg to R sources?

2020-01-08 Thread Toby Hocking

Hi R-core, I was wondering if somebody could please add jsslogo.jpg to the
R sources? (as I reported yesterday in this bug)

https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17687

R already includes jss.cls which is the document class file for Journal of
Statistical Software. Actually, for the jss.cls file to be useful, it also
requires jsslogo.jpg in order to compile JSS articles without error.

This is an issue for me because I am writing a JSS paper that includes
figures created using tikzDevice, which I am telling to use the jss class
for computing metrics. On debian/ubuntu the R-src/share/texmf directory is
copied to /usr/share/texmf/tex/latex/R, so tikzDevice is finding jss.cls in
/usr/share/texmf/tex/latex/R/tex/latex/jss.cls but it is failing with a
'jsslogo not found' error -- the fix is to also include jsslogo.jpg in the
R sources (in the same directory as jss.cls).

thanks and happy new year
Toby

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] add jsslogo.jpg to R sources?

2020-01-10 Thread Toby Hocking

hi there, thanks for the feedback, sorry about the cross-posting, and that
makes sense given the nojss option, which I was not aware of.

On Wed, Jan 8, 2020 at 9:16 AM Achim Zeileis 
wrote:

> On Wed, 8 Jan 2020, Iñaki Ucar wrote:
>
> > On Wed, 8 Jan 2020 at 19:21, Toby Hocking  wrote:
> >>
> >> Hi R-core, I was wondering if somebody could please add jsslogo.jpg to
> the
> >> R sources? (as I reported yesterday in this bug)
> >>
> >> https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17687
> >>
> >> R already includes jss.cls which is the document class file for Journal
> of
> >> Statistical Software. Actually, for the jss.cls file to be useful, it
> also
> >> requires jsslogo.jpg in order to compile JSS articles without error.
> >>
> >> This is an issue for me because I am writing a JSS paper that includes
> >> figures created using tikzDevice, which I am telling to use the jss
> class
> >> for computing metrics. On debian/ubuntu the R-src/share/texmf directory
> is
> >> copied to /usr/share/texmf/tex/latex/R, so tikzDevice is finding
> jss.cls in
> >> /usr/share/texmf/tex/latex/R/tex/latex/jss.cls but it is failing with a
> >> 'jsslogo not found' error -- the fix is to also include jsslogo.jpg in
> the
> >> R sources (in the same directory as jss.cls).
> >
> > Why don't you just include jsslogo.jpg in your working directory?
> > jss.cls is included in the R sources because there are many vignettes
> > with the JSS style, but always *without* the logo. The logo should
> > only be used for actual JSS publication, so I think that the R sources
> > are no place for it.
>
> Thanks, Iñaki, you are right. The motivation for including jss.cls and
> jss.bst in the R sources was to facilitate turning JSS papers into
> vignettes (see the FAQ at https://www.jstatsoft.org/pages/view/style)
> with
> \documentclass[nojss]{jss}. Before jss.cls/bst were shipped along with
> base R many packages shipped with their own copy which seemed like a waste
> of resources and source of confusion.
>
> When preparing new papers for submission in JSS you can also use the
> "nojss" option, this is also accepted by the journal.
>
> Hope that helps,
> Achim
>
> P.S.: Toby, if you plan on discussing an such an issue anyway, I would
> recommend to wait with the bug report. Cross-posting on different channels
> is always a bit of a nuisance.

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] docs about _R_CHECK_FORCE_SUGGESTS_ ?

2020-05-13 Thread Toby Hocking

Can someone please add documentation for that environment variable to
Writing R Extensions? An appropriate place would be section
https://cloud.r-project.org/doc/manuals/r-release/R-exts.html#Suggested-packages
which already discusses _R_CHECK_DEPENDS_ONLY_=true

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] docs about _R_CHECK_FORCE_SUGGESTS_ ?

2020-05-16 Thread Toby Hocking

I agree with the doc updates Gabe proposes, they would be helpful.

On Wed, May 13, 2020 at 12:56 PM Gabriel Becker 
wrote:

> Hi Toby,
>
> As Gabor pointed out the place where the various levers R CMD check
> supports is in the R-internals manual, but there is a link directly to that
> section in
> https://cloud.r-project.org/doc/manuals/r-release/R-exts.html#Checking-packages
>
> It could perhaps be more prominent, perhaps by moving the paragraph that
> appears in to before the detailed list of  exact tests that are performed?
> I'm happy to put a patch for that together if there is a) interest, and b)
> a patch is preferable to someone on R-core simply doing that migration
> themselves.
>
> I do also agree that given that _R_CHECK_DEPENDS_ONLY_ and 
> _R_CHECK_SUGGESTS_ONLY_
> are mentioned in the section you link, it would perhaps make sense to
> mention _R_CHECK_FORCE_SUGGESTS_ as well. I can put that in the patch as
> well, if there is agreement from R-core that one or both of these changes
> make sense.
>
> Best,
> ~G
>
> On Wed, May 13, 2020 at 11:07 AM Gábor Csárdi 
> wrote:
>
>> See at https://cran.r-project.org/doc/manuals/r-devel/R-ints.html#Tools
>>
>> Gabor
>>
>> On Wed, May 13, 2020 at 7:05 PM Toby Hocking  wrote:
>> >
>> > Can someone please add documentation for that environment variable to
>> > Writing R Extensions? An appropriate place would be section
>> >
>> https://cloud.r-project.org/doc/manuals/r-release/R-exts.html#Suggested-packages
>> > which already discusses _R_CHECK_DEPENDS_ONLY_=true
>> >
>> > [[alternative HTML version deleted]]
>> >
>> > __
>> > R-devel@r-project.org mailing list
>> > https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>> __
>> R-devel@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] mclapply memory leak?

2015-09-02 Thread Toby Hocking

Dear R-devel,

I am running mclapply with many iterations over a function that modifies
nothing and makes no copies of anything. It is taking up a lot of memory,
so it seems to me like this is a bug. Should I post this to
bugs.r-project.org?

A minimal reproducible example can be obtained by first starting a memory
monitoring program such as htop, and then executing the following code
while looking at how much memory is being used by the system

library(parallel)
seconds <- 5
N <- 10
result.list <- mclapply(1:N, function(i)Sys.sleep(1/N*seconds))

On my system, memory usage goes up about 60MB on this example. But it does
not go up at all if I change mclapply to lapply. Is this a bug?

For a more detailed discussion with a figure that shows that the memory
overhead is linear in N, please see
https://github.com/tdhock/mclapply-memory

> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu precise (12.04.5 LTS)

locale:
 [1] LC_CTYPE=en_CA.UTF-8   LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_CA.UTF-8
 [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_CA.UTF-8
 [7] LC_PAPER=en_US.UTF-8   LC_NAME=C
 [9] LC_ADDRESS=C   LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] parallel  graphics  utils datasets  stats grDevices methods
[8] base

other attached packages:
[1] ggplot2_1.0.1  RColorBrewer_1.0-5 lattice_0.20-33

loaded via a namespace (and not attached):
 [1] Rcpp_0.11.6 digest_0.6.4MASS_7.3-43
 [4] grid_3.2.2  plyr_1.8.1  gtable_0.1.2
 [7] scales_0.2.3reshape2_1.2.2  proto_1.0.0
[10] labeling_0.2tools_3.2.2 stringr_0.6.2
[13] dichromat_2.0-0 munsell_0.4.2   PeakSegJoint_2015.08.06
[16] compiler_3.2.2  colorspace_1.2-4

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] mclapply memory leak?

2015-09-03 Thread Toby Hocking

right, it is not a memory leak, sorry for the misleading subject line.

the problem is the fact that the memory usage goes up, linearly with the
length of the first argument to mclapply. in practice with large data sets
this can cause the machine to start swapping, or to have my cluster jobs
killed due to using too much memory.

On Wed, Sep 2, 2015 at 2:35 PM, Gabriel Becker  wrote:

> Well it's only a leak if you don't get the memory back after it returns,
> right?
>
> Anyway, one (untested by me) possibility is the copying of memory pages
> when the garbage collector touches objects, as pointed out by Radford Neal
> here:
> http://r.789695.n4.nabble.com/Re-R-devel-Digest-Vol-149-Issue-22-td4710367.html
>
> If so, I don't think this would be easily avoidable, but there may be
> mitigation strategies.
>
> ~G
>
> On Wed, Sep 2, 2015 at 10:12 AM, Toby Hocking  wrote:
>
>> Dear R-devel,
>>
>> I am running mclapply with many iterations over a function that modifies
>> nothing and makes no copies of anything. It is taking up a lot of memory,
>> so it seems to me like this is a bug. Should I post this to
>> bugs.r-project.org?
>>
>> A minimal reproducible example can be obtained by first starting a memory
>> monitoring program such as htop, and then executing the following code
>> while looking at how much memory is being used by the system
>>
>> library(parallel)
>> seconds <- 5
>> N <- 10
>> result.list <- mclapply(1:N, function(i)Sys.sleep(1/N*seconds))
>>
>> On my system, memory usage goes up about 60MB on this example. But it does
>> not go up at all if I change mclapply to lapply. Is this a bug?
>>
>> For a more detailed discussion with a figure that shows that the memory
>> overhead is linear in N, please see
>> https://github.com/tdhock/mclapply-memory
>>
>> > sessionInfo()
>> R version 3.2.2 (2015-08-14)
>> Platform: x86_64-pc-linux-gnu (64-bit)
>> Running under: Ubuntu precise (12.04.5 LTS)
>>
>> locale:
>>  [1] LC_CTYPE=en_CA.UTF-8   LC_NUMERIC=C
>>  [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_CA.UTF-8
>>  [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_CA.UTF-8
>>  [7] LC_PAPER=en_US.UTF-8   LC_NAME=C
>>  [9] LC_ADDRESS=C   LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] parallel  graphics  utils datasets  stats grDevices methods
>> [8] base
>>
>> other attached packages:
>> [1] ggplot2_1.0.1  RColorBrewer_1.0-5 lattice_0.20-33
>>
>> loaded via a namespace (and not attached):
>>  [1] Rcpp_0.11.6 digest_0.6.4MASS_7.3-43
>>  [4] grid_3.2.2  plyr_1.8.1  gtable_0.1.2
>>  [7] scales_0.2.3reshape2_1.2.2  proto_1.0.0
>> [10] labeling_0.2tools_3.2.2 stringr_0.6.2
>> [13] dichromat_2.0-0 munsell_0.4.2
>>  PeakSegJoint_2015.08.06
>> [16] compiler_3.2.2  colorspace_1.2-4
>>
>> [[alternative HTML version deleted]]
>>
>> __
>> R-devel@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>
>
>
> --
> Gabriel Becker, PhD
> Computational Biologist
> Bioinformatics and Computational Biology
> Genentech, Inc.
>

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] mclapply memory leak?

2015-09-04 Thread Toby Hocking

Thanks for the detailed analysis Simon. I figured out a workaround that
seems to be working in my real application. By limiting the length of the
first argument to mclapply (to the number of cores), I get speedups while
limiting the memory overhead.

### Run mclapply inside of a for loop, ensuring that it never receives
### a first argument with a length more than maxjobs. This avoids some
### memory problems (swapping, or getting jobs killed on the cluster)
### when using mclapply(1:N, FUN) where N is large.
maxjobs.mclapply <- function(X, FUN, maxjobs=getOption("mc.cores")){
  N <- length(X)
  i.list <- splitIndices(N, N/maxjobs)
  result.list <- list()
  for(i in seq_along(i.list)){
i.vec <- i.list[[i]]
result.list[i.vec] <- mclapply(X[i.vec], FUN)
  }
  result.list
}


On Thu, Sep 3, 2015 at 5:27 PM, Simon Urbanek 
wrote:

> Toby,
>
> > On Sep 2, 2015, at 1:12 PM, Toby Hocking  wrote:
> >
> > Dear R-devel,
> >
> > I am running mclapply with many iterations over a function that modifies
> > nothing and makes no copies of anything. It is taking up a lot of memory,
> > so it seems to me like this is a bug. Should I post this to
> > bugs.r-project.org?
> >
> > A minimal reproducible example can be obtained by first starting a memory
> > monitoring program such as htop, and then executing the following code
> > while looking at how much memory is being used by the system
> >
> > library(parallel)
> > seconds <- 5
> > N <- 10
> > result.list <- mclapply(1:N, function(i)Sys.sleep(1/N*seconds))
> >
> > On my system, memory usage goes up about 60MB on this example. But it
> does
> > not go up at all if I change mclapply to lapply. Is this a bug?
> >
> > For a more detailed discussion with a figure that shows that the memory
> > overhead is linear in N, please see
> > https://github.com/tdhock/mclapply-memory
> >
>
>
> I'm not quite sure what is supposed to be the issue here. One would expect
> the memory used will be linear in the number elements you process - by
> definition of the task, since you'll be creating linearly many more objects.
>
> Also using top doesn't actually measure the memory used by R itself (see
> FAQ 7.42).
>
> That said, I re-run your script and it didn't look anything like what you
> have on your webpage.  For the NULL result you end up dealing will all the
> objects you create in your test that overshadow any memory usage and
> stabilizes after garbage-collection. As you would expect, any output of top
> is essentially bogus up to a gc. How much memory R will use is essentially
> governed by the level at which you set the gc trigger. In real world you
> actually want that to be fairly high if you can afford it (in gigabytes,
> not megabytes), because you get often much higher performance by delaying
> gcs if you don't have low total memory (essentially using the memory as a
> buffer). Given that the usage is so negligible, it won't trigger any gc on
> its own, so you're just measuring accumulated objects - which will be
> always higher for mclapply because of the bookkeeping and serialization
> involved in the communication.
>
> The real difference is only in the df case. The reason for it is that your
> lapply() there is simply a no-op, because R is smart enough to realize that
> you are always returning the same object so it won't actually create
> anything and just return a reference back to df - thus using no memory at
> all. However, once you split the inputs, your main session can no longer
> perform this optimization because the processing is now in a separate
> process, so it has no way of knowing that you are returning the object
> unmodified. So what you are measuring is a special case that is arguably
> not really relevant in real applications.
>
> Cheers,
> Simon
>
>
>
> >> sessionInfo()
> > R version 3.2.2 (2015-08-14)
> > Platform: x86_64-pc-linux-gnu (64-bit)
> > Running under: Ubuntu precise (12.04.5 LTS)
> >
> > locale:
> > [1] LC_CTYPE=en_CA.UTF-8   LC_NUMERIC=C
> > [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_CA.UTF-8
> > [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_CA.UTF-8
> > [7] LC_PAPER=en_US.UTF-8   LC_NAME=C
> > [9] LC_ADDRESS=C   LC_TELEPHONE=C
> > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
> >
> > attached base packages:
> > [1] parallel  graphics  utils datasets  stats grDevices methods
> > [8] base
> >
> > other attached packages:
> > [1] ggplot2_1.0.1  RColorBrewer_1.0-5 lattice_0.20-33
> >
> > loaded via a

[Rd] Un-informative Error in re-building vignettes

2017-11-29 Thread Toby Hocking

I am getting the following on CRAN windows and winbuilder
https://www.r-project.org/nosvn/R.check/r-devel-windows-ix86+x86_64/penaltyLearning-00check.html

Apparently there is an error in re-building vignettes, but I do not have
any idea what it is, because all that is listed is three dots (...). Is
this a bug in R CMD check?

If not, the only solution I can think of is removing the vignette entirely.
Any other ideas?

   - checking re-building of vignette outputs ... [11s] WARNING
   Error in re-building vignettes:
 ...
   - checking PDF version of manual ... OK

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] valgrind false positive on R startup?

2020-06-09 Thread Toby Hocking

Hi all,

I'm on Ubuntu 18.04, running R-4.0.0 which I compiled from source, and
using valgrind I am always seeing the following message. Does anybody
else see that? Is that a known false positive? Any ideas how to
fix/suppress? Seems related to TRE, do I need to upgrade that?

(base) tdhock@maude-MacBookPro:~/R/binsegRcpp$ R --vanilla -d valgrind
-e 'extSoftVersion()'
==9565== Memcheck, a memory error detector
==9565== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==9565== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
==9565== Command: /home/tdhock/lib/R/bin/exec/R --vanilla -e extSoftVersion()
==9565==
==9565== Conditional jump or move depends on uninitialised value(s)
==9565==at 0x55AB9E0: __wcsnlen_sse4_1 (strlen.S:147)
==9565==by 0x5598EC1: wcsrtombs (wcsrtombs.c:104)
==9565==by 0x551EB20: wcstombs (wcstombs.c:34)
==9565==by 0x50BAA07: wcstombs (stdlib.h:154)
==9565==by 0x50BAA07: tre_parse_bracket_items (tre-parse.c:336)
==9565==by 0x50BAA07: tre_parse_bracket (tre-parse.c:453)
==9565==by 0x50BAA07: tre_parse (tre-parse.c:1380)
==9565==by 0x50B2498: tre_compile (tre-compile.c:1920)
==9565==by 0x50AFBE0: tre_regcompb (regcomp.c:150)
==9565==by 0x4FA9F42: do_gsub (grep.c:2023)
==9565==by 0x4F79045: bcEval (eval.c:7090)
==9565==by 0x4F8572F: Rf_eval (eval.c:723)
==9565==by 0x4F8754E: R_execClosure (eval.c:1888)
==9565==by 0x4F88316: Rf_applyClosure (eval.c:1814)
==9565==by 0x4F85902: Rf_eval (eval.c:846)
==9565==

R version 4.0.0 (2020-04-24) -- "Arbor Day"
Copyright (C) 2020 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> extSoftVersion()
zlibbzlib
"1.2.11" "1.0.6, 6-Sept-2010"
  xz PCRE
 "5.2.2"   "10.31 2018-02-12"
 ICU  TRE
  "60.2""TRE 0.8.0 R_fixes (BSD)"
   iconv readline
"glibc 2.27""7.0"
BLAS
"/home/tdhock/lib/R/lib/libRblas.so"
>
>
==9565==
==9565== HEAP SUMMARY:
==9565== in use at exit: 40,492,919 bytes in 9,170 blocks
==9565==   total heap usage: 19,784 allocs, 10,614 frees, 62,951,535
bytes allocated
==9565==
==9565== LEAK SUMMARY:
==9565==definitely lost: 0 bytes in 0 blocks
==9565==indirectly lost: 0 bytes in 0 blocks
==9565==  possibly lost: 0 bytes in 0 blocks
==9565==still reachable: 40,492,919 bytes in 9,170 blocks
==9565==   of which reachable via heuristic:
==9565== newarray   : 4,264 bytes in 1 blocks
==9565== suppressed: 0 bytes in 0 blocks
==9565== Rerun with --leak-check=full to see details of leaked memory
==9565==
==9565== For counts of detected and suppressed errors, rerun with: -v
==9565== Use --track-origins=yes to see where uninitialised values come from
==9565== ERROR SUMMARY: 46 errors from 1 contexts (suppressed: 0 from 0)
(base) tdhock@maude-MacBookPro:~/R/binsegRcpp$

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] valgrind false positive on R startup?

2020-06-18 Thread Toby Hocking

Great, that works for me, thanks!
If this is an issue with glibc, has it been reported to the glibc devs, and
are they working on a fix?

On Tue, Jun 9, 2020 at 9:12 PM Prof Brian Ripley 
wrote:

> It is known, with a known workaround, see e.g.
> https://www.stats.ox.ac.uk/pub/bdr/memtests/README.txt .  Set
> suppressions in ~/.valgrindrc, e.g. the CRAN check machine has
>
> --suppressions=/data/blackswan/ripley/wcsrtombs.supp
>
> It is an issue in your OS (glibc), not TRE nor R.
>
> On 10/06/2020 00:21, Toby Hocking wrote:
> > Hi all,
> >
> > I'm on Ubuntu 18.04, running R-4.0.0 which I compiled from source, and
> > using valgrind I am always seeing the following message. Does anybody
> > else see that? Is that a known false positive? Any ideas how to
> > fix/suppress? Seems related to TRE, do I need to upgrade that?
> >
> > (base) tdhock@maude-MacBookPro:~/R/binsegRcpp$ R --vanilla -d valgrind
> > -e 'extSoftVersion()'
> > ==9565== Memcheck, a memory error detector
> > ==9565== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
> > ==9565== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright
> info
> > ==9565== Command: /home/tdhock/lib/R/bin/exec/R --vanilla -e
> extSoftVersion()
> > ==9565==
> > ==9565== Conditional jump or move depends on uninitialised value(s)
> > ==9565==at 0x55AB9E0: __wcsnlen_sse4_1 (strlen.S:147)
> > ==9565==by 0x5598EC1: wcsrtombs (wcsrtombs.c:104)
> > ==9565==by 0x551EB20: wcstombs (wcstombs.c:34)
> > ==9565==by 0x50BAA07: wcstombs (stdlib.h:154)
> > ==9565==by 0x50BAA07: tre_parse_bracket_items (tre-parse.c:336)
> > ==9565==by 0x50BAA07: tre_parse_bracket (tre-parse.c:453)
> > ==9565==by 0x50BAA07: tre_parse (tre-parse.c:1380)
> > ==9565==by 0x50B2498: tre_compile (tre-compile.c:1920)
> > ==9565==by 0x50AFBE0: tre_regcompb (regcomp.c:150)
> > ==9565==by 0x4FA9F42: do_gsub (grep.c:2023)
> > ==9565==by 0x4F79045: bcEval (eval.c:7090)
> > ==9565==by 0x4F8572F: Rf_eval (eval.c:723)
> > ==9565==by 0x4F8754E: R_execClosure (eval.c:1888)
> > ==9565==by 0x4F88316: Rf_applyClosure (eval.c:1814)
> > ==9565==by 0x4F85902: Rf_eval (eval.c:846)
> > ==9565==
> >
> > R version 4.0.0 (2020-04-24) -- "Arbor Day"
> > Copyright (C) 2020 The R Foundation for Statistical Computing
> > Platform: x86_64-pc-linux-gnu (64-bit)
> >
> > R is free software and comes with ABSOLUTELY NO WARRANTY.
> > You are welcome to redistribute it under certain conditions.
> > Type 'license()' or 'licence()' for distribution details.
> >
> >Natural language support but running in an English locale
> >
> > R is a collaborative project with many contributors.
> > Type 'contributors()' for more information and
> > 'citation()' on how to cite R or R packages in publications.
> >
> > Type 'demo()' for some demos, 'help()' for on-line help, or
> > 'help.start()' for an HTML browser interface to help.
> > Type 'q()' to quit R.
> >
> >> extSoftVersion()
> >  zlib
> bzlib
> >  "1.2.11" "1.0.6,
> 6-Sept-2010"
> >xz
>  PCRE
> >   "5.2.2"   "10.31
> 2018-02-12"
> >   ICU
> TRE
> >"60.2""TRE 0.8.0 R_fixes
> (BSD)"
> > iconv
>  readline
> >  "glibc 2.27"
> "7.0"
> >  BLAS
> > "/home/tdhock/lib/R/lib/libRblas.so"
> >>
> >>
> > ==9565==
> > ==9565== HEAP SUMMARY:
> > ==9565== in use at exit: 40,492,919 bytes in 9,170 blocks
> > ==9565==   total heap usage: 19,784 allocs, 10,614 frees, 62,951,535
> > bytes allocated
> > ==9565==
> > ==9565== LEAK SUMMARY:
> > ==9565==definitely lost: 0 bytes in 0 blocks
> > ==9565==indirectly lost: 0 bytes in 0 blocks
> > ==9565==  possibly lost: 0 bytes in 0 blocks
> > ==9565==still reachable: 40,492,919 bytes in 9,170 blocks
> > ==9565==   of which reachable via heuristic:
> > ==9565== newarray   : 4,264 bytes in 1
> blocks
> > ==9565== suppressed: 0 bytes in 0 blocks
> > ==9565== Rerun w

[Rd] Error in substring: invalid multibyte string

2020-06-26 Thread Toby Hocking

Hi all,
I'm getting the following error from substring:

> substr("Jens Oehlschl\xe4gel-Akiyoshi", 1, 100)
Error in substr("Jens Oehlschl\xe4gel-Akiyoshi", 1, 100) :
  invalid multibyte string at 'gel-A<6b>iyoshi'

Is that normal / intended? I've tried setting the Encoding/locale to
Latin-1/UTF-8 but that does not help. nchar gives me something similar

> nchar("Jens Oehlschl\xe4gel-Akiyoshi")
Error in nchar("Jens Oehlschl\xe4gel-Akiyoshi") :
  invalid multibyte string, element 1

I find it strange that substr/nchar give an error but regexpr works for
telling me the length:

> regexpr(".*", "Jens Oehlschl\xe4gel-Akiyoshi")
[1] 1
attr(,"match.length")
[1] 29

Is that inconsistency normal/intended?

btw this example comes from our very own list:

> readLines("
https://stat.ethz.ch/pipermail/r-devel/1999-November/author.html";)[28]
[1] "Jens Oehlschl\xe4gel-Akiyoshi"

Best,
Toby

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] Error in substring: invalid multibyte string

2020-06-27 Thread Toby Hocking

Thanks for the quick response Ivan. readLines with encoding='latin1' works
for me (on Ubuntu).

However I was more concerned with the inconsistency in results between
substr and regexpr. I was expecting that if one of them errors because of
an unknown encoding then the other should as well. Even better, if regexpr
works, why shouldn't substr work as well?

Incidentally the analogous stringi function stri_sub works fine in this
case:

> stringi::stri_sub("Jens Oehlschl\xe4gel-Akiyoshi", 1, 100)
[1] "Jens Oehlschl\xe4gel-Akiyoshi"

But the stringi analog to nchar gives a similar warning:

> stringi::stri_length("Jens Oehlschl\xe4gel-Akiyoshi")
[1] NA
Warning message:
In stringi::stri_length("Jens Oehlschl\xe4gel-Akiyoshi") :
  invalid UTF-8 byte sequence detected; try calling stri_enc_toutf8()


On Sat, Jun 27, 2020 at 2:12 AM Ivan Krylov  wrote:

> On Fri, 26 Jun 2020 15:57:06 -0700
> Toby Hocking  wrote:
>
> >invalid multibyte string at 'gel-A<6b>iyoshi'
>
> >https://stat.ethz.ch/pipermail/r-devel/1999-November/author.html
>
> The server says that the text is UTF-8:
>
> curl -sI \
>  https://stat.ethz.ch/pipermail/r-devel/1999-November/author.html | \
>  grep Content-Type
> # Content-Type: text/html; charset=UTF-8
>
> But it's not, at least not all of it. If you ask readLines to mark
> the text as Latin-1, you get Jens Oehlschlägel-Akiyoshi without the
> mojibake and invalid multi-byte characters:
>
> x <- readLines(
>  'https://stat.ethz.ch/pipermail/r-devel/1999-November/author.html',
>  encoding = 'latin1'
> )[28]
> substr(x, 1, 100)
> # [1] "Jens Oehlschlägel-Akiyoshi"
>
> The behaviour we observe when encoding = 'latin1' is not specified
> results from returned lines having "unknown" encoding. The substr()
> implementation tries to interpret such strings according to multi-byte C
> locale rules (using mbrtowc(3)). On my system (yours too, probably, if
> it's GNU/Linux or macOS), the multi-byte C locale encoding is UTF-8,
> and this Latin-1 string does not result in valid code points when
> decoded as UTF-8.
>
> --
> Best regards,
> Ivan
>

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] HELPWANTED keyword in bugs.r-project.org

2020-08-11 Thread Toby Hocking

Hi Luke,
I just wanted to say thanks for taking the time to add this tag. That is
very helpful to know which bugs are worth working on and need help. Keep up
the good work!
Toby

On Wed, Aug 5, 2020 at 7:23 AM  wrote:

> Just a quick note to mention that we have added a HELPWANTED keyword
> on bugs.r-project.org for tagging bugs and issues where a good
> well-tested patch would be particularly appreciated.  You can find the
> HELPWANTED issues by selecting the keyword in the search interface or at
>
> https://bugs.r-project.org/bugzilla/buglist.cgi?keywords=HELPWANTED
>
> This URL shows both open and resolved HELPWANTED issues.
>
> At the moment only a handful of issues have been tagged, but there
> will be more over time. One of these may be a good place to start if
> you are looking for ways to contribute. The techincal level varies;
> some might be resolved with a small amount of R code; others might
> need more extensive changes at the C level.
>
> Best,
>
> luke
>
>
> --
> Luke Tierney
> Ralph E. Wareham Professor of Mathematical Sciences
> University of Iowa  Phone: 319-335-3386
> Department of Statistics andFax:   319-335-3017
> Actuarial Science
> 241 Schaeffer Hall  email:   luke-tier...@uiowa.edu
> Iowa City, IA 52242 WWW:  http://www.stat.uiowa.edu
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] Stale link from ?check to R Internals

2020-08-19 Thread Toby Hocking

Hi the reference to R Internals
https://cran.r-project.org/doc/manuals/r-release/R-ints.html#Tools
in ?check (PkgUtils.Rd in utils package) is stale. Here is my proposed
patch (use named reference rather than numeric reference to avoid any
similar broken links in the future).

Index: src/library/utils/man/PkgUtils.Rd
===
--- src/library/utils/man/PkgUtils.Rd (revision 79049)
+++ src/library/utils/man/PkgUtils.Rd (working copy)
@@ -40,7 +40,7 @@
   set by environment variables \env{_R_BUILD_RESAVE_DATA_} and
   \env{_R_BUILD_COMPACT_VIGNETTES_}: see \sQuote{Writing \R Extensions}.
   Many of the checks in \command{R CMD check} can be turned off or on by
-  environment variables: see Chapter 6 of the \sQuote{R Internals} manual.
+  environment variables: see Chapter "Tools" of the \sQuote{R Internals}
manual.

   By default \command{R CMD build} uses the \code{"internal"} option to
   \code{\link{tar}} to prepare the tarball.  An external \command{tar}

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] r-project.org SSL certificate issues

2020-08-19 Thread Toby Hocking

Hi win-builder certificate expired on Aug 15. My student on the other side
of the world is also seeing this problem so I think it needs to be fixed...
> download.file("https://win-builder.r-project.org";, "/tmp/wb.html")
trying URL 'https://win-builder.r-project.org'
Error in download.file("https://win-builder.r-project.org";, "/tmp/wb.html")
:
  cannot open URL 'https://win-builder.r-project.org'
In addition: Warning message:
In download.file("https://win-builder.r-project.org";, "/tmp/wb.html") :
  URL 'https://win-builder.r-project.org/': status was 'Peer certificate
cannot be authenticated with given CA certificates'
>

On Wed, Jun 10, 2020 at 2:40 AM Gábor Csárdi  wrote:

> My (also not expert) understanding is that there is nothing insecure about
> alternative certificate chains at all. All browsers and macOS's built in
> SSL library (secure transport) support them properly. OpenSSL and LibreSSL
> were/are simply broken. This was not such a big issue so far, but now that
> some old long lived certificates are expiring, it is increasingly an issue.
>
> FWIW it is possible to build libcurl on macOS without any external SSL
> library, so OpenSSL and LibreSSL are not needed at all. (Unfortunately the
> libcurl build that comes with most (all?) macOS versions does use
> LibreSSL.) The R installer could link to such a static libcurl library on
> macOS, and that would solve the issue for macOS. Whether it should, that's
> another question.
>
> Gabor
>
> On Wed, Jun 10, 2020 at 9:56 AM peter dalgaard  wrote:
>
> > As I said, there is stuff that I don't understand in here (including
> > why browsers apparently do trust alternative chains)
> >
> > -pd
> >
> > > On 10 Jun 2020, at 01:53 , Simon Urbanek 
> > wrote:
> > >
> > > You are making a very strong assumption that finding an alternative
> > chain of trust is safe. I'd argue it's not - it means that an adversary
> > could manipulate the chain in a way to trust it instead of the declared
> > chain and thus subverting it. In fact switching to OpenSSL would create a
> > serious security hole here - in particular since it installs a separate
> > trust store which it is far more easily attacked and subverted. By your
> > argument we should disable all SSL checks as that produces error with
> > incorrectly configured servers so not performing checks is better. It is
> > true that R is likely not used for sensitive transactions, but I would
> > rather it warned me about situations where the communication may be
> > compromised instead of just silently going along.
> > >
> > > Cheers,
> > > Simon
> > >
> > >
> > >
> > >> On Jun 10, 2020, at 11:39 AM, peter dalgaard 
> wrote:
> > >>
> > >> Yes and no... At least as I understand it (Disclaimer: There are
> things
> > I am pretty sure that I don't understand properly, somewhere in the
> Bermuda
> > triangle beween CA bundles, TLS protocols, and Server-side settings),
> there
> > are two sided to this:
> > >>
> > >> One is that various *.r-project.org servers got hit by a fumble where
> > a higher-up certificate in the chain of trust expired before the *.
> > r-project.org one. This was fixed by changing the certificate chain on
> > each server.
> > >>
> > >> The other side is that this situation hit Mac users harder than
> others,
> > because Apple's LibreSSL doesn't have the same feature that openSSL has
> to
> > detect a secondary chain of trust when the primary one expired. This was
> > not unique to R - svn also failed from the command line - but it did
> affect
> > download.file() inside R.
> > >>
> > >> The upshot is that there might be 3rd party servers with a similar
> > certificate setup which have not been updated like *.r-project.org. This
> > is not too unlikely since web browsers do not have trouble accessing
> them,
> > and the whole matter may go undetected. For such servers, download.file()
> > would still fail.
> > >>
> > >> I.e., there is a case to be made that we might want to link openSSL
> > rather than LibreSSL.  On the other hand, I gather that newer versions of
> > LibreSSL contain the relevant protocol upgrade, so maybe one can just
> wait
> > for Apple to update it. Or maybe we do want to link R against openSSL,
> but
> > almost certainly not for a hotfix release.
> > >>
> > >> Best
> > >> -pd
> > >>
> > >>> On 10 Jun 2020, at 00:50 , Simon Urbanek  >
> > wrote:
> > >>>
> > >>> To be clear, this not an issue in the libraries nor R, the
> > certificates on the server were simply wrong. So, no, this has nothing to
> > do with R.
> > >>>
> > >>> Cheers,
> > >>> Simon
> > >>>
> > >>>
> >  On Jun 10, 2020, at 10:45 AM, Henrik Bengtsson <
> > henrik.bengts...@gmail.com> wrote:
> > 
> >  Was this resolved upstream or is this something that R should/could
> >  fix? If the latter, could this also go into the "emergency release"
> R
> >  4.0.2 that is scheduled for 2020-06-22?
> > 
> >  My $.02
> > 
> >  /Henrik
> > 
> > 
> >  On Sun, May 31, 2020 at

Re: [Rd] Specifying C Standard in Package's Makevars File

2020-09-28 Thread Toby Hocking

WRE explains for C++11 14 etc standards but I don't know about C
https://cran.r-project.org/doc/manuals/r-release/R-exts.html#Using-C_002b_002b11-code
BTW I believe this question would be more appropriate for R-package-devel.

On Mon, Sep 28, 2020 at 4:44 AM Andreas Kersting 
wrote:

> Hi,
>
> what is the correct way to specify a C standard in a package's Makevars
> file?
>
> Building a package with e.g. PKG_CFLAGS = -std=gnu11 does work but R CMD
> check issues a warning:
>
> * checking compilation flags in Makevars ... WARNING
> Non-portable flags in variable 'PKG_CFLAGS':
>   -std=gnu11
>
> (Same for -std=c11.)
>
> Thanks! Regards,
> Andreas Kersting
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] NEWS item for bugfix in normalizePath and file.exists?

2021-04-27 Thread Toby Hocking

Hi all, Today I noticed bug(s?) in R-4.0.5, which seem to be fixed in
R-devel already. I checked on
https://developer.r-project.org/blosxom.cgi/R-devel/NEWS and there is no
mention of these changes, so I'm wondering if they are intentional? If so,
could someone please add a mention of the bugfix in the NEWS?

The problem involves file.exists, on windows, when a long/strange input
file name Encoding is unknown, in C locale. I expected that FALSE should be
returned (and it is on R-devel), but I got an error in R-4.0.5. Code to
reproduce is:

x <- "\360\237\247\222\n| \360\237\247\222\360\237\217\273\n|
\360\237\247\222\360\237\217\274\n| \360\237\247\222\360\237\217\275\n|
\360\237\247\222\360\237\217\276\n| \360\237\247\222\360\237\217\277\n"
Encoding(x) <- "unknown"
Sys.setlocale(locale="C")
sessionInfo()
file.exists(x)

Output I got from R-4.0.5 was

> sessionInfo()
R version 4.0.5 (2021-03-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)

Matrix products: default

locale:
[1] C
system code page: 1252

attached base packages:
[1] stats graphics  grDevices utils datasets  methods   base

loaded via a namespace (and not attached):
[1] compiler_4.0.5
> file.exists(x)
Error in file.exists(x) : file name conversion problem -- name too long?
Execution halted

Output I got from R-devel was

> sessionInfo()
R Under development (unstable) (2021-04-26 r80229)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)

Matrix products: default

locale:
[1] C

attached base packages:
[1] stats graphics  grDevices utils datasets  methods   base

loaded via a namespace (and not attached):
[1] compiler_4.2.0
> file.exists(x)
[1] FALSE

I also observed similar results when using normalizePath instead of
file.exists (error in R-4.0.5, no error in R-devel).

> normalizePath(x) #R-4.0.5
Error in path.expand(path) : unable to translate 'p'
| p'p;
| p'p<
| p'p=
| p'p>
| p'p
' to UTF-8
Calls: normalizePath -> path.expand
Execution halted

> normalizePath(x) #R-devel
[1] "C:\\Users\\th798\\R\\\360\237\247\222\n|
\360\237\247\222\360\237\217\273\n| \360\237\247\222\360\237\217\274\n|
\360\237\247\222\360\237\217\275\n| \360\237\247\222\360\237\217\276\n|
\360\237\247\222\360\237\217\277\n"
Warning message:
In normalizePath(path.expand(path), winslash, mustWork) : path[1]="🧒
| 🧒🏻
| 🧒🏼
| 🧒🏽
| 🧒🏾
| 🧒🏿
": The filename, directory name, or volume label syntax is incorrect

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] NEWS item for bugfix in normalizePath and file.exists?

2021-04-28 Thread Toby Hocking

Hi Tomas, thanks for the thoughtful reply. That makes sense about the
problems with C locale on windows. Actually I did not choose to use C
locale, but instead it was invoked automatically during a package check.
To be clear, I do NOT have a file with that name, but I do want file.exists
to return a reasonable value, FALSE (with no error). If that behavior is
unspecified, then should I use something like tryCatch(file.exists(x),
error=function(e)FALSE) instead of assuming that file.exists will always
return a logical vector without error? For my particular application that
work-around should probably be sufficient, but one may imagine a situation
where you want to do

x <- "\360\237\247\222\n| \360\237\247\222\360\237\217\273\n|
\360\237\247\222\360\237\217\274\n| \360\237\247\222\360\237\217\275\n|
\360\237\247\222\360\237\217\276\n| \360\237\247\222\360\237\217\277\n"
Encoding(x) <- "unknown"
Sys.setlocale(locale="C")
f <- tempfile()
cat("", file = f)
two <- c(x, f)
file.exists(two)

and in that case the correct response from R, in my opinion, would be
c(FALSE, TRUE) -- not an error.
Toby

On Wed, Apr 28, 2021 at 3:10 AM Tomas Kalibera 
wrote:

> Hi Toby,
>
> a defensive, portable approach would be to use only file names regarded
> portable by POSIX, so characters including ASCII letters, digits,
> underscore, dot, hyphen (but hyphen should not be the first character).
> That would always work on all systems and this is what I would use.
>
> Individual operating systems and file systems and their configurations
> differ in which additional characters they support and how. On some,
> file names are just sequences of bytes, on some, they have to be valid
> strings in certain encoding (and then with certain exceptions).
>
> On Windows, file names are at the lowest level in UTF-16LE encoding (and
> admitting unpaired surrogates for historical reasons). R stores strings
> in other encodings (UTF-8, native, Latin-1), so file names have to be
> translated to/from UTF-16LE, either directly by R or by Windows.
>
> But, there is no way to convert (non-ASCII) strings in "C" encoding to
> UTF16-LE, so the examples cannot be made to work on Windows.
>
> When the translation is left on Windows, it assumes the non-UTF-16LE
> strings are in the Active Code Page encoding (shown as "system encoding"
> in sessionInfo() in R, Latin-1 in your example) instead of the current C
> library encoding ("C" in your example). So, file names coming from
> Windows will be either the bytes of their UTF-16LE representation or the
> bytes of their Latin-1 representation, but which one is subject to the
> implementation details, so the result is really unusable.
>
> I would say using "C" as encoding in R is not a good idea, and
> particularly not on Windows.
>
> I would say that what happens with such file names in "C" encoding is
> unspecified behavior, which is subject to change at any time without
> notice, and that both the R 4.0.5 and R-devel behavior you are observing
> are acceptable. I don't think it should be mentioned in the NEWS.
> Personally, I would prefer some stricter checks of strings validity and
> perhaps disallowing the "C" encoding in R, so yet another behavior where
> it would be clearer that this cannot really work, but that would require
> more thought and effort.
>
> Best
> Tomas
>
>
> On 4/27/21 9:53 PM, Toby Hocking wrote:
>
> > Hi all, Today I noticed bug(s?) in R-4.0.5, which seem to be fixed in
> > R-devel already. I checked on
> > https://developer.r-project.org/blosxom.cgi/R-devel/NEWS and there is no
> > mention of these changes, so I'm wondering if they are intentional? If
> so,
> > could someone please add a mention of the bugfix in the NEWS?
> >
> > The problem involves file.exists, on windows, when a long/strange input
> > file name Encoding is unknown, in C locale. I expected that FALSE should
> be
> > returned (and it is on R-devel), but I got an error in R-4.0.5. Code to
> > reproduce is:
> >
> > x <- "\360\237\247\222\n| \360\237\247\222\360\237\217\273\n|
> > \360\237\247\222\360\237\217\274\n| \360\237\247\222\360\237\217\275\n|
> > \360\237\247\222\360\237\217\276\n| \360\237\247\222\360\237\217\277\n"
> > Encoding(x) <- "unknown"
> > Sys.setlocale(locale="C")
> > sessionInfo()
> > file.exists(x)
> >
> > Output I got from R-4.0.5 was
> >
> >> sessionInfo()
> > R version 4.0.5 (2021-03-31)
> > Platform: x86_64-w64-mingw32/x64 (64-bit)
> > Running under: Windows 10 x64 (build 19042)
> >
> > Matrix products: default
> >
> > loc

Re: [Rd] NEWS item for bugfix in normalizePath and file.exists?

2021-04-28 Thread Toby Hocking

+1 for Martin's proposal, that makes sense to me too.
About Tomas' idea to immediately stop with an error when the user tries to
create a string which is invalid in its declared encoding, that sounds
great. I'm just wondering if that would break my application. My package is
running an example during a check, in which the unicode/emoji is read into
R using readLines from a file under inst/extdata, so presumably it should
work as long as readLines handles the encoding correctly and/or the locale
during package check is changed to something more reasonable on windows?

On Wed, Apr 28, 2021 at 9:04 AM Tomas Kalibera 
wrote:

>
> On 4/28/21 5:22 PM, Martin Maechler wrote:
> >>>>>> Toby Hocking
> >>>>>>  on Wed, 28 Apr 2021 07:21:05 -0700 writes:
> >  > Hi Tomas, thanks for the thoughtful reply. That makes sense about
> the
> >  > problems with C locale on windows. Actually I did not choose to
> use C
> >  > locale, but instead it was invoked automatically during a package
> check.
> >  > To be clear, I do NOT have a file with that name, but I do want
> file.exists
> >  > to return a reasonable value, FALSE (with no error). If that
> behavior is
> >  > unspecified, then should I use something like
> tryCatch(file.exists(x),
> >  > error=function(e)FALSE) instead of assuming that file.exists will
> always
> >  > return a logical vector without error? For my particular
> application that
> >  > work-around should probably be sufficient, but one may imagine a
> situation
> >  > where you want to do
> >
> >  > x <- "\360\237\247\222\n| \360\237\247\222\360\237\217\273\n|
> >  > \360\237\247\222\360\237\217\274\n|
> \360\237\247\222\360\237\217\275\n|
> >  > \360\237\247\222\360\237\217\276\n|
> \360\237\247\222\360\237\217\277\n"
> >  > Encoding(x) <- "unknown"
> >  > Sys.setlocale(locale="C")
> >  > f <- tempfile()
> >  > cat("", file = f)
> >  > two <- c(x, f)
> >  > file.exists(two)
> >
> >  > and in that case the correct response from R, in my opinion,
> would be
> >  > c(FALSE, TRUE) -- not an error.
> >  > Toby
> >
> > Indeed, thanks a lot to Tomas!
> >
> > # A remark
> > We *could* -- and according to my taste should -- try to have
> file.exists()
> > return a logical vector in almost all cases, namely, e.g., still give an
> > error for file.exists(pi) :
> > Notably  if  `c(...)`  {for the  `...`  arguments of file.exists() }
> > is a character vector, always return a logical vector of the same
> > length, *and* we could notably make use of the fact that R's
> > logical type is not binary but ternary, and hence that return
> > value could contain values from {TRUE, NA, FALSE}  and interpret NA
> > as "don't know" in all cases where the corresponding string in
> > the input had an Encoding(.) that was "fishy" in some sense
> > given the "context" (OS, locale, OS_version, ICU-presence, ...).
> >
> > In particular, when the underlying code sees encoding-translation issues
> > for a string,  NA  would be returned instead of an error.
>
> Yes, I agree with Toby and you that there is benefit in allowing
> per-element, vectorized use of file.exists(), and well it is the case
> now, we just fall back to FALSE. NA might be be better in case of error
> that prevents the function from deciding whether the file exists or not
> (e.g. an invalid name in form that make is clear such file cannot exist
> might be a different case...).
>
> But, the only way to get a translation error is by passing a string to
> file.exists() which is invalid in its declared encoding (or which is in
> "C" encoding). I would hope that we could get to the point where such
> situation is prevented (we only allow creation of strings that can be
> translated to Unicode). If we get there, the example would fail with
> error (yet, right, before getting to file.exists()).
>
> My point that I would not write tests of this behavior stands. One
> should not use such file names, and after the change Toby reported from
> ERROR to FALSE, Martin's proposal would change to NA, mine eventually to
> ERROR, etc. So it is best for now to leave it unspecified and not
> trigger it, I think.
>
> Tomas
>
> >
> > Martin
> >
> >  > On Wed, Apr 28, 2021 at 3:10 AM Tomas Kalibera <
> tomas.kalib...@gmail.com>
> >  > wrote:
> >
>

Re: [Rd] [External] Possible ALTREP bug

2021-06-16 Thread Toby Hocking

By the way, where is the documentation for INTEGER_ELT, REAL_ELT, etc? I
looked in Writing R Extensions and R Internals but I did not see any
mention.
REAL_ELT is briefly mentioned on
https://svn.r-project.org/R/branches/ALTREP/ALTREP.html
Would it be possible to please add some mention of them to Writing R
Extensions?
- how many of these _ELT functions are there? INTEGER, REAL, ... ?
- in what version of R were they introduced?
- I guess input types are always SEXP and int?
- What are the output types for each?

On Fri, May 28, 2021 at 5:16 PM  wrote:

> Since the INTEGER_ELT, REAL_ELT, etc, functions are fairly new it may
> be possible to check that places where they are used allow for them to
> allocate. I have fixed the one that got caught by Gabor's example, and
> a rchk run might be able to pick up others if rchk knows these could
> allocate. (I may also be forgetting other places where the _ELt
> methods are used.)  Fixing all call sites for REAL, INTEGER, etc, was
> never realistic so there GC has to be suspended during the method
> call, and that is done in the dispatch mechanism.
>
> The bigger problem is jumps from inside things that existing code
> assumes will not do that. Catching those jumps is possible but
> expensive; doing anything sensible if one is caught is really not
> possible.
>
> Best,
>
> luke
>
> On Fri, 28 May 2021, Gabriel Becker wrote:
>
> > Hi Jim et al,
> > Just to hopefully add a bit to what Luke already answered, from what I am
> > recalling looking back at that bioconductor thread Elt methods are used
> in
> > places where there are hard implicit assumptions that no garbage
> collection
> > will occur (ie they are called on things that aren't PROTECTed), and
> beyond
> > that, in places where there are hard assumptions that no error (longjmp)
> > will occur. I could be wrong, but I don't know that suspending garbage
> > collection would protect from the second one. Ie it is possible that an
> > error *ever* being raised from R code that implements an elt method could
> > cause all hell to break loose.
> >
> > Luke or Tomas Kalibera would know more.
> >
> > I was disappointed that implementing ALTREPs in R code was not in the
> cards
> > (it was in my original proposal back in 2016 to the DSC) but I trust Luke
> > that there are important reasons we can't safely allow that.
> >
> > Best,
> > ~G
> >
> > On Fri, May 28, 2021 at 8:31 AM Jim Hester 
> wrote:
> >   From reading the discussion on the Bioconductor issue tracker it
> >   seems like
> >   the reason the GC is not suspended for the non-string ALTREP Elt
> >   methods is
> >   primarily due to performance concerns.
> >
> >   If this is the case perhaps an additional flag could be added to
> >   the
> >   `R_set_altrep_*()` functions so ALTREP authors could indicate if
> >   GC should
> >   be halted when that particular method is called for that
> >   particular ALTREP
> >   class.
> >
> >   This would avoid the performance hit (other than a boolean
> >   check) for the
> >   standard case when no allocations are expected, but allow
> >   authors to
> >   indicate that R should pause GC if needed for methods in their
> >   class.
> >
> >   On Fri, May 28, 2021 at 9:42 AM  wrote:
> >
> >   > integer and real Elt methods are not expected to allocate. You
> >   would
> >   > have to suspend GC to be able to do that. This currently can't
> >   be done
> >   > from package code.
> >   >
> >   > Best,
> >   >
> >   > luke
> >   >
> >   > On Fri, 28 May 2021, Gábor Csárdi wrote:
> >   >
> >   > > I have found some weird SEXP corruption behavior with
> >   ALTREP, which
> >   > > could be a bug. (Or I could be doing something wrong.)
> >   > >
> >   > > I have an integer ALTREP vector that calls back to R from
> >   the Elt
> >   > > method. When this vector is indexed in a lapply(), its first
> >   element
> >   > > gets corrupted. Sometimes it's just a type change to
> >   logical, but
> >   > > sometimes the corruption causes a crash.
> >   > >
> >   > > I saw this on macOS from R 3.5.3 to 4.2.0. I created a small
> >   package
> >   > > that demonstrates this:
> >   https://github.com/gaborcsardi/redfish
> >   > >
> >   > > The R callback in this package calls
> >   `loadNamespace("Matrix")`, but
> >   > > the same crash happens for other packages as well, and
> >   sometimes it
> >   > > also happens if I don't load any packages at all. (But that
> >   example
> >   > > was much more complicated, so I went with the package
> >   loading.)
> >   > >
> >   > > It is somewhat random, and sometimes turning off the JIT
> >   avoids the
> >   > > crash, but not always.
> >   > >
> >   > > Hopefully I am just doing something wrong in the ALTREP code
> >   (see
> >   > >
> >

Re: [Rd] [External] Possible ALTREP bug

2021-06-17 Thread Toby Hocking

Oliver, for clarification that section in writing R extensions mentions
VECTOR_ELT and REAL but not REAL_ELT nor any other *_ELT functions. I was
looking for an explanation of all the *_ELT functions (which are apparently
new), not just VECTOR_ELT.
Thanks Simon that response was very helpful.
One more question: are there any circumstances in which one should use
REAL_ELT(x,i) rather than REAL(x)[i] or vice versa? Or can they be used
interchangeably?

On Wed, Jun 16, 2021 at 4:29 PM Simon Urbanek 
wrote:

> The usual quote applies: "use the source, Luke":
>
> $ grep _ELT *.h | sort
> Rdefines.h:#define SET_ELEMENT(x, i, val)   SET_VECTOR_ELT(x, i, val)
> Rinternals.h:   The function STRING_ELT is used as an argument to
> arrayAssign even
> Rinternals.h:#define VECTOR_ELT(x,i)((SEXP *) DATAPTR(x))[i]
> Rinternals.h://SEXP (STRING_ELT)(SEXP x, R_xlen_t i);
> Rinternals.h:Rbyte (RAW_ELT)(SEXP x, R_xlen_t i);
> Rinternals.h:Rbyte ALTRAW_ELT(SEXP x, R_xlen_t i);
> Rinternals.h:Rcomplex (COMPLEX_ELT)(SEXP x, R_xlen_t i);
> Rinternals.h:Rcomplex ALTCOMPLEX_ELT(SEXP x, R_xlen_t i);
> Rinternals.h:SEXP (STRING_ELT)(SEXP x, R_xlen_t i);
> Rinternals.h:SEXP (VECTOR_ELT)(SEXP x, R_xlen_t i);
> Rinternals.h:SEXP ALTSTRING_ELT(SEXP, R_xlen_t);
> Rinternals.h:SEXP SET_VECTOR_ELT(SEXP x, R_xlen_t i, SEXP v);
> Rinternals.h:double (REAL_ELT)(SEXP x, R_xlen_t i);
> Rinternals.h:double ALTREAL_ELT(SEXP x, R_xlen_t i);
> Rinternals.h:int (INTEGER_ELT)(SEXP x, R_xlen_t i);
> Rinternals.h:int (LOGICAL_ELT)(SEXP x, R_xlen_t i);
> Rinternals.h:int ALTINTEGER_ELT(SEXP x, R_xlen_t i);
> Rinternals.h:int ALTLOGICAL_ELT(SEXP x, R_xlen_t i);
> Rinternals.h:void ALTCOMPLEX_SET_ELT(SEXP x, R_xlen_t i, Rcomplex v);
> Rinternals.h:void ALTINTEGER_SET_ELT(SEXP x, R_xlen_t i, int v);
> Rinternals.h:void ALTLOGICAL_SET_ELT(SEXP x, R_xlen_t i, int v);
> Rinternals.h:void ALTRAW_SET_ELT(SEXP x, R_xlen_t i, Rbyte v);
> Rinternals.h:void ALTREAL_SET_ELT(SEXP x, R_xlen_t i, double v);
> Rinternals.h:void ALTSTRING_SET_ELT(SEXP, R_xlen_t, SEXP);
> Rinternals.h:void SET_INTEGER_ELT(SEXP x, R_xlen_t i, int v);
> Rinternals.h:void SET_LOGICAL_ELT(SEXP x, R_xlen_t i, int v);
> Rinternals.h:void SET_REAL_ELT(SEXP x, R_xlen_t i, double v);
> Rinternals.h:void SET_STRING_ELT(SEXP x, R_xlen_t i, SEXP v);
>
> So the indexing is with R_xlen_t and they return the value itself as one
> would expect.
>
> Cheers,
> Simon
>
>
> > On Jun 17, 2021, at 2:22 AM, Toby Hocking  wrote:
> >
> > By the way, where is the documentation for INTEGER_ELT, REAL_ELT, etc? I
> > looked in Writing R Extensions and R Internals but I did not see any
> > mention.
> > REAL_ELT is briefly mentioned on
> > https://svn.r-project.org/R/branches/ALTREP/ALTREP.html
> > Would it be possible to please add some mention of them to Writing R
> > Extensions?
> > - how many of these _ELT functions are there? INTEGER, REAL, ... ?
> > - in what version of R were they introduced?
> > - I guess input types are always SEXP and int?
> > - What are the output types for each?
> >
> > On Fri, May 28, 2021 at 5:16 PM  wrote:
> >
> >> Since the INTEGER_ELT, REAL_ELT, etc, functions are fairly new it may
> >> be possible to check that places where they are used allow for them to
> >> allocate. I have fixed the one that got caught by Gabor's example, and
> >> a rchk run might be able to pick up others if rchk knows these could
> >> allocate. (I may also be forgetting other places where the _ELt
> >> methods are used.)  Fixing all call sites for REAL, INTEGER, etc, was
> >> never realistic so there GC has to be suspended during the method
> >> call, and that is done in the dispatch mechanism.
> >>
> >> The bigger problem is jumps from inside things that existing code
> >> assumes will not do that. Catching those jumps is possible but
> >> expensive; doing anything sensible if one is caught is really not
> >> possible.
> >>
> >> Best,
> >>
> >> luke
> >>
> >> On Fri, 28 May 2021, Gabriel Becker wrote:
> >>
> >>> Hi Jim et al,
> >>> Just to hopefully add a bit to what Luke already answered, from what I
> am
> >>> recalling looking back at that bioconductor thread Elt methods are used
> >> in
> >>> places where there are hard implicit assumptions that no garbage
> >> collection
> >>> will occur (ie they are called on things that aren't PROTECTed), and
> >> beyond
> >>> that, in places where there are hard assumptions that no error
> (longjmp)
> >>> will occur. I could be wron

[Rd] na.omit inconsistent with is.na on list

2021-08-11 Thread Toby Hocking

na.omit is documented as "na.omit returns the object with incomplete cases
removed." and "At present these will handle vectors," so I expected that
when it is used on a list, it should return the same thing as if we subset
via is.na; however I observed the following,

> L <- list(NULL, NA, 0)
> str(L[!is.na(L)])
List of 2
 $ : NULL
 $ : num 0
> str(na.omit(L))
List of 3
 $ : NULL
 $ : logi NA
 $ : num 0

Should na.omit be fixed so that it returns a result that is consistent with
is.na? I assume that is.na is the canonical definition of what should be
considered a missing value in R.

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] na.omit inconsistent with is.na on list

2021-08-11 Thread Toby Hocking

Also, the na.omit method for data.frame with list column seems to be
inconsistent with is.na,

> L <- list(NULL, NA, 0)
> str(f <- data.frame(I(L)))
'data.frame': 3 obs. of  1 variable:
 $ L:List of 3
  ..$ : NULL
  ..$ : logi NA
  ..$ : num 0
  ..- attr(*, "class")= chr "AsIs"
> is.na(f)
 L
[1,] FALSE
[2,]  TRUE
[3,] FALSE
> na.omit(f)
   L
1
2 NA
3  0

On Wed, Aug 11, 2021 at 9:58 PM Toby Hocking  wrote:

> na.omit is documented as "na.omit returns the object with incomplete cases
> removed." and "At present these will handle vectors," so I expected that
> when it is used on a list, it should return the same thing as if we subset
> via is.na; however I observed the following,
>
> > L <- list(NULL, NA, 0)
> > str(L[!is.na(L)])
> List of 2
>  $ : NULL
>  $ : num 0
> > str(na.omit(L))
> List of 3
>  $ : NULL
>  $ : logi NA
>  $ : num 0
>
> Should na.omit be fixed so that it returns a result that is consistent
> with is.na? I assume that is.na is the canonical definition of what
> should be considered a missing value in R.
>

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] na.omit inconsistent with is.na on list

2021-08-12 Thread Toby Hocking

Hi Gabe thanks for the feedback.

On Thu, Aug 12, 2021 at 1:19 PM Gabriel Becker 
wrote:

> Hi Toby,
>
> This definitely appears intentional, the first  expression of
> stats:::na.omit.default is
>
>if (!is.atomic(object))
>
> return(object)
>
> Based on this code it does seem that the documentation could be clarified
to say atomic vectors.

>
> So it is explicitly just returning the object in non-atomic cases, which
> includes lists. I was not involved in this decision (obviously) but my
> guess is that it is due to the fact that what constitutes an observation
> "being complete" in unclear in the list case. What should
>
> na.omit(list(5, NA, c(NA, 5)))
>
> return? Just the first element, or the first and the last? It seems, at
> least to me, unclear.
>
I agree in principle/theory that it is unclear, but in practice is.na has
an un-ambiguous answer (if list element is scalar NA then it is considered
missing, otherwise not).

> A small change to the documentation to to add "atomic (in the sense of
> is.atomic returning \code{TRUE})" in front of "vectors"  or similar  where
> what types of objects are supported seems justified, though, imho, as the
> current documentation is either ambiguous or technically incorrect,
> depending on what we take "vector" to mean.
>
> Best,
> ~G
>
> On Wed, Aug 11, 2021 at 10:16 PM Toby Hocking  wrote:
>
>> Also, the na.omit method for data.frame with list column seems to be
>> inconsistent with is.na,
>>
>> > L <- list(NULL, NA, 0)
>> > str(f <- data.frame(I(L)))
>> 'data.frame': 3 obs. of  1 variable:
>>  $ L:List of 3
>>   ..$ : NULL
>>   ..$ : logi NA
>>   ..$ : num 0
>>   ..- attr(*, "class")= chr "AsIs"
>> > is.na(f)
>>  L
>> [1,] FALSE
>> [2,]  TRUE
>> [3,] FALSE
>> > na.omit(f)
>>L
>> 1
>> 2 NA
>> 3  0
>>
>> On Wed, Aug 11, 2021 at 9:58 PM Toby Hocking  wrote:
>>
>> > na.omit is documented as "na.omit returns the object with incomplete
>> cases
>> > removed." and "At present these will handle vectors," so I expected that
>> > when it is used on a list, it should return the same thing as if we
>> subset
>> > via is.na; however I observed the following,
>> >
>> > > L <- list(NULL, NA, 0)
>> > > str(L[!is.na(L)])
>> > List of 2
>> >  $ : NULL
>> >  $ : num 0
>> > > str(na.omit(L))
>> > List of 3
>> >  $ : NULL
>> >  $ : logi NA
>> >  $ : num 0
>> >
>> > Should na.omit be fixed so that it returns a result that is consistent
>> > with is.na? I assume that is.na is the canonical definition of what
>> > should be considered a missing value in R.
>> >
>>
>> [[alternative HTML version deleted]]
>>
>> __
>> R-devel@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] na.omit inconsistent with is.na on list

2021-08-14 Thread Toby Hocking

Some relevant information from ?is.na: the behavior for lists is
documented,

 For is.na, elementwise the result is false unless that element
 is a length-one atomic vector and the single element of that
 vector is regarded as NA or NaN (note that any is.na method
 for the class of the element is ignored).

Also there are other functions anyNA and is.na<- which are consistent with
is.na. That is, anyNA only returns TRUE if the list has an element which is
a scalar NA. And is.na<- sets list elements to logical NA to indicate
missingness.

On Fri, Aug 13, 2021 at 1:10 AM Hugh Parsonage 
wrote:

> The data.frame method deliberately skips non-atomic columns before
> invoking is.na(x) so I think it is fair to assume this behaviour is
> intentional and assumed.
>
> Not so clear to me that there is a sensible answer for list columns.
> (List columns seem to collide with the expectation that in each
> variable every observation will be of the same type)
>
> Consider your list L as
>
> L <- list(NULL, NA, c(NA, NA))
>
> Seems like every observation could have a claim to be 'missing' here.
> Concretely, if a data.frame had a list column representing the lat-lon
> of an observation, we might only be able to represent missing values
> like c(NA, NA).
>
> On Fri, 13 Aug 2021 at 17:27, Iñaki Ucar  wrote:
> >
> > On Thu, 12 Aug 2021 at 22:20, Gabriel Becker 
> wrote:
> > >
> > > Hi Toby,
> > >
> > > This definitely appears intentional, the first  expression of
> > > stats:::na.omit.default is
> > >
> > >if (!is.atomic(object))
> > >
> > > return(object)
> >
> > I don't follow your point. This only means that the *default* method
> > is not intended for non-atomic cases, but it doesn't mean it shouldn't
> > exist a method for lists.
> >
> > > So it is explicitly just returning the object in non-atomic cases,
> which
> > > includes lists. I was not involved in this decision (obviously) but my
> > > guess is that it is due to the fact that what constitutes an
> observation
> > > "being complete" in unclear in the list case. What should
> > >
> > > na.omit(list(5, NA, c(NA, 5)))
> > >
> > > return? Just the first element, or the first and the last? It seems, at
> > > least to me, unclear. A small change to the documentation to to add
> "atomic
> >
> > > is.na(list(5, NA, c(NA, 5)))
> > [1] FALSE  TRUE FALSE
> >
> > Following Toby's argument, it's clear to me: the first and the last.
> >
> > Iñaki
> >
> > > (in the sense of is.atomic returning \code{TRUE})" in front of
> "vectors"
> > > or similar  where what types of objects are supported seems justified,
> > > though, imho, as the current documentation is either ambiguous or
> > > technically incorrect, depending on what we take "vector" to mean.
> > >
> > > Best,
> > > ~G
> > >
> > > On Wed, Aug 11, 2021 at 10:16 PM Toby Hocking 
> wrote:
> > >
> > > > Also, the na.omit method for data.frame with list column seems to be
> > > > inconsistent with is.na,
> > > >
> > > > > L <- list(NULL, NA, 0)
> > > > > str(f <- data.frame(I(L)))
> > > > 'data.frame': 3 obs. of  1 variable:
> > > >  $ L:List of 3
> > > >   ..$ : NULL
> > > >   ..$ : logi NA
> > > >   ..$ : num 0
> > > >   ..- attr(*, "class")= chr "AsIs"
> > > > > is.na(f)
> > > >  L
> > > > [1,] FALSE
> > > > [2,]  TRUE
> > > > [3,] FALSE
> > > > > na.omit(f)
> > > >L
> > > > 1
> > > > 2 NA
> > > > 3  0
> > > >
> > > > On Wed, Aug 11, 2021 at 9:58 PM Toby Hocking 
> wrote:
> > > >
> > > > > na.omit is documented as "na.omit returns the object with
> incomplete
> > > > cases
> > > > > removed." and "At present these will handle vectors," so I
> expected that
> > > > > when it is used on a list, it should return the same thing as if we
> > > > subset
> > > > > via is.na; however I observed the following,
> > > > >
> > > > > > L <- list(NULL, NA, 0)
> > > > > > str(L[!is.na(L)])
> > > > > List of 2
> > > > >  $ : NULL
> > > > >  $ : num 0
> >

Re: [Rd] na.omit inconsistent with is.na on list

2021-08-16 Thread Toby Hocking

To clarify, ?is.na docs say that 'na.omit' returns the object with
incomplete cases removed.
If we take is.na to be the definition of "incomplete cases" then a list
element with scalar NA is incomplete.
About the data.frame method, in my opinion it is highly
confusing/inconsistent for na.omit to keep rows with incomplete cases in
list columns, but not in columns which are atomic vectors,

> (f.num <- data.frame(num=c(1,NA,2)))
  num
1   1
2  NA
3   2
> is.na(f.num)
   num
[1,] FALSE
[2,]  TRUE
[3,] FALSE
> na.omit(f.num)
  num
1   1
3   2

> (f.list <- data.frame(list=I(list(1,NA,2
  list
11
2   NA
32
> is.na(f.list)
  list
[1,] FALSE
[2,]  TRUE
[3,] FALSE
> na.omit(f.list)
  list
11
2   NA
32

On Sat, Aug 14, 2021 at 5:15 PM Gabriel Becker 
wrote:

> I understand what is.na does, the issue I have is that its task is not
> equivalent to the conceptual task na.omit is doing, in my opinion, as
> illustrated by what the data.frame method does.
>
> Thus what i was getting at above about it not being clear that lst[is.na(lst)]
> being the correct thing for na.omit to do
>
> ~G
>
> ~G
>
> On Sat, Aug 14, 2021, 1:49 PM Toby Hocking  wrote:
>
>> Some relevant information from ?is.na: the behavior for lists is
>> documented,
>>
>>  For is.na, elementwise the result is false unless that element
>>  is a length-one atomic vector and the single element of that
>>  vector is regarded as NA or NaN (note that any is.na method
>>  for the class of the element is ignored).
>>
>> Also there are other functions anyNA and is.na<- which are consistent
>> with
>> is.na. That is, anyNA only returns TRUE if the list has an element which
>> is
>> a scalar NA. And is.na<- sets list elements to logical NA to indicate
>> missingness.
>>
>> On Fri, Aug 13, 2021 at 1:10 AM Hugh Parsonage 
>> wrote:
>>
>> > The data.frame method deliberately skips non-atomic columns before
>> > invoking is.na(x) so I think it is fair to assume this behaviour is
>> > intentional and assumed.
>> >
>> > Not so clear to me that there is a sensible answer for list columns.
>> > (List columns seem to collide with the expectation that in each
>> > variable every observation will be of the same type)
>> >
>> > Consider your list L as
>> >
>> > L <- list(NULL, NA, c(NA, NA))
>> >
>> > Seems like every observation could have a claim to be 'missing' here.
>> > Concretely, if a data.frame had a list column representing the lat-lon
>> > of an observation, we might only be able to represent missing values
>> > like c(NA, NA).
>> >
>> > On Fri, 13 Aug 2021 at 17:27, Iñaki Ucar 
>> wrote:
>> > >
>> > > On Thu, 12 Aug 2021 at 22:20, Gabriel Becker 
>> > wrote:
>> > > >
>> > > > Hi Toby,
>> > > >
>> > > > This definitely appears intentional, the first  expression of
>> > > > stats:::na.omit.default is
>> > > >
>> > > >if (!is.atomic(object))
>> > > >
>> > > > return(object)
>> > >
>> > > I don't follow your point. This only means that the *default* method
>> > > is not intended for non-atomic cases, but it doesn't mean it shouldn't
>> > > exist a method for lists.
>> > >
>> > > > So it is explicitly just returning the object in non-atomic cases,
>> > which
>> > > > includes lists. I was not involved in this decision (obviously) but
>> my
>> > > > guess is that it is due to the fact that what constitutes an
>> > observation
>> > > > "being complete" in unclear in the list case. What should
>> > > >
>> > > > na.omit(list(5, NA, c(NA, 5)))
>> > > >
>> > > > return? Just the first element, or the first and the last? It
>> seems, at
>> > > > least to me, unclear. A small change to the documentation to to add
>> > "atomic
>> > >
>> > > > is.na(list(5, NA, c(NA, 5)))
>> > > [1] FALSE  TRUE FALSE
>> > >
>> > > Following Toby's argument, it's clear to me: the first and the last.
>> > >
>> > > Iñaki
>> > >
>> > > > (in the sense of is.atomic returning \code{TRUE})" in front of
>> > "vectors"
>> > > > or similar  where what types of objects are suppo

Re: [Rd] Problem with accessibility in R 4.2.0 and 4.2.1.

2022-09-22 Thread Toby Hocking

Another option is to use https://emacspeak.sourceforge.net/ (version of
emacs editor/ide which can speak letters/words/lines -- has a blind
maintainer) with https://ess.r-project.org/ (interface for editing and
running R code from within emacs)

On Thu, Sep 22, 2022 at 9:42 AM Duncan Murdoch 
wrote:

> On 22/09/2022 9:48 a.m., Andrew Hart via R-devel wrote:
> > Hi. I'm having an issue with R 4.2.1 on Windows but I'm not sure if this
> > is the right place to ask about it. If it's not, I'm hoping someone can
> > point me in the right direction.
> >
> > I'm blind and have been using R for about 11 years now. The base build
> > available on CRAN is quite accessible and works pretty well with
> > screen-reading software such as JAWS for Windows and NVDA. R-studio is
> > not accessible which appears to have something to do with the version of
> > QT it uses, but that's not relevant as I don't use it.
>
> I believe RStudio is in the process of moving away from QT to Electron.
>   I don't know when the non-QT version will be released (if not
> already), but you might want to investigate that if Rgui doesn't work out.
>
> Duncan Murdoch
> >
> > Recently I installed R 4.2.1 (I tend to upgrade two or three times a
> > year and this time I was jumping from R 4.1.2 to 4.2.1).
> > However, I've encountered a serious problem which makes the latest
> > version more or less unusable for doing any kind of serious work.
> > The issue is that the screen-reading software is unable to locate the R
> > cursor and behaves as though the cursor is near the top left of the R
> > application window. Practically, this means I can't tell what characters
> > I'm passing over when cursoring left and right, nor can I hear what
> > character is being deleted when the backspace is pressed. Most
> > importantly, I can't tell where the insertion point is. This is a major
> > regression in the ability to work with and edit the command line in the
> > R console. There are ways of actually viewing the command line but the
> > way I work is frequently calling up a previous command and making a
> > change so as to not have to type the whole command again.
> >
> > I Went and installed R 4.1.3 and R 4.2.0 in an attempt to find out
> > exactly when things went awry and the issue first appeared in R 4.2.0.
> > Looking through the release notes, the only things mentioned that seem
> > likely to be relevant are the following:
> >
> > • R uses a new 64-bit Tcl/Tk bundle. The previous 32-bit/64-bit bundle
> > had a different layout and can no longer be used.
> >
> > and
> >
> > • R uses UTF-8 as the native encoding on recent Windows systems (at
> > least Windows 10 version 1903, Windows Server 2022 or Windows Server
> > 1903). As a part
> > of this change, R uses UCRT as the C runtime. UCRT should be installed
> > manually on systems older than Windows 10 or Windows Server 2016 before
> > installing
> > R.
> >
> > I can't really see how changing to utf-8 as the native encoding would
> > produce the behaviour I'm seeing, so I am guessing that the change in
> > TCL/TK might be the culprit.
> >
> > I'm hoping that someone will be able to help shed some light on what's
> > going on here.
> >
> > Thanks a lot,
> > Andrew.
> >
> > __
> > R-devel@r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] `dendrapply` Enhancements

2023-02-24 Thread Toby Hocking

Hi Aidan, I think you are on the right email list.
I'm not R-core, but this looks like an interesting/meaningful/significant
contribution to base R.
I'm not sure what the original dendrapply looks like in terms of code style
(variable names/white space formatting/etc) but in my experience it is
important that your code contribution makes minimal changes in that area.
Did you hear about the R project sprint 2023?
https://contributor.r-project.org/r-project-sprint-2023/ Your work falls
into the "new developments" category so I think you could apply for that
funding to participate.
Toby

On Fri, Feb 24, 2023 at 3:47 AM Lakshman, Aidan H  wrote:

> Hi everyone,
>
> My apologies if this isn’t the right place to submit this—I’m new to the
> R-devel community and still figuring out what is where.
>
> If people want to skip my writeup and just look at the code, I’ve made a
> repository for it here:
> https://github.com/ahl27/new_dendrapply/tree/master. I’m not quite sure
> how to integrate it into a fork of R-devel; the package structure is
> different from what I’m used to.
>
> I had written a slightly improved version of dendrapply for one of my
> research projects, and my advisor encouraged me to submit it to the R
> project. It took me longer than I expected, but I’ve finally gotten my
> implementation to be a drop-in replacement for `stats::dendrapply`. The man
> page for `stats::dendrapply` says “The implementation is somewhat
> experimental and suggestions for enhancements (or nice examples of usage)
> are very welcome,” so I figured this had the potential to be a worthwhile
> contribution. I wanted to send it out to R-devel to see if this was
> something worth pursuing as an enhancement to R.
>
> The implementation I have is based in C, which I understand implies an
> increased burden of maintenance over pure R code. However, it does come
> with the following benefits:
>
> - Completely eliminates recursion, so no memory overhead from function
> calls or possibility of stack overflows (this was a major issue reported on
> some of the functions in one of our Bioconductor packages that previously
> used `dendrapply`).
> - Modest runtime improvement, around 2x on my computer (2021 MBP, 32GB
> RAM). I’m relatively confident this could be optimized more.
> - Seemingly significant reduction in memory reduction, still working on a
> robust benchmark. Suggestions for the best way to do that are welcome.
> - Support for applying functions with an inorder traversal (as in
> `stats::dendrapply`) as well as using a postorder traversal.
>
> This implementation was tested manually as well as running all the unit
> tests in `dendextend`, which comprises a lot of applications of
> `dendrapply`.
>
> The postorder traversal would be a significant new functionality to
> dendrapply, as it would allow for functions that use the child nodes to
> correctly execute. A toy example of this is something like:
> ```
> exFunc <- function(x){
>   attr(x, 'newA') <- 'a'
>   if(is.null(attr(x, 'leaf'))){
> cat(attr(x[[1]], 'newA'), attr(x[[2]], 'newA'))
> cat('\n')
>   }
>   x
> })
>
> dendrapply(dend, exFunc)
> ```
>
> With the current version of dendrapply, this prints nothing, but the
> postorder traversal version will print ‘a’ twice for each internal branch.
> If this would be a worthwhile addition, I can refactor the code for brevity
> and add a `how=c("in.order", "post.order")`, with the default value
> “in.order” to maintain backwards compatibility. A preorder traversal
> version should also be possible, I just haven’t gotten to it yet.
>
> I think the runtime could be optimized more as well.
>
> Thank you in advance for looking at my code and offering feedback; I’m
> excited at the possibility of helping contribute to the R project! I’m
> happy to discuss more either here, on GitHub, or on the R Contributors
> Slack.
>
> Sincerely,
> Aidan Lakshman
>
> ---
> Aidan Lakshman (he/him)<https://www.ahl27.com/>
> Doctoral Candidate, Wright Lab<https://www.wrightlabscience.com/>
> University of Pittsburgh School of Medicine
> Department of Biomedical Informatics
> ah...@pitt.edu
> (724) 612-9940
>
>
> [[alternative HTML version deleted]]
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] read.csv quadratic time in number of columns

2023-03-29 Thread Toby Hocking

Dear R-devel,
A number of people have observed anecdotally that read.csv is slow for
large number of columns, for example:
https://stackoverflow.com/questions/7327851/read-csv-is-extremely-slow-in-reading-csv-files-with-large-numbers-of-columns
I did a systematic comparison of read.csv with similar functions, and
observed that read.csv is quadratic time (N^2) in the number of columns N,
whereas the others are linear (N).
Can read.csv be improved to use a linear time algorithm, so it can handle
CSV files with larger numbers of columns?
For more details including figures and session info, please see
https://github.com/tdhock/atime/issues/8
Sincerely,
Toby Dylan Hocking

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] write.csv performance improvements?

2023-03-29 Thread Toby Hocking

Dear R-devel,
I did a systematic comparison of write.csv with similar functions, and
observed two asymptotic inefficiencies that could be improved.

1. write.csv is quadratic time (N^2) in the number of columns N.
Can write.csv be improved to use a linear time algorithm, so it can handle
CSV files with larger numbers of columns?
For more details including figures and session info, please see
https://github.com/tdhock/atime/issues/9

2. write.csv uses memory that is linear in the number of rows, whereas
similar R functions for writing CSV use only constant memory. This is not
as important of an issue to fix, because anyway linear memory is used to
store the data in R. But since the other functions use constant memory,
could write.csv also? Is there some copying happening that could be
avoided? (this memory measurement uses bench::mark, which in turn uses
utils::Rprofmem)
https://github.com/tdhock/atime/issues/10

Sincerely,
Toby Dylan Hocking

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] Bug in PCRE interface code

2023-09-05 Thread Toby Hocking

BTW this is documented here
http://pcre.org/current/doc/html/pcre2api.html#infoaboutpattern with a
helpful example, copied below.

As a simple example of the name/number table, consider the following
pattern after compilation by the 8-bit library (assume PCRE2_EXTENDED
is set, so white space - including newlines - is ignored):

  (? (?(\d\d)?\d\d) - (?\d\d) - (?\d\d) )

There are four named capture groups, so the table has four entries,
and each entry in the table is eight bytes long. The table is as
follows, with non-printing bytes shows in hexadecimal, and undefined
bytes shown as ??:

  00 01 d  a  t  e  00 ??
  00 05 d  a  y  00 ?? ??
  00 04 m  o  n  t  h  00
  00 02 y  e  a  r  00 ??

On Mon, Sep 4, 2023 at 3:02 AM Duncan Murdoch  wrote:
>
> This Stackoverflow question https://stackoverflow.com/q/77036362 turned
> up a bug in the R PCRE interface.
>
> The example (currently in an edit to the original question) tried to use
> named capture with more than 127 named groups.  Here's the code:
>
> append_unique_id <- function(x) {
>for (i in seq_along(x)) {
>  x[i] <- paste0("<", paste(sample(letters, 10), collapse = ""), ">",
> x[i])
>}
>x
> }
>
> list_regexes <- sample(letters, 128, TRUE) # <<< change this to
> # 127 and it works
> regex2 <- append_unique_id(list_regexes)
> regex2 <- paste0("(?", regex2, ")")
> regex2 <- paste(regex2, collapse = "|")
>
> out <- gregexpr(regex2, "Cyprus", perl = TRUE, ignore.case = TRUE)
> #> Error in gregexpr(regex2, "Cyprus", perl = TRUE, ignore.case = TRUE):
> attempt to set index -129/128 in SET_STRING_ELT
>
> I think the bug is in R, here:
> https://github.com/wch/r-source/blob/57d15d68235dd9bcfaa51fce83aaa71163a020e1/src/main/grep.c#L3079
>
> This is the line
>
> int capture_num = (entry[0]<<8) + entry[1] - 1;
>
> where entry is declared as a pointer to a char.  What this is doing is
> extracting a 16 bit number from the first two bytes of a character
> string holding the name of the capture group.  Since char is a signed
> type, the conversion of bytes to integer gets messed up and the value
> comes out wrong.
>
> Duncan Murdoch
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] FR: valid_regex() to test string validity as a regular expression

2023-10-11 Thread Toby Hocking

Hi Michael, it sounds like you don't want to use a CRAN package for
this, but you may try re2, see below.

> grepl("(invalid","subject",perl=TRUE)
Error in grepl("(invalid", "subject", perl = TRUE) :
  invalid regular expression '(invalid'
In addition: Warning message:
In grepl("(invalid", "subject", perl = TRUE) :
  PCRE pattern compilation error
'missing closing parenthesis'
at ''

> grepl("(invalid","subject",perl=FALSE)
Error in grepl("(invalid", "subject", perl = FALSE) :
  invalid regular expression '(invalid', reason 'Missing ')''
In addition: Warning message:
In grepl("(invalid", "subject", perl = FALSE) :
  TRE pattern compilation error 'Missing ')''

> re2::re2_regexp("(invalid")
Error: missing ): (invalid

On Tue, Oct 10, 2023 at 7:57 AM Michael Chirico via R-devel
 wrote:
>
> > Grepping an empty string might work in many cases...
>
> That's precisely why a base R offering is important, as a surer way of
> validating in all cases. To be clear I am trying to directly access the
> results of tre_regcomp().
>
> > it is probably more portable to simply be prepared to propagate such
> errors from the actual use on real inputs
>
> That works best in self-contained calls -- foo(re) and we execute re inside
> foo().
>
> But the specific context where I found myself looking for a regex validator
> is more complicated (https://github.com/r-lib/lintr/pull/2225). User
> supplies a regular expression in a configuration file, only "later" is it
> actually supplied to grepl().
>
> Till now, we've done your suggestion -- just surface the regex error at run
> time. But our goal is to make it friendlier and fail earlier at "compile
> time" as the config is loaded, "long" before any regex is actually executed.
>
> At a bare minimum this is a good place to return a classed warning (say
> invalid_regex_warning) to allow finer control than tryCatch(condition=).
>
> On Mon, Oct 9, 2023, 11:30 PM Tomas Kalibera 
> wrote:
>
> >
> > On 10/10/23 01:57, Michael Chirico via R-devel wrote:
> >
> > It will be useful to package authors trying to validate input which is
> > supposed to be a valid regular expression.
> >
> > As near as I can tell, the only way we can do so now is to run any
> > regex function and check for the warning and/or condition to bubble
> > up:
> >
> > valid_regex <- function(str) {
> >   stopifnot(is.character(str), length(str) == 1L)
> >   !inherits(tryCatch(grepl(str, ""), condition = identity), "condition")
> > }
> >
> > That's pretty hefty/inscrutable for such a simple validation. I see a
> > variety of similar approaches in CRAN packages [1], all slightly
> > different. It would be good for R to expose a "canonical" way to run
> > this validation.
> >
> > At root, the problem is that R does not expose the regex compilation
> > routines like 'tre_regcomp', so from the R side we have to resort to
> > hacky approaches.
> >
> > Hi Michael,
> >
> > I don't think you need compilation functions for that. If a regular
> > expression is found invalid by a specific third party library R uses, the
> > library should return and error to R and R should return an error to you,
> > and you should probably propagate that to your users. Grepping an empty
> > string might work in many cases as a test, but it is probably more portable
> > to simply be prepared to propagate such errors from the actual use on real
> > inputs. In theory, there could be some optimization for a particular case,
> > the checking may not be the same - but that is the same say for compilation
> > and checking.
> >
> > Things get slightly complicated by encoding/useBytes modes
> > (tre_regwcomp, tre_regncomp, tre_regwncomp, tre_regcompb,
> > tre_regncompb; all in tre.h), but all are already present in other
> > regex routines, so this is doable.
> >
> > Re encodings, simply R strings should be valid in their encoding. This is
> > not just for regular expressions but also for anything else. You shouldn't
> > assume that R can handle invalid strings in any reasonable way. Definitely
> > you shouldn't try adding invalid strings in tests - behavior with invalid
> > strings is unspecified. To test whether a string is valid, there is
> > validEnc() (or validUTF8()). But, again, it is probably safest to propagate
> > errors from the regular expression R functions (in case the checks differ,
> > particularly for non-UTF-8), also, duplicating the encoding checks can be a
> > non-trivial overhead.
> >
> > If there was a strong need to have an automated way to somehow classify
> > specifically errors from the regex libraries, perhaps R could attach some
> > classes to them when the library tells.
> >
> > Tomas
> >
> > Exposing a function to compile regular expressions is common in other
> > languages, e.g. Go [2], Python [3], JavaScript [4].
> >
> > [1] 
> > https://github.com/search?q=lang%3AR+%2Fis%5Ba-zA-Z0-9._%5D*reg%5Ba-zA-Z0-9._%5D*ex.*%28%3C-%7C%3D%29%5Cs*function%2F+org%3Acran&type=code
> > [2] https://pkg.go.dev/regexp#Compile
> > [3] h

Re: [Rd] Partial matching performance in data frame rownames using [

2023-12-19 Thread Toby Hocking

Hi Hilmar and Ivan,
I have used your code examples to write a blog post about this topic,
which has figures that show the asymptotic time complexity of the
various approaches,
https://tdhock.github.io/blog/2023/df-partial-match/
The asymptotic complexity of partial matching appears to be quadratic
O(N^2) whereas the other approaches are asymptotically faster: linear
O(N) or log-linear O(N log N).
I think that accepting Ivan's pmatch.rows patch would add un-necessary
complexity to base R, since base R already provides an efficient
work-around, d1[match(q1,rownames(d1)),]
I do think the CheckUserInterrupt patch is a good idea, though.
Best,
Toby

On Sat, Dec 16, 2023 at 2:49 AM Ivan Krylov  wrote:
>
> On Wed, 13 Dec 2023 09:04:18 +0100
> Hilmar Berger via R-devel  wrote:
>
> > Still, I feel that default partial matching cripples the functionality
> > of data.frame for larger tables.
>
> Changing the default now would require a long deprecation cycle to give
> everyone who uses `[.data.frame` and relies on partial matching
> (whether they know it or not) enough time to adjust.
>
> Still, adding an argument feels like a small change: edit
> https://svn.r-project.org/R/trunk/src/library/base/R/dataframe.R and
> add a condition before calling pmatch(). Adjust the warning() for named
> arguments. Don't forget to document the new argument in the man page at
> https://svn.r-project.org/R/trunk/src/library/base/man/Extract.data.frame.Rd
>
> Index: src/library/base/R/dataframe.R
> ===
> --- src/library/base/R/dataframe.R  (revision 85664)
> +++ src/library/base/R/dataframe.R  (working copy)
> @@ -591,14 +591,14 @@
>  ###  These are a little less general than S
>
>  `[.data.frame` <-
> -function(x, i, j, drop = if(missing(i)) TRUE else length(cols) == 1)
> +function(x, i, j, drop = if(missing(i)) TRUE else length(cols) == 1, 
> pmatch.rows = TRUE)
>  {
>  mdrop <- missing(drop)
>  Narg <- nargs() - !mdrop  # number of arg from x,i,j that were specified
>  has.j <- !missing(j)
> -if(!all(names(sys.call()) %in% c("", "drop"))
> +if(!all(names(sys.call()) %in% c("", "drop", "pmatch.rows"))
> && !isS4(x)) # at least don't warn for callNextMethod!
> -warning("named arguments other than 'drop' are discouraged")
> +warning("named arguments other than 'drop', 'pmatch.rows' are 
> discouraged")
>
>  if(Narg < 3L) {  # list-like indexing or matrix indexing
>  if(!mdrop) warning("'drop' argument will be ignored")
> @@ -679,7 +679,11 @@
>  ## for consistency with [, ]
>  if(is.character(i)) {
>  rows <- attr(xx, "row.names")
> -i <- pmatch(i, rows, duplicates.ok = TRUE)
> +i <- if (pmatch.rows) {
> +pmatch(i, rows, duplicates.ok = TRUE)
> +} else {
> +match(i, rows)
> +}
>  }
>  ## need to figure which col was selected:
>  ## cannot use .subset2 directly as that may
> @@ -699,7 +703,11 @@
>   # as this can be expensive.
>  if(is.character(i)) {
>  rows <- attr(xx, "row.names")
> -i <- pmatch(i, rows, duplicates.ok = TRUE)
> +i <- if (pmatch.rows) {
> +pmatch(i, rows, duplicates.ok = TRUE)
> +} else {
> +match(i, rows)
> +}
>  }
>  for(j in seq_along(x)) {
>  xj <- xx[[ sxx[j] ]]
> Index: src/library/base/man/Extract.data.frame.Rd
> ===
> --- src/library/base/man/Extract.data.frame.Rd  (revision 85664)
> +++ src/library/base/man/Extract.data.frame.Rd  (working copy)
> @@ -15,7 +15,7 @@
>Extract or replace subsets of data frames.
>  }
>  \usage{
> -\method{[}{data.frame}(x, i, j, drop = )
> +\method{[}{data.frame}(x, i, j, drop =, pmatch.rows = TRUE)
>  \method{[}{data.frame}(x, i, j) <- value
>  \method{[[}{data.frame}(x, ..., exact = TRUE)
>  \method{[[}{data.frame}(x, i, j) <- value
> @@ -45,6 +45,9 @@
>  column is selected.}
>
> \item{exact}{logical: see \code{\link{[}}, and applies to column names.}
> +
> +   \item{pmatch.rows}{logical: whether to perform partial matching on
> + row names in case \code{i} is a character vector.}
>  }
>  \details{
>Data frames can be indexed in several modes.  When \code{[} and
>
>
> system.time({r <- d1[q2,, drop=

Re: [Rd] Partial matching performance in data frame rownames using [

2023-12-19 Thread Toby Hocking

Hi Hilmar and Ivan,
I have used your code examples to write a blog post about this topic,
which has figures that show the asymptotic time complexity of the
various approaches,
https://tdhock.github.io/blog/2023/df-partial-match/
The asymptotic complexity of partial matching appears to be quadratic
O(N^2) whereas the other approaches are asymptotically faster: linear
O(N) or log-linear O(N log N).
I think that accepting Ivan's pmatch.rows patch would add un-necessary
complexity to base R, since base R already provides an efficient
work-around, d1[match(q1,rownames(d1)),]
I do think the CheckUserInterrupt patch is a good idea, though.
Best,
Toby

On Sat, Dec 16, 2023 at 2:49 AM Ivan Krylov  wrote:
>
> On Wed, 13 Dec 2023 09:04:18 +0100
> Hilmar Berger via R-devel  wrote:
>
> > Still, I feel that default partial matching cripples the functionality
> > of data.frame for larger tables.
>
> Changing the default now would require a long deprecation cycle to give
> everyone who uses `[.data.frame` and relies on partial matching
> (whether they know it or not) enough time to adjust.
>
> Still, adding an argument feels like a small change: edit
> https://svn.r-project.org/R/trunk/src/library/base/R/dataframe.R and
> add a condition before calling pmatch(). Adjust the warning() for named
> arguments. Don't forget to document the new argument in the man page at
> https://svn.r-project.org/R/trunk/src/library/base/man/Extract.data.frame.Rd
>
> Index: src/library/base/R/dataframe.R
> ===
> --- src/library/base/R/dataframe.R  (revision 85664)
> +++ src/library/base/R/dataframe.R  (working copy)
> @@ -591,14 +591,14 @@
>  ###  These are a little less general than S
>
>  `[.data.frame` <-
> -function(x, i, j, drop = if(missing(i)) TRUE else length(cols) == 1)
> +function(x, i, j, drop = if(missing(i)) TRUE else length(cols) == 1, 
> pmatch.rows = TRUE)
>  {
>  mdrop <- missing(drop)
>  Narg <- nargs() - !mdrop  # number of arg from x,i,j that were specified
>  has.j <- !missing(j)
> -if(!all(names(sys.call()) %in% c("", "drop"))
> +if(!all(names(sys.call()) %in% c("", "drop", "pmatch.rows"))
> && !isS4(x)) # at least don't warn for callNextMethod!
> -warning("named arguments other than 'drop' are discouraged")
> +warning("named arguments other than 'drop', 'pmatch.rows' are 
> discouraged")
>
>  if(Narg < 3L) {  # list-like indexing or matrix indexing
>  if(!mdrop) warning("'drop' argument will be ignored")
> @@ -679,7 +679,11 @@
>  ## for consistency with [, ]
>  if(is.character(i)) {
>  rows <- attr(xx, "row.names")
> -i <- pmatch(i, rows, duplicates.ok = TRUE)
> +i <- if (pmatch.rows) {
> +pmatch(i, rows, duplicates.ok = TRUE)
> +} else {
> +match(i, rows)
> +}
>  }
>  ## need to figure which col was selected:
>  ## cannot use .subset2 directly as that may
> @@ -699,7 +703,11 @@
>   # as this can be expensive.
>  if(is.character(i)) {
>  rows <- attr(xx, "row.names")
> -i <- pmatch(i, rows, duplicates.ok = TRUE)
> +i <- if (pmatch.rows) {
> +pmatch(i, rows, duplicates.ok = TRUE)
> +} else {
> +match(i, rows)
> +}
>  }
>  for(j in seq_along(x)) {
>  xj <- xx[[ sxx[j] ]]
> Index: src/library/base/man/Extract.data.frame.Rd
> ===
> --- src/library/base/man/Extract.data.frame.Rd  (revision 85664)
> +++ src/library/base/man/Extract.data.frame.Rd  (working copy)
> @@ -15,7 +15,7 @@
>Extract or replace subsets of data frames.
>  }
>  \usage{
> -\method{[}{data.frame}(x, i, j, drop = )
> +\method{[}{data.frame}(x, i, j, drop =, pmatch.rows = TRUE)
>  \method{[}{data.frame}(x, i, j) <- value
>  \method{[[}{data.frame}(x, ..., exact = TRUE)
>  \method{[[}{data.frame}(x, i, j) <- value
> @@ -45,6 +45,9 @@
>  column is selected.}
>
> \item{exact}{logical: see \code{\link{[}}, and applies to column names.}
> +
> +   \item{pmatch.rows}{logical: whether to perform partial matching on
> + row names in case \code{i} is a character vector.}
>  }
>  \details{
>Data frames can be indexed in several modes.  When \code{[} and
>
>
> system.time({r <- d1[q2,, drop=

Re: [Rd] [External] readChar() could read the whole file by default?

2024-01-29 Thread Toby Hocking

My opinion is that the proposed feature would be greatly appreciated by users.
I had always wondered if I was the only one doing paste(readLines(f),
collapse="\n") all the time.
It would be great to have the proposed, more straightforward way to
read the whole file as a string: readChar("my_file.txt", -1) or even
better readChar("my_file.txt")
Thanks for your detailed analysis Michael.

On Fri, Jan 26, 2024 at 2:05 PM luke-tierney--- via R-devel
 wrote:
>
> On Fri, 26 Jan 2024, Michael Chirico wrote:
>
> > I am curious why readLines() has a default (n=-1L) to read the full
> > file while readChar() has no default for nchars= (i.e., readChar(file)
> > is an error). Is there a technical reason for this?
> >
> > I often[1] see code like paste(readLines(f), collapse="\n") which
> > would be better served by readChar(), especially given issues with the
> > global string cache I've come across[2]. But lacking the default, the
> > replacement might come across less clean.
>
> The string cache seems like a very dark pink herring to me. The fact
> that the lines are allocated on the heap might create an issue; the
> cache isn't likely to add much to that. In any case I would need to
> see a realistic example to convince me this is worth addressing on
> performance grounds.
>
> I don't see any reason in principle not to have readChar and readBin
> read the entire file if n = -1 (others might) but someone would need
> to write a patch to implement that.
>
> Best,
>
> luke
>
> > For my own purposes the incantation readChar(file, file.size(file)) is
> > ubiquitous. Taking CRAN code[3] as a sample[4], 41% of readChar()
> > calls use either readChar(f, file.info(f)$size) or readChar(f,
> > file.size(f))[5].
> >
> > Thanks for the consideration and feedback,
> > Mike C
> >
> > [1] e.g. a quick search shows O(100) usages in CRAN packages:
> > https://github.com/search?q=org%3Acran+%2Fpaste%5B%28%5D%5Cs*readLines%5B%28%5D.*%5B%29%5D%2C%5Cs*collapse%5Cs*%3D%5Cs*%5B%27%22%5D%5B%5C%5C%5D%2F+lang%3AR&type=code,
> > and O(1000) usages generally on GitHub:
> > https://github.com/search?q=lang%3AR+%2Fpaste%5B%28%5D%5Cs*readLines%5B%28%5D.*%5B%29%5D%2C%5Cs*collapse%5Cs*%3D%5Cs*%5B%27%22%5D%5B%5C%5C%5D%2F+lang%3AR&type=code
> > [2] AIUI the readLines() approach "pollutes" the global string cache
> > with potentially 1000s/1s of strings for each line, only to get
> > them gc()'d after combining everything with paste(collapse="\n")
> > [3] The mirror on GitHub, which includes archived packages as well as
> > current (well, eventually-consistent) versions.
> > [4] Note that usage in packages is likely not representative of usage
> > in scripts, e.g. I often saw readChar(f, 1), or eol-finders like
> > readChar(f, 500) + grep("[\n\r]"), which makes more sense to me as
> > something to find in package internals than in analysis scripts. FWIW
> > I searched an internal codebase (scripts and packages) and found 70%
> > of usages reading the full file.
> > [5] repro: 
> > https://gist.github.com/MichaelChirico/247ea9500460dca239f031e74bdcf76b
> > requires GitHub PAT in env GITHUB_PAT for API permissions.
> >
> > __
> > R-devel@r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel
> >
>
> --
> Luke Tierney
> Ralph E. Wareham Professor of Mathematical Sciences
> University of Iowa  Phone: 319-335-3386
> Department of Statistics andFax:   319-335-3017
> Actuarial Science
> 241 Schaeffer Hall  email:   luke-tier...@uiowa.edu
> Iowa City, IA 52242 WWW:  http://www.stat.uiowa.edu
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] strcapture performance when perl = TRUE

2024-02-13 Thread Toby Hocking

Thanks Tim.
I confirm the proposed solution is over 10x faster, see
https://github.com/tdhock/atime/issues/29#issuecomment-1943037753 for
figure and source code.

On Mon, Jan 29, 2024 at 7:05 AM Tim Taylor
 wrote:
>
> I wanted to raise the possibility of improving strcapture performance in
> cases where perl = TRUE. I believe we can do this in a non-breaking way
> by calling regexpr instead of regexec (conditionally when perl = TRUE).
> To illustrate this I've put together a 'proof of concept' function called
> strcapture2 that utilises output from regexpr directly (following a very
> nice substring approach that I've seen implemented by Toby Hocking
> in the nc package - nc::capture_first_vec).
>
> strcapture2 <- function(pattern, x, proto, perl = FALSE, useBytes = FALSE) {
> if (isTRUE(perl)) {
> m <- regexpr(pattern = pattern, text = x, perl = TRUE, useBytes = 
> useBytes)
> nomatch <- is.na(m) | m == -1L
> ntokens <- length(proto)
> if (any(!nomatch)) {
> length <- attr(m, "match.length")
> start <- attr(m, "capture.start")
> length <- attr(m, "capture.length")
> end <- start + length - 1L
> end[nomatch, ] <- start[nomatch, ] <- NA
> res <- substring(x, start, end)
> out <- matrix(res, length(m))
> if (ncol(out) != ntokens) {
> stop("The number of captures in 'pattern' != 'length(proto)'")
> }
> } else {
> out <- matrix(NA_character_, length(m), ntokens)
> }
> utils:::conformToProto(out,proto)
> } else {
> strcapture(pattern,x,proto,perl,useBytes)
> }
> }
>
> Now comparing with strcapture we can expand the named capture example
> from the grep documentation:
>
> notables <- c(
> "  Ben Franklin and Jefferson Davis",
> "\tMillard Fillmore",
> "Bob",
> NA_character_
> )
>
> regex <- "(?[[:upper:]][[:lower:]]+) (?[[:upper:]][[:lower:]]+)"
> proto = data.frame("", "")
>
> (strcapture(regex, notables, proto, perl = TRUE))
>   X..X...1
> 1 Ben Franklin
> 2 Millard Fillmore
> 3 
> 4 
>
> (strcapture2(regex, notables, proto, perl = TRUE))
>   X..X...1
> 1 Ben Franklin
> 2 Millard Fillmore
> 3 
> 4 
>
> Now to compare timings over multiple reps:
>
> lengths <- sort(outer(c(1, 2, 5), 10^(1:4)))
> reps <- 20
>
> time_strcapture <- function(text, length, regex, proto, reps) {
> text <- rep_len(text, length)
> str <- system.time(for (i in seq_len(reps)) strcapture(regex, text, 
> proto, perl = TRUE))
> str2 <- system.time(for (i in seq_len(reps)) strcapture2(regex, text, 
> proto, perl = TRUE))
> c(strcapture = str[["user.self"]], strcapture2 = str2[["user.self"]])
> }
> timings <- sapply(
> lengths,
> time_strcapture,
> text = notables, regex = regex, reps = reps, proto = proto
> )
> cbind(lengths, t(timings))
>   lengths strcapture strcapture2
>  [1,]  10  0.005   0.003
>  [2,]  20  0.005   0.002
>  [3,]  50  0.008   0.003
>  [4,] 100  0.012   0.002
>  [5,] 200  0.021   0.003
>  [6,] 500  0.051   0.003
>  [7,]1000  0.097   0.004
>  [8,]2000  0.171   0.005
>  [9,]5000  0.517   0.011
> [10,]   1  1.203   0.018
> [11,]   2  2.563   0.037
> [12,]   5  7.276   0.090
>
> I've attached a plot of these timings in case helpful.
>
> I appreciate that changing strcapture in this way does make it more
> complicated but I think the performance improvements make it worth
> considering. Note that I've not thoroughly tested the above implementation
> as wanted to get feedback from the list before proceeding further.
>
> Hope all this make sense. Cheers
>
> Tim
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] Minor inconsistencies in tools:::funAPI()

2024-07-29 Thread Toby Hocking

Hi Ivan
Can you please clarify what input files should be used with your
proposed function? I tried a few files in r-svn/src/include and one of
them gave me an error.

> getdecl("~/R/r-svn/src/include/R.h")
[1] "R_FlushConsole"  "R_ProcessEvents" "R_WaitEvent"
> getdecl("~/R/r-svn/src/include/Rdefines.h")
Error in regmatches(lines, gregexec(rx, lines, perl = TRUE))[[1]][3, ] :
  incorrect number of dimensions

On Mon, Jul 15, 2024 at 10:32 AM Ivan Krylov via R-devel
 wrote:
>
> Hi all,
>
> I've noticed some peculiarities in the tools:::funAPI output that
> complicate its programmatic use a bit.
>
>  - Is it for remapped symbol names (with Rf_ or the Fortran
>underscore), or for unmapped names (without Rf_ or the underscore)?
>
> I see that the functions marked in WRE are almost all (except
> Rf_installChar and Rf_installTrChar) unmapped. This makes a lot of
> sense because some of those interfaces (e.g. CONS(), CHAR(),
> NOT_SHARED()) are C preprocessor macros, not functions. I also see that
> installTrChar is not explicitly marked.
>
> Are we allowed to call tools:::unmap(tools:::funAPI()$name) and
> consider the return value to be the list of all unmapped APIs, despite,
> e.g., installTrChar not being explicitly marked?
>
>  - Should R_PV be an @apifun if it's currently caught by checks in
>sotools.R?
>
>  - Should R_FindSymbol be commented /* Not API */ if it's marked as
>@apifun in WRE and not caught by sotools.R? It is currently used by 8
>CRAN packages.
>
>  - The names 'select', 'delztg' from R_ext/Lapack.h are function
>pointer arguments, not functions or type declarations. They are
>being found because funcRegexp is written to match incomplete
>function declarations (e.g. when they end up being split over
>multiple lines, like in R_ext/Lapack.h), and function pointer
>argument declarations look sufficiently similar.
>
> A relatively compact (but still brittle) way to match function
> declarations in C header files is shown at the end of this message. I
> have confirmed that compared to tools:::getFunsHdr, the only extraneous
> symbols that it finds in preprocessed headers are "R_SetWin32",
> "user_unif_rand", "user_unif_init", "user_unif_nseed",
> "user_unif_seedloc" "user_norm_rand", which are special-cased in
> tools:::getFunsHdr, and the only symbols it doesn't find are "select"
> and "delztg" in R_ext/Lapack.h, which we should not be finding.
>
> # "Bird's eye" view, gives unmapped names on non-preprocessed headers
> getdecl <- function(file, lines = readLines(file)) {
> # have to combine to perform multi-line matches
> lines <- paste(c(lines, ''), collapse = '\n')
> # first eat the C comments, dotall but non-greedy match
> lines <- gsub('(?s)/\\*.*?\\*/', '', lines, perl = TRUE)
> # C++-style comments too, multiline not dotall
> lines <- gsub('(?m)//.*$', '', lines, perl = TRUE)
> # drop all preprocessor directives
> lines <- gsub('(?m)^\\s*#.*$', '', lines, perl = TRUE)
>
> rx <- r"{(?xs)
> (?!typedef)(? # return type with attributes
> (
> # words followed by whitespace or stars
> (?: \w+ (?:\s+ | \*)+)+
> )
> # function name, assumes no extra whitespace
> (
> \w+\(\w+\) # macro call
> | \(\w+\)  # in parentheses
> | \w+  # a plain name
> )
> # arguments: non-greedy match inside parentheses
> \s* \( (.*?) \) \s* # using dotall here
> # will include R_PRINTF_FORMAT(1,2 but we don't care
> # finally terminated by semicolon
> ;
> }"
>
> regmatches(lines, gregexec(rx, lines, perl = TRUE))[[1]][3,]
> }
>
> # Preprocess then extract remapped function names like getFunsHdr
> getdecl2 <- function(file)
> file |>
> readLines() |>
> grep('^\\s*#\\s*error', x = _, value = TRUE, invert = TRUE) |>
> tools:::ccE() |>
> getdecl(lines = _)
>
> --
> Best regards,
> Ivan
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] numerical issue with t.test

2024-09-16 Thread Toby Hocking

Hi! I expected that t.test should report a very large p-value (close
to 1), even when using paired=TRUE, for the data below (which are very
similar). However, I observe p-value = 0.02503 which indicates a
significant difference, even though there is none. Can this be fixed
please? This is with R-4.4.1. For reference below I use paired=FALSE
with the same data, and I get p-value = 1 as expected.

> err1 = c(-1.6076199373862132, -1.658521185520103, -1.6549424312339873, 
> -1.5887767975086149, -1.634129577540383, -1.7442711937982249)
> err2 = c(-1.6076199373862132, -1.6585211855201032, -1.6549424312339875, 
> -1.5887767975086149, -1.6341295775403832, -1.7442711937982252)
> t.test(err1,err2,paired=TRUE)

Paired t-test

data:  err1 and err2
t = 3.1623, df = 5, p-value = 0.02503
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
 2.769794e-17 2.683615e-16
sample estimates:
mean difference
   1.480297e-16

> t.test(err1,err2,paired=FALSE)

Welch Two Sample t-test

data:  err1 and err2
t = 0, df = 10, p-value = 1
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.06988771  0.06988771
sample estimates:
mean of x mean of y
-1.648044 -1.648044

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] Print output during long tests?

2015-05-04 Thread Toby Hocking

I am the author of R package animint which uses testthat for unit tests.

This means that there is a single test file (animint/tests/testthat.R) and
during R CMD check we will see the following output

* checking tests ...
Running ‘testthat.R’

I run these tests on Travis, which has a policy that if no output is
received after 10 minutes, it will kill the check. Because animint's
testthat tests take a total of over 10 minutes, Travis kills the R CMD
check job before it has finished all the tests. This is a problem since we
would like to run animint tests on Travis.

One solution to this problem would be if R CMD check could output more
lines other than just Running testthat.R. Can I give some command line
switch to R CMD check or set some environment variable, so that some more
verbose test output could be shown on R CMD check?

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] Could .Primitive("[") stop forcing R_Visible = TRUE?

2024-10-24 Thread Toby Hocking

Thanks for the detailed analysis and proposition Ivan. The patch you
are proposing to base R is
https://github.com/Rdatatable/data.table/issues/6566#issuecomment-2428912338
right?

On Thu, Oct 24, 2024 at 8:48 AM Ivan Krylov via R-devel
 wrote:
>
> Hello,
>
> The "[" primitive operator currently has the 'eval' flag set to 0 in
> src/main/names.c. This means that the result of subsetting, whether
> R-native or implemented by a method, will never be invisible().
>
> This is a very reasonable default: if the user goes as far as to subset
> a value, they probably want to see the result. Unfortunately, there
> also exists at least one counter-example to that: data.table's
> modification by reference using the `:=` operator from inside the `[`
> operator.
>
> If a user creates a data.table object `x` and evaluates x[,foo := bar],
> the desired outcome is to return x invisibly, both to allow chained
> updates by reference (x[,foo := bar][,bar := baz]) and to avoid
> cluttering the screen by printing the whole object after updating a few
> columns. Since .Primitive("[") forces visibility on, the data.table
> developers had to come up with their own visibility flag [1] and check
> it from inside the print() method when it looks like it originates from
> auto-printing [2]. Since the auto-printing detection works by looking
> at the call stack, this recently broke after a knitr update (but can
> be reliably repaired [3]) and doesn't work for sub-classes of
> data.table [4].
>
> Is it feasible for R to consider allowing methods for `[` to set their
> own visibility flag at this point? The change is deceptively small: set
> 'eval' to 200 in names.c and R_Visible = TRUE before returning from the
> non-method-dispatch branch in do_subset(). This results in one change
> in the saved output of R's own tests/reg-S4.R [5]. Or is the potential
> breakage for existing code too prohibitive?
>
> --
> Best regards,
> Ivan
>
> [1]
> https://github.com/Rdatatable/data.table/blob/e5b845e5cbc6be826558d11d601243240abe7a72/R/print.data.table.R#L164-L169
>
> [2]
> https://github.com/Rdatatable/data.table/blob/e5b845e5cbc6be826558d11d601243240abe7a72/R/print.data.table.R#L24-L41
>
> [3]
> https://github.com/Rdatatable/data.table/pull/6589
>
> [4]
> https://github.com/Rdatatable/data.table/issues/3029
>
> [5] A method for `[` that runs cat() used to return NULL visibly.
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] Will the R Project be a Mentoring Organization for GSOC 2025?

2025-01-21 Thread Toby Hocking

Hi Greg thanks for your interest!
I plan to submit an application next week, on behalf of R project.
Our wiki with a timeline is
https://github.com/rstats-gsoc/gsoc2025/wiki#status-and-timeline
If you would like to mentor, please add your project idea to
https://github.com/rstats-gsoc/gsoc2025/wiki/table%20of%20proposed%20coding%20projects
and instead of using R-devel, the more appropriate list for
GSOC-related info is
https://groups.google.com/forum/?pli=1#!forum/gsoc-r
Thanks!
Toby

On Wed, Jan 22, 2025 at 1:19 AM Simon Urbanek
 wrote:
>
> Please see the GSoC schedule - for 2025 the organisation applications won't 
> close until February 12 so we may not know the answer until possibly sometime 
> in March (that's assuming there are proposed projects).
>
> Cheers,
> Simon
>
>
> > On Jan 22, 2025, at 6:23 AM, Greg Forkutza  wrote:
> >
> > Hi there,
> >
> > I just wanted to confirm if this will be true this year.
> >
> > Best,
> >
> > Greg Forkutza
> >
> >   [[alternative HTML version deleted]]
> >
> > __
> > R-devel@r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel
> >
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] R does not build with conda libcurl

2025-04-14 Thread Toby Hocking

Hi all,
I'm not sure if this is an issue with conda or R.
I expected that I should be able to build R from source, with a conda
environment active.
However, I observe that with conda 23.9.0, in an environment with libcurl
package installed, I get a link error.
The configure works fine:

(base) hoct2726@dinf-thock-02i:~/R/R-4.5.0$ ./configure --prefix=$HOME
--with-cairo --with-blas --with-lapack --enable-R-shlib
--with-valgrind-instrumentation=2 --enable-memory-profiling
...
R is now configured for x86_64-pc-linux-gnu

  Source directory:.
  Installation directory:  /home/local/USHERBROOKE/hoct2726

  C compiler:  gcc -std=gnu2x  -g -O2
  Fortran fixed-form compiler: gfortran  -g -O2

  Default C++ compiler:g++ -std=gnu++17  -g -O2
  Fortran free-form compiler:  gfortran  -g -O2
  Obj-C compiler:

  Interfaces supported:X11, tcltk
  External libraries:  pcre2, readline, BLAS(generic),
LAPACK(generic), curl, libdeflate
  Additional capabilities: PNG, JPEG, TIFF, NLS, cairo, ICU
  Options enabled: shared R library, R profiling, memory
profiling, libdeflate for lazyload

  Capabilities skipped:
  Options not enabled: shared BLAS

  Recommended packages:yes



but I got an error from "make" --

(base) hoct2726@dinf-thock-02i:~/R/R-4.5.0$ make
...
make[3] : on entre dans le répertoire
« /home/local/USHERBROOKE/hoct2726/R/R-4.5.0/src/main »
gcc -std=gnu2x -I../../src/extra -I../../src/extra/xdr -I.
-I../../src/include -I../../src/include
-I/home/local/USHERBROOKE/hoct2726/miniconda3/include -I/usr/local/include
-I../../src/nmath -DHAVE_CONFIG_H   -fopenmp -fpic  -g -O2  -c Rmain.c -o
Rmain.o
gcc -std=gnu2x -Wl,--export-dynamic -fopenmp  -L"../../lib"
-L/usr/local/lib -o R.bin Rmain.o  -lR
/usr/bin/ld : ../../lib/libR.so : référence indéfinie vers
« ucol_setAttribute_73 »
/usr/bin/ld : ../../lib/libR.so : référence indéfinie vers « ucol_close_73 »
/usr/bin/ld : ../../lib/libR.so : référence indéfinie vers « ucol_open_73 »
/usr/bin/ld : ../../lib/libR.so : référence indéfinie vers
« uiter_setUTF8_73 »
/usr/bin/ld : ../../lib/libR.so : référence indéfinie vers
« ucol_getLocaleByType_73 »
/usr/bin/ld : ../../lib/libR.so : référence indéfinie vers
« ucol_setStrength_73 »
/usr/bin/ld : ../../lib/libR.so : référence indéfinie vers
« u_versionToString_73 »
/usr/bin/ld : ../../lib/libR.so : référence indéfinie vers
« ucol_strcollIter_73 »
/usr/bin/ld : ../../lib/libR.so : référence indéfinie vers
« uloc_setDefault_73 »
/usr/bin/ld : ../../lib/libR.so : référence indéfinie vers
« u_getVersion_73 »
collect2: error: ld returned 1 exit status
make[3]: *** [Makefile:150 : R.bin] Erreur 1
make[3] : on quitte le répertoire
« /home/local/USHERBROOKE/hoct2726/R/R-4.5.0/src/main »
make[2]: *** [Makefile:141 : R] Erreur 2
make[2] : on quitte le répertoire
« /home/local/USHERBROOKE/hoct2726/R/R-4.5.0/src/main »
make[1]: *** [Makefile:28 : R] Erreur 1
make[1] : on quitte le répertoire
« /home/local/USHERBROOKE/hoct2726/R/R-4.5.0/src »
make: *** [Makefile:61 : R] Erreur 1


It seems that the libcurl package in conda provides the curl-config command
line program, which R is using to get this flag:
-I/home/local/USHERBROOKE/hoct2726/miniconda3/include
(it goes into CURL_CPPFLAGS variable in config.status)
To fix the build, I did "conda remove libcurl" and then "make clean" and
then "configure" and "make" worked.

It would be more user-friendly if the R build could "just work" even when
the user has activated a conda environment with libcurl package installed.
Is this an issue that R could fix?

Thanks
Toby

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] Bug in prettyNum

2025-05-27 Thread Toby Hocking

Thanks for the contribution Mikko!
For testing future patches, you can actually do it right in the web
browser, thanks to Heather Turner's R Dev Container, see instructions here
https://contributor.r-project.org/r-dev-env/container_setup/
Best
Toby

On Mon, May 26, 2025 at 6:28 PM Martin Maechler 
wrote:

> Thank you,  Marttila  and Ivan,
>
> As the original author of prettyNum() {etc ..},
> I will commit such a bug fix to R-devel (and probably port it to
> R 4.5.0 patched) quite soon
> (but not yet today).
>
> Best regards,
>
> Martin Maechler
>
> >>>>> Ivan Krylov via R-devel
> >>>>> on Fri, 23 May 2025 17:14:57 +0300 writes:
>
> > В Fri, 23 May 2025 11:47:33 +
> > Marttila Mikko via R-devel  пишет:
>
> >> When called with a numeric vector, the `replace.zero` argument is
> >> disregarded.
> >>
> >> > prettyNum(0, zero.print = "- ", replace.zero = TRUE)
> >> [1] "-"
> >> Warning message:
> >> In .format.zeros(x, zero.print, replace = replace.zero) :
> >> 'zero.print' is truncated to fit into formatted zeros; consider
> >> 'replace=TRUE'
>
> >> Please see below a patch which I believe would fix this.
>
> > Surprisingly, it's not enough. The 'replace' argument to
> .format.zeros
> > needs to be "threaded" through both the call to vapply(x, format,
> ...)
> > and the internal call from format.default(...) to prettyNum(...):
>
> R> options(warn = 2, error = recover)
> R> prettyNum(0, zero.print = "--", replace.zero = TRUE)
> > Error in .format.zeros(x, zero.print, replace = replace.zero) :
> > (converted from warning) 'zero.print' is truncated to fit into
> formatted zeros; consider 'replace=TRUE'
>
> > Enter a frame number, or 0 to exit
>
> > 1: prettyNum(0, zero.print = "--", replace.zero = TRUE)
> > 2: vapply(x, format, "", big.mark = big.mark, big.interval =
> big.interval, sma
> > 3: FUN(X[[i]], ...)
> > 4: format.default(X[[i]], ...)
> > 5: prettyNum(.Internal(format(x, trim, digits, nsmall, width, 3,
> na.encode, sc
> > 6: .format.zeros(x, zero.print, replace = replace.zero)
> > 7: warning("'zero.print' is truncated to fit into formatted zeros;
> consider 'r
> > <...omitted...>
> > Selection: 6
> > <...>
> > Browse[1]> ls.str()
> > i0 :  logi TRUE
> > ind0 :  int 1
> > nc :  int 1
> > nx :  num 0
> > nz :  int 2
> > replace :  logi FALSE
> > warn.non.fitting :  logi TRUE
> > x :  chr "0"
> > zero.print :  chr "--"
>
> > Since prettyNum() accepts ... and thus ignores unknown arguments, it
> > seems to be safe to forward the ellipsis from format.default() to
> > prettyNum(). The patch survives LANGUAGE=en TZ=UTC make check-devel.
>
> > Index: src/library/base/R/format.R
> > ===
> > --- src/library/base/R/format.R   (revision 88229)
> > +++ src/library/base/R/format.R   (working copy)
> > @@ -73,7 +73,7 @@
> > decimal.mark = decimal.mark, input.d.mark = decimal.mark,
> > zero.print = zero.print, drop0trailing = drop0trailing,
> > is.cmplx = is.complex(x),
> > -  preserve.width = if (trim) "individual" else
> "common"),
> > +  preserve.width = if (trim) "individual" else
> "common", ...),
> > ## all others (for now):
> > stop(gettextf("Found no format() method for class \"%s\"",
> > class(x)), domain = NA))
> > @@ -338,7 +338,8 @@
> > big.mark=big.mark, big.interval=big.interval,
> > small.mark=small.mark, small.interval=small.interval,
> > decimal.mark=decimal.mark, zero.print=zero.print,
> > - drop0trailing=drop0trailing, ...)
> > + drop0trailing=drop0trailing, replace.zero=replace.zero,
> > + ...)
> > }
> > ## be fast in trivial case, when all options have their default, or
> "match"
> > nMark <- big.mark == "" && small.mark == "" && (notChar ||
> decimal.mark == input.d.mark)
>
>
> > --
> > Best regards,
> > Ivan
>
> > __
> > R-devel@r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] Named capture in regexp

2011-02-25 Thread Toby Dylan Hocking

Dear R core developers,

One feature from Python that I have been wanting in R is the ability
to capture groups in regular expressions using names. Consider the
following example in R.

> notables <- c("  Ben Franklin and Jefferson Davis","\tMillard Fillmore")
> name.rex <- "(?[A-Z][a-z]+) (?[A-Z][a-z]+)"
> (parsed <- regexpr(name.rex,notables,perl=TRUE))
[1] 3 2
attr(,"match.length")
[1] 12 16
attr(,"capture.start")
 [,1] [,2]
[1,]37
[2,]2   10
attr(,"capture.length")
 [,1] [,2]
[1,]38
[2,]78
attr(,"capture.names")
[1] "first" "last" 
> parse.one(notables,parsed)
 first last  
[1,] "Ben" "Franklin"
[2,] "Millard" "Fillmore"
> parse.one(notables,parsed)[,"last"]
[1] "Franklin" "Fillmore"

The advantage to this approach is that you can tag groups by name, and
then use the names later in the code to extract the matched substrings.

I realized this is possible by using the PCRE library which ships with
R, so in the last couple days I hacked a bit in src/main/grep.c in the
R source code. I managed to get named capture to work with the
standard gregexpr and regexpr functions. For backwards-compatibility,
my strategy was to just add more attributes to the results of these
functions, as shown above.

Attached is the patch and some R code for testing the new features. It
works fine for me with no memory problems. However, I noticed that
there is some UTF8 handling code, which I did not touch (use_UTF8 is
false on my machine). I presume we will need to make some small
modifications to get it to work with unicode, but I'm not sure how to
do them.

Would you consider integrating this patch into the R source code for
future releases, so the larger R community can take advantage of this
feature? If there's anything else I can do to help please let me know.

Sincerely,
Toby Dylan Hocking
http://cbio.ensmp.fr/~thocking/

Index: ../r-devel/src/main/grep.c
===
--- ../r-devel/src/main/grep.c	(revision 54562)
+++ ../r-devel/src/main/grep.c	(working copy)
@@ -1635,28 +1635,38 @@
 static SEXP
 gregexpr_perl(const char *pattern, const char *string,
 	  pcre *re_pcre, pcre_extra *re_pe,
-	  Rboolean useBytes, Rboolean use_UTF8)
+	  Rboolean useBytes, Rboolean use_UTF8,
+	  int *ovector, int ovector_size,
+	  int capture_count)
 {
-int matchIndex = -1, st = 0, foundAll = 0, foundAny = 0, j, start=0;
+int matchIndex = -1, st = 0, foundAll = 0, foundAny = 0, i,j, start=0;
 SEXP ans, matchlen; /* return vect and its attribute */
+SEXP capture,capturelen,capturebuf,capturelenbuf;
 SEXP matchbuf, matchlenbuf; /* buffers for storing multiple matches */
 int bufsize = 1024; /* starting size for buffers */
+PROTECT(capturelenbuf = allocVector(INTSXP, bufsize*capture_count));
+PROTECT(capturebuf = allocVector(INTSXP, bufsize*capture_count));
 PROTECT(matchbuf = allocVector(INTSXP, bufsize));
 PROTECT(matchlenbuf = allocVector(INTSXP, bufsize));
 while (!foundAll) {
-	int rc, ovector[3], slen = strlen(string);
-	rc = pcre_exec(re_pcre, re_pe, string, slen, start, 0, ovector, 3);
+	int rc, slen = strlen(string);
+	rc = pcre_exec(re_pcre, re_pe, string, slen, start, 0, 
+		   ovector, ovector_size);
 	if (rc >= 0) {
 	if ((matchIndex + 1) == bufsize) {
-		/* Reallocate match buffers */
+		/* Reallocate match buffers
+		   TODO: need to update this for new args
+		 */
 		int newbufsize = bufsize * 2;
 		SEXP tmp;
+
 		tmp = allocVector(INTSXP, 2 * bufsize);
 		for (j = 0; j < bufsize; j++)
 		INTEGER(tmp)[j] = INTEGER(matchlenbuf)[j];
 		UNPROTECT(1);
 		matchlenbuf = tmp;
 		PROTECT(matchlenbuf);
+
 		tmp = allocVector(INTSXP, 2 * bufsize);
 		for (j = 0; j < bufsize; j++)
 		INTEGER(tmp)[j] = INTEGER(matchbuf)[j];
@@ -1664,6 +1674,28 @@
 		UNPROTECT(2);
 		PROTECT(matchbuf);
 		PROTECT(matchlenbuf);
+
+		tmp = allocVector(INTSXP, 2 * bufsize*capture_count);
+		for(j=0;j 0) {
-		INTEGER(matchbuf)[matchIndex] = 1 + getNc(string, st);
-		if (INTEGER(matchbuf)[matchIndex] <= 0) { /* an invalid string */
-			INTEGER(matchbuf)[matchIndex] = NA_INTEGER;
-			foundAll = 1; /* if we get here, we are done */
-		}
+	  int mlen = ovector[1] - st;
+	  /* Unfortunately these are in bytes */
+	  if (st > 0) {
+		INTEGER(matchbuf)[matchIndex] = 1 + getNc(string, st);
+		if (INTEGER(matchbuf)[matchIndex] <= 0) { /* an invalid string */
+		  INTEGER(matchbuf)[matchIndex] = NA_INTEGER;
+		  foundAll = 1; /* if we get here, we are done */
 		}
-		INTEGER(matchlenbuf)[matchIndex] = getNc(string+st, mlen);
-		if (INTEGER(matchlenbuf)[matchIndex] < 0) {/* an invalid string */
-		INTEGER(matchlenbuf)[matchIndex]

[Rd] Request for a crop option on R's standard plot context menu

2021-02-11 Thread Marthews, Toby R. via R-devel

Dear R Dev,

I hope you don't mind a request for a feature from a long-time R user (happily 
using R since 2005).

I use R for lots of plotting for my work, e.g. for my current project I have a 
script that generates 300 plots. I generally copy these all as bitmaps and put 
them directly into a Word document for my report (see e.g. attached screenshot).

I am always required to maximise the plots to get the best resolution, but that 
always means I get lots of white space when I copy the plot into Word. Cropping 
them takes a reasonable amount of time (e.g. an entire day to crop all 300 for 
this project).

I'm wondering whether it would be possible to have an option added to the plot 
window context menu for "Cropped bitmap" or "Crop whitespace"? I think this 
would be VERY useful because all the online tools I've seen for batch-cropping 
a set of images require the same crop to be applied to all images (e.g. 
https://www.youtube.com/watch?v=icbpS0OH9a0 ) and that's not the case for my 
plots (some of my plots are panel plots, some simpler plots).

I know it's a bit cheeky to ask for something that I have no idea how to code 
up myself: I am just hoping that this feedback might go somewhere useful. R can 
really do 99% of things kind of perfectly so it seems churlish to point out the 
1%, but if this is an easy thing to add in then at least I would use it pretty 
much every project I work on!

Many thanks and best regards,

Best regards,

Toby

PS. I know I could modify my script so that it creates windows of exactly the 
right size (dev.new), but doing that would mean I would have to recalculate the 
sizes if I changed anything on the plots at all (e.g. aspect, x label) so I 
would lose probably more time pursuing that option. Also, I've found that using 
lots of dev.new commands makes it difficult for colleagues who use RStudio to 
use my scripts.
   I am also aware I could export these images as pdfs, but then I would have 
to open 300 pdfs, extract the images and crop them all in 3rd party software. 
Again, perfectly possible but I am searching for a slightly quicker solution 
(!).


Dr Toby Marthews

UKCEH Band 6 Researcher in Global Surface Science (Hydro-Climate Risks)

   Mob: +44 753 2168305, web: www.tobymarthews.com<http://www.tobymarthews.com>



This email and any attachments are intended solely for the named recipients and 
are confidential. If you are not the intended recipient please reply to the 
email to highlight the error and delete this email from your system; you must 
not use, disclose, copy or distribute this email or any of its attachments. 
UKCEH has taken every reasonable precaution to minimise risk of this email or 
any attachments containing viruses or malware but the recipient should carry 
out its own virus and malware checks before opening the attachments. UKCEH does 
not accept any liability for any losses or damages which the recipient may 
sustain due to presence of any viruses. Opinions, conclusions or other 
information in this message and attachments that are not related directly to 
UKCEH business are solely those of the author and do not represent the views of 
UKCEH. We process your personal data in accordance with our Privacy Notice, 
available on the UKCEH website.
https://www.ceh.ac.uk/privacy-notice
__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

58 matches

Mail list logo