[Rd] Creating a Factor Object in C code?

2012-12-27 Thread Rory Winston
Hi guys

I am currently working on a small bit of bridging code between a database 
system and R. The database system has the concept of varchars, a la factors in 
R, as distinct from plain character strings. What I would like to do is when I 
receive a list of character strings from the remote database system that are of 
type varchar, turn these into a factor variable. This would ideally need to be 
done in C code, where the rest of the datatype translation is occuring. 

My first attempt was a bit naive (setting the factor class attribute on a 
vector of character strings, which obviously results in an error), looking at 
the R factor() implementation, I can see the core logic for factor conversion 
is:

 y <- unique(x)
 ind <- sort.list(y)
 y <- as.character(y)
 levels <- unique(y[ind])

So I am guessing this would need to be replicated in C? My question is - is it 
possible to create a fully-formed factor variable in C code (Ive struggled to 
find many / any examples), or should this be done in R when the call returns? I 
would like to make it seamless to the end user, so an automatic conversion to 
factors would be preferable..

Cheers
-- Rory
__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] reinterpreting externalptr in R

2012-12-27 Thread andre__
Thanks a lot ... 





--
View this message in context: 
http://r.789695.n4.nabble.com/reinterpreting-externalptr-in-R-tp4653908p4654033.html
Sent from the R devel mailing list archive at Nabble.com.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] Doc patch for Sys.time and system.time

2012-12-27 Thread Ken Williams
Here’s a patch that adds ‘seealso’ entries to Sys.time and system.time
docs, to help people who forget what the distinction is between them.

Patch was made against https://svn.r-project.org/R/trunk@61454 .

 -Ken
__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Creating a Factor Object in C code?

2012-12-27 Thread Simon Urbanek
Rory,

On Dec 27, 2012, at 3:14 AM, Rory Winston wrote:

> Hi guys
> 
> I am currently working on a small bit of bridging code between a database 
> system and R. The database system has the concept of varchars, a la factors 
> in R, as distinct from plain character strings.

varchars are character strings. Factors consists of index and level set, so if 
your DB doesn't keep those separate, it is not a factor (and below you suggest 
it doesn't). Even if the DB supports ordered and unordered sets, the drivers 
typically only return the strings anyway, so you don't get at the set (without 
querying the schema). To make a point - a factor is if you can have a column 
consisting of values A,A,B,B and a level set of A,B,C (i.e. C is not used so it 
is extra information that you cannot express in a character string). if you 
don't have levels information nor the order then it's just a character vector.


> What I would like to do is when I receive a list of character strings from 
> the remote database system that are of type varchar, turn these into a factor 
> variable. This would ideally need to be done in C code, where the rest of the 
> datatype translation is occuring. 
> 

It really depends on what you want to get out and what your input really is. If 
your DB will be delivering results in rows, probably the most efficient way to 
construct a factor from string input is to simply create the index as you go 
and keep a hash of the levels. Then at the end you just put the two together 
into one factor object. Note that if your DB doesn't pre-specify the levels the 
the order is undefined.

If you are collecting the whole character vector first anyway, then I see no 
real point of not using as.factor() - even from C code.
Note, however, that in such case you should really give the user an option not 
do to that - dealing with factors is very painful and they are bad for data 
manipulation so many users prefer to set stringsAsFactors default to FALSE 
(including me) because it's much more efficient and less error-prone to deal 
with character vectors. Having to convert factors back to strings is very 
inefficient (in particular with large data) and superfluous since you already 
had strings to start with.


> My first attempt was a bit naive (setting the factor class attribute on a 
> vector of character strings, which obviously results in an error), looking at 
> the R factor() implementation, I can see the core logic for factor conversion 
> is:
> 
> y <- unique(x)
> ind <- sort.list(y)
> y <- as.character(y)
> levels <- unique(y[ind])
> 
> So I am guessing this would need to be replicated in C? My question is - is 
> it possible to create a fully-formed factor variable in C code (Ive struggled 
> to find many / any examples), or should this be done in R when the call 
> returns? I would like to make it seamless to the end user, so an automatic 
> conversion to factors would be preferable..
> 

It would not for reasons above which is why it's typically done at R level as 
an optional post-processing step. That doesn't mean you can't do it in C, but 
it is somewhat painful as you'll have to hash the levels - it's more convenient 
to have R do that for you.

Cheers,
Simon



> Cheers
> -- Rory
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
> 
> 

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] How to ensure -O3 on Win64

2012-12-27 Thread Simon Urbanek

On Dec 23, 2012, at 9:22 PM, Matthew Dowle wrote:

> 
> Hi,
> 
> Similar questions have come up before on the list and elsewhere but I haven't 
> found a solution yet.
> 
> winbuilder's install.out shows data.table's .c files compiled with -O3 on 
> Win32 but -O2 on Win64. The same happens on R-Forge. I gather that some 
> packages don't work with -O3 so the default is -O2.
> 
> I've tried this in data.table's Makevars (entire contents) :
> 
> 
> MAKEFLAGS="CFLAGS=-O3"# added
> CFLAGS=-O3# added
> PKG_CFLAGS=-O3# added
> all: $(SHLIB) # no change
>   mv $(SHLIB) datatable$(SHLIB_EXT) # no change
> 
> 
> but -O2 still appears in winbuilder's install.out (after -O3, and I believe 
> the last -O is the one that counts) :
> 
> gcc -m64 -I"D:/RCompile/recent/R-2.15.2/include" -DNDEBUG 
> -I"d:/Rcompile/CRANpkg/extralibs215/local215/include"  -O3   -O2 -Wall  
> -std=gnu99 -mtune=core2 -c dogroups.c -o dogroups.o
> 
> How can I ensure that data.table is compiled with -O3 on Win64?
> 

You can't - at least not in a way that doesn't circumvent the R build system. 
Also it's not portable so you don't want to mess with optimization flags and 
hard-code it in your package as it's user's choice how they setup R and its 
flags. You can certainly setup your R to compile with -O3, you just can't 
impose that on others.

Cheers,
Simon

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Doc patch for Sys.time and system.time

2012-12-27 Thread Ken Williams
Duncan noticed that either the sending server (Gmail - shouldn't be the
case) or receiving server stripped out the attachment.  Here it is again,
inline.

 -Ken

===
>From 99766dd8f16804ecddc73f6169be3e42b916b8fa Mon Sep 17 00:00:00 2001
From: Ken Williams 
Date: Thu, 27 Dec 2012 09:58:21 -0600
Subject: [PATCH] Add system.time link to Sys.time documentation, and vice
 versa.


diff --git a/src/library/base/man/Sys.time.Rd
b/src/library/base/man/Sys.time.Rd
index d34571b..f0b0c50 100644
--- a/src/library/base/man/Sys.time.Rd
+++ b/src/library/base/man/Sys.time.Rd
@@ -41,6 +41,8 @@ Sys.Date()
   string.

   \code{\link{Sys.timezone}}.
+
+  \code{\link{system.time}} for measuring elapsed/CPU time of expressions.
 }
 \examples{\donttest{
 Sys.time()
diff --git a/src/library/base/man/system.time.Rd
b/src/library/base/man/system.time.Rd
index 5cd79b7..ad21267 100644
--- a/src/library/base/man/system.time.Rd
+++ b/src/library/base/man/system.time.Rd
@@ -38,6 +38,8 @@ unix.time(expr, gcFirst = TRUE)
 }
 \seealso{
   \code{\link{proc.time}}, \code{\link{time}} which is for time series.
+
+  \code{\link{Sys.time}} to get the current date & time.
 }
 \examples{
 require(stats)
-- 
1.7.9
===



On Thu, Dec 27, 2012 at 10:08 AM, Ken Williams  wrote:

> Here’s a patch that adds ‘seealso’ entries to Sys.time and system.time
> docs, to help people who forget what the distinction is between them.
>
> Patch was made against https://svn.r-project.org/R/trunk@61454 .
>
>  -Ken

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] Suggestion: 'method' slot for expand.grid() (incl. diffs)

2012-12-27 Thread Marius Hofert
Dear expeRts,

The order in which the variables vary in expand.grid() is often unintuitive. I
would like to suggest a 'method' slot for expand.grid() which requires only very
little changes (100% backward compatible) and which allows one to control this
order. Please find attached diffs against R-devel.

Cheers,

Marius



### ./src/library/base/R/expand.grid.R #

--- expand.grid.R   2012-12-27 22:37:29.0 +0100
+++ expand.grid2.R  2012-12-27 22:41:00.331979950 +0100
@@ -16,7 +16,8 @@
 #  A copy of the GNU General Public License is available at
 #  http://www.r-project.org/Licenses/

-expand.grid <- function(..., KEEP.OUT.ATTRS = TRUE, stringsAsFactors = TRUE)
+expand.grid <- function(..., KEEP.OUT.ATTRS = TRUE, stringsAsFactors = TRUE,
+method = c("decreasing", "increasing"))
 {
 ## x should either be a list or a set of vectors or factors
 nargs <- length(args <- list(...))
@@ -26,7 +27,9 @@
 if(nargs == 0L) return(as.data.frame(list()))
 ## avoid classed args such as data frames: cargs <- args
 cargs <- vector("list", nargs)
-iArgs <- seq_len(nargs)
+seqArgs <- seq_len(nargs)
+method <- match.arg(method)
+iArgs <- if(method=="decreasing") seqArgs else rev(seqArgs)
 nmc <- paste0("Var", iArgs)
 nm <- names(args)
 if(is.null(nm))


### ./src/library/base/man/expand.grid.Rd ##

--- expand.grid.Rd  2012-12-27 22:38:13.0 +0100
+++ expand.grid2.Rd 2012-12-27 22:46:53.103964121 +0100
@@ -6,7 +6,8 @@
 \name{expand.grid}
 \title{Create a Data Frame from All Combinations of Factors}
 \usage{
-expand.grid(\dots, KEEP.OUT.ATTRS = TRUE, stringsAsFactors = TRUE)
+expand.grid(\dots, KEEP.OUT.ATTRS = TRUE, stringsAsFactors = TRUE,
+method = c("decreasing", "increasing"))
 }
 \alias{expand.grid}
 \arguments{
@@ -15,6 +16,15 @@
 attribute (see below) should be computed and returned.}
   \item{stringsAsFactors}{logical specifying if character vectors are
 converted to factors.}
+  \item{method}{method slot for how the resulting data frame is
+presented. Available are:
+\describe{
+  \item{"decreasing"}{the default; the variability of the variables
+   is decreasing in the column number.}
+  \item{"increasing"}{the variability of the variables
+   is increasing in the column number.}
+}
+  }
 }
 \description{
   Create a data frame from all combinations of the supplied vectors or
@@ -52,6 +62,8 @@

 expand.grid(height = seq(60, 80, 5), weight = seq(100, 300, 50),
 sex = c("Male","Female"))
+expand.grid(height = seq(60, 80, 5), weight = seq(100, 300, 50),
+sex = c("Male","Female"), method = "increasing")

 x <- seq(0, 10, length.out = 100)
 y <- seq(-1, 1, length.out = 20)

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Creating a Factor Object in C code?

2012-12-27 Thread Rory Winston
Hi Simon

Thanks for the clarification - makes sense and I now think youre right - 
probably better to avoid an automatic factor conversion and let the user 
explicitly convert if necessary. And you are right, I did abuse the term factor 
when referring to varchar - instead of factor, I really meant something like 
'internalized strings' a la Java (ie like a factor but with no ordering or 
distinct levels attributes.

Many thanks
-- Rory


On 27/12/2012, at 5:47 PM, Simon Urbanek  wrote:

> varchars are character strings. Factors consists of index and level set, so 
> if your DB doesn't keep those separate, it is not a factor (and below you 
> suggest it doesn't). Even if the DB supports ordered and unordered sets, the 
> drivers typically only return the strings anyway, so you don't get at the set 
> (without querying the schema). To make a point - a factor is if you can have 
> a column consisting of values A,A,B,B and a level set of A,B,C (i.e. C is not 
> used so it is extra information that you cannot express in a character 
> string). if you don't have levels information nor the order then it's just a 
> character vector.


[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] How to ensure -O3 on Win64

2012-12-27 Thread Matthew Dowle

On 27.12.2012 17:53, Simon Urbanek wrote:

On Dec 23, 2012, at 9:22 PM, Matthew Dowle wrote:



Hi,

Similar questions have come up before on the list and elsewhere but 
I haven't found a solution yet.


winbuilder's install.out shows data.table's .c files compiled with 
-O3 on Win32 but -O2 on Win64. The same happens on R-Forge. I gather 
that some packages don't work with -O3 so the default is -O2.


I've tried this in data.table's Makevars (entire contents) :


MAKEFLAGS="CFLAGS=-O3"# added
CFLAGS=-O3# added
PKG_CFLAGS=-O3# added
all: $(SHLIB) # no change
mv $(SHLIB) datatable$(SHLIB_EXT) # no change


but -O2 still appears in winbuilder's install.out (after -O3, and I 
believe the last -O is the one that counts) :


gcc -m64 -I"D:/RCompile/recent/R-2.15.2/include" -DNDEBUG 
-I"d:/Rcompile/CRANpkg/extralibs215/local215/include"  -O3   -O2 -Wall 
-std=gnu99 -mtune=core2 -c dogroups.c -o dogroups.o


How can I ensure that data.table is compiled with -O3 on Win64?



You can't - at least not in a way that doesn't circumvent the R build
system. Also it's not portable so you don't want to mess with
optimization flags and hard-code it in your package as it's user's
choice how they setup R and its flags. You can certainly setup your R
to compile with -O3, you just can't impose that on others.

Cheers,
Simon


Thanks Simon. This makes complete sense where users compile packages on 
install (Unix and Mac, and I better check my settings then), but on 
Windows where it's more common for the user to install the pre-compiled 
.zip from CRAN is my concern. This came up because the new fread 
function in data.table wasn't showing as much of a speedup on Win64 as 
on Linux. I'm not 100% sure that non -O3 is the cause, but there are 
some function calls which get iterated a lot (e.g. isspace) and I'd seen 
that inlining was something -O3 did and not -O2.


In general, why wouldn't a user of a package want the best performance 
from -O3?  By non portable do you mean the executable produced by 
winbuilder (or by CRAN) might not run on all Windows machines it's 
installed on (because -O3 (over) optimizes for the machine it's built 
on), or do you mean that -O3 itself might not be available on some 
compilers (and if so which compilers don't have -O3?).


Thanks, Matthew

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Doc patch for Sys.time and system.time

2012-12-27 Thread Duncan Murdoch

These are now in R-devel.  Thanks!

Duncan Murdoch

On 12-12-27 1:34 PM, Ken Williams wrote:

Duncan noticed that either the sending server (Gmail - shouldn't be the
case) or receiving server stripped out the attachment.  Here it is again,
inline.

  -Ken

===

From 99766dd8f16804ecddc73f6169be3e42b916b8fa Mon Sep 17 00:00:00 2001

From: Ken Williams 
Date: Thu, 27 Dec 2012 09:58:21 -0600
Subject: [PATCH] Add system.time link to Sys.time documentation, and vice
  versa.


diff --git a/src/library/base/man/Sys.time.Rd
b/src/library/base/man/Sys.time.Rd
index d34571b..f0b0c50 100644
--- a/src/library/base/man/Sys.time.Rd
+++ b/src/library/base/man/Sys.time.Rd
@@ -41,6 +41,8 @@ Sys.Date()
string.

\code{\link{Sys.timezone}}.
+
+  \code{\link{system.time}} for measuring elapsed/CPU time of expressions.
  }
  \examples{\donttest{
  Sys.time()
diff --git a/src/library/base/man/system.time.Rd
b/src/library/base/man/system.time.Rd
index 5cd79b7..ad21267 100644
--- a/src/library/base/man/system.time.Rd
+++ b/src/library/base/man/system.time.Rd
@@ -38,6 +38,8 @@ unix.time(expr, gcFirst = TRUE)
  }
  \seealso{
\code{\link{proc.time}}, \code{\link{time}} which is for time series.
+
+  \code{\link{Sys.time}} to get the current date & time.
  }
  \examples{
  require(stats)



__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel



__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] How to ensure -O3 on Win64

2012-12-27 Thread Simon Urbanek

On Dec 27, 2012, at 6:08 PM, Matthew Dowle wrote:

> On 27.12.2012 17:53, Simon Urbanek wrote:
>> On Dec 23, 2012, at 9:22 PM, Matthew Dowle wrote:
>> 
>>> 
>>> Hi,
>>> 
>>> Similar questions have come up before on the list and elsewhere but I 
>>> haven't found a solution yet.
>>> 
>>> winbuilder's install.out shows data.table's .c files compiled with -O3 on 
>>> Win32 but -O2 on Win64. The same happens on R-Forge. I gather that some 
>>> packages don't work with -O3 so the default is -O2.
>>> 
>>> I've tried this in data.table's Makevars (entire contents) :
>>> 
>>> 
>>> MAKEFLAGS="CFLAGS=-O3"# added
>>> CFLAGS=-O3# added
>>> PKG_CFLAGS=-O3# added
>>> all: $(SHLIB) # no change
>>> mv $(SHLIB) datatable$(SHLIB_EXT) # no change
>>> 
>>> 
>>> but -O2 still appears in winbuilder's install.out (after -O3, and I believe 
>>> the last -O is the one that counts) :
>>> 
>>> gcc -m64 -I"D:/RCompile/recent/R-2.15.2/include" -DNDEBUG 
>>> -I"d:/Rcompile/CRANpkg/extralibs215/local215/include"  -O3   -O2 -Wall 
>>> -std=gnu99 -mtune=core2 -c dogroups.c -o dogroups.o
>>> 
>>> How can I ensure that data.table is compiled with -O3 on Win64?
>>> 
>> 
>> You can't - at least not in a way that doesn't circumvent the R build
>> system. Also it's not portable so you don't want to mess with
>> optimization flags and hard-code it in your package as it's user's
>> choice how they setup R and its flags. You can certainly setup your R
>> to compile with -O3, you just can't impose that on others.
>> 
>> Cheers,
>> Simon
> 
> Thanks Simon. This makes complete sense where users compile packages on 
> install (Unix and Mac, and I better check my settings then), but on Windows 
> where it's more common for the user to install the pre-compiled .zip from 
> CRAN is my concern. This came up because the new fread function in data.table 
> wasn't showing as much of a speedup on Win64 as on Linux. I'm not 100% sure 
> that non -O3 is the cause, but there are some function calls which get 
> iterated a lot (e.g. isspace) and I'd seen that inlining was something -O3 
> did and not -O2.
> 
> In general, why wouldn't a user of a package want the best performance from 
> -O3?

Because it doesn't work? I don't know, you said yourself that -O2 may be there 
since -O3 breaks - that was not the question, though. (If you are curious about 
that, ask on CRAN, I don't remember the answer -- note that Win64 compiler 
support is relatively recent).


>  By non portable do you mean the executable produced by winbuilder (or by 
> CRAN) might not run on all Windows machines it's installed on (because -O3 
> (over) optimizes for the machine it's built on), or do you mean that -O3 
> itself might not be available on some compilers (and if so which compilers 
> don't have -O3?).
> 

Non-portable as in -O3 may not be supported or may break (we have seen -O3 
trigger bugs in gcc before). If you hard-code it, there is no way around it. 
The point is that you cannot make decisions for the user in advance, because 
you don't know the setup the user may use. I agree that Windows is a bit of a 
special-case in that there are very few choices so the risk of breaking things 
is lower, but if -O2 is really such a big deal, it is not just your problem and 
so you may want to investigate it further.

Cheers,
Simon

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Creating a Factor Object in C code?

2012-12-27 Thread Simon Urbanek
On Dec 27, 2012, at 5:43 PM, Rory Winston wrote:

> Hi Simon
> 
> Thanks for the clarification - makes sense and I now think youre right - 
> probably better to avoid an automatic factor conversion and let the user 
> explicitly convert if necessary. And you are right, I did abuse the term 
> factor when referring to varchar - instead of factor, I really meant 
> something like 'internalized strings' a la Java (ie like a factor but with no 
> ordering or distinct levels attributes.
> 

FWIW all strings are internalized in R (for some years now) - hence character 
vectors are very memory-efficient and essentially what you were looking for.

Cheers,
Simon

> 
> 
> On 27/12/2012, at 5:47 PM, Simon Urbanek  wrote:
> 
>> varchars are character strings. Factors consists of index and level set, so 
>> if your DB doesn't keep those separate, it is not a factor (and below you 
>> suggest it doesn't). Even if the DB supports ordered and unordered sets, the 
>> drivers typically only return the strings anyway, so you don't get at the 
>> set (without querying the schema). To make a point - a factor is if you can 
>> have a column consisting of values A,A,B,B and a level set of A,B,C (i.e. C 
>> is not used so it is extra information that you cannot express in a 
>> character string). if you don't have levels information nor the order then 
>> it's just a character vector.
> 

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel