Re: [Rd] How x[, 'colname1'] is implemented?

2010-01-01 Thread Barry Rowlingson
On Thu, Dec 31, 2009 at 11:27 PM, Peng Yu  wrote:
> I don't see where describes the implementation of '[]'.
>
> For example, if x is a matrix or a data.frame, how the lookup of
> 'colname1' is x[, 'colname1'] executed. Does R perform a lookup in the
> a hash of the colnames? Is the reference O(1) or O(n), where n is the
> second dim of x?

 Where have you looked? I doubt this kind of implementation detail is
in the .Rd documentation since a regular user doesn't care for it.

 As Obi-wan Kenobi may have said in Star Wars: "Use the source, Luke!":

 Line 450 of subscript.c of the source code of R 2.10 is the
stringSubscript function. It has this comment:

/* The original code (pre 2.0.0) used a ns x nx loop that was too
 * slow.  So now we hash.  Hashing is expensive on memory (up to 32nx
 * bytes) so it is only worth doing if ns * nx is large.  If nx is
 * large, then it will be too slow unless ns is very small.
 */

The definition of "large" and "small" here appears to be such that:

457: Rboolean usehashing = in && ( ((ns > 1000 && nx) || (nx > 1000 &&
ns)) || (ns * nx > 15*nx + ns) );

  Barry

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] R Wish List

2010-01-01 Thread Gabor Grothendieck
This is my 2010 Wish list for R.  Most of these have been discussed on
r-help or r-devel already so this is more of a wrap-up.  The first 4
relate to R itself, the next 2 to the R environment and the last 4
relate to using R with other languages:

R

1. Strings. Some way of placing backslashes in literal strings without
escaping them.  This is useful for latex, regular expressions and
Windows filepaths.  C# and scripting languages such as Ruby, python,
perl, etc. have various ways to handle this which might be used as a
model.  Also related to this is a way of having both single and double
quotes within a string without escaping either.

2. read.table
  a. read.table allows double quotes arround numerics if colClasses is
not specified but if colClasses = "numeric" then it does not.  There
should be some easy way to specify the default behavior when
colClasses is specified.
  b. argument to specify how many rows read.table uses to establish
the column classes.  All rows, i.e. Inf, should be a possible option
(i.e. entire file passed through twice).

3. support png(con), pdf(con), etc. where con is a connection to be written to.

4.Cross Platform Consistency. Improve consistency of R across platforms:
  a. provide analog of Windows R gui Paste Commands Only menu on other platforms
  b. support "clipboard" as in readLines("clipboard") on Mac
  c. shell command works on Windows, check if its available on other platforms
  d. other?

R ENVIRONMENT

5. Installation.  When installing R version x.y.z, a new directory
should be created only if x or y changes but if z changes then the
default action should be to overwrite the existing x.y.z version. This
would simplify the configuration of R on disk by having fewer nearly
identical versions and is also consistent with how win-library already
works.  The current scheme of a separate tree for each z level version
would still be possible by selecting a custom installation but would
not be the default.

6. Documentation. demo files and non-Sweave pdf documents should be listed in
   - help(package = myPkg)
   - CRAN page for myPkg
These would be useful even if they are not in the form of links so the
reader could get a global view of what is available.

SCRIPTING/OTHER LANGUAGES

7. -x argument which works similar to the same argument in
Perl/Python/Ruby to allow a batch file and an R file to be combined.
For example:
  python -x prog.py   skips first line of prog.py allowing
non-UNIX forms of #!cmd
  perl -x prog.pl or perl -xdir prog.pl   strips off text
before !#perl line and possibly cd's to directory (if one is given).
ruby works same way.
Thus one could write:
  Rscript -x myfile.bat
and combine a Windows batch file and an R file into myfile.bat, say.

8. tcltk
  a. some easy way to run R minimized but have a tcltk GUI not
minimized (i.e. without having to write a frontend in C)
  b. ability to load tcl without loading tk

9. provide packages like tcltk for the other popular scripting
languages: perl, python, ruby.  It would be sufficient that there is a
package that contains the scripting language executable so an R
package that uses perl, say, could simply list such a package for perl
in Depends: in DESCRIPTION file and thereby be sure that perl were
available.  They could then access it via:
  perlcmd <- if (.Platform$OS == "windows") "perl.exe" else "perl"
  perl <- system.file(perlcmd, package = "perl")
  cmd <- paste(perl, my_command)
  system(cmd)
An actual interface to R, as in tcltk, while nice to have is not an
essential part of this and could be omitted to make this easier to
accomplish.

10. eliminate dependence of R on perl (this seems to be occurring or
maybe its already happened). For those packages needing perl they
could rely on #9 if it were available.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] How x[, 'colname1'] is implemented?

2010-01-01 Thread Peng Yu
On Fri, Jan 1, 2010 at 6:52 AM, Barry Rowlingson
 wrote:
> On Thu, Dec 31, 2009 at 11:27 PM, Peng Yu  wrote:
>> I don't see where describes the implementation of '[]'.
>>
>> For example, if x is a matrix or a data.frame, how the lookup of
>> 'colname1' is x[, 'colname1'] executed. Does R perform a lookup in the
>> a hash of the colnames? Is the reference O(1) or O(n), where n is the
>> second dim of x?
>
>  Where have you looked? I doubt this kind of implementation detail is
> in the .Rd documentation since a regular user doesn't care for it.

I'm not complaining that it is not documented.

>  As Obi-wan Kenobi may have said in Star Wars: "Use the source, Luke!":
>
>  Line 450 of subscript.c of the source code of R 2.10 is the
> stringSubscript function. It has this comment:
>
> /* The original code (pre 2.0.0) used a ns x nx loop that was too
>  * slow.  So now we hash.  Hashing is expensive on memory (up to 32nx
>  * bytes) so it is only worth doing if ns * nx is large.  If nx is
>  * large, then it will be too slow unless ns is very small.
>  */

Could you explain what ns and nx represent?

> The definition of "large" and "small" here appears to be such that:
>
> 457: Rboolean usehashing = in && ( ((ns > 1000 && nx) || (nx > 1000 &&
> ns)) || (ns * nx > 15*nx + ns) );

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] How x[, 'colname1'] is implemented?

2010-01-01 Thread Seth Falcon
On 1/1/10 1:40 PM, Peng Yu wrote:
> On Fri, Jan 1, 2010 at 6:52 AM, Barry Rowlingson
>  wrote:
>> On Thu, Dec 31, 2009 at 11:27 PM, Peng Yu  wrote:
>>> I don't see where describes the implementation of '[]'.
>>>
>>> For example, if x is a matrix or a data.frame, how the lookup of
>>> 'colname1' is x[, 'colname1'] executed. Does R perform a lookup in the
>>> a hash of the colnames? Is the reference O(1) or O(n), where n is the
>>> second dim of x?
>>
>>  Where have you looked? I doubt this kind of implementation detail is
>> in the .Rd documentation since a regular user doesn't care for it.
> 
> I'm not complaining that it is not documented.
> 
>>  As Obi-wan Kenobi may have said in Star Wars: "Use the source, Luke!":
>>
>>  Line 450 of subscript.c of the source code of R 2.10 is the
>> stringSubscript function. It has this comment:
>>
>> /* The original code (pre 2.0.0) used a ns x nx loop that was too
>>  * slow.  So now we hash.  Hashing is expensive on memory (up to 32nx
>>  * bytes) so it is only worth doing if ns * nx is large.  If nx is
>>  * large, then it will be too slow unless ns is very small.
>>  */
> 
> Could you explain what ns and nx represent?

integers :-)

Consider a 5x5 matrix m and a call like m[ , c("C", "D")], then
in the call to stringSubscript:

  s - The character vector of subscripts, here c("C", "D")

  ns - length of s, here 2

  nx - length of the dimension being subscripted, here 5

  names - the dimnames being subscripted.  Here, perhaps
  c("A", "B", "C", "D", "E")

>> The definition of "large" and "small" here appears to be such that:
>>
>> 457: Rboolean usehashing = in && ( ((ns > 1000 && nx) || (nx > 1000 &&
>> ns)) || (ns * nx > 15*nx + ns) );

The 'in' argument is always TRUE AFAICS so this boils down to:

Use hashing for x[i] if either length(x) > 1000 or length(i) > 1000 (and
we aren't in the trivial case where either length(x) == 0 or length(i) == 0)

OR use hashing if (ns * nx > 15*nx + ns)


+ seth

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel