[Rd] Error in socketConnection(master, port = port, blocking = TRUE, open = "a+b", : cannot open the connection

2016-01-15 Thread Soumen Pal via R-devel
Dear All

I have sucessfully created cluster of four nodes using localhost in my local 
machine by executing the following command

> cl<-makePSOCKcluster(c(rep("localhost",4)),outfile='',homogeneous=FALSE,port=11001)
starting worker pid=4271 on localhost:11001 at 12:12:26.164
starting worker pid=4280 on localhost:11001 at 12:12:26.309
starting worker pid=4289 on localhost:11001 at 12:12:26.456
starting worker pid=4298 on localhost:11001 at 12:12:26.604
> 
> stopCluster(cl)

Now I am trying to create a cluster of 2 nodes (one in my local machine and 
another remote machine) by using "makePSOCKcluster" command. Both machine have 
identical settings and connected by SSH. OS is Ubuntu 14.04 LTS & R version 
3.2.1. I have executed the follwoing command to create cluster but getting the 
following error message and R Session is getting hanged.

cl<-makePSOCKcluster(c(rep("soumen@10.10.2.32",1)),outfile='',homogeneous=FALSE,port=11001)
soumen@10.10.2.32's password: 
starting worker pid=2324 on soumen-HP-ProBook-440-G2:11001 at 12:11:59.349
Error in socketConnection(master, port = port, blocking = TRUE, open = "a+b",  
: 
  cannot open the connection
Calls:  ... doTryCatch -> recvData -> makeSOCKmaster -> 
socketConnection
In addition: Warning message:
In socketConnection(master, port = port, blocking = TRUE, open = "a+b",  :
  soumen-HP-ProBook-440-G2:11001 cannot be opened
Execution halted


My sessionInfo() is as follwos

sessionInfo()
R version 3.2.1 (2015-06-18)
Platform: x86_64-unknown-linux-gnu (64-bit)
Running under: Ubuntu 14.04.1 LTS

locale:
 [1] LC_CTYPE=en_IN   LC_NUMERIC=C LC_TIME=en_IN   
 [4] LC_COLLATE=en_IN LC_MONETARY=en_IN    LC_MESSAGES=en_IN   
 [7] LC_PAPER=en_IN   LC_NAME=C    LC_ADDRESS=C    
[10] LC_TELEPHONE=C   LC_MEASUREMENT=en_IN LC_IDENTIFICATION=C 

attached base packages:
[1] parallel  stats graphics  grDevices utils datasets  methods  
[8] base   


I dont know how to solve this problem.Plese help me to solve this problem.

Thanks

Soumen Pal

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] Multiple cores are used in simple for loop

2016-01-15 Thread Daniel Kaschek

Dear all,

I run different R versions (3.2.1, 3.2.2 and 3.2.3) on different 
platforms (Arch, Ubuntu, Debian) with a different number of available 
cores (24, 4, 24). The following line produces very different behavior 
on the three machines:


for(i in 1:1e6) {n <- 100; M <- matrix(rnorm(n^2), n, n); M %*% M}

On the Ubuntu and Arch machine one core is used, but on the Debian 
machine ALL cores are used with heavy "kernel time" vs. "normal time" 
(red vs. green in htop). It seems that the number of cores used on 
Debian is related to the size of the matrix. Reducing n from 100 to 4 
causes four cores to work.


A similar problem persists with the parallel package and mclapply():

library(parallel)
out <- mclapply(1:1e6, function(i) { n <- 100; M <- matrix(rnorm(n^2), 
n, n); M %*% M }, mc.cores = 24)


On Arch and Debian all 24 cores run and show a high kernel time vs. 
normal time (all CPU bars in htop are 80% red). With mc.cores = 4 on 
the Ubuntu system however, all four cores run at full load with almost 
no kernel time but full normal time (all bars are green).


Have you seen this problem before? Does anybody know how to fix it?

Cheers,
Daniel

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Error in socketConnection(master, port = port, blocking = TRUE, open = "a+b", : cannot open the connection

2016-01-15 Thread Morgan, Martin
Arrange to make the ssh connection passwordless. Do this by copying your 
'public key' to the machine that you are trying to connect to. Google will be 
your friend in accomplishing this.

It might be that a firewall stands between you and the other machine, or that 
the other machine does not allow connections to port 11001. Either way, the 
direction toward a solution is to speak with your system administrator. If it 
is firewall, then they are unlikely to accommodate you; the strategy is to run 
your cluster exclusively on one side of the firewall.

Martin Morgan

From: R-devel [r-devel-boun...@r-project.org] on behalf of Soumen Pal via 
R-devel [r-devel@r-project.org]
Sent: Friday, January 15, 2016 2:05 AM
To: r-devel@r-project.org
Subject: [Rd] Error in socketConnection(master, port = port, blocking = TRUE, 
open = "a+b", :cannot open the connection

Dear All

I have sucessfully created cluster of four nodes using localhost in my local 
machine by executing the following command

> cl<-makePSOCKcluster(c(rep("localhost",4)),outfile='',homogeneous=FALSE,port=11001)
starting worker pid=4271 on localhost:11001 at 12:12:26.164
starting worker pid=4280 on localhost:11001 at 12:12:26.309
starting worker pid=4289 on localhost:11001 at 12:12:26.456
starting worker pid=4298 on localhost:11001 at 12:12:26.604
>
> stopCluster(cl)

Now I am trying to create a cluster of 2 nodes (one in my local machine and 
another remote machine) by using "makePSOCKcluster" command. Both machine have 
identical settings and connected by SSH. OS is Ubuntu 14.04 LTS & R version 
3.2.1. I have executed the follwoing command to create cluster but getting the 
following error message and R Session is getting hanged.

cl<-makePSOCKcluster(c(rep("soumen@10.10.2.32",1)),outfile='',homogeneous=FALSE,port=11001)
soumen@10.10.2.32's password:
starting worker pid=2324 on soumen-HP-ProBook-440-G2:11001 at 12:11:59.349
Error in socketConnection(master, port = port, blocking = TRUE, open = "a+b",  :
  cannot open the connection
Calls:  ... doTryCatch -> recvData -> makeSOCKmaster -> 
socketConnection
In addition: Warning message:
In socketConnection(master, port = port, blocking = TRUE, open = "a+b",  :
  soumen-HP-ProBook-440-G2:11001 cannot be opened
Execution halted


My sessionInfo() is as follwos

sessionInfo()
R version 3.2.1 (2015-06-18)
Platform: x86_64-unknown-linux-gnu (64-bit)
Running under: Ubuntu 14.04.1 LTS

locale:
 [1] LC_CTYPE=en_IN   LC_NUMERIC=C LC_TIME=en_IN
 [4] LC_COLLATE=en_IN LC_MONETARY=en_INLC_MESSAGES=en_IN
 [7] LC_PAPER=en_IN   LC_NAME=CLC_ADDRESS=C
[10] LC_TELEPHONE=C   LC_MEASUREMENT=en_IN LC_IDENTIFICATION=C

attached base packages:
[1] parallel  stats graphics  grDevices utils datasets  methods
[8] base


I dont know how to solve this problem.Plese help me to solve this problem.

Thanks

Soumen Pal

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

This email message may contain legally privileged and/or confidential 
information.  If you are not the intended recipient(s), or the employee or 
agent responsible for the delivery of this message to the intended 
recipient(s), you are hereby notified that any disclosure, copying, 
distribution, or use of this email message is prohibited.  If you have received 
this message in error, please notify the sender immediately by e-mail and 
delete this email message from your computer. Thank you.
__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Multiple cores are used in simple for loop

2016-01-15 Thread Martyn Plummer
On Fri, 2016-01-15 at 15:03 +0100, Daniel Kaschek wrote:
> Dear all,
> 
> I run different R versions (3.2.1, 3.2.2 and 3.2.3) on different 
> platforms (Arch, Ubuntu, Debian) with a different number of available 
> cores (24, 4, 24). The following line produces very different behavior 
> on the three machines:
> 
> for(i in 1:1e6) {n <- 100; M <- matrix(rnorm(n^2), n, n); M %*% M}
> 
> On the Ubuntu and Arch machine one core is used, but on the Debian 
> machine ALL cores are used with heavy "kernel time" vs. "normal time" 
> (red vs. green in htop). It seems that the number of cores used on 
> Debian is related to the size of the matrix. Reducing n from 100 to 4 
> causes four cores to work.

It depends on what backend R is using for linear algebra. Some will
split large matrix calculations over multiple threads. On Debian, you
can set the blas and lapack libraries to the implementation of your
choice. 

https://wiki.debian.org/DebianScience/LinearAlgebraLibraries

As far as I know reference blas and lapack are still single threaded.

Alternatively, you may be able to control the maximum number of threads
by setting and exporting an appropriate environment variable depending
on what backend you are using, e.g. OPENBLAS_NUM_THREADS or
MKL_NUM_THREADS.

Martyn

> A similar problem persists with the parallel package and mclapply():
> 
> library(parallel)
> out <- mclapply(1:1e6, function(i) { n <- 100; M <- matrix(rnorm(n^2), 
> n, n); M %*% M }, mc.cores = 24)
> 
> On Arch and Debian all 24 cores run and show a high kernel time vs. 
> normal time (all CPU bars in htop are 80% red). With mc.cores = 4 on 
> the Ubuntu system however, all four cores run at full load with almost 
> no kernel time but full normal time (all bars are green).
> 
> Have you seen this problem before? Does anybody know how to fix it?
> 
> Cheers,
> Daniel
> 
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

---
This message and its attachments are strictly confidenti...{{dropped:8}}

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] JDataFrame API

2016-01-15 Thread Simon Urbanek
Tom,

this may be good for embedding small data sets, but for practical purposes is 
doesn't seem like the most efficient solution.

Since you didn't provide any code, I built a test case using the build-in Java 
JSON API to build a medium-sized dataset (1e6 rows) and read it in just to get 
a ballpark (see
https://gist.github.com/s-u/4efb284e3c15c6a2db16

# generate:
time java -cp .:javax.json-api-1.0.jar:javax.json-1.0.4.jar A > 1e6

real0m2.764s
user0m20.356s
sys 0m0.962s

# read:
> system.time(temp <- RJSONIO::fromJSON("1e6"))
   user  system elapsed 
  3.484   0.279   3.834 
> str(temp)
List of 2
 $ V1: num [1:100] 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 ...
 $ V2: chr [1:100] "X0" "X1" "X2" "X3" ...

For comparison using Java directly (includes both generation and reading into 
R):

> system.time(temp <- lapply(J("A")$direct(), .jevalArray))
   user  system elapsed 
  0.962   0.186   0.494 

So the JSON route is very roughly ~13x slower than using Java directly. 
Obviously, this will vary by data set type etc. since there is R overhead 
involved as well: for example, if you have only numeric variables, the JSON 
route is 30x slower on reading alone [50x total]. String variables slow down 
everyone equally. Interestingly, the JSON encoding is using all 16 cores, so 
the 2.7s real time add up to over 20s CPU time so on smaller machines you may 
see more overhead. 

If you need process separation, it may be a different story - in principle it 
is faster to use more native serialization than JSON since parsing is the 
slowest part for big datasets.

Cheers,
Simon


> On Jan 14, 2016, at 4:52 PM, Thomas Fuller  
> wrote:
> 
> Hi Folks,
> 
> If you need to send data from Java to R you may consider using the
> JDataFrame API -- which is used to convert data into JSON which then
> can be converted into a data frame in R.
> 
> Here's the project page:
> 
> https://coherentlogic.com/middleware-development/jdataframe/
> 
> and here's a partial example which demonstrates what the API looks like:
> 
> String result = new JDataFrameBuilder()
>.addColumn("Code", new Object[] {"WV", "VA", })
>.addColumn("Description", new Object[] {"West Virginia", "Virginia"})
>.toJson();
> 
> and in R script we would need to do this:
> 
> temp <- RJSONIO::fromJSON(json)
> tempDF <- as.data.frame(temp)
> 
> which yields a data frame that looks like this:
> 
>> tempDF
>Description Code
> 1 West Virginia   WV
> 2  Virginia   VA
> 
> It is my intention to deploy this project to Maven Central this week,
> time permitting.
> 
> Questions and comments are welcomed.
> 
> Tom
> 
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
> 

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] JDataFrame API

2016-01-15 Thread Thomas Fuller
Hi Simon,

Thanks for your feedback. -- this is an observation that I wasn't
considering when I wrote this mainly because I am, in fact, working
with rather small data sets. BTW: There is code there, it's under the
bitbucket link -- here's the direct link if you'd still like to look
at it:

https://bitbucket.org/CoherentLogic/jdataframe

Re "for practical purposes is doesn't seem like the most efficient
solution" and "So the JSON route is very roughly ~13x slower than
using Java directly."

I've not benchmarked this and will take a closer look at what you have
today -- in fact I may include these details on the JDataFrame page.
The JDataFrame targets the use case where there's significant
development being done in Java and data is exported into R and,
additionally, the developer intends to keep the two separated as much
as possible. I could work with Java directly, but then I potentially
end up with quite a bit of Java code taking up space in R and I don't
like this because if I need to refactor something I have to do it in
two places.

There's another use case for the JDataFrame as well and that's in an
enterprise application (you may have alluded to this when you said
"[i]f you need process separation..."). Consider a business where
users are working with R and the application that produces the data is
actually running in Tomcat. Shipping large amounts of data over the
wire in this example would be a performance destroyer, but for small
data sets it certainly would be helpful from a development perspective
to expose JSON-based web services where the R script would be able to
convert a result into a data frame gracefully.

Tom

On Fri, Jan 15, 2016 at 10:58 AM, Simon Urbanek
 wrote:
> Tom,
>
> this may be good for embedding small data sets, but for practical purposes is 
> doesn't seem like the most efficient solution.
>
> Since you didn't provide any code, I built a test case using the build-in 
> Java JSON API to build a medium-sized dataset (1e6 rows) and read it in just 
> to get a ballpark (see
> https://gist.github.com/s-u/4efb284e3c15c6a2db16
>
> # generate:
> time java -cp .:javax.json-api-1.0.jar:javax.json-1.0.4.jar A > 1e6
>
> real0m2.764s
> user0m20.356s
> sys 0m0.962s
>
> # read:
>> system.time(temp <- RJSONIO::fromJSON("1e6"))
>user  system elapsed
>   3.484   0.279   3.834
>> str(temp)
> List of 2
>  $ V1: num [1:100] 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 ...
>  $ V2: chr [1:100] "X0" "X1" "X2" "X3" ...
>
> For comparison using Java directly (includes both generation and reading into 
> R):
>
>> system.time(temp <- lapply(J("A")$direct(), .jevalArray))
>user  system elapsed
>   0.962   0.186   0.494
>
> So the JSON route is very roughly ~13x slower than using Java directly. 
> Obviously, this will vary by data set type etc. since there is R overhead 
> involved as well: for example, if you have only numeric variables, the JSON 
> route is 30x slower on reading alone [50x total]. String variables slow down 
> everyone equally. Interestingly, the JSON encoding is using all 16 cores, so 
> the 2.7s real time add up to over 20s CPU time so on smaller machines you may 
> see more overhead.
>
> If you need process separation, it may be a different story - in principle it 
> is faster to use more native serialization than JSON since parsing is the 
> slowest part for big datasets.
>
> Cheers,
> Simon
>
>
>> On Jan 14, 2016, at 4:52 PM, Thomas Fuller  
>> wrote:
>>
>> Hi Folks,
>>
>> If you need to send data from Java to R you may consider using the
>> JDataFrame API -- which is used to convert data into JSON which then
>> can be converted into a data frame in R.
>>
>> Here's the project page:
>>
>> https://coherentlogic.com/middleware-development/jdataframe/
>>
>> and here's a partial example which demonstrates what the API looks like:
>>
>> String result = new JDataFrameBuilder()
>>.addColumn("Code", new Object[] {"WV", "VA", })
>>.addColumn("Description", new Object[] {"West Virginia", "Virginia"})
>>.toJson();
>>
>> and in R script we would need to do this:
>>
>> temp <- RJSONIO::fromJSON(json)
>> tempDF <- as.data.frame(temp)
>>
>> which yields a data frame that looks like this:
>>
>>> tempDF
>>Description Code
>> 1 West Virginia   WV
>> 2  Virginia   VA
>>
>> It is my intention to deploy this project to Maven Central this week,
>> time permitting.
>>
>> Questions and comments are welcomed.
>>
>> Tom
>>
>> __
>> R-devel@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Multiple cores are used in simple for loop

2016-01-15 Thread Daniel Kaschek

Dear Martyn,


On Fr, Jan 15, 2016 at 4:01 , Martyn Plummer  wrote:


Alternatively, you may be able to control the maximum number of 
threads

by setting and exporting an appropriate environment variable depending
on what backend you are using, e.g. OPENBLAS_NUM_THREADS or
MKL_NUM_THREADS.



Thanks a lot. Running

export OPENBLAS_NUM_THREADS = 1

in the bash before starting R solves both problems!


Cheers,
Daniel

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] JDataFrame API

2016-01-15 Thread Simon Urbanek

> On Jan 15, 2016, at 12:35 PM, Thomas Fuller  
> wrote:
> 
> Hi Simon,
> 
> Thanks for your feedback. -- this is an observation that I wasn't
> considering when I wrote this mainly because I am, in fact, working
> with rather small data sets. BTW: There is code there, it's under the
> bitbucket link -- here's the direct link if you'd still like to look
> at it:
> 
> https://bitbucket.org/CoherentLogic/jdataframe
> 

Ah, sorry, all links just send you back to the page, so I missed the little 
filed that tells you how to check it out.


> Re "for practical purposes is doesn't seem like the most efficient
> solution" and "So the JSON route is very roughly ~13x slower than
> using Java directly."
> 
> I've not benchmarked this and will take a closer look at what you have
> today -- in fact I may include these details on the JDataFrame page.
> The JDataFrame targets the use case where there's significant
> development being done in Java and data is exported into R and,
> additionally, the developer intends to keep the two separated as much
> as possible. I could work with Java directly, but then I potentially
> end up with quite a bit of Java code taking up space in R and I don't
> like this because if I need to refactor something I have to do it in
> two places.
> 

No, the code is the same - it makes no difference. The R code is only one call 
to fetch what you need by calling your Java method. The nice thing is that you 
in fact save some code since there is no reason to serialize since you can 
simply access all Java objects directly without serialization.


> There's another use case for the JDataFrame as well and that's in an
> enterprise application (you may have alluded to this when you said
> "[i]f you need process separation..."). Consider a business where
> users are working with R and the application that produces the data is
> actually running in Tomcat. Shipping large amounts of data over the
> wire in this example would be a performance destroyer, but for small
> data sets it certainly would be helpful from a development perspective
> to expose JSON-based web services where the R script would be able to
> convert a result into a data frame gracefully.
> 

Yes, sure, that makes sense. Like I said, I would probably use some native 
format in that case if I worried about performance. Some candidates that come 
to my mind are ProtoBuf and QAP (serialization used by Rserve). If you have 
arrays, you can always serialize them directly which may be most efficient, but 
you'd probably have to write the wrapper for that yourself (annoyingly, the 
default Java methods use big-endian format which is slower on most machines). 
But then, you're right that for Tomcat applications the sizes are small enough 
that using JSON has the benefit that you can inspect payload by eye and/or 
other tools very easily.

Cheers,
Simon


> 
> On Fri, Jan 15, 2016 at 10:58 AM, Simon Urbanek
>  wrote:
>> Tom,
>> 
>> this may be good for embedding small data sets, but for practical purposes 
>> is doesn't seem like the most efficient solution.
>> 
>> Since you didn't provide any code, I built a test case using the build-in 
>> Java JSON API to build a medium-sized dataset (1e6 rows) and read it in just 
>> to get a ballpark (see
>> https://gist.github.com/s-u/4efb284e3c15c6a2db16
>> 
>> # generate:
>> time java -cp .:javax.json-api-1.0.jar:javax.json-1.0.4.jar A > 1e6
>> 
>> real0m2.764s
>> user0m20.356s
>> sys 0m0.962s
>> 
>> # read:
>>> system.time(temp <- RJSONIO::fromJSON("1e6"))
>>   user  system elapsed
>>  3.484   0.279   3.834
>>> str(temp)
>> List of 2
>> $ V1: num [1:100] 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 ...
>> $ V2: chr [1:100] "X0" "X1" "X2" "X3" ...
>> 
>> For comparison using Java directly (includes both generation and reading 
>> into R):
>> 
>>> system.time(temp <- lapply(J("A")$direct(), .jevalArray))
>>   user  system elapsed
>>  0.962   0.186   0.494
>> 
>> So the JSON route is very roughly ~13x slower than using Java directly. 
>> Obviously, this will vary by data set type etc. since there is R overhead 
>> involved as well: for example, if you have only numeric variables, the JSON 
>> route is 30x slower on reading alone [50x total]. String variables slow down 
>> everyone equally. Interestingly, the JSON encoding is using all 16 cores, so 
>> the 2.7s real time add up to over 20s CPU time so on smaller machines you 
>> may see more overhead.
>> 
>> If you need process separation, it may be a different story - in principle 
>> it is faster to use more native serialization than JSON since parsing is the 
>> slowest part for big datasets.
>> 
>> Cheers,
>> Simon
>> 
>> 
>>> On Jan 14, 2016, at 4:52 PM, Thomas Fuller 
>>>  wrote:
>>> 
>>> Hi Folks,
>>> 
>>> If you need to send data from Java to R you may consider using the
>>> JDataFrame API -- which is used to convert data into JSON which then
>>> can be converted into a data frame in R.
>>> 
>>> Here's the project page:
>>> 
>>> http

Re: [Rd] Multiple cores are used in simple for loop

2016-01-15 Thread Henrik Bengtsson
On Fri, Jan 15, 2016 at 10:15 AM, Daniel Kaschek
 wrote:
> Dear Martyn,
>
>
> On Fr, Jan 15, 2016 at 4:01 , Martyn Plummer  wrote:
>>
>>
>> Alternatively, you may be able to control the maximum number of threads
>> by setting and exporting an appropriate environment variable depending
>> on what backend you are using, e.g. OPENBLAS_NUM_THREADS or
>> MKL_NUM_THREADS.
>
>
>
> Thanks a lot. Running
>
> export OPENBLAS_NUM_THREADS = 1
>
> in the bash before starting R solves both problems!

I don't have builds so I can try myself, but as an alternative, is it
possible to set this environment variable in ~/.Renviron, or is that
too late in the R startup process?  What about
Sys.setenv(OPENBLAS_NUM_THREADS=1) in ~/.Rprofile?

/Henrik

>
>
>
> Cheers,
> Daniel
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] JDataFrame API

2016-01-15 Thread Thomas Fuller
Hi Simon,

Aha! I re-read your message and noticed this line:

lapply(J("A")$direct(), .jevalArray)

which I had overlooked earlier. I wrote an example that is very
similar to yours and see what you mean now regarding how we can do
this directly.

Many thanks,

T

groovyScript <- paste (
"def stringList = [] as java.util.List",
"def numberList = [] as java.util.List",
"for (def ctr in 0..99) { stringList << new String(\"TGIF $ctr\");
numberList << ctr; }",
"def strings = stringList.toArray()",
"def numbers = numberList.toArray()",
"def result = [strings, numbers]",
"return (Object[]) result",
sep="\n")

result <- Evaluate (groovyScript=groovyScript)

temp <- lapply(result, .jevalArray)

On Fri, Jan 15, 2016 at 1:58 PM, Simon Urbanek
 wrote:
>
>> On Jan 15, 2016, at 12:35 PM, Thomas Fuller 
>>  wrote:
>>
>> Hi Simon,
>>
>> Thanks for your feedback. -- this is an observation that I wasn't
>> considering when I wrote this mainly because I am, in fact, working
>> with rather small data sets. BTW: There is code there, it's under the
>> bitbucket link -- here's the direct link if you'd still like to look
>> at it:
>>
>> https://bitbucket.org/CoherentLogic/jdataframe
>>
>
> Ah, sorry, all links just send you back to the page, so I missed the little 
> filed that tells you how to check it out.
>
>
>> Re "for practical purposes is doesn't seem like the most efficient
>> solution" and "So the JSON route is very roughly ~13x slower than
>> using Java directly."
>>
>> I've not benchmarked this and will take a closer look at what you have
>> today -- in fact I may include these details on the JDataFrame page.
>> The JDataFrame targets the use case where there's significant
>> development being done in Java and data is exported into R and,
>> additionally, the developer intends to keep the two separated as much
>> as possible. I could work with Java directly, but then I potentially
>> end up with quite a bit of Java code taking up space in R and I don't
>> like this because if I need to refactor something I have to do it in
>> two places.
>>
>
> No, the code is the same - it makes no difference. The R code is only one 
> call to fetch what you need by calling your Java method. The nice thing is 
> that you in fact save some code since there is no reason to serialize since 
> you can simply access all Java objects directly without serialization.
>
>
>> There's another use case for the JDataFrame as well and that's in an
>> enterprise application (you may have alluded to this when you said
>> "[i]f you need process separation..."). Consider a business where
>> users are working with R and the application that produces the data is
>> actually running in Tomcat. Shipping large amounts of data over the
>> wire in this example would be a performance destroyer, but for small
>> data sets it certainly would be helpful from a development perspective
>> to expose JSON-based web services where the R script would be able to
>> convert a result into a data frame gracefully.
>>
>
> Yes, sure, that makes sense. Like I said, I would probably use some native 
> format in that case if I worried about performance. Some candidates that come 
> to my mind are ProtoBuf and QAP (serialization used by Rserve). If you have 
> arrays, you can always serialize them directly which may be most efficient, 
> but you'd probably have to write the wrapper for that yourself (annoyingly, 
> the default Java methods use big-endian format which is slower on most 
> machines). But then, you're right that for Tomcat applications the sizes are 
> small enough that using JSON has the benefit that you can inspect payload by 
> eye and/or other tools very easily.
>
> Cheers,
> Simon
>
>
>>
>> On Fri, Jan 15, 2016 at 10:58 AM, Simon Urbanek
>>  wrote:
>>> Tom,
>>>
>>> this may be good for embedding small data sets, but for practical purposes 
>>> is doesn't seem like the most efficient solution.
>>>
>>> Since you didn't provide any code, I built a test case using the build-in 
>>> Java JSON API to build a medium-sized dataset (1e6 rows) and read it in 
>>> just to get a ballpark (see
>>> https://gist.github.com/s-u/4efb284e3c15c6a2db16
>>>
>>> # generate:
>>> time java -cp .:javax.json-api-1.0.jar:javax.json-1.0.4.jar A > 1e6
>>>
>>> real0m2.764s
>>> user0m20.356s
>>> sys 0m0.962s
>>>
>>> # read:
 system.time(temp <- RJSONIO::fromJSON("1e6"))
>>>   user  system elapsed
>>>  3.484   0.279   3.834
 str(temp)
>>> List of 2
>>> $ V1: num [1:100] 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 ...
>>> $ V2: chr [1:100] "X0" "X1" "X2" "X3" ...
>>>
>>> For comparison using Java directly (includes both generation and reading 
>>> into R):
>>>
 system.time(temp <- lapply(J("A")$direct(), .jevalArray))
>>>   user  system elapsed
>>>  0.962   0.186   0.494
>>>
>>> So the JSON route is very roughly ~13x slower than using Java directly. 
>>> Obviously, this will vary by data set type etc. since there is R overhead 
>>> invol