Hi!

Santosh Srinivas wrote:
Dear R-helpers,

Considering that a substantial part of analysis is related data
manipulation, I'm just wondering if I should do the basic data part in a
database server (currently I have the data in .txt file).
For this purpose, I am planning to use MySQL. Is MySQL a good way to go
about? Are there any anticipated problems that I need to be aware of?

I'm afraid that I've no real answer to your questions but more questions and, perhaps, another point of view from a different world!


Considering, that many users here use large datasets. Do you typical store
the data in databases and query relevant portions for your analysis?
Does it speed up the entire process? Is it neater to do things in a
database? (for e.g. errors could corrected at data import stage itself by
conditions in defining the data itself in the database as opposed to
discovering things when you do the analysis in R and realize something is
wrong in the output?)

Please, what do you mean with "large datasets"?

I wouldn't consider only process speed, but also how the global repository of data is constructed. I mean, the problem is not only accession speed now, but to be able to identify any set of data in the future. For this, it could be so useful a RDBMS as a hierarchical folder structure plus well designed file names.


This is vis-à-vis using the built in SQLLite, indexing, etc capabilities in
R itself? Does performance work better with a database backend (especially
for simple but large datasets)?

As you said, R itself has powerful tools for data filtering rearrangement. I've only see some problems related with Genomics analysis where an external tool was required to manage some huge matrix. And that was time ago and some patches where on their way to solve this problem within R.

What I see here, with sets of data coming from experimental designs pouring into Excel data sheets and analytical facilities generating big (around 1Gb/day) plain text files, is that we have such a huge variability in model structure that it will be quite expensive to programme interfaces to all the processes to store data in a central repository managed by a RDBMS.

Even worst! During these last years some of hour information has been moved to object oriented databases, so the problem is becoming more complex.


The financial applications that I am thinking of are not exactly realtime
but quick response and fast performance would definitely help.

Aside info, I want to take things to a cloud environment at some point of
time just because it will be easier and cheaper to deliver.

Kind of an open question, but any inputs will help.

As you see, there are no answers here, but more doubts as I'm in a similar situation. So, any idea will be extremely welcome for us!

Thanks!



--
Ricardo Rodríguez
Your XEN ICT Team

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to