Re: [R] RJDBC vs RMySQL vs ???

James W. MacDonald Thu, 24 Jun 2010 07:33:05 -0700

Hi Ralf,

Ralf B wrote:

Sorry for the lack of details. Since I run the same SQL first directly
on MySQL (using the MySQL Query Browser) and then again using R
through the RJDBC interface, I assume that I won't simply have a badly
constructed SQL query. However, just to clear possible objection, here
the SQL:



# Extracts vector of data points
getData <- function(connection) {
        queryStart <- "SELECT id1, id2, x, y FROM `mytable` "
        queryEnd <- ";"
        query <- paste(queryStart, " WHERE id1 IN(", id1s, ") AND id2 IN(",
id2s, ") AND subtype='TYPE1'", queryEnd)
        # execute query
        data =  dbGetQuery(connection, query)
        return(data)
}

When running this method using either RGUI or the command line, I have
a runtime that reaches an incredible 10 minutes (!) for selecting
about 50k - 80k data points (which I consider not much) based on the
range of IDs I choose. The table size is about 5-8 million data points
total. The same SQL query directly executed in MySQL Query Browser
takes about 20 seconds which I would consider fine. There are no
indices created for any of the fields but since the query runs a lot
faster in the query browser I don't suspect this to be the main
reason.

Any ideas?


Well, the RJDBC rforge page has this note:

Note: The current implementation of RJDBC is done entirely in R, no Javacode is used. This means that it may not be extremely efficient andcould be potentially sped up by using Java native code. However, it wassufficient for most tasks we tested. If you have performance issues withRJDBC, please let us know and tell us more details about your test case.

And from my quick peek at the page, it appears RJDBC is designed toallow one to query any DBMS. Since RMySQL is MySQL-specific, it may bemore efficient. Anyway, why don't you just try it and see?


Best,

Jim


Best,
Ralf




On Wed, Jun 23, 2010 at 4:36 PM, James W. MacDonald
<[email protected]> wrote:

Hi Ralf,

Ralf B wrote:

I am running a simple SQL SELECT statement that involvs 50k + data
points using R and the RJDBC interface. I am facing very slow response
times in both the RGUI and the R console. When running this SQL
statement directly in a SQL client I have processing times that are a
lot lot faster (which means that the SQL statement itself is not the
problem).

Did any of you compare RJDBC vs RMySQL or is there a better, more
efficient way to extract large data from databases using R? Would you
recommend dumping data out completely into flat files and working with
flat files instead? I expected that this would not be such a problem
given that businesses maintain their data in DBs and R is supposed to
be good in shifting around data. Am I doing something wrong?

Well, if you don't show people what you have done, how can anybody tell if
you are doing something wrong or not?

I have no experience with RJDBC, so cannot say anything about that. However,
I have always found RMySQL to be speedy enough. As an example:

library(RMySQL)

Loading required package: DBI

con <- dbConnect("MySQL", host="genome-mysql.cse.ucsc.edu", user =
"genome", dbname = "hg18")
system.time(a <- dbGetQuery(con, "select name, chromEnd from snp129 where
chrom='chr1' and chromStart between 1 and 1e8;")

+ )
  user  system elapsed
  7.95    0.06   38.59

dim(a)

[1] 508676      2

So 40 seconds to get half a million records. Since this is via the internet,
I have to imagine things would be much faster querying a local DB.

But then you never say what constitutes 'slow' for you, so maybe this is
slow as well?

Best,

Jim

Ralf

______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

--
James W. MacDonald, M.S.
Biostatistician
Douglas Lab
University of Michigan
Department of Human Genetics
5912 Buhl
1241 E. Catherine St.
Ann Arbor MI 48109-5618
734-615-7826
**********************************************************
Electronic Mail is not secure, may not be read every day, and should not be
used for urgent or sensitive issues


--
James W. MacDonald, M.S.
Biostatistician
Douglas Lab
University of Michigan
Department of Human Genetics
5912 Buhl
1241 E. Catherine St.
Ann Arbor MI 48109-5618
734-615-7826
**********************************************************

Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues

______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] RJDBC vs RMySQL vs ???

Reply via email to