Re: [R] The Future of R | API to Public Databases

Jason Edgecombe Sun, 15 Jan 2012 07:24:09 -0800

At first, I thought that RDF and SDMX were two competing standards andwas disheartened, but it appears that there is collaboration betweenthem, yay!


http://groups.google.com/group/publishing-statistical-data/browse_thread/thread/531b1b5a73397c1c?pli=1


On 01/15/2012 08:31 AM, Benjamin Weber wrote:

Yes, R-devel would be the right mailing list for this discussion.
As some people pointed out, the problem definition is vague. This was
to encourage people to share their *different* perceptions about the
problem and to get to some extent a consensus.

My starting point has come from my mind, consequently I must be an
egocentric person. I agree on that.
There are a lot of other egocentric persons who download R and just
want to have their result ASAP. That's reality.
The same is given with each and every special interest group (where
each and every member has a special interest).
Everyone cares only about his needs. That is the systematic issue we
have to overcome by working together to simplify everyone's individual
situation. Finally we should reach a win-win situation for all. That
is my notion.

What I wanted to point out was more or less about the process of a
statistical research:

1. Set up your research objective
2. Find the right data (time intensive)
3. Download the right format
4. Import it, make it compatible, clean it up
5. Work with it
6. Get your results

The more integrative your research objective is set up, the more time
you spent on parts 1 to 3. And points 1 to 3 make up most of the time
in most cases. Some people will resign due to lack of time or just due
to lack of accessibility of data.

I highly appreciate that a lot of people participated in this
discussion, the publishers itself address the problem nowadays (just
take a look at [1]) and some people are working on it in the R world
(i.e. TSdbi).

Reality is better than I initially perceived it. But is is not as it should be.

Benjamin


[1] http://sdmx.org/wp-content/uploads/2011/10/SDMX-Action-Plan-2011_2015.pdf

On 15 January 2012 13:15, Prof Brian Ripley<rip...@stats.ox.ac.uk>  wrote:

On 14/01/2012 18:51, Joshua Wiley wrote:

I have been following this thread, but there are many aspects of it
which are unclear to me.  Who are the publishers?  Who are the users?
What is the problem?  I have a vauge sense for some of these, but it
seems to me like one valuable starting place would be creating a
document that clarifies everything.  It is easier to tackle a concrete
problem (e.g., agree on a standard numerical representation of dates
and times a la ISO 8601) than something diffuse (e.g., information
overload).


Let alone something as vague as 'the future of R' (for which the R-devel
list is the appropriate one).  I believe the original poster is being
egocentric: as someone said earlier, she has never had need of this concept,
and I believe that is true of the vast majority of R users.

The development of R per se is primarily driven by the needs of the core
developers and those around them.  Other R communities have sent up their
own special-interest groups and sets of packages, and that would seem the
way forward here.

Good luck,

Josh

On Sat, Jan 14, 2012 at 10:02 AM, Benjamin Weber<m...@bwe.im>    wrote:

Mike

We see that the publishers are aware of the problem. They don't think
that the raw data is the usable for the user. Consequently they
recognizing this fact with the proprietary formats. Yes, they resign
in the information overload. That's pathetic.

It is not a question of *which* data format, it is a question about
the general concept. Where do publisher and user meet? There has to be
one *defined* point which all parties agree on. I disagree with your
statement that the publisher should just publish csv or cook his own
API. That leads to fragmentation and inaccessibility of data. We want
data to be accessible.

A more pragmatic approach is needed to revolutionize the way we go
about raw data.

Benjamin

On 14 January 2012 22:17, Mike Marchywka<marchy...@hotmail.com>    wrote:








LOL, I remember posting about this in the past. The US gov agencies vary
but mostare quite good. The big problem appears to be people who push
proprietary orcommercial "standards" for which only one effective source
exists. Some formats,like Excel and PDF come to mind and there is a
disturbing trend towards theiradoption in some places where raw data is
needed by many. The best thing to do is contact the informationprovider and
let them know you want raw data, not images or stuff that worksin limited
commercial software packages. Often data sources are valuable andthe revenue
model impacts availability.

If you are just arguing over different open formats,  it is usually easy
for someone towrite some conversion code and publish it- CSV to JSON would
not be a problem for example. Data of course are quite variable and there is
nothingwrong with giving provider his choice.

----------------------------------------

Date: Sat, 14 Jan 2012 10:21:23 -0500
From: ja...@rampaginggeek.com
To: r-help@r-project.org
Subject: Re: [R] The Future of R | API to Public Databases

Web services are only part of the problem. In essence, there are at
least two facets:
1. downloading the data using some protocol
2. mapping the data to a common model

Having #1 makes the import/download easier, but it really becomes
useful
when both are included. I think #2 is the harder problem to address.
Software can usually be written to handle #1 by making a useful
abstraction layer. #2 means that data has consistent names and
meanings,
and this requires people to agree on common definitions and a common
naming convention.

RDF (Resource Description Framework) and its related technologies
(SPARQL, OWL, etc) are one of the many attempts to try to address this.
While this effort would benefit R, I think it's best if it's part of a
larger effort.

Services such as DBpedia and Freebase are trying to unify many data
sets
using RDF.

The task view and package ideas a great ideas. I'm just adding another
perspective.

Jason

On 01/13/2012 05:18 PM, Roy Mendelssohn wrote:

HI Benjamin:

What would make this easier is if these sites used standardized web
services, so it would only require writing once. data.gov is the worst
example, they spun the own, weak service.

There is a lot of environmental data available through OPenDAP, and
that is supported in the ncdf4 package. My own group has a service called
ERDDAP that is entirely RESTFul, see:

http://coastwatch.pfel.noaa.gov/erddap

and

http://upwell.pfeg.noaa.gov/erddap

We provide R (and matlab) scripts that automate the extract for
certain cases, see:

http://coastwatch.pfeg.noaa.gov/xtracto/

We also have a tool called the Environmental Data Connector (EDC) that
provides a GUI from with R (and ArcGIS, Matlab and Excel) that allows you to
subset data that is served by OPeNDAP, ERDDAP, certain Sensor Observation
Service (SOS) servers, and have it read directly into R. It is freely
available at:

http://www.pfeg.noaa.gov/products/EDC/

We can write such tools because the service is either standardized
(OPeNDAP, SOS) or is easy to implement (ERDDAP).

-Roy


On Jan 13, 2012, at 1:14 PM, Benjamin Weber wrote:

Dear R Users -

R is a wonderful software package. CRAN provides a variety of tools
to
work on your data. But R is not apt to utilize all the public
databases in an efficient manner.
I observed the most tedious part with R is searching and downloading
the data from public databases and putting it into the right format.
I
could not find a package on CRAN which offers exactly this
fundamental
capability.
Imagine R is the unified interface to access (and analyze) all public
data in the easiest way possible. That would create a real impact,
would put R a big leap forward and would enable us to see the world
with different eyes.

There is a lack of a direct connection to the API of these databases,
to name a few:

- Eurostat
- OECD
- IMF
- Worldbank
- UN
- FAO
- data.gov
- ...

The ease of access to the data is the key of information processing
with R.

How can we handle the flow of information noise? R has to give an
answer to that with an extensive API to public databases.

I would love your comments and ideas as a contribution in a vital
discussion.

Benjamin

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

**********************
"The contents of this message do not reflect any position of the U.S.
Government or NOAA."
**********************
Roy Mendelssohn
Supervisory Operations Research Analyst
NOAA/NMFS
Environmental Research Division
Southwest Fisheries Science Center
1352 Lighthouse Avenue
Pacific Grove, CA 93950-2097

e-mail: roy.mendelss...@noaa.gov (Note new e-mail address)
voice: (831)-648-9029
fax: (831)-648-8440
www: http://www.pfeg.noaa.gov/

"Old age and treachery will overcome youth and skill."
"From those who have been given much, much will be expected"
"the arc of the moral universe is long, but it bends toward justice"
-MLK Jr.


--
Brian D. Ripley,                  rip...@stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595


______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] The Future of R | API to Public Databases

Reply via email to