Re: [R] RCurl and Google Scholar's EndNote references

Jarno Tuimala Fri, 18 Sep 2009 01:00:55 -0700

Hi!

Thanks for the code examples. I'll try to elaborate a bit here.


If you paste this:

http://scholar.google.fi/scholar?hl=fi&oe=ASCII&q=Frank+Harrell

 to your browser, you'll get the citations, and each citation lists a link
("Import to EndNote") to export a citation in EndNote format.

And, if you save this HTML-file, you'll get the links (they contain a string
"info:") pointing to these EndNote files.  For example:

<a href="/scholar.enw?q=info:U6Gfb4QPVFMJ:
scholar.google.com/&amp;output=citation&amp;hl=en&amp;oe=ASCII&amp;oe=ASCII&amp;ct=citation&amp;cd=0">Import
into EndNote</a>

Now, if you do this in R:

curl = getCurlHandle()
z = getForm("http://scholar.google.com/scholar";, q ='Frank Harrell', hl =
'en', btnG = 'Search', oe="ASCII", .opts = list(verbose = TRUE), curl =
curl)

object z does not contain any "info:"-containing links:

grep("info:", z)
integer(0)

Fortunately there is a "related:"-link that gives us the same ID
(U6Gfb4QPVFMJ) as the "info:"-link above:

substr(z, gregexpr("related:", z)[[1]]+8, gregexpr("related:", z)[[1]]+19)
[1] "U6Gfb4QPVFMJ"

Now checking from the Google Scholar page, the correct format for the
EndNote query would appear to be:

http://scholar.google.fi/scholar.enw?q=info:U6Gfb4QPVFMJ:scholar.google.com/&output=citation&hl=en&oe=ASCII&oe=ASCII&ct=citation&cd=0

You can copy and paste this link to your browser, and save the EndNote
refence as a file.

Yet, when this link is constracted in R:

getURL(paste("http://scholar.google.fi/scholar.enw?q=info:";, substr(z,
gregexpr("related:", z)[[1]]+8, gregexpr("related:", z)[[1]]+19), ":
scholar.google.com/&output=citation&hl=en&oe=ASCII&oe=ASCII&ct=citation&cd=0",
sep=""), curl=curl)

the result is an HTML-file containing "403 Forbidden" error.

But, this type of a functionality seems to be missing from the Google API
(thank to Peter Konings for the link):

http://code.google.com/p/google-ajax-apis/issues/detail?id=109

- Jarno

2009/9/18 Duncan Temple Lang <dun...@wald.ucdavis.edu>

>
> Hi Jarno
>
> You've only told us half the story. You didn't show how you
> i) performed the original query
> ii) retrieved the URL you used in subsequent queries
>
>
> But I can suggest two possible problems.
>
> a) specifying the cookiejar option tells libcurl where to write the
>   cookies that the particular curl handle has collected during its life.
>   These are written when the curl handle is destroyed.
>   So that wouldn't change the getURL() operation, just change what happens
>   when the curl handle is destroyed.
>
> b) You probably mean to use cookiefile rather than cookiejar so that
>   the curl request would read existing cookies from a file.
>   But in that case, how did that file get created with the correct cookies.
>
> c) libcurl will collect cookies in a curl handle as it receives them from a
> server
>   as part of a response. And it will use these in subsequent requests to
> that server.
>   But you must be using the same curl handle.  Different curl handles are
> entirely
>   independent (unless one is copied from another).
>   So a possible solution may be that you need to do the initial query with
> the same
>   curl handle
>
>
> So I would try something like
>
> curl = getCurlHandle()
> z = getForm("http://scholar.google.com/scholar";, q ='Frank Harrell', hl =
> 'en', btnG = 'Search',
>              .opts = list(verbose = TRUE), curl = curl)
>
> dd = htmlParse(z)
> links = getNodeSet(dd, "//a...@href]")
>
> # do something to identify the link you want
>
> tmp = getURL(linkIWant, curl = curl)
>
>
> Note that we are using the same curl object in both requests.
>
>
> This may not do what you want, but if you let us know the details
> about how you are doing the preceding steps, we should be able to sort
> things out.
>
>  D.
>
>
> Jarno Tuimala wrote:
> > Hi!
> >
> > I've performed a Google Scholar Search using a query, let's say "Frank
> > Harrell", and parsed the links to the EndNote references from the
> resulting
> > HTML code. Now I'd like to download all the references automatically. For
> > this, I have tried to use RCurl, but I can't seem to get it working: I
> > always get error code "403 Forbidden" from the web server.
> >
> > Initially I tried to do this without using cookies:
> >
> > library(RCurl)
> > getURL("
> >
> http://scholar.google.fi/scholar.enw?q=info:U6Gfb4QPVFMJ:scholar.google.com/&output=citation&hl=fi&oe=ASCII&ct=citation&cd=0
> > ")
> >
> > or
> >
> > getURLContent("
> >
> http://scholar.google.fi/scholar.enw?q=info:U6Gfb4QPVFMJ:scholar.google.com/&output=citation&hl=fi&oe=ASCII&ct=citation&cd=0
> > ")
> > Error: Forbidden
> > and then with cookies:
> >
> >  getURL("
> >
> http://scholar.google.fi/scholar.enw?q=info:U6Gfb4QPVFMJ:scholar.google.com/&output=citation&hl=fi&oe=ASCII&ct=citation&cd=0
> ",
> > .opts=list(cookiejar="cookiejar.txt"))
> >
> > But they both consistently fail the same way. What am I doing wrong?
> >
> > sessionInfo()
> > R version 2.9.0 (2009-04-17)
> > i386-pc-mingw32
> > locale:
> >
> LC_COLLATE=Finnish_Finland.1252;LC_CTYPE=Finnish_Finland.1252;LC_MONETARY=Finnish_Finland.1252;LC_NUMERIC=C;LC_TIME=Finnish_Finland.1252
> > attached base packages:
> > [1] stats     graphics  grDevices utils     datasets  methods   base
> > other attached packages:
> > [1] RCurl_0.98-1   bitops_1.0-4.1
> >
> > Thanks!
> > Jarno
> >
>  >       [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-help@r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html<http://www.r-project.org/posting-guide.html>
> > and provide commented, minimal, self-contained, reproducible code.
>
>

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] RCurl and Google Scholar's EndNote references

Reply via email to