Hi! Thanks for the code examples. I'll try to elaborate a bit here.
If you paste this: http://scholar.google.fi/scholar?hl=fi&oe=ASCII&q=Frank+Harrell to your browser, you'll get the citations, and each citation lists a link ("Import to EndNote") to export a citation in EndNote format. And, if you save this HTML-file, you'll get the links (they contain a string "info:") pointing to these EndNote files. For example: <a href="/scholar.enw?q=info:U6Gfb4QPVFMJ: scholar.google.com/&output=citation&hl=en&oe=ASCII&oe=ASCII&ct=citation&cd=0">Import into EndNote</a> Now, if you do this in R: curl = getCurlHandle() z = getForm("http://scholar.google.com/scholar", q ='Frank Harrell', hl = 'en', btnG = 'Search', oe="ASCII", .opts = list(verbose = TRUE), curl = curl) object z does not contain any "info:"-containing links: grep("info:", z) integer(0) Fortunately there is a "related:"-link that gives us the same ID (U6Gfb4QPVFMJ) as the "info:"-link above: substr(z, gregexpr("related:", z)[[1]]+8, gregexpr("related:", z)[[1]]+19) [1] "U6Gfb4QPVFMJ" Now checking from the Google Scholar page, the correct format for the EndNote query would appear to be: http://scholar.google.fi/scholar.enw?q=info:U6Gfb4QPVFMJ:scholar.google.com/&output=citation&hl=en&oe=ASCII&oe=ASCII&ct=citation&cd=0 You can copy and paste this link to your browser, and save the EndNote refence as a file. Yet, when this link is constracted in R: getURL(paste("http://scholar.google.fi/scholar.enw?q=info:", substr(z, gregexpr("related:", z)[[1]]+8, gregexpr("related:", z)[[1]]+19), ": scholar.google.com/&output=citation&hl=en&oe=ASCII&oe=ASCII&ct=citation&cd=0", sep=""), curl=curl) the result is an HTML-file containing "403 Forbidden" error. But, this type of a functionality seems to be missing from the Google API (thank to Peter Konings for the link): http://code.google.com/p/google-ajax-apis/issues/detail?id=109 - Jarno 2009/9/18 Duncan Temple Lang <dun...@wald.ucdavis.edu> > > Hi Jarno > > You've only told us half the story. You didn't show how you > i) performed the original query > ii) retrieved the URL you used in subsequent queries > > > But I can suggest two possible problems. > > a) specifying the cookiejar option tells libcurl where to write the > cookies that the particular curl handle has collected during its life. > These are written when the curl handle is destroyed. > So that wouldn't change the getURL() operation, just change what happens > when the curl handle is destroyed. > > b) You probably mean to use cookiefile rather than cookiejar so that > the curl request would read existing cookies from a file. > But in that case, how did that file get created with the correct cookies. > > c) libcurl will collect cookies in a curl handle as it receives them from a > server > as part of a response. And it will use these in subsequent requests to > that server. > But you must be using the same curl handle. Different curl handles are > entirely > independent (unless one is copied from another). > So a possible solution may be that you need to do the initial query with > the same > curl handle > > > So I would try something like > > curl = getCurlHandle() > z = getForm("http://scholar.google.com/scholar", q ='Frank Harrell', hl = > 'en', btnG = 'Search', > .opts = list(verbose = TRUE), curl = curl) > > dd = htmlParse(z) > links = getNodeSet(dd, "//a...@href]") > > # do something to identify the link you want > > tmp = getURL(linkIWant, curl = curl) > > > Note that we are using the same curl object in both requests. > > > This may not do what you want, but if you let us know the details > about how you are doing the preceding steps, we should be able to sort > things out. > > D. > > > Jarno Tuimala wrote: > > Hi! > > > > I've performed a Google Scholar Search using a query, let's say "Frank > > Harrell", and parsed the links to the EndNote references from the > resulting > > HTML code. Now I'd like to download all the references automatically. For > > this, I have tried to use RCurl, but I can't seem to get it working: I > > always get error code "403 Forbidden" from the web server. > > > > Initially I tried to do this without using cookies: > > > > library(RCurl) > > getURL(" > > > http://scholar.google.fi/scholar.enw?q=info:U6Gfb4QPVFMJ:scholar.google.com/&output=citation&hl=fi&oe=ASCII&ct=citation&cd=0 > > ") > > > > or > > > > getURLContent(" > > > http://scholar.google.fi/scholar.enw?q=info:U6Gfb4QPVFMJ:scholar.google.com/&output=citation&hl=fi&oe=ASCII&ct=citation&cd=0 > > ") > > Error: Forbidden > > and then with cookies: > > > > getURL(" > > > http://scholar.google.fi/scholar.enw?q=info:U6Gfb4QPVFMJ:scholar.google.com/&output=citation&hl=fi&oe=ASCII&ct=citation&cd=0 > ", > > .opts=list(cookiejar="cookiejar.txt")) > > > > But they both consistently fail the same way. What am I doing wrong? > > > > sessionInfo() > > R version 2.9.0 (2009-04-17) > > i386-pc-mingw32 > > locale: > > > LC_COLLATE=Finnish_Finland.1252;LC_CTYPE=Finnish_Finland.1252;LC_MONETARY=Finnish_Finland.1252;LC_NUMERIC=C;LC_TIME=Finnish_Finland.1252 > > attached base packages: > > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > > [1] RCurl_0.98-1 bitops_1.0-4.1 > > > > Thanks! > > Jarno > > > > [[alternative HTML version deleted]] > > > > ______________________________________________ > > R-help@r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html<http://www.r-project.org/posting-guide.html> > > and provide commented, minimal, self-contained, reproducible code. > > [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.