On 11/10/2016 7:59 AM, Ryan Utz wrote:
Bob/Duncan,

Thanks for writing. I think some of the things Bob mentioned might work,
but I'm still not quite getting there. Below is the example I'm working
with:


It worked for me when I replaced the browseURL call with a readLines call, as I suggested the other day. What went wrong for you?

Duncan Murdoch

#1
browseURL('http://pick18.discoverlife.org/mp/20m?plot=2&kind=Hypoprepia+fucosa&site=33.9+-83.3&date1=2011,2012,2013&flags=build_txt:
<http://pick18.discoverlife.org/mp/20m?plot=2&kind=Hypoprepia+fucosa&site=33.9+-83.3&date1=2011,2012,2013&flags=build_txt:>')
# This opens the URL and creates a link to machine-readable data on the
page, which I can then download by simply doing this:

#2
read.delim('http://pick18.discoverlife.org/tmp/Hypoprepia_fucosa_33.9_-83.3_2011,2012,2013.txt
<http://pick18.discoverlife.org/tmp/Hypoprepia_fucosa_33.9_-83.3_2011,2012,2013.txt>')
#This is what I need to read in terms of data, but this URL only exists
if the URL ran above is activated first

So, for example, try running line #2 without the first line- it won't
work. Next run #1 then #2- works fine.

See what I mean?


On Thu, Sep 29, 2016 at 5:09 PM, Bob Rudis <b...@rud.is
<mailto:b...@rud.is>> wrote:

    The rvest/httr/curl trio can do the cookie management pretty well.
    Make the initial connection via rvest::html_session() and then
    hopefully be able to use other rvest function calls, but curl and
    httr calls will use the cached in-memory handle info seamlessly.
    You'd need to store and retrieve cookies if you need them preserved
    between R sessions.

    Failing the above and assuming this would not need to be lightning
    fast, use the phantomjs or firefox web driver (either with RSelenium
    or some new stuff rOpenSci is cooking up) which will then do what
    browsers do best and maintain all this state for you. You can still
    slurp the page contents up with xml2::read_html() and use the super
    handy processing idioms in the scraping tidyverse (it needs it's own
    name).

    A concrete example (assuming the URLs aren't sensitive) would enable
    me or someone else to mock up something for you.


    On Thu, Sep 29, 2016 at 4:59 PM, Duncan Murdoch
    <murdoch.dun...@gmail.com <mailto:murdoch.dun...@gmail.com>> wrote:

        On 29/09/2016 3:29 PM, Ryan Utz wrote:

            Hi all,

            I've got a situation that involves activating a URL so that
            a link to some
            data becomes available for download. I can easily use
            'browseURL' to do so,
            but I'm hoping to make this batch-process-able, and I would
            prefer to not
            have 100s of browser windows open when I go to download
            multiple data sets.

            Here's the example:

            #1
            browseURL('
            
http://pick18.discoverlife.org/mp/20m?plot=2&kind=Hypoprepia+fucosa&site=33.9+-83.3&date1=2011,2012,2013&flags=build_txt
            
<http://pick18.discoverlife.org/mp/20m?plot=2&kind=Hypoprepia+fucosa&site=33.9+-83.3&date1=2011,2012,2013&flags=build_txt>:
            ')
            # This opens the URL and creates a link to machine-readable
            data on the
            page, which I can then download by simply doing this:

            #2
            read.delim('
            
http://pick18.discoverlife.org/tmp/Hypoprepia_fucosa_33.9_-83.3_2011,2012,2013.txt
            
<http://pick18.discoverlife.org/tmp/Hypoprepia_fucosa_33.9_-83.3_2011,2012,2013.txt>
            ')

            However, I can only get the second line above to work if the
            thing in line
            #1 has been opened in a browser already. Is there any way to
            allow me to
            either 1) close the browser after it's been opened or 2)
            execute the line
            #2 above without having to open a browser? We have hundreds
            of species that
            you can see after the '&kind=' bit of the URL, so I'm trying
            to keep the
            browsing situation sane.

            Thanks!
            R


        You'll need to figure out what happens when you open the first
        page. Does it set a cookie?  Does it record your IP address?
        Does it just build the file but record nothing about you?

        If it's one of the simpler versions, you can just read the first
        page, wait a bit, then read the second one.

        If you need to manage cookies, you'll need something more
        complicated. I don't know the easiest way to do that.

        Duncan Murdoch


        ______________________________________________
        R-help@r-project.org <mailto:R-help@r-project.org> mailing list
        -- To UNSUBSCRIBE and more, see
        https://stat.ethz.ch/mailman/listinfo/r-help
        <https://stat.ethz.ch/mailman/listinfo/r-help>
        PLEASE do read the posting guide
        http://www.R-project.org/posting-guide.html
        <http://www.R-project.org/posting-guide.html>
        and provide commented, minimal, self-contained, reproducible code.





--

Ryan Utz, Ph.D.
Assistant professor of water resources
*chatham**UNIVERSITY*
Home/Cell: (724) 272-7769


______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to