Scraping multiple web pages help

2019-02-18 Thread Drake Gossi
Hello everyone,

For a research project, I need to scrape a lot of comments from
regulations.gov

https://www.regulations.gov/docketBrowser?rpp=25&so=DESC&sb=commentDueDate&po=0&dct=PS&D=ED-2018-OCR-0064

But partly what's throwing me is the url addresses of the comments. They
aren't consistent. I mean, there's some consistency insofar as the numbers
that differentiate the pages all begin after that 0064 number in the url
listed above. But the differnetiating numbers aren't even all the same
amount of numbers. Some are 4 (say, 4019) whereas others are 5 (say,
50343). But I dont think they go over 5. So this is a problem. I dont know
how to write the code to access the multiple pages.

I should also mention I'm new to programing, so that's also a problem (if
you cant already tell by the way I'm describing my problem).


I should also mention that, I think, there's an API on regulations.gov, but
I'm such a beginner that I dont evem really know where to find it, or even
what to do with it once I do. That's how helpless am right now.

Any help anyone could offer would be much appreciated.

D
-- 
https://mail.python.org/mailman/listinfo/python-list


trying to begin a code for web scraping

2019-02-18 Thread Drake Gossi
Hi everyone,

I'm trying to write code to scrape this website
 (
regulations.gov) of its comments, but I'm having trouble figuring out what
to link onto in the inspect page (like when I right click on inspect with
the mouse).

Although I need to write code to scrape all 11,000ish of the comments
related to this event (by putting a code in a loop?), I'm still at the
stage of looking at individual comments. So, for example, with this comment
, I know
enough to right click on inspect and to look at the xml? (This is how much
of a beginner I am--what am I looking at when I right click inspect?) Then,
I control F to find where the comment is in the code. For that comment, the
word I used control F on was "troubling." So, I found the comment buried in
the xml

But my issue is this. I don't know what to link onto to scrape the comment
(and I assume that this same sequence of letters would apply to scraping
all of the comments in general). I assume what I grab is GIY1LSJISD. I'm
watching this video, and the person is linking onto "tr" and "td," but mine
is not that easy. In other words, what is the most essential language (bit
of xml? code), the copying of which would allow me to extract not only this
comment, but all of the comments, were I to put this bit of language(/xml?)
my code? ... ... soup.findALL ('?')

In sum, what I need to know is, how do I tell my Python code to ignore all
of the surrounding code and go straight in and grab the comment. Of course,
I need to grab other things too like the name, category, date, and so on,
but I haven't gotten that far yet. Right now, I'm just trying to figure out
what I need to insert into my code so that I can get the comment.

Help! I'm trying to learn code on the fly. I'm an experienced researcher
but am new to coding. Any help you could give me would be tremendously
awesome.

Best,
Drake
-- 
https://mail.python.org/mailman/listinfo/python-list


trying to retrieve comments with activated API key

2019-03-08 Thread Drake Gossi
Hi everyone,

I'm further along than I was last time. I've installed python and am
running this in spyder. This is the code I'm working with:

import requests
import csv
import time
import sys
api_key = 'my api key'
docket_id = 'ED-2018-OCR-0064'
total_docs = 32068
docs_per_page = 1000

Where the "my api key" is, is actually my api key. I was just told not to
give it out. But I put this into spyder and clicked run and nothing
happened. I went to right before "import requests and clicked run." I
think. I'm better in R. But I'm horrible at both. If you can't already
tell. Well, this might have happened:

runfile('/Users/susan/.spyder-py3/temp.py', wdir='/Users/susan/.spyder-py3')

but I don't know what to do with it, if it actually happened.

But I feel like I'm missing something super important. Like, for instance,
how is python being told to go to the right website? Again, I'm trying to
retrieve these comments

off of regulations.gov. I don't know if this helps, but the interactive API
interface is here .

I've installed anaconda. I was told to get postman. Should I get postman?

Help! At the end of the day, I'm trying to use python to get the comments
from regulations.gov into a csv file so that I can analyze them in R.

And then I think that I only need the name, comment, date, and category in
the JSON dictionary. I can't send a picture through the list, but I have
one of what I'm talking about.

Drake
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: trying to retrieve comments with activated API key

2019-03-08 Thread Drake Gossi
Still having issues:

7. import requests
8. import csv
9. import time
10. import sys
11. api_key = 'PUT API KEY HERE'
12. docket_id = 'ED-2018-OCR-0064'
13. total_docs = 32068
14. docs_per_page = 1000
15.
16. r = requests.get("https://api.data.gov:443/regulations/v3/documents.json
<https://api.data.gov/regulations/v3/documents.json>",
17. params={
18."api_key": api_key,
19."dktid": docket_id,
 20."rrp": docs_per_page,

})

I'm getting an "imported but unused" by 8, 9 & 10.

But then also, on the interactive API
<https://regulationsgov.github.io/developers/console/#!/documents.json/documents_get_0>,
it has something called a "po"--i.e., page offset. I was able to figure out
that "rrp" means results per page, but now I'm unsure how to get all 32000
comments into a response object. Do I have to add an "ro" on line 21? or
something else?

Also, I ran the code as is starting on line 7 and got this:

runfile('/Users/susan/.spyder-py3/temp.py', wdir='/Users/susan/.spyder-py3')

This was in the terminal, I think. Since nothing popped up in what I assume
is the environment--I'm running this in spyder--I assume it didn't work.

Drake

On Fri, Mar 8, 2019 at 12:29 PM Chris Angelico  wrote:

> On Sat, Mar 9, 2019 at 7:26 AM Drake Gossi  wrote:
> >
> > Yea, it looks like the request url is:
> >
> > import requests
> > import csv
>  <---on
> these three, I have an "imported but unused warning"
> > import time
> <---
> > import sys
>  <---
> > api_key = 'PUT API KEY HERE'
> > docket_id = 'ED-2018-OCR-0064'
> > total_docs = 32068  <-but then also, what happens here? does it have to
> do with po   page offset? how do I get all 32000 instead of a 1000
> > docs_per_page = 1000
> >
>
> Can you post on the list, please? Also - leave the original text as
> is, and add your responses underneath, like this; don't use colour to
> indicate what's your text and what you're replying to, as not everyone
> can see colour. Text is the only form that truly communicates.
>
> ChrisA
>
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: trying to retrieve comments with activated API key

2019-03-08 Thread Drake Gossi
Ok, I think it worked. Embarrassingly, in my environment, I was clicked on
"help" rather than "variable explorer." This is what is represented:

api_key

string

1

DEMO KEY

docket_id

string

1

ED-2018-OCR-0064

docs_per_page

int

1

1000

total_docs

int

1

32068

Does this mean I can add on the loop? that is, to get all 32000?

and is this in JSON format? it has to be, right? eventually I'd like it to
be in csv, but that's because I assume I have to manipulate it was R
later...

D

On Fri, Mar 8, 2019 at 12:54 PM Chris Angelico  wrote:

> On Sat, Mar 9, 2019 at 7:47 AM Drake Gossi  wrote:
> >
> > Still having issues:
> >
> > 7. import requests
> > 8. import csv
> > 9. import time
> > 10. import sys
> > 11. api_key = 'PUT API KEY HERE'
> > 12. docket_id = 'ED-2018-OCR-0064'
> > 13. total_docs = 32068
> > 14. docs_per_page = 1000
> > 15.
> > 16. r = requests.get("
> https://api.data.gov:443/regulations/v3/documents.json
> > <https://api.data.gov/regulations/v3/documents.json>",
> > 17. params={
> > 18."api_key": api_key,
> > 19."dktid": docket_id,
> >  20."rrp": docs_per_page,
> >
> > })
> >
> > I'm getting an "imported but unused" by 8, 9 & 10.
>
> Doesn't matter - may as well leave them in.
>
> > But then also, on the interactive API
> > <
> https://regulationsgov.github.io/developers/console/#!/documents.json/documents_get_0
> >,
> > it has something called a "po"--i.e., page offset. I was able to figure
> out
> > that "rrp" means results per page, but now I'm unsure how to get all
> 32000
> > comments into a response object. Do I have to add an "ro" on line 21? or
> > something else?
>
> Probably you'll need a loop. For now, don't worry about it, and
> concentrate on getting the first page.
>
> > Also, I ran the code as is starting on line 7 and got this:
> >
> > runfile('/Users/susan/.spyder-py3/temp.py',
> wdir='/Users/susan/.spyder-py3')
> >
> > This was in the terminal, I think. Since nothing popped up in what I
> assume
> > is the environment--I'm running this in spyder--I assume it didn't work.
>
> Ahh, that might be making things confusing.
>
> My recommendation is to avoid Spyder altogether. Just run your script
> directly and let it do its downloading. Keep it simple!
>
> ChrisA
> --
> https://mail.python.org/mailman/listinfo/python-list
>
-- 
https://mail.python.org/mailman/listinfo/python-list