Hello everyone, I will be using R to manipulate this data <https://www.regulations.gov/docketBrowser?rpp=25&so=DESC&sb=commentDueDate&po=0&dct=PS&D=ED-2018-OCR-0064>. Specifically, it's proposed changes to Title IX--over 11,000 publicly available comments. So, the end goal is for me to tabulate each of these 11,000 comments in a csv file, so I can begin to manipulate and visualize the data.
But I'm not there yet. I just put in for an API key and, while I have one, I'm waiting for it to be activated. After that, though, I'm a little lost. Do I need to scrape the comments from the site? Or does having the API render that unnecessary? There is this interface <https://regulationsgov.github.io/developers/console/> that works with the API, but I don't know if, though it, I can get the data I need. I'm still trying to figure out what JSON is. Or, if I have to scrape the comments, can I do that with R? I can't get a straight answer from the python people. I can't tell if I need to do this through beautiful soup or through scrapy (or even if I need to do it at all, as I said...). The trouble with the comments is, they are each on their own URL, so--and again this is assuming that I will have to scrape them--I don't know how to code in order to grab all of the comments from all of the URLs. I also am trying to figure out how to isolate the essence of the comments in the html. From the python people, I've heard the following: scrapy fetch 'url' will download the raw page you are interested in. And you can look at the raw source code. Important to appreciate that what you see in the browser is often processed in your browser before you see it. Of course, a scraper can do the same processing, but it's complicated. So, start by looking at the raw source code. Maybe you can grab what you need with simple parsing like Beautiful Soup does. Maybe you need to do more. Scrapy is your friend. Beautiful soup is your friend here. It can analyze the data within the html tags on your scraped page. But often javascript is used on 'modern' web pages so the page is actually not just html, but javascript that changes the html. For this you need another tool -- i think one is called scrapy. Others here probably have experience with that. I think part of my problem relates to that yellow part. I was saying things like I think what I might be looking for is a div class = GIY1LSJIXD, since that's where the hierarchy seems to taper off in the html for the comment I'm looking to scrape. What I'm trying to do here is, locate the comment in the html so I can tell the request function to extract it. Any help anyone could offer here would be much appreciated. I'm very lost. Drake [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.