[Tutor] Using Python to access .txt files stored behind a firewall as .exe files
I've got a Python project that I'd love some help on from a Python developer who is well versed at web scraping or requests. I work for a supplier, and we use a distributor to sell our products to retailers. The distributor has a reporting website that requires a login. >From that home / login page, you land on a page with 1 link for each state in which we do business (12 states / links in total). >From the 'state' page, you click a state link, and are taken to a page with many data files for that state. The data files are neatly arranged .txt files displayed as links, with logical naming conventions. The problem is, when you click a link for a particular file, an .exe downloads to your local machine. Then you have you run the .exe which produces a zipped file, and inside the zipped file, is the .txt, which what I really want. There's no way the distributor will change anything about how they store files on their website for me. I've written a script using the requests module but I think a web scraper like Scrapy, Beautiful Soup or Selinium may be required. What would you do? Thanks for your time. -Ian ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Using Python to access .txt files stored behind a firewall as .exe files
Hi Alan, thanks for the reply. My goal is to automatically via Python download the .exe, unzip it, and place the new .txt in a folder on my OneDrive. Then I have another visualization program that loads all the .txt files in that folder and displays them in a web-dashboard. My sales team has access to the dashboard through Sharepoint. So, I'm trying to automate the input to the dashboard so the team is always updated, without taking any of my time. Thanks for your time and thoughts -Ian On Mon, May 1, 2017 at 2:44 PM, Alan Gauld via Tutor wrote: > On 01/05/17 18:20, Ian Monat wrote: > > ... I've written a script using the requests module but I > > think a web scraper like Scrapy, Beautiful Soup or Selinium may be > > required. > > I'm not sure what you are looking for. Scrapy, BS etc will > help you read the HTML but not to fetch the file. Also do > you want to process the file (extract the text) in Python > too, or is it enough to just fetch the file? > > If the problem is with reading the HTML then you need to > give us more detail about the problem areas and HTML > format. > > If the problem is fetching the file, it sounds like you > have already done that and it should be a case of fine > tuning/tidying up the code you've written. > > What kind of help exactly are you asking for? > > -- > Alan G > Author of the Learn to Program web site > http://www.alan-g.me.uk/ > http://www.amazon.com/author/alan_gauld > Follow my photo-blog on Flickr at: > http://www.flickr.com/photos/alangauldphotos > > > ___ > Tutor maillist - Tutor@python.org > To unsubscribe or change subscription options: > https://mail.python.org/mailman/listinfo/tutor > ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Using Python to access .txt files stored behind a firewall as .exe files
Thank you for the reply Mats. I agree the fact that files are wrapped in an .exe is ridiculous. We're talking about a $15B company that is doing this by the way, not a ma and pa shop. Anyways... If I understand you correctly, you're saying I can: 1) Use Python to download the file from the web (but not by using a webscraper, according to Alan) 2) Simply ignore the .exe wrapper and use, maybe Windows Task Manager, to unzip the file and place the .txt file in the desired folder Am I understanding you correctly? Thank you -Ian On Mon, May 1, 2017 at 4:14 PM, Mats Wichmann wrote: > On 05/01/2017 03:44 PM, Alan Gauld via Tutor wrote: > > On 01/05/17 18:20, Ian Monat wrote: > >> ... I've written a script using the requests module but I > >> think a web scraper like Scrapy, Beautiful Soup or Selinium may be > >> required. > > > > I'm not sure what you are looking for. Scrapy, BS etc will > > help you read the HTML but not to fetch the file. Also do > > you want to process the file (extract the text) in Python > > too, or is it enough to just fetch the file? > > > > If the problem is with reading the HTML then you need to > > give us more detail about the problem areas and HTML > > format. > > > > If the problem is fetching the file, it sounds like you > > have already done that and it should be a case of fine > > tuning/tidying up the code you've written. > > > > What kind of help exactly are you asking for? > > > > This is a completely non-Python, non-Tutor response to part of this: > > The self-extracting archive. Convenience, at a price: running > executables of unverified reliability is just a terrible idea. > > I know you said your disty won't change their website, but you should > tell them they should: a tremendous number of organizations have > policies that don't just allow pulling down and running an exe file from > a website. Even if that's not currently the case for you, you could say > that you're not allowed, and get someone in your management chain to > promise to support that if there's a question - should not be hard. It > may be wired into the distributor's content delivery system, but that's > a stupid choice on their part. > > "Then you have you run the .exe which produces a zipped file" > > Don't do this ("run"), unless there's a way you trust to be able to > verify the security of what is offered. Just about any payload could be > buried in the exe, especially if someone broke in to the distributor's > site. > > Possibly slightly pythonic: > > if it is really just a wrapper for a zipfile (i.e. the aforementioned > self-extracting archive), you should be able to open it in 7zip or > similar, and extract the zipfile, without ever "running" it. And if > that is the case, you should be able to script extracting the zipfile > from the .exe, and then extracting the text file from the zipfile, using > Python (or other scripting languages: that's not particularly > Python-specific). > ___ > Tutor maillist - Tutor@python.org > To unsubscribe or change subscription options: > https://mail.python.org/mailman/listinfo/tutor > ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Using Python to access .txt files stored behind a firewall as .exe files
Hi Steven, Thanks for your commentary, made me laugh, I wish switching distributors were that easy. I could give them reasons why .exe files won't work for me but they don't really care if I take the data files on their site or not. So I guess to answer your question, we need them more. That said, I think my plan is to use requests to pull the .exe file down and and then try to write a python script to extract the .zip without running the .exe. (maybe with pandas?) I'm a beginner with python so we'll see how it goes! Thanks for your help -Ian On Tue, May 2, 2017 at 9:44 AM, Steven D'Aprano wrote: > On Mon, May 01, 2017 at 10:20:42AM -0700, Ian Monat wrote: > [...] > > Then you have you run the .exe which produces a zipped file, and inside > the > > zipped file, is the .txt, which what I really want. There's no way the > > distributor will change anything about how they store files on their > > website for me. I've written a script using the requests module but I > > think a web scraper like Scrapy, Beautiful Soup or Selinium may be > > required. > > > > What would you do? > > Find another distributor. > > (Its this sort of business to business incompetence that makes me laugh > when people say that private industry is always more efficient than the > alternatives. Did I say laugh? I meant cry.) > > Seriously, can't you tell them that your anti-virus blocks the .exe > files, and if they want you to use their system, they'll have to provide > text files as text files? > > Or tell them that you're using Apple Macs and the .exe files don't run > under Mac. > > I guess it depends on whether you need them more than they need you. > > In any case, this isn't a problem that can be solved by a web scraper. > The distributor's website provides .exe files. There's nothing you can > do about that except complain or leave. The website gives you a .exe > file, so that's what you receive. > > However, once you have the .exe file in your possession, you *may* be > able to hack open the file and extract the .zip file without running it. > That will require detailed knowledge of how the .exe file does its job, > but it is conceivable that it will work. A good low-level hacker could > probably determine whether the zip file is embedded in the .exe or if it > is generated on the fly. That's beyond my skills though. > > If it is generated on the fly, you're screwed. You have no choice but to > run the .exe, until you do the zip doesn't even exist. But if it is > embedded, it can be extracted, and once the zip file is extracted, > Python can easily unzip it. > > > > -- > Steve > ___ > Tutor maillist - Tutor@python.org > To unsubscribe or change subscription options: > https://mail.python.org/mailman/listinfo/tutor > ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor