On 20.06.2018 18:20, Nils Gerlach wrote: > It does not delete any html-file or anything else. Either it is accepted > and kept or it is saved forever. > With the tip about --accept and --acept-regex I can get wget to traverse > the links but it does not go deep > enough to get the *l.jpgs I tried to increase -l but to no avail. It seems > like it is going only 1 link deep. > And not deletes.
Yes, my failure. Looking at the code, the regex options are applied without taking --recursive or --level into account. They are dumb URL filters. We are back at wget -d -olog -r -Dcomicstriplibrary.org -A "*little-nemo*s.jpeg" 'http://comicstriplibrary.org/search?search=little+nemo' that doesn't work as expected. Somehow it doesn't follow certain links so that little-nemo*s.jpeg files aren't found. Interestingly, the same options with wget2 are finding + downloading those files. From a first glimpse: those files are linked from an RSS / Atom file. Those aren't supported by wget, but wget2 does parse them for URLs. Want to give it a try ? https://gitlab.com/gnuwget/wget2 Regards, Tim > > 2018-06-20 16:58 GMT+02:00 Tim Rühsen <[email protected]>: > >> Hi Niels, >> >> please always answer to the mailing list (no problem if you CC me, but >> not needed). >> >> It was just an example for POSIX regexes - it's up to you to work out >> the details ;-) Or maybe there is a volunteer reading this. >> >> The implicitly downloaded HTML pages should be removed after parsing >> when you use --accept-regex. Except the explicitly 'starting' page from >> your command line. >> >> Regards, Tim >> >> On 06/20/2018 04:28 PM, Nils Gerlach wrote: >>> Hi Tim, >>> >>> I am sorry but your command does not work. It only downloads the >> thumbnails >>> from the first page >>> and follows none of the links. Open the link in a browser. Click on the >>> pictures to get a larger picture. >>> There is a link "high quality picture" the pictures behind those links >> are >>> the ones i want to download. >>> Regex being ".*little-nemo.*n\l.jpeg". And not only the first page but >> from >>> the other search result pages, too. >>> Can you work that one out? Does this work with wget? Best result would be >>> if the visited html-pages were >>> deleted by wget. But if they stay I can delete them afterwards. But >>> automatism would be better, that's why I am >>> trying to use wget ;) >>> >>> Thanks for the information on the filename and path, though. >>> >>> Greetings >>> >>> 2018-06-20 16:13 GMT+02:00 Tim Rühsen <[email protected]>: >>> >>>> Hi Nils, >>>> >>>> On 06/20/2018 06:16 AM, Nils Gerlach wrote: >>>>> Hi there, >>>>> >>>>> in #wget on freenode I was suggested to write this to you: >>>>> I tried using wget to get some images: >>>>> wget -nd -rH -Dcomicstriplibrary.org -A >>>>> "little-nemo*s.jpeg","*html*","*.html.*","*.tmp","*page*","*display*" >>>> -p -e >>>>> robots=off 'http://comicstriplibrary.org/search?search=little+nemo' >>>>> I wanted to download the images only but wget was not following any of >>>> the >>>>> links so I got that much more into -A. But it still does not follow the >>>>> links. >>>>> Page numbers of the search result contain "page" in the link, links to >>>> the >>>>> big pictures i want wget to download contain "display". Both are given >> in >>>>> -A and are seen in the html-document wget gets. Neither is followed by >>>> wget. >>>>> >>>>> Why does this not work at all? Website is public, anybody is free to >>>> test. >>>>> But this is not my website! >>>> >>>> -A / -R works only on the filename, not on the path. The docs (man page) >>>> is not very explicit about it. >>>> >>>> Instead try --accept-regex / --reject-regex which acts on the complete >>>> URL - but shell wildcard's won't work. >>>> >>>> For your example this means to replace '.' by '\.' and '*' by '.*'. >>>> >>>> To download those nemo jpegs: >>>> wget -d -rH -Dcomicstriplibrary.org --accept-regex >>>> ".*little-nemo.*n\.jpeg" -p -e robots=off >>>> 'http://comicstriplibrary.org/search?search=little+nemo' >>>> --regex-type=posix >>>> >>>> Regards, Tim >>>> >>>> >>> >> >>
signature.asc
Description: OpenPGP digital signature
