Hi Evan, wget -rH -D $(cat trash/text.txt) williamstallings.com
is not what you want. Leave away the -H, else host-spanning is ON and -D will be ignored. > I bring this up for one my two questions. Can someone recommend a better > method of performance testing? What you want to know is how many CPU cycles does wget need to perform a defined task (if you compare make sure exactly the same files are downloaded). The measurement of the real time used depends on many time-variant side- effects and thus two runs of wget are hardly comparable. Use valgrind --tool=callgrind wget ... You can use kcachegrind to display/analyse which part of wget took how many CPU cycles. Regards, Tim On Tuesday 02 June 2015 17:21:48 ekrell wrote: > Greetings, > > I recently used wget in such a way that the result disagreed with my > understanding of what should have happened. This came about during a > small programming exercise I am currently working on; I am attempting to > see if a large number of domains (from '-D' option) would be processed > more quickly by using the hashtable included in hash.c. While comparing > the speed of my hashed implementation of host checking against an > unmodified version of wget, the standard wget did not seem to respect my > list of accepted domains. > > For the hash table version, I did the following: > In recur.c, I init a hashtable with all of the accepted domains from > opt.domain. Ignoring (for the moment) increased memory usage, I assumed > that this would surely be faster than the current method of checking the > url's host. > However, when performing the check inside host.c's accept_domain > function, I realized that I would need to parse u->host to get just the > domain component. This involves some overhead that may make hashing not > worth it. Also, during this entire operation, I am assuming that if it > would provide any significant improvement, it would have most likely > been done before my decision to try it out. Nonetheless, I've enjoyed > playing around with it. > > My first couple tests were against my own website, using a list over > 5000 domains. Both wget and wget-modified downloaded the same files, and > at roughly the same speed. My website is so small, that I wanted > something larger, but not so large that it would take more than a fre > minutes. I know going around and mirroring random sites is perhaps not > recommended behaviour (without a delay), but it worked. > > I bring this up for one my two questions. Can someone recommend a better > method of performance testing? > > Having found my target website, I went ahead and ran the two wget > versions, one after the other. When mine came out to be almost twice as > fast, I knew to assume that something was amiss. Sure enough, wget has > downloaded much more content.. and spanned to many more domains. > > This is the command I ran for each: > > <pathToVersion>/src/wget -rH -D $(cat trash/text.txt) > williamstallings.com > > Excusing the useless use of cat, text.txt contains the massive > comma-separated list of domains. > Each of those domains is a randomly generated numeric value, expect for > the final one: williamstallings.com > > Previously, whenever I ran this test against a (smaller) website, both > versions of wget would only recursively download those specified by my > single "real" domain in the list. However, this time (and I did it twice > to make sure) original wget went on to download from over 20 other > domains. > > I would appreciate it if someone could explain what is going on here. > Seeing as this behaviour exists with the version I obtained from > git://git.savannah.gnu.org/wget.git as well as wget from the package > manager, I am not proclaiming "found a bug!". I imagine that I just > misunderstand what should have taken place, since I expected to only > have the single directory from williamstallings.com > > Thanks, > Evan Krell
signature.asc
Description: This is a digitally signed message part.
