i wanna crawl http://www.amazone.com/ and just wanna product title , product information, writer, publisher.and other data i wanna ignore.
How about http://blog.foofactory.fi/2007/02/online-indexing-integrating-nutch-with.html or if you're prepared to wait or help out there's http://svn.apache.org/repos/asf/labs/droids/README.TXT