Try this if you haven't use python before :
http://gun.io/blog/python-for-the-web/
Keep in mind that the usage of "some very known search engine" is usually
not in line with their ToS, so they will sooner or later block you, at
least.
Be gentle and polite, and you even might make it work... ;)
Hello Marco, Markus and Óscar.
Thank you very much for your answers. What you suggest, Óscar, sounds very
interesting. I mean the alternative that covers data mining with any
'popular searcher'. Do you know any tutorial or book that can teach me the
first steps?
Bye!
Hi Luis, just an opinion (worked with Nutch intensively, 2005-2008).
Web crawling is a bitch, and Nutch won't make it any easier.
Some problems you'll find along the way:
1. Spidering tunnels/traps
2. Duplicate and near-duplicate content removal
3. GET parameter explosion in dynamic page
I'm a bit biased but i would certainly use Nutch as it's the right tool for
the job, it seems. Developing custom plugins is actually easier than you might
think.
Solr, with it's extracting request handling, can only help in a very limited
way.
> Hello everyone.
>
> I've been thinking about a
Hi Luis,
Have you tried the copyField function with custom analyzers and tokenizers?
bye,
Marco Martínez Bautista
http://www.paradigmatecnologico.com
Avenida de Europa, 26. Ática 5. 3ª Planta
28224 Pozuelo de Alarcón
Tel.: 91 352 59 42
2011/10/18 Luis Cappa Banda
> Hello everyone.
>
> I've b
Hello everyone.
I've been thinking about a way to retrieve information from a domain (for
example, http://www.ign.com) to process and index. My idea is to use Solr as
a searcher. I'm familiarized with Apache Nutch and I know that the latest
version has a gateway to Solr to retrieve and index infor