Hi Tony,
Strangely I started looking into the Solr/Nutch integration yesterday
so I might be able to help :)
The documentation for it is very sparse, but the trunk of nutch does
have the solr integration committed.
If I remember correctly, what I had to do was...
I went through one of the nutch setup guides and set it up as if I
wasn't going to use solr. (Can't remember which one, sorry).
Copy the crawl script from here: http://www.foofactory.fi/files/nutch-solr/crawl.sh
into my nutch directory.
I was running this under the soy-latte JVM on OSX, and I had to modify
the crawler a little to pick up filenames instead of permissions
strings:
This line was changed (note the 'cut' command)
SEGMENT=`bin/hadoop dfs -ls $BASEDIR/segments|grep $BASEDIR|cut -d\ -
f17|sort|tail -1`
I also changed the second to last line to match the required
parameters for the new solr indexer:
bin/nutch org.apache.nutch.indexer.solr.SolrIndexer http://localhost:8983/solr/
$BASEDIR/crawldb $BASEDIR/linkdb $SEGMENT
Copy the schema.xml from the nutch config directory into a fresh solr
install & start it up.
run the crawler.sh, and you should end up with content in your solr
instance.
I probably wont' be able to answer many nutch-related questions, but
that's how I managed to get it up and running.
Toby.
On 6 Mar 2009, at 11:27, Andrzej Bialecki wrote:
Tony Wang wrote:
Hi Hoss,
But I cannot find documents about the integration of Nutch and Solr
in
anywhere. Could you give me some clue? thanks
Tony, I suggest that you follow Hoss's advice and ask these
questions on nutch-user. This integration is built into Nutch, and
not Solr, so it's less likely that people on this list know what you
are talking about.
This integration is quite fresh, too, so there are almost no docs
except on the mailing list. Eventually someone is going to create
some docs, and if you keep asking questions on nutch-user you will
contribute to the creation of such docs ;)
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
Toby Cole
Software Engineer
Semantico
Lees House, Floor 1, 21-23 Dyke Road, Brighton BN1 3FE
T: +44 (0)1273 358 238
F: +44 (0)1273 723 232
E: toby.c...@semantico.com
W: www.semantico.com