he.org
Sent: Friday, July 6, 2007 2:19:21 AM
Subject: Re: Indexing HTML and other doc types
I guess I misread your original question. I believe Nutch would be the
choice for crawling, however I do not know about its abilities for indexing
other document types. If you needed to index multiple do
Peter,
I was playing with Nutch for quite some time before Solr, so
I know Nutch better than Solr. Nutch has a plugin mechanism
so that you can add a parser for a document type. It comes with
parser plugins for most popular doc types (with varying degrees of
international text support).
My que
I guess I misread your original question. I believe Nutch would be the
choice for crawling, however I do not know about its abilities for indexing
other document types. If you needed to index multiple document types such
as PDF, DOC, etc and Nutch does not provide functionality to do so you woul
Thank you, Otis and Peter, for your replies.
> From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]
> doc of some type -> parse content into various fields -> post to Solr
I understand this part, but the question is who should do this.
I was under assumption that it's Solr client's job to crawl the
A coworker of mine posted the code that we used for adding pdf, doc, xls,
etc documents into solr. You can find the files at the following location.
https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
Just apply the patch and put the
Kuro,
doc of some type -> parse content into various fields -> post to Solr
Even Nutch does the same - there is a title field, a content field, and so on
(the exact names may be different).
Of course, you can always just combine everything into a single content field.
Otis
. . . . . . . . . .