Re: Indexing HTML and other doc types

2007-07-06 Thread Otis Gospodnetic
he.org Sent: Friday, July 6, 2007 2:19:21 AM Subject: Re: Indexing HTML and other doc types I guess I misread your original question. I believe Nutch would be the choice for crawling, however I do not know about its abilities for indexing other document types. If you needed to index multiple do

RE: Indexing HTML and other doc types

2007-07-06 Thread Teruhiko Kurosaka
Peter, I was playing with Nutch for quite some time before Solr, so I know Nutch better than Solr. Nutch has a plugin mechanism so that you can add a parser for a document type. It comes with parser plugins for most popular doc types (with varying degrees of international text support). My que

Re: Indexing HTML and other doc types

2007-07-05 Thread Peter Manis
I guess I misread your original question. I believe Nutch would be the choice for crawling, however I do not know about its abilities for indexing other document types. If you needed to index multiple document types such as PDF, DOC, etc and Nutch does not provide functionality to do so you woul

RE: Indexing HTML and other doc types

2007-07-05 Thread Teruhiko Kurosaka
Thank you, Otis and Peter, for your replies. > From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] > doc of some type -> parse content into various fields -> post to Solr I understand this part, but the question is who should do this. I was under assumption that it's Solr client's job to crawl the

Re: Indexing HTML and other doc types

2007-07-04 Thread Peter Manis
A coworker of mine posted the code that we used for adding pdf, doc, xls, etc documents into solr. You can find the files at the following location. https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel Just apply the patch and put the

Re: Indexing HTML and other doc types

2007-07-03 Thread Otis Gospodnetic
Otis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simpy -- http://www.simpy.com/ - Tag - Search - Share - Original Message From: Teruhiko Kurosaka <[EMAIL PROTECTED]> To: solr-user@lucene.apache.org Sent: Tuesday, July 3, 2007 8:56:23 PM Subject: Indexing HTML and other doc types

Indexing HTML and other doc types

2007-07-03 Thread Teruhiko Kurosaka
Solr looks very good for indexing and searching strcutured data. But I noticed there is no tool in the Solr distribution with which documents of other doc types can be indexed. Are there other side projects that develop Solr clients for indexing documents of other doc types? Or is the generic f