RE: Indexing HTML and other doc types

Teruhiko Kurosaka Thu, 05 Jul 2007 17:02:26 -0700

Thank you, Otis and Peter, for your replies.

> From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]


> doc of some type -> parse content into various fields -> post to Solr

I understand this part, but the question is who should do this.
I was under assumption that it's Solr client's job to crawl the net, read
documents, parse them, put the contents into different fields
(the "contents", title, author, date, URL, etc.), then post the
result to Solr via HTTP in HTML or CSV. 

And I was asking if there are open-source projects to build
such clients.

Peter's approach is different; he adds the intelligence of parsing of
document to Solr itself. (I guess the crawling has to be done
by clients.)  I wonder this fits in the model that Solr has. Or is
it just my illusion ?

-kuro

RE: Indexing HTML and other doc types

Reply via email to