: > ...When we get to it, I'd like to hear why it (things like PDF parsing)
: > should be inside Solr rather than outside using our update interfaces....
:
: Same here.

I wouldn't way that i think it *should* be inside of Solr, just that it
*could* be inside of Solr.  the use case i imagine is when you run an
operation in which multiple clients that all want to index PDF files
according to some custom rules to map pices of the fiels to fields in your
schema ... if they have to send Solr XML data listing all the field=value
pairs then they all have to not only load the same PDF Parsing library,
but they also have to share the same biz logic built in to understand what
kinds of SOlr XML documents to produce and send to the server.

If you let people write their own PDFMUpdateHandler then all of those
clients can POST (or upload, refer via URL) the raw PDF file, and the
extraction logic is in one place.


At this point though, I can't for the life of me remeber what Ryan said to
convince me that it made sense to have a DocumentParser concept that
UpdateHandlers could delegate to -- as opposed to the UpdateHandler doing
it directly :)

: I haven't had time to follow the recent (rich) design discussions
: about this stuff, but if I was designing this, I'd put all the
: document processing code in a separate module (separate servlet?) and

never fret ... i too want to keep Solr lean.  The idea (in my mind anyway)
is that there are very few out of the box UpdateHandlers (one for XML, one
for CSV, probably want for JDBC) but that there could be lots of contrib
style Updaters that know how to deal with different exotic document types
users could load if they wanted to.




-Hoss

Reply via email to