: > ...When we get to it, I'd like to hear why it (things like PDF parsing) : > should be inside Solr rather than outside using our update interfaces.... : : Same here.
I wouldn't way that i think it *should* be inside of Solr, just that it *could* be inside of Solr. the use case i imagine is when you run an operation in which multiple clients that all want to index PDF files according to some custom rules to map pices of the fiels to fields in your schema ... if they have to send Solr XML data listing all the field=value pairs then they all have to not only load the same PDF Parsing library, but they also have to share the same biz logic built in to understand what kinds of SOlr XML documents to produce and send to the server. If you let people write their own PDFMUpdateHandler then all of those clients can POST (or upload, refer via URL) the raw PDF file, and the extraction logic is in one place. At this point though, I can't for the life of me remeber what Ryan said to convince me that it made sense to have a DocumentParser concept that UpdateHandlers could delegate to -- as opposed to the UpdateHandler doing it directly :) : I haven't had time to follow the recent (rich) design discussions : about this stuff, but if I was designing this, I'd put all the : document processing code in a separate module (separate servlet?) and never fret ... i too want to keep Solr lean. The idea (in my mind anyway) is that there are very few out of the box UpdateHandlers (one for XML, one for CSV, probably want for JDBC) but that there could be lots of contrib style Updaters that know how to deal with different exotic document types users could load if they wanted to. -Hoss