Hi Grant, Thanks for the quick response. My Colleague looked into the code a bit, and I did as well, here is what I see (my Java sucks):
http://svn.apache.org/repos/asf/lucene/solr/trunk/contrib/extraction/src/main/java/org/apache/solr/handler/extraction/SolrContentHandler.java //handle the literals from the params Iterator<String> paramNames = params.getParameterNamesIterator(); while (paramNames.hasNext()) { String name = paramNames.next(); if (name.startsWith(LITERALS_PREFIX)) { String fieldName = name.substring(LITERALS_PREFIX.length()); //no need to map names here, since they are literals from the user SchemaField schFld = schema.getFieldOrNull(fieldName); if (schFld != null) { String value = params.get(name); boost = getBoost(fieldName); //no need to transform here, b/c we can assume the user sent it in correctly document.addField(fieldName, value, boost); } else { handleUndeclaredField(fieldName); } } } I don't know the solr source quite well enough to know if document.addField() can take a struct in the form of some serialized string, but how can I pass a multi-valued field via a file-upload/multi-part POST? One idea is that as one of the POST fields, I could add an XML payload as could be parsed by the XML handler, and then we could instantiate it, pass in the doc by reference, and get its multivalue fields all populated nicely. But this perhaps isn't a fantastic solution, I'm really not much of a Java programmer at all, would love to hear your expert opinion on how to solve this. Best, J On Fri, Dec 12, 2008 at 6:40 PM, Grant Ingersoll <gsing...@apache.org> wrote: > Hmmm, I think I see the disconnect, but I'm not sure. Sending to the ERH > (ExtractingReqHandler) is not an XML command at all, it's a file-upload/ > multi-part encoding. I think you will need an API that does something like: > > (Just making this up, this is not real code) > File file = new File(fileToIndex) > resp = solr.addFile(file, params); > ---- > > Where params contains the literals, captures, etc. Then, in your API you > need to do whatever PHP does to send that file as a multipart file (I think > you can also POST it, too, but that has some downsides as described on the > wiki) > > I'll try to whip up some SolrJ sample code, as I know others have asked for > that. > > -Grant > > On Dec 12, 2008, at 5:34 AM, Jacob Singh wrote: > >> Hi Grant, >> >> Happy to. >> >> Currently we are sending over documents by building a big XML file of >> all of the fields of that document. Something like this: >> >> $document = new Apache_Solr_Document(); >> $document->id = apachesolr_document_id($node->nid); >> $document->title = $node->title; >> $document->body = strip_tags($text); >> $document->type = $node->type; >> foreach ($categories as $cat) { >> $document->setMultiValue('category', $cat); >> } >> >> The PHP Client library then takes all of this, and builds it into an >> XML payload which we POST over to Solr. >> >> When we implement rich file handling, I see these instructions: >> >> ----------------------------- >> Literals >> >> To add in your own metadata, pass in the literal parameter along with the >> file: >> >> curl >> http://localhost:8983/solr/update/extract?ext.idx.attr=true\&ext.def.fl=text\&ext.map.div=foo_t\&ext.capture=div\&ext.boost.foo_t=3\&ext.literal.blah_i=1 >> -F "tutori...@tutorial.pdf" >> >> ----------------------------- >> >> So it seems we can: >> >> a). Refactor the class to not generate XML, but rather to build post >> headers for each field. We would like to avoid this. >> b) Instead, I was hoping we could send the XML payload with all the >> literal fields defined (like id, type, etc), and the post fields >> required for the file content and the field it belongs to in one >> reqeust >> >> Since my understanding is that docs in Solr are immutable, there is no: >> c). Send the file contents over, give it an ID, and then send over the >> rest of the fields and merge into that ID. >> >> If the unfortunate answer is a, then how do we deal with multi-value >> fields? I don't know how to format them given the ext.literal format >> above. >> >> Thanks for your help and awesome contributions! >> >> -Jacob >> >> >> >> >> On Fri, Dec 12, 2008 at 4:52 AM, Grant Ingersoll <gsing...@apache.org> >> wrote: >>> >>> On Dec 10, 2008, at 10:21 PM, Jacob Singh wrote: >>> >>>> Hey folks, >>>> >>>> I'm looking at implementing ExtractingRequestHandler in the >>>> Apache_Solr_PHP >>>> library, and I'm wondering what we can do about adding meta-data. >>>> >>>> I saw the docs, which suggests you use different post headers to pass >>>> field >>>> values along with ext.literal. Is there anyway to use the >>>> XmlUpdateHandler >>>> instead along with a document? I'm not sure how this would work, >>>> perhaps it >>>> would require 2 trips, perhaps the XML would be in the post "content" >>>> and >>>> the file in something else? The thing is we would need to refactor the >>>> class pretty heavily in this case when indexing RichDocs and we were >>>> hoping >>>> to avoid it. >>>> >>> >>> I'm not sure I follow how the XmlUpdateHandler plays in, can you explain >>> a little more? My PHP is weak, but maybe some code will help... >>> >>> >>>> Thanks, >>>> Jacob >>>> -- >>>> >>>> +1 510 277-0891 (o) >>>> +91 9999 33 7458 (m) >>>> >>>> web: http://pajamadesign.com >>>> >>>> Skype: pajamadesign >>>> Yahoo: jacobsingh >>>> AIM: jacobsingh >>>> gTalk: jacobsi...@gmail.com >>> >>> >> >> >> >> -- >> >> +1 510 277-0891 (o) >> +91 9999 33 7458 (m) >> >> web: http://pajamadesign.com >> >> Skype: pajamadesign >> Yahoo: jacobsingh >> AIM: jacobsingh >> gTalk: jacobsi...@gmail.com > > -------------------------- > Grant Ingersoll > > Lucene Helpful Hints: > http://wiki.apache.org/lucene-java/BasicsOfPerformance > http://wiki.apache.org/lucene-java/LuceneFAQ > > > > > > > > > > > -- +1 510 277-0891 (o) +91 9999 33 7458 (m) web: http://pajamadesign.com Skype: pajamadesign Yahoo: jacobsingh AIM: jacobsingh gTalk: jacobsi...@gmail.com