No, I didn't mean storing the binary along with, just that I could send a binary file (or a text file) which tika could process and store along with the XML which describes its literal meta-data.
Best, Jacob On Mon, Dec 15, 2008 at 7:17 PM, Grant Ingersoll <gsing...@apache.org> wrote: > > On Dec 15, 2008, at 8:20 AM, Jacob Singh wrote: > >> Hi Erik, >> >> Sorry I wasn't totally clear. Some responses inline: >>> >>> If the file is visible from the Solr server, there is no need to actually >>> send the bits through HTTP. Solr's content steam capabilities allow a >>> file >>> to be retrieved from Solr itself. >>> >> >> Yeah, I know. But in my case not possible. Perhaps a simple file >> receiving HTTP POST handler which simply stored the file on disk and >> returned a path to it is the way to go here. >> >>>> So I could send the file, and receive back a token which I would then >>>> throw into one of my fields as a reference. Then using it to map tika >>>> fields as well. like: >>>> >>>> <str name="file_mod_date">${FILETOKEN}.last_modified</str> >>>> >>>> <str name="file_body">${FILETOKEN}.content</str> >>> >>> Huh? I'm don't follow the file token thing. Perhaps you're thinking >>> you'll post the file, then later update other fields on that same >>> document. >>> An important point here is that Solr currently does not have document >>> update capabilities. A document can be fully replaced, but cannot have >>> fields added to it, once indexed. It needs to be handled all in one shot >>> to >>> accomplish the blending of file/field indexing. Note the >>> ExtractingRequestHandler already has the field mapping capability. >>> >> >> Sorta... I was more thinking of a new feature wherein a Solr Request >> handler doesn't actually put the file in the index, merely runs it >> through tika and stores a datastore which links a "token" with the >> tika extraction. Then the client could make another request w/ the >> XMLUpdateHandler which referenced parts of the stored tika extraction. >> > > Hmmm, thinking out loud.... > > Override SolrContentHandler. It is responsible for mapping the Tika output > to a Solr Document. > Capture all the content into a single buffer. > Add said buffer to a field that is stored only > Add a second field that is indexed. This is your "token". You could, just > as well, have that token be the only thing that gets returned by extract > only. > > Alternately, you could implement an UpdateProcessor thingamajob that takes > the output and stores it to the filesystem and just adds the token to a > document. > > > > > >>> But, here's a solution that will work for you right now... let Tika >>> extract >>> the content and return back to you, then turn around and post it and >>> whatever other fields you like: >>> >>> <http://wiki.apache.org/solr/TikaExtractOnlyExampleOutput> >>> >>> In that example, the contents aren't being indexed, just returned back to >>> the client. And you can leverage the content stream capability with this >>> as >>> well avoiding posting the actual binary file, pointing the extracting >>> request to a file path visible by Solr. >>> >> >> Yeah, I saw that. This is pretty much what I was talking about above, >> the only disadvantage (which is a deal breaker in our case) is the >> extra bandwidth to move the file back and forth. >> >> Thanks for your help and quick response. >> >> I think we'll integrate the POST fields as Grant has kindly provided >> multi-value input now, and see what happens in the future. I realize >> what I'm talking about (XML and binary together) is probably not a >> high priority feature. >> > > Is the use case this: > > 1. You want to assign metadata and also store the original and have it > stored in binary format, too? Thus, Solr becomes a backing, searchable > store? > > I think we could possibly add an option to serialize the ContentStream onto > a Field on the Document. In other words, store the original with the > Document. Of course, buyer beware on the cost of doing so. > > -- +1 510 277-0891 (o) +91 9999 33 7458 (m) web: http://pajamadesign.com Skype: pajamadesign Yahoo: jacobsingh AIM: jacobsingh gTalk: jacobsi...@gmail.com