Re: ExtractingRequestHandler and XmlUpdateHandler

Jacob Singh Tue, 16 Dec 2008 02:44:25 -0800

No, I didn't mean storing the binary along with, just that I could
send a binary file (or a text file) which tika could process and store
along with the XML which describes its literal meta-data.


Best,
Jacob

On Mon, Dec 15, 2008 at 7:17 PM, Grant Ingersoll <gsing...@apache.org> wrote:
>
> On Dec 15, 2008, at 8:20 AM, Jacob Singh wrote:
>
>> Hi Erik,
>>
>> Sorry I wasn't totally clear.  Some responses inline:
>>>
>>> If the file is visible from the Solr server, there is no need to actually
>>> send the bits through HTTP.  Solr's content steam capabilities allow a
>>> file
>>> to be retrieved from Solr itself.
>>>
>>
>> Yeah, I know.  But in my case not possible.   Perhaps a simple file
>> receiving HTTP POST handler which simply stored the file on disk and
>> returned a path to it is the way to go here.
>>
>>>> So I could send the file, and receive back a token which I would then
>>>> throw into one of my fields as a reference.  Then using it to map tika
>>>> fields as well. like:
>>>>
>>>> <str name="file_mod_date">${FILETOKEN}.last_modified</str>
>>>>
>>>> <str name="file_body">${FILETOKEN}.content</str>
>>>
>>> Huh?   I'm don't follow the file token thing.  Perhaps you're thinking
>>> you'll post the file, then later update other fields on that same
>>> document.
>>> An important point here is that Solr currently does not have document
>>> update capabilities.  A document can be fully replaced, but cannot have
>>> fields added to it, once indexed.  It needs to be handled all in one shot
>>> to
>>> accomplish the blending of file/field indexing.  Note the
>>> ExtractingRequestHandler already has the field mapping capability.
>>>
>>
>> Sorta... I was more thinking of a new feature wherein a Solr Request
>> handler doesn't actually put the file in the index, merely runs it
>> through tika and stores a datastore which links a "token" with the
>> tika extraction.  Then the client could make another request w/ the
>> XMLUpdateHandler which referenced parts of the stored tika extraction.
>>
>
> Hmmm, thinking out loud....
>
> Override SolrContentHandler.  It is responsible for mapping the Tika output
> to a Solr Document.
> Capture all the content into a single buffer.
> Add said buffer to a field that is stored only
> Add a second field that is indexed.  This is your "token".  You could, just
> as well, have that token be the only thing that gets returned by extract
> only.
>
> Alternately, you could implement an UpdateProcessor thingamajob that takes
> the output and stores it to the filesystem and just adds the token to a
> document.
>
>
>
>
>
>>> But, here's a solution that will work for you right now... let Tika
>>> extract
>>> the content and return back to you, then turn around and post it and
>>> whatever other fields you like:
>>>
>>> <http://wiki.apache.org/solr/TikaExtractOnlyExampleOutput>
>>>
>>> In that example, the contents aren't being indexed, just returned back to
>>> the client.  And you can leverage the content stream capability with this
>>> as
>>> well avoiding posting the actual binary file, pointing the extracting
>>> request to a file path visible by Solr.
>>>
>>
>> Yeah, I saw that.  This is pretty much what I was talking about above,
>> the only disadvantage (which is a deal breaker in our case) is the
>> extra bandwidth to move the file back and forth.
>>
>> Thanks for your help and quick response.
>>
>> I think we'll integrate the POST fields as Grant has kindly provided
>> multi-value input now, and see what happens in the future.  I realize
>> what I'm talking about (XML and binary together) is probably not a
>> high priority feature.
>>
>
> Is the use case this:
>
> 1. You want to assign metadata and also store the original and have it
> stored in binary format, too?  Thus, Solr becomes a backing, searchable
> store?
>
> I think we could possibly add an option to serialize the ContentStream onto
> a Field on the Document.  In other words, store the original with the
> Document.  Of course, buyer beware on the cost of doing so.
>
>



-- 

+1 510 277-0891 (o)
+91 9999 33 7458 (m)

web: http://pajamadesign.com

Skype: pajamadesign
Yahoo: jacobsingh
AIM: jacobsingh
gTalk: jacobsi...@gmail.com

Re: ExtractingRequestHandler and XmlUpdateHandler

Reply via email to