bq: I'd ideally like to put the burden of tika-extraction into the Solr-process.

Why? That puts the entire parsing burden on the Solr
machine. Under any significant indexing load, parsing the doc
may become a bottleneck. If you do the Tika extraction on the client,
you can spread that (sometimes quite heavy) load over as
many clients as you can muster without adversely affecting
searching. And that would increase your maximum indexing rate.

Or perhaps I'm just not understanding the division you need here.

If you really would be better served by doing the extraction on the
Solr side, see the ExtractingRequestHandler (ERH) as another option
here:
https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika

So you'd have something like this:
find the metadata you care about from your system. Include
all the fields as "literals" on the call to SolrCell, and
let ERH then extract the text content from the doc and index
it along with the data you've passed as literals.

On word of caution if you want to extract metadata _from the
document_. Unless you have very uniform documents, this
gets "interesting".Say you wanted to pull out the "last edited"
date _from the document_. Word might have a meta-data
field conveniently named "last_edited", which you could map
into your Solr schema. However, a PDF file might have a field
"latest_change" expressing the same concept. Note: the names
are made up, the point is there's no standard amongst different
types of docs . I find all this easier to deal with a SolrJ program
using Tika, but I admit that's largely a matter where my comfort
zone is.

The previous paragraph of course doesn't apply at all if all you care
about is the text.

Even so, I'm still puzzled by why you see it as advantageous to
make this split. If I understand correctly, you are already having
a SolrJ program to find the docs to send to Solr. You're already
going to have to send SolrInputDocuments to Solr with certain
metadata. Adding the Tika extraction is just a few lines and has,
from my perspective, several near and long-term advantages.

Which probably just means I don't understand your problem
space in sufficient depth....

Best,
Erick

On Fri, Sep 12, 2014 at 11:36 PM, Clemens Wyss DEV <clemens...@mysign.ch> wrote:
> Erick, thanks for you input. You are right that the "miraculous connection" 
> is not always that miraculous ;)
>
> In your example the extraction is being done in the client side. But as I 
> said, I'd ideally like to put the burden of tika-extraction into the 
> Solr-process. All fields, but the file-content-based-fields, should be field 
> on the client side and only the file-content-based-fields shall be extracted 
> (before indexing) in Solr. So it would "only" be the files that needed tob e 
> "shared"
>
> --Clemens
>
> -----Ursprüngliche Nachricht-----
> Von: Erick Erickson [mailto:erickerick...@gmail.com]
> Gesendet: Freitag, 12. September 2014 17:57
> An: solr-user@lucene.apache.org
> Betreff: Re: SolrJ : fieldcontent from (multiple) file(s)
>
> bq: I could of course push in the filename(s) in a field, but this would 
> require Solr (due to field-type e.g. "filecontent") to extract the content 
> from the given file.
>
> Why? If you're already dealing with SolrJ, you do all the work you need to 
> there by adding fields to a SolrInputDocument, including any metadata and 
> content your client extracts. Here's an example that uses Tika (shipped with 
> Solr) to do just that, as well as extract DB contents etc.
>
> http://searchhub.org/2012/02/14/indexing-with-solrj/
>
> Best,
> Erick
>
> On Fri, Sep 12, 2014 at 5:55 AM, Clemens Wyss DEV <clemens...@mysign.ch> 
> wrote:
>> Thanks Alex,
>>> Do you just care about document content?
>> content only.
>>
>> The documents (not necessarily coming from a Db) are being pushed (through 
>> Solrj). This is at least the initial idea, mainly due to the dynamic nature 
>> of our index/search architecture.
>> I could of course push in the filename(s) in a field, but this would require 
>> Solr (due to field-type e.g. "filecontent") to extract the content from the 
>> given file. Is something alike this possible in Solr-indexing?
>>
>>> DataImportHandler
>> Would I need to write a custom DIH? Or is the DIH as is, i.e. just 
>> configurable through the data-config.xml?
>>
>>> nested entities design
>> Could you link me to this concept/idea?
>>
>> -----Ursprüngliche Nachricht-----
>> Von: Alexandre Rafalovitch [mailto:arafa...@gmail.com]
>> Gesendet: Freitag, 12. September 2014 14:12
>> An: solr-user
>> Betreff: Re: SolrJ : fieldcontent from (multiple) file(s)
>>
>> Do you just care about document content? Not metadata, such as file name, 
>> date, author, etc?
>>
>> Does it have to be push into Solr or can be pull? If pull, DataImportHandler 
>> should be able to do what you want with nested entities design.
>>
>> Regards,
>>    Alex.
>> Personal: http://www.outerthoughts.com/ and @arafalov Solr resources
>> and newsletter: http://www.solr-start.com/ and @solrstart Solr
>> popularizers community: https://www.linkedin.com/groups?gid=6713853
>>
>>
>> On 12 September 2014 06:53, Clemens Wyss DEV <clemens...@mysign.ch> wrote:
>>> Looks like I haven't finished " I know"
>>> I know I could extract the content on our server's side, but I'd really 
>>> like to take that burden of it.
>>> That said:
>>> Can I hand in the path-to-the-file in a "specific field" which would yield 
>>> an extraction in Solr?
>>>
>>> -----Ursprüngliche Nachricht-----
>>> Von: Clemens Wyss DEV [mailto:clemens...@mysign.ch]
>>> Gesendet: Freitag, 12. September 2014 11:30
>>> An: 'solr-user@lucene.apache.org'
>>> Betreff: SolrJ : fieldcontent from (multiple) file(s)
>>>
>>> First of all  I'd like to say hello to the Solr world/community ;) So far 
>>> we have been using Lucene as-is and now intend to go for Solr.
>>>
>>> Say I have a document which in one field should have the content of
>>> a file (indexed only, not stored), in order to make the document
>>> searchable due to the file's content. I know
>>>
>>> How is this achieved using SolrJ, i.e. how do I hand in this document?
>>>
>>> Thx
>>> Clemens
>>>

Reply via email to