Re: index multiple files into one index entity

Erick Erickson Sun, 26 May 2013 07:47:26 -0700

I'm still not quite getting the issue. Separate requests (i.e. any
addition of a SolrInputDocument) are treated as a separate document.
There's no notion of "append the contents of one doc to another based
on ID", unless you're doing atomic updates.


And Tika takes some care to index separate files as separate documents.

Now, if you don't need these as with the same uniqueKey, you might
index them as separate documents and include a field that lets you
associate these documents somehow (see the group/field collapsing Wiki
page).

But otherwise, I think I need a higher-level view of what you're
trying to accomplish to make an intelligent comment.

Best
Erick

On Thu, May 23, 2013 at 9:05 AM,  <mark.ka...@t-systems.com> wrote:
> Hello Erick,
> Thank you for your fast answer.
>
> Maybe I don't exclaim my question clearly.
> I want index many files to one index entity. I will use the same behavior as 
> any other multivalued field which can indexed to one unique id.
> So I think every ContentStreamUpdateRequest represent one index entity, isn't 
> it? And with each addContentStream I will add one File to this entity.
>
> Thank you and with best Regards
> Mark
>
>
>
>
> -----Ursprüngliche Nachricht-----
> Von: Erick Erickson [mailto:erickerick...@gmail.com]
> Gesendet: Donnerstag, 23. Mai 2013 14:11
> An: solr-user@lucene.apache.org
> Betreff: Re: index multiple files into one index entity
>
> I just skimmed your post, but I'm responding to the last bit.
>
> If you have <uniqueKey> defined as "id" in schema.xml then no, you cannot 
> have multiple documents with the same ID.
> Whenever a new doc comes in it replaces the old doc with that ID.
>
> You can remove the <uniqueKey> definition and do what you want, but there are 
> very few Solr installations with no <uniqueKey> and it's probably a better 
> idea to make your id's truly unique.
>
> Best
> Erick
>
> On Thu, May 23, 2013 at 6:14 AM,  <mark.ka...@t-systems.com> wrote:
>> Hello solr team,
>>
>> I want to index multiple fields into one solr index entity, with the
>> same id. We are using solr 4.1
>>
>>
>> I try it with following source fragment:
>>
>>     public void addContentSet(ContentSet contentSet) throws
>> SearchProviderException {
>>
>>                                 ...
>>
>>             ContentStreamUpdateRequest csur = 
>> generateCSURequest(contentSet.getIndexId(), contentSet);
>>             String indexId = contentSet.getIndexId();
>>
>>             ConcurrentUpdateSolrServer server = 
>> serverPool.getUpdateServer(indexId);
>>             server.request(csur);
>>
>>                                 ...
>>     }
>>
>>     private ContentStreamUpdateRequest generateCSURequest(String indexId, 
>> ContentSet contentSet)
>>             throws IOException {
>>         ContentStreamUpdateRequest csur = new
>> ContentStreamUpdateRequest(confStore.getExtractUrl());
>>
>>         ModifiableSolrParams parameters = csur.getParams();
>>         if (parameters == null) {
>>             parameters = new ModifiableSolrParams();
>>         }
>>
>>         parameters.set("literalsOverride", "false");
>>
>>         // maps the tika default content attribute to the Attribute with 
>> name 'fulltext'
>>         parameters.set("fmap.content", 
>> SearchSystemAttributeDef.FULLTEXT.getName());
>>         // create an empty content stream, this seams necessary for 
>> ContentStreamUpdateRequest
>>         csur.addContentStream(new ImaContentStream());
>>
>>         for (Content content : contentSet.getContentList()) {
>>             csur.addContentStream(new ImaContentStream(content));
>>             // for each content stream add additional attributes
>>             parameters.add("literal." + 
>> SearchSystemAttributeDef.CONTENT_ID.getName(), 
>> content.getBinaryObjectId().toString());
>>             parameters.add("literal." + 
>> SearchSystemAttributeDef.CONTENT_KEY.getName(), content.getContentKey());
>>             parameters.add("literal." + 
>> SearchSystemAttributeDef.FILE_NAME.getName(), content.getContentName());
>>             parameters.add("literal." + 
>> SearchSystemAttributeDef.MIME_TYPE.getName(), content.getMimeType());
>>         }
>>
>>         parameters.set("literal.id ", indexId);
>>
>>         // adding some other attributes
>>         ...
>>
>>         csur.setParams(parameters);
>>
>>         return csur;
>>     }
>>
>> During debugging I can see that the method 'server.request(csur)' read for 
>> each ImaContentStream the buffer.
>> When I'm looking on solr catalina log I see that the attached files reach 
>> the solr servlet.
>>
>> INFO: Releasing directory:/data/V-4-1/master0/data/index
>> Apr 25, 2013 5:48:07 AM
>> org.apache.solr.update.processor.LogUpdateProcessor finish
>> INFO: [master0] webapp=/solr-4-1 path=/update/extract 
>> params={literal.searchconnectortest15_c8150e41_cc49_4a ...... 
>> &literal.id=26afa5dc-40ad-442a-ac79-0e7880c06aa1& .....
>> {add=[26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910940958720),
>> 26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910971367424),
>> 26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910976610304),
>> 26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910983950336),
>> 26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910989193216),
>> 26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910995484672)]} 0 58
>>
>>
>> But only the latest in the content list will be indexed.
>>
>>
>> My schema.xml has the following field definitions:
>>
>>     <field name="id" type="string" indexed="true" stored="true" 
>> required="true" />
>>     <field name="content" type="text_general" indexed="false"
>> stored="true" multiValued="true"/>
>>
>>     <field name="contentkey" type="string" indexed="true" stored="true" 
>> multiValued="true"/>
>>     <field name="contentid" type="string" indexed="true" stored="true" 
>> multiValued="true"/>
>>     <field name="contentfilename " type="string" indexed="true" 
>> stored="true" multiValued="true"/>
>>     <field name="contentmimetype" type="string" indexed="true"
>> stored="true" multiValued="true"/>
>>
>>     <field name="fulltext" type="text_general" indexed="true"
>> stored="true" multiValued="true"/>
>>
>>
>> I'm using the tika ExtractingRequestHandler which can extract binary files.
>>
>>
>>
>>   <requestHandler name="/update/extract"
>>                   startup="lazy"
>>                   class="solr.extraction.ExtractingRequestHandler" >
>>     <lst name="defaults">
>>       <str name="lowernames">true</str>
>>       <str name="uprefix">ignored_</str>
>>
>>       <!-- capture link hrefs but ignore div attributes -->
>>       <str name="captureAttr">true</str>
>>       <str name="fmap.a">links</str>
>>       <str name="fmap.div">ignored_</str>
>>
>>     </lst>
>>   </requestHandler>
>>
>> Is it possible to index multiple files with the same id?
>> It is necessary to implement my own RequestHandler?
>>
>> With best regards Mark
>>
>>
>>

Re: index multiple files into one index entity

Reply via email to