index multiple files into one index entity

2013-05-23 Thread Mark.Kappe
Hello solr team,

I want to index multiple fields into one solr index entity, with the same id. 
We are using solr 4.1


I try it with following source fragment:

public void addContentSet(ContentSet contentSet) throws 
SearchProviderException {

...

ContentStreamUpdateRequest csur = 
generateCSURequest(contentSet.getIndexId(), contentSet);
String indexId = contentSet.getIndexId();

ConcurrentUpdateSolrServer server = 
serverPool.getUpdateServer(indexId);
server.request(csur);

...
}

private ContentStreamUpdateRequest generateCSURequest(String indexId, 
ContentSet contentSet)
throws IOException {
ContentStreamUpdateRequest csur = new 
ContentStreamUpdateRequest(confStore.getExtractUrl());

ModifiableSolrParams parameters = csur.getParams();
if (parameters == null) {
parameters = new ModifiableSolrParams();
}

parameters.set("literalsOverride", "false");

// maps the tika default content attribute to the Attribute with name 
'fulltext'
parameters.set("fmap.content", 
SearchSystemAttributeDef.FULLTEXT.getName());
// create an empty content stream, this seams necessary for 
ContentStreamUpdateRequest
csur.addContentStream(new ImaContentStream());

for (Content content : contentSet.getContentList()) {
csur.addContentStream(new ImaContentStream(content));
// for each content stream add additional attributes
parameters.add("literal." + 
SearchSystemAttributeDef.CONTENT_ID.getName(), 
content.getBinaryObjectId().toString());
parameters.add("literal." + 
SearchSystemAttributeDef.CONTENT_KEY.getName(), content.getContentKey());
parameters.add("literal." + 
SearchSystemAttributeDef.FILE_NAME.getName(), content.getContentName());
parameters.add("literal." + 
SearchSystemAttributeDef.MIME_TYPE.getName(), content.getMimeType());
}

parameters.set("literal.id ", indexId);

// adding some other attributes
...

csur.setParams(parameters);

return csur;
}

During debugging I can see that the method 'server.request(csur)' read for each 
ImaContentStream the buffer.
When I'm looking on solr catalina log I see that the attached files reach the 
solr servlet.

INFO: Releasing directory:/data/V-4-1/master0/data/index
Apr 25, 2013 5:48:07 AM org.apache.solr.update.processor.LogUpdateProcessor 
finish
INFO: [master0] webapp=/solr-4-1 path=/update/extract 
params={literal.searchconnectortest15_c8150e41_cc49_4a .. 
&literal.id=26afa5dc-40ad-442a-ac79-0e7880c06aa1& .
{add=[26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910940958720), 
26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910971367424), 
26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910976610304), 
26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910983950336), 
26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910989193216), 
26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910995484672)]} 0 58


But only the latest in the content list will be indexed.


My schema.xml has the following field definitions:












I'm using the tika ExtractingRequestHandler which can extract binary files.



  

  true
  ignored_

  
  true
  links
  ignored_


  

Is it possible to index multiple files with the same id?
It is necessary to implement my own RequestHandler?

With best regards Mark





AW: index multiple files into one index entity

2013-05-23 Thread Mark.Kappe
Hello Erick,
Thank you for your fast answer.

Maybe I don't exclaim my question clearly.
I want index many files to one index entity. I will use the same behavior as 
any other multivalued field which can indexed to one unique id.
So I think every ContentStreamUpdateRequest represent one index entity, isn't 
it? And with each addContentStream I will add one File to this entity.

Thank you and with best Regards
Mark




-Ursprüngliche Nachricht-
Von: Erick Erickson [mailto:erickerick...@gmail.com] 
Gesendet: Donnerstag, 23. Mai 2013 14:11
An: solr-user@lucene.apache.org
Betreff: Re: index multiple files into one index entity

I just skimmed your post, but I'm responding to the last bit.

If you have  defined as "id" in schema.xml then no, you cannot have 
multiple documents with the same ID.
Whenever a new doc comes in it replaces the old doc with that ID.

You can remove the  definition and do what you want, but there are 
very few Solr installations with no  and it's probably a better idea 
to make your id's truly unique.

Best
Erick

On Thu, May 23, 2013 at 6:14 AM,   wrote:
> Hello solr team,
>
> I want to index multiple fields into one solr index entity, with the 
> same id. We are using solr 4.1
>
>
> I try it with following source fragment:
>
> public void addContentSet(ContentSet contentSet) throws 
> SearchProviderException {
>
> ...
>
> ContentStreamUpdateRequest csur = 
> generateCSURequest(contentSet.getIndexId(), contentSet);
> String indexId = contentSet.getIndexId();
>
> ConcurrentUpdateSolrServer server = 
> serverPool.getUpdateServer(indexId);
> server.request(csur);
>
> ...
> }
>
> private ContentStreamUpdateRequest generateCSURequest(String indexId, 
> ContentSet contentSet)
> throws IOException {
> ContentStreamUpdateRequest csur = new 
> ContentStreamUpdateRequest(confStore.getExtractUrl());
>
> ModifiableSolrParams parameters = csur.getParams();
> if (parameters == null) {
> parameters = new ModifiableSolrParams();
> }
>
> parameters.set("literalsOverride", "false");
>
> // maps the tika default content attribute to the Attribute with name 
> 'fulltext'
> parameters.set("fmap.content", 
> SearchSystemAttributeDef.FULLTEXT.getName());
> // create an empty content stream, this seams necessary for 
> ContentStreamUpdateRequest
> csur.addContentStream(new ImaContentStream());
>
> for (Content content : contentSet.getContentList()) {
> csur.addContentStream(new ImaContentStream(content));
> // for each content stream add additional attributes
> parameters.add("literal." + 
> SearchSystemAttributeDef.CONTENT_ID.getName(), 
> content.getBinaryObjectId().toString());
> parameters.add("literal." + 
> SearchSystemAttributeDef.CONTENT_KEY.getName(), content.getContentKey());
> parameters.add("literal." + 
> SearchSystemAttributeDef.FILE_NAME.getName(), content.getContentName());
> parameters.add("literal." + 
> SearchSystemAttributeDef.MIME_TYPE.getName(), content.getMimeType());
> }
>
> parameters.set("literal.id ", indexId);
>
> // adding some other attributes
> ...
>
> csur.setParams(parameters);
>
> return csur;
> }
>
> During debugging I can see that the method 'server.request(csur)' read for 
> each ImaContentStream the buffer.
> When I'm looking on solr catalina log I see that the attached files reach the 
> solr servlet.
>
> INFO: Releasing directory:/data/V-4-1/master0/data/index
> Apr 25, 2013 5:48:07 AM 
> org.apache.solr.update.processor.LogUpdateProcessor finish
> INFO: [master0] webapp=/solr-4-1 path=/update/extract 
> params={literal.searchconnectortest15_c8150e41_cc49_4a .. 
> &literal.id=26afa5dc-40ad-442a-ac79-0e7880c06aa1& .
> {add=[26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910940958720), 
> 26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910971367424), 
> 26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910976610304), 
> 26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910983950336), 
> 26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910989193216), 
> 26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910995484672)]} 0 58
>
>
> But only the latest in the content list will be indexed.
>
>
> My schema.xml has the following field definitions:
>
>  required="true" />
>  stored="true" multiValued="true"/>
>
>  multiValued="true"/>
>  multiValued="true"/>
>  multiValued="true"/>
>  stored="true" multiValued="true"/>
>
>  stored="true" multiValued="true"/>
>
>
> I'm using the tika ExtractingRequestHandler which can extract binary files.
>
>
>
>  startup="lazy"
>   class="solr.extraction.ExtractingRequestHandler" >
> 
>   true
>   ignored_
>
>   
>   true
>