Is this SolrCloud or single Solr Instance? On Jan 2, 2015 3:44 PM, <j...@ece.ubc.ca> wrote:
> Happy New Year Everyone :) > > I am trying to automatically generate document Id when indexing a csv > file that contains multiple lines of documents. The desired case: if the > csv file contains 2 lines (each line is a document), then the index > should contain 2 documents. > > What I observed: If the csv files contains 2 lines, then the index > contains 3 documents, because the 1st document is repeated once, an > example output: > <doc> > <sr name ="col1"> doc1 </str> > <sr name= "col2"> rank1 </str> > <str name="id"> randomlyGeneratedId1</str> > </doc> > <doc> > <sr name ="col1"> doc1 </str> > <sr name= "col2"> rank1 </str> > <str name="id"> randomlyGeneratedId2</str> > </doc> > <doc> > <sr name ="col1"> doc2 </str> > <sr name= "col2"> rank2 </str> > <str name="id"> randomlyGeneratedId3</str> > </doc> > > And if the csv file contains 3 lines, then the index contains 6 elements, > because document 1 is repeated 3 times and document 2 is repeated twice, > as following: > <doc> > <sr name ="col1"> doc1 </str> > <sr name= "col2"> rank1 </str> > <str name="id"> randomlyGeneratedId1</str> > </doc> > <doc> > <sr name ="col1"> doc1 </str> > <sr name= "col2"> rank1 </str> > <str name="id"> randomlyGeneratedId2</str> > </doc> > <doc> > <sr name ="col1"> doc2 </str> > <sr name= "col2"> rank2 </str> > <str name="id"> randomlyGeneratedId3</str> > <doc> > <sr name ="col1"> doc1 </str> > <sr name= "col2"> rank1 </str> > <str name="id"> randomlyGeneratedId4</str> > </doc> > <doc> > <sr name ="col1"> doc2 </str> > <sr name= "col2"> rank2 </str> > <str name="id"> randomlyGeneratedId5</str> > </doc> > <doc> > <sr name ="col1"> doc3 </str> > <sr name= "col2"> rank3 </str> > <str name="id"> randomlyGeneratedId6</str> > </doc> > > Here's what I have done: > 1. In my solrConfig: > <updateRequestProcessorChain name="autoGenId"> > <processor class="solr.UUIDUpdateProcessorFactory"> > <str name="fieldName">doc_key</str> > </processor> > <processor class="solr.LogUpdateProcessorFactory" /> > <processor class="solr.RunUpdateProcessorFactory" /> > </updateRequestProcessorChain> > <requestHandler name="/update" class="solr.UpdateRequestHandler"> > <lst name="defaults"> > <str name="update.chain">autoGenId</str> > </lst> > </requestHandler> > 2. in schema.xml: > <field name="doc_key" type="string" indexed="true" stored="true" > required="true" multiValued="false"/> > <field name = "col1" type="string" indexed="true" stored="true" > required="true" multiValued="false"/> > <field name = "col2" type="string" indexed="true" stored="true" > required="true" multiValued="false"/> > <uniqueKey>id</uniqueKey> > > This problem doesn't exist when I assign an Id field, instead of using > the UUIDUpdateProcessorFactory, so I assumed the problem is there? Looks > like the csv file is processed one line at a time, and the index shows > the entire process: so we see each previous line repeated in the output. > Is there a way to not show the 'appending of previous lines', and > rather just the 'final results' - so the total number of indexed > document would match the input number of documents from the csv file? > > Many thanks, > Jia >