Re: UUIDUpdateProcessorFactory causes repeated documents when uploading csv files?

Meraj A. Khan Fri, 02 Jan 2015 12:57:44 -0800

Is this SolrCloud or single Solr Instance?
On Jan 2, 2015 3:44 PM, <j...@ece.ubc.ca> wrote:


> Happy New Year Everyone :)
>
> I am trying to automatically generate document Id when indexing a csv
> file that contains multiple lines of documents. The desired case: if the
> csv file contains 2 lines (each line is a document), then the index
> should contain 2 documents.
>
>  What I observed: If the csv files contains 2 lines, then the index
> contains 3 documents, because the 1st document is repeated once, an
> example output:
> <doc>
> <sr name ="col1"> doc1 </str>
> <sr name= "col2"> rank1 </str>
> <str name="id"> randomlyGeneratedId1</str>
> </doc>
> <doc>
> <sr name ="col1"> doc1 </str>
> <sr name= "col2"> rank1 </str>
> <str name="id"> randomlyGeneratedId2</str>
> </doc>
> <doc>
> <sr name ="col1"> doc2 </str>
> <sr name= "col2"> rank2 </str>
> <str name="id"> randomlyGeneratedId3</str>
> </doc>
>
> And if the csv file contains 3 lines, then the index contains 6 elements,
> because document 1 is repeated 3 times and document 2 is repeated twice,
> as following:
> <doc>
> <sr name ="col1"> doc1 </str>
> <sr name= "col2"> rank1 </str>
> <str name="id"> randomlyGeneratedId1</str>
> </doc>
> <doc>
> <sr name ="col1"> doc1 </str>
> <sr name= "col2"> rank1 </str>
> <str name="id"> randomlyGeneratedId2</str>
> </doc>
> <doc>
> <sr name ="col1"> doc2 </str>
> <sr name= "col2"> rank2 </str>
> <str name="id"> randomlyGeneratedId3</str>
> <doc>
> <sr name ="col1"> doc1 </str>
> <sr name= "col2"> rank1 </str>
> <str name="id"> randomlyGeneratedId4</str>
> </doc>
> <doc>
> <sr name ="col1"> doc2 </str>
> <sr name= "col2"> rank2 </str>
> <str name="id"> randomlyGeneratedId5</str>
> </doc>
> <doc>
> <sr name ="col1"> doc3 </str>
> <sr name= "col2"> rank3 </str>
> <str name="id"> randomlyGeneratedId6</str>
> </doc>
>
> Here's what I have done:
> 1. In my solrConfig:
> <updateRequestProcessorChain name="autoGenId">
>                 <processor class="solr.UUIDUpdateProcessorFactory">
>                 <str name="fieldName">doc_key</str>
>                 </processor>
>                 <processor class="solr.LogUpdateProcessorFactory" />
>                 <processor class="solr.RunUpdateProcessorFactory" />
> </updateRequestProcessorChain>
> <requestHandler name="/update" class="solr.UpdateRequestHandler">
>        <lst name="defaults">
>             <str name="update.chain">autoGenId</str>
>        </lst>
>   </requestHandler>
> 2. in schema.xml:
> <field name="doc_key" type="string" indexed="true" stored="true"
> required="true" multiValued="false"/>
>         <field name = "col1" type="string" indexed="true" stored="true"
> required="true" multiValued="false"/>
>         <field name = "col2" type="string" indexed="true" stored="true"
> required="true" multiValued="false"/>
>  <uniqueKey>id</uniqueKey>
>
> This problem doesn't exist when I assign an Id field, instead of using
> the UUIDUpdateProcessorFactory, so I assumed the problem is there? Looks
> like the csv file is processed one line at a time, and the index shows
> the entire process: so we see each previous line repeated in the output.
> Is there a way to not show the 'appending of previous lines', and
> rather just the 'final results' - so the total number of indexed
> document would match the input number of documents from the csv file?
>
> Many thanks,
> Jia
>

Re: UUIDUpdateProcessorFactory causes repeated documents when uploading csv files?

Reply via email to