Erick - thanks very much, all of this makes sense.  But the one thing I still
find puzzling is the fact that re-adding the file a second, third, fourth
etc time causes numDocs to increase, and ALWAYS by the same amount
(141,645).  Any ideas as to what could cause that?

Dan


Erick Erickson wrote:
> 
> I think the root of your problem is that unique fields should NOT
> be multivalued. See
> http://wiki.apache.org/solr/FieldOptionsByUseCase?highlight=(unique)|(key)
> 
> <http://wiki.apache.org/solr/FieldOptionsByUseCase?highlight=(unique)|(key)>In
> this case, since you're tokenizing, your "query" field is
> implicitly multi-valued, I don't know what the behavior will be.
> 
> But there's another problem:
> All the filters in your analyzer definition will mess up the
> correspondence between the Unix uniq and numDocs even
> if you got by the above. I.e....
> 
> StopFilter would make the lines "a problem" and "the problem" identical.
> WordDelimiter would do all kinds of interesting things....
> LowerCaseFilter would make "Myproblem" and "myproblem" identical.
> RemoveDuplicatesFilter would make "interesting interesting" and
> "interesting" identical
> 
> You could define a second field, make *that* one unique and NOT analyzer
> it in any way...
> 
> You could hash your sentences and define the hash as your unique key.
> 
> You could....
> 
> HTH
> Erick
> 
> On Wed, Jan 6, 2010 at 1:06 PM, danben <dan...@gmail.com> wrote:
> 
>>
>> The problem:
>>
>> Not all of the documents that I expect to be indexed are showing up in
>> the
>> index.
>>
>> The background:
>>
>> I start off with an empty index based on a schema with a single field
>> named
>> 'query', marked as unique and using the following analyzer:
>>
>> <analyzer type="index">
>>            <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>            <filter class="solr.StopFilterFactory" ignoreCase="true"
>> words="stopwords.txt" enablePositionIncrements="true"/>
>>            <filter class="solr.WordDelimiterFilterFactory"
>> generateWordParts="1" generateNumberParts="1" catenateWords="1"
>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>>            <filter class="solr.LowerCaseFilterFactory"/>
>>            <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>> </analyzer>
>>
>> My input is a utf-8 encoded file with one sentence per line.  Its total
>> size
>> is about 60MB.  I would like each line of the file to correspond to a
>> single
>> document in the solr index.  If I print the number of unique lines in the
>> file (using cat | sort | uniq | wc -l), I get a little over 2M.  Printing
>> the total number of lines in the file gives me around 2.7M.
>>
>> I use the following to start indexing:
>>
>> curl
>> '
>> http://localhost:8983/solr/update/csv?commit=true&separator=%09&stream.file=/home/gkropitz/querystage2map/file1&stream.contentType=text/plain;charset=utf-8&fieldnames=query&escape=
>> \'
>>
>> When this command completes, I see numDocs is approximately 470k (which
>> is
>> what I find strange) and maxDocs is approximately 890k (which is fine
>> since
>> I know I have around 700k duplicates).  Even more confusing is that if I
>> run
>> this exact command a second time without performing any other operations,
>> numDocs goes up to around 610k, and a third time brings it up to about
>> 750k.
>>
>> Can anyone tell me what might cause Solr not to index everything in my
>> input
>> file the first time, and why it would be able to index new documents the
>> second and third times?
>>
>> I also have this line in solrconfig.xml, if it matters:
>>
>> <requestParsers enableRemoteStreaming="true"
>> multipartUploadLimitInKB="20480000" />
>>
>> Thanks,
>> Dan
>>
>> --
>> View this message in context:
>> http://old.nabble.com/Strange-Behavior-When-Using-CSVRequestHandler-tp27026926p27026926.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>>
> 
> 

-- 
View this message in context: 
http://old.nabble.com/Strange-Behavior-When-Using-CSVRequestHandler-%28Solr-1.4%29-tp27026926p27061086.html
Sent from the Solr - User mailing list archive at Nabble.com.

Reply via email to