Erick - thanks very much, all of this makes sense. But the one thing I still find puzzling is the fact that re-adding the file a second, third, fourth etc time causes numDocs to increase, and ALWAYS by the same amount (141,645). Any ideas as to what could cause that?
Dan Erick Erickson wrote: > > I think the root of your problem is that unique fields should NOT > be multivalued. See > http://wiki.apache.org/solr/FieldOptionsByUseCase?highlight=(unique)|(key) > > <http://wiki.apache.org/solr/FieldOptionsByUseCase?highlight=(unique)|(key)>In > this case, since you're tokenizing, your "query" field is > implicitly multi-valued, I don't know what the behavior will be. > > But there's another problem: > All the filters in your analyzer definition will mess up the > correspondence between the Unix uniq and numDocs even > if you got by the above. I.e.... > > StopFilter would make the lines "a problem" and "the problem" identical. > WordDelimiter would do all kinds of interesting things.... > LowerCaseFilter would make "Myproblem" and "myproblem" identical. > RemoveDuplicatesFilter would make "interesting interesting" and > "interesting" identical > > You could define a second field, make *that* one unique and NOT analyzer > it in any way... > > You could hash your sentences and define the hash as your unique key. > > You could.... > > HTH > Erick > > On Wed, Jan 6, 2010 at 1:06 PM, danben <dan...@gmail.com> wrote: > >> >> The problem: >> >> Not all of the documents that I expect to be indexed are showing up in >> the >> index. >> >> The background: >> >> I start off with an empty index based on a schema with a single field >> named >> 'query', marked as unique and using the following analyzer: >> >> <analyzer type="index"> >> <tokenizer class="solr.WhitespaceTokenizerFactory"/> >> <filter class="solr.StopFilterFactory" ignoreCase="true" >> words="stopwords.txt" enablePositionIncrements="true"/> >> <filter class="solr.WordDelimiterFilterFactory" >> generateWordParts="1" generateNumberParts="1" catenateWords="1" >> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/> >> <filter class="solr.LowerCaseFilterFactory"/> >> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> >> </analyzer> >> >> My input is a utf-8 encoded file with one sentence per line. Its total >> size >> is about 60MB. I would like each line of the file to correspond to a >> single >> document in the solr index. If I print the number of unique lines in the >> file (using cat | sort | uniq | wc -l), I get a little over 2M. Printing >> the total number of lines in the file gives me around 2.7M. >> >> I use the following to start indexing: >> >> curl >> ' >> http://localhost:8983/solr/update/csv?commit=true&separator=%09&stream.file=/home/gkropitz/querystage2map/file1&stream.contentType=text/plain;charset=utf-8&fieldnames=query&escape= >> \' >> >> When this command completes, I see numDocs is approximately 470k (which >> is >> what I find strange) and maxDocs is approximately 890k (which is fine >> since >> I know I have around 700k duplicates). Even more confusing is that if I >> run >> this exact command a second time without performing any other operations, >> numDocs goes up to around 610k, and a third time brings it up to about >> 750k. >> >> Can anyone tell me what might cause Solr not to index everything in my >> input >> file the first time, and why it would be able to index new documents the >> second and third times? >> >> I also have this line in solrconfig.xml, if it matters: >> >> <requestParsers enableRemoteStreaming="true" >> multipartUploadLimitInKB="20480000" /> >> >> Thanks, >> Dan >> >> -- >> View this message in context: >> http://old.nabble.com/Strange-Behavior-When-Using-CSVRequestHandler-tp27026926p27026926.html >> Sent from the Solr - User mailing list archive at Nabble.com. >> >> > > -- View this message in context: http://old.nabble.com/Strange-Behavior-When-Using-CSVRequestHandler-%28Solr-1.4%29-tp27026926p27061086.html Sent from the Solr - User mailing list archive at Nabble.com.