Hoss, Thanks for your answers. You are absolutely right, I should have provided you more details.
We index using 4 processes that read from a queue of documents. Each process send one document at a time to the /update handler. Yes, I double checked that no delete occur. Since that indexation, I re-index the same set of documents twice and we always end up with 7725 documents, but it did not show that ~10000 documents count that we saw the first time. But the difference between the first indexation and the others was that the first time, the indexation last a couple of hours because the documents were not always accessible in our document queue. The others times, the documents were all available so it took around 20 minutes to re-index all documents. So there we no time for an auto-commit to happen during the others indexation so the log never shows the newSearcher warming query that I use as a document count. About the newSearcher warming query, it is a typo in the config. It should have been 'qt'. Thanks for this one! In my schema.xml, I have define the id ans signature fields like this: <field name="id" type="string" indexed="true" stored="true" required="true" /> <field name="signature" type="string" indexed="true" stored="true"/> ... <uniqueKey>id</uniqueKey> <defaultSearchField>fulltext</defaultSearchField> And here is our solrconfig.xml: <?xml version="1.0" encoding="UTF-8" ?> <config> <abortOnConfigurationError>${solr.abortOnConfigurationError:true}</abortOnConfigurationError> <indexDefaults> <useCompoundFile>false</useCompoundFile> <mergeFactor>10</mergeFactor> <ramBufferSizeMB>32</ramBufferSizeMB> <maxMergeDocs>2147483647</maxMergeDocs> <maxFieldLength>10000</maxFieldLength> <writeLockTimeout>1000</writeLockTimeout> <commitLockTimeout>10000</commitLockTimeout> <lockType>single</lockType> </indexDefaults> <mainIndex> <useCompoundFile>false</useCompoundFile> <ramBufferSizeMB>32</ramBufferSizeMB> <mergeFactor>10</mergeFactor> <maxMergeDocs>2147483647</maxMergeDocs> <maxFieldLength>10000</maxFieldLength> <unlockOnStartup>false</unlockOnStartup> </mainIndex> <updateHandler class="solr.DirectUpdateHandler2"> <!-- Perform a <commit/> automatically under certain conditions: maxDocs - number of updates since last commit is greater than this maxTime - oldest uncommited update (in ms) is this long ago --> <autoCommit> <maxDocs>10000</maxDocs> <maxTime>1800000</maxTime> </autoCommit> </updateHandler> <query> <maxBooleanClauses>1024</maxBooleanClauses> <filterCache class="solr.FastLRUCache" size="1048576" initialSize="4096" autowarmCount="1024"/> <queryResultCache class="solr.LRUCache" size="16384" initialSize="4096" autowarmCount="128"/> <documentCache class="solr.FastLRUCache" size="1048576" initialSize="512" autowarmCount="0"/> <enableLazyFieldLoading>true</enableLazyFieldLoading> <queryResultWindowSize>50</queryResultWindowSize> <queryResultMaxDocsCached>200</queryResultMaxDocsCached> <HashDocSet maxSize="3000" loadFactor="0.75"/> <listener event="newSearcher" class="solr.QuerySenderListener"> <arr name="queries"> <lst> <str name="q">*:*</str> <str name="sort">original_date desc</str> </lst> <lst> <str name="q">*:*</str> <str name="wt">dismax</str> </lst> <lst> <str name="q">*:*</str> <str name="facet">true</str> <str name="facet.field">source</str> <str name="facet.field">author</str> <str name="facet.field">type</str> <str name="facet.field">site</str> </lst> </arr> </listener> <listener event="firstSearcher" class="solr.QuerySenderListener"> <arr name="queries"> <lst> <str name="q">*:*</str> <str name="sort">original_date desc</str> </lst> <lst> <str name="q">*:*</str> <str name="wt">dismax</str> </lst> <lst> <str name="q">*:*</str> <str name="facet">true</str> <str name="facet.field">source</str> <str name="facet.field">author</str> <str name="facet.field">type</str> <str name="facet.field">site</str> </lst> </arr> </listener> <useColdSearcher>false</useColdSearcher> <maxWarmingSearchers>2</maxWarmingSearchers> </query> <requestDispatcher handleSelect="true" > <requestParsers enableRemoteStreaming="false" multipartUploadLimitInKB="2048" /> <httpCaching lastModifiedFrom="openTime" etagSeed="Solr"> </httpCaching> </requestDispatcher> <requestHandler name="standard" class="solr.SearchHandler" default="true"> <!-- default values for query parameters --> <lst name="defaults"> <str name="echoParams">explicit</str> <str name="spellcheck.extendedResults">true</str> <str name="spellcheck.count">5</str> <str name="spellcheck.collate">true</str> <str name="spellcheck.onlyMorePopular">true</str> </lst> <arr name="last-components"> <str>spellcheck</str> </arr> </requestHandler> <requestHandler name="dismax" class="solr.SearchHandler" > <lst name="defaults"> <str name="defType">dismax</str> <str name="echoParams">explicit</str> <float name="tie">0.2</float> <str name="qf"> fulltext^1.2 title^2.3 text^1.2 </str> <str name="pf"> fulltext^1.2 title^1.8 text^1.2 </str> <str name="bf">recip(rord(original_date),1,10000,10000)^100</str> <str name="bq">original_date:[NOW-10DAY TO *]^2</str> <str name="fl"> id,title,text,author,original_date,source,section </str> <str name="mm"> 2<100% 3<-1 4<-2 8<60% </str> <int name="ps">100</int> <str name="q.alt">*:*</str> <!-- example highlighter config, enable per-query with hl=true --> <str name="hl.fl">text features name</str> <!-- for this field, we want no fragmenting, just highlighting --> <str name="f.name.hl.fragsize">0</str> <!-- instructs Solr to return the field itself if no query terms are found --> <str name="f.name.hl.alternateField">name</str> <str name="f.text.hl.fragmenter">regex</str> <!-- defined below --> <str name="spellcheck.extendedResults">true</str> <str name="spellcheck.count">5</str> <str name="spellcheck.collate">true</str> <str name="spellcheck.onlyMorePopular">true</str> </lst> <arr name="last-components"> <str>spellcheck</str> <str>facetcleaner</str> <str>docreader</str> <str>queryelevation</str> <str>didyoumean</str> <str>likethis</str> </arr> </requestHandler> <searchComponent name="spellcheck" class="solr.SpellCheckComponent"> <str name="queryAnalyzerFieldType">textSpell</str> <lst name="spellchecker"> <str name="name">default</str> <str name="field">spellchecker</str> <str name="spellcheckIndexDir">./spellchecker1</str> </lst> </searchComponent> <requestHandler name="/update" class="solr.XmlUpdateRequestHandler" /> <requestHandler name="/analysis" class="solr.AnalysisRequestHandler" startup="lazy"/> <requestHandler name="/admin/" class="org.apache.solr.handler.admin.AdminHandlers" /> <requestHandler name="/admin/ping" class="PingRequestHandler"> <lst name="defaults"> <str name="qt">standard</str> <str name="q">solrpingquery</str> <str name="echoParams">all</str> </lst> </requestHandler> <requestHandler name="/debug/dump" class="solr.DumpRequestHandler" startup="lazy"> <lst name="defaults"> <str name="echoParams">explicit</str> <!-- for all params (including the default etc) use: 'all' --> <str name="echoHandler">true</str> </lst> </requestHandler> <requestHandler name="/mlt" class="org.apache.solr.handler.MoreLikeThisHandler" /> <highlighting> <fragmenter name="gap" class="org.apache.solr.highlight.GapFragmenter" default="true"> <lst name="defaults"> <int name="hl.fragsize">100</int> </lst> </fragmenter> <fragmenter name="regex" class="org.apache.solr.highlight.RegexFragmenter"> <lst name="defaults"> <!-- slightly smaller fragsizes work better because of slop --> <int name="hl.fragsize">70</int> <!-- allow 50% slop on fragment sizes --> <float name="hl.regex.slop">0.5</float> <!-- a basic sentence pattern --> <str name="hl.regex.pattern">[-\w ,/\n\"']{20,200}</str> </lst> </fragmenter> <formatter name="html" class="org.apache.solr.highlight.HtmlFormatter" default="true"> <lst name="defaults"> <str name="hl.simple.pre"><![CDATA[<em>]]></str> <str name="hl.simple.post"><![CDATA[</em>]]></str> </lst> </formatter> </highlighting> <queryResponseWriter name="xslt" class="org.apache.solr.request.XSLTResponseWriter"> <int name="xsltCacheLifetimeSeconds">5</int> </queryResponseWriter> <admin> <defaultQuery>solr</defaultQuery> </admin> <requestHandler name="/replication" class="solr.ReplicationHandler" > <lst name="master"> <str name="enable">${enable.master:false}</str> <str name="replicateAfter">startup</str> <str name="replicateAfter">commit</str> <str name="replicateAfter">optimize</str> </lst> <lst name="slave"> <str name="enable">${enable.slave:false}</str> <str name="masterUrl">${slave.master.url}</str> <str name="pollInterval">${slave.poll.interval}</str> </lst> </requestHandler> <updateRequestProcessorChain> <processor class="org.apache.solr.update.processor.SignatureUpdateProcessorFactory"> <bool name="enabled">true</bool> <bool name="overwriteDupes">false</bool> <str name="signatureField">signature</str> <str name="fields">title,text</str> <str name="signatureClass">org.apache.solr.update.processor.Lookup3Signature</str> </processor> <processor class="solr.LogUpdateProcessorFactory" /> <processor class="solr.RunUpdateProcessorFactory" /> </updateRequestProcessorChain> </config> Again, thanks for your help! hossman wrote: > > > : I have encounter a situation that I can't explain. We are indexing > documents > : that are often duplicates so we activated deduplication like this: > > FWIW: w/o providing us more info about what your schema looks like, and > how you are indexing documents, all we can do is speculate about some of > hte possible causes of your problems -- for all we know you don't have > your uniqueKey configured properly, or have something in DIH configured to > do deletes on delta imports, etc... We need all the facts to make > informed suggestions. > > : What I can't explain is that when I look at the documents count in the > log, > : I see documents disappearing. > : > : 11:24:23 INFO - [myindex] webapp=null path=null > : params={event=newSearcher&q=*:*&wt=dismax} hits=0 status=0 QTime=0 > > 1) it looks like you only included the "newSearcher" related warming query > log messages in your email ... i assume you double checked that there were > no "delete" messages logged by the LogUpdateProcessor ? > > 2) that's a fairly non-sensical warming query ... do you really have a > queryResponseWriter registered with the name "dismax" (it's typically used > as either a RequestHandler (qt) or QParser (defType) ... w/o knowing what > your default requestHandler declaration looks like, its totally possible > that the number you are seeing has nothing to do with the totaly number of > docs in your index, and instead just indicates how many docs match the > litteral string "*:*" in your default seearch fielt (or some set of query > fields if you are using dismax as the default QParser) which can > certainly change as you update existing documents.. > > As i said: full configs would make it a lot easier to help clear up what > you are seeing. > > > > -Hoss > > > -- View this message in context: http://old.nabble.com/Documents-disappearing-tp27659047p27714221.html Sent from the Solr - User mailing list archive at Nabble.com.