Re: Documents disappearing

Pascal Dimassimo Wed, 24 Feb 2010 06:55:41 -0800

Hoss,

Thanks for your answers. You are absolutely right, I should have provided
you more details.


We index using 4 processes that read from a queue of documents. Each process
send one document at a time to the /update handler.

Yes, I double checked that no delete occur. Since that indexation, I
re-index the same set of documents twice and we always end up with 7725
documents, but it did not show that ~10000 documents count that we saw the
first time. But the difference between the first indexation and the others
was that the first time, the indexation last a couple of hours because the
documents were not always accessible in our document queue. The others
times, the documents were all available so it took around 20 minutes to
re-index all documents. So there we no time for an auto-commit to happen
during the others indexation so the log never shows the newSearcher warming
query that I use as a document count. 

About the newSearcher warming query, it is a typo in the config. It should
have been 'qt'. Thanks for this one!

In my schema.xml, I have define the id ans signature fields like this:
<field name="id" type="string" indexed="true" stored="true" required="true"
/>
<field name="signature" type="string" indexed="true" stored="true"/>
...
<uniqueKey>id</uniqueKey>
<defaultSearchField>fulltext</defaultSearchField>


And here is our solrconfig.xml:
<?xml version="1.0" encoding="UTF-8" ?>

<config>
 
<abortOnConfigurationError>${solr.abortOnConfigurationError:true}</abortOnConfigurationError>

  <indexDefaults>
    <useCompoundFile>false</useCompoundFile>
    <mergeFactor>10</mergeFactor>
    <ramBufferSizeMB>32</ramBufferSizeMB>
    <maxMergeDocs>2147483647</maxMergeDocs>
    <maxFieldLength>10000</maxFieldLength>
    <writeLockTimeout>1000</writeLockTimeout>
    <commitLockTimeout>10000</commitLockTimeout>
    <lockType>single</lockType>
  </indexDefaults>

  <mainIndex>
    <useCompoundFile>false</useCompoundFile>
    <ramBufferSizeMB>32</ramBufferSizeMB>
    <mergeFactor>10</mergeFactor>
    <maxMergeDocs>2147483647</maxMergeDocs>
    <maxFieldLength>10000</maxFieldLength>
    <unlockOnStartup>false</unlockOnStartup>
  </mainIndex>

  <updateHandler class="solr.DirectUpdateHandler2">
        <!-- Perform a <commit/> automatically under certain conditions:
         maxDocs - number of updates since last commit is greater than this
         maxTime - oldest uncommited update (in ms) is this long ago
    -->
        <autoCommit>
                <maxDocs>10000</maxDocs>
                <maxTime>1800000</maxTime>
        </autoCommit>
  </updateHandler>


  <query>
    <maxBooleanClauses>1024</maxBooleanClauses>

    <filterCache
      class="solr.FastLRUCache"
      size="1048576"
      initialSize="4096"
      autowarmCount="1024"/>

    <queryResultCache
      class="solr.LRUCache"
      size="16384"
      initialSize="4096"
      autowarmCount="128"/>

    <documentCache
      class="solr.FastLRUCache"
      size="1048576"
      initialSize="512"
      autowarmCount="0"/>

    <enableLazyFieldLoading>true</enableLazyFieldLoading>
    <queryResultWindowSize>50</queryResultWindowSize>
    <queryResultMaxDocsCached>200</queryResultMaxDocsCached>
    <HashDocSet maxSize="3000" loadFactor="0.75"/>

    <listener event="newSearcher" class="solr.QuerySenderListener">
      <arr name="queries">
        <lst>
                        <str name="q">*:*</str>
                        <str name="sort">original_date desc</str>
                </lst>
                <lst>
                        <str name="q">*:*</str>
                        <str name="wt">dismax</str>
                </lst>
                <lst>
                        <str name="q">*:*</str>
                        <str name="facet">true</str>                    
                        <str name="facet.field">source</str>
                        <str name="facet.field">author</str>
                        <str name="facet.field">type</str>
                        <str name="facet.field">site</str>
                </lst>
      </arr>
    </listener>

    <listener event="firstSearcher" class="solr.QuerySenderListener">
      <arr name="queries">
        <lst>
                        <str name="q">*:*</str>
                        <str name="sort">original_date desc</str>
                </lst>
                <lst>
                        <str name="q">*:*</str>
                        <str name="wt">dismax</str>
                </lst>
                <lst>
                        <str name="q">*:*</str>
                        <str name="facet">true</str>                    
                        <str name="facet.field">source</str>
                        <str name="facet.field">author</str>
                        <str name="facet.field">type</str>
                        <str name="facet.field">site</str>
                </lst>
      </arr>
    </listener>

    <useColdSearcher>false</useColdSearcher>
    <maxWarmingSearchers>2</maxWarmingSearchers>
  </query>

  <requestDispatcher handleSelect="true" >
    <requestParsers enableRemoteStreaming="false"
multipartUploadLimitInKB="2048" />
    <httpCaching lastModifiedFrom="openTime" etagSeed="Solr">
    </httpCaching>
  </requestDispatcher>
      
  <requestHandler name="standard" class="solr.SearchHandler" default="true">
    <!-- default values for query parameters -->
     <lst name="defaults">
        <str name="echoParams">explicit</str>
        <str name="spellcheck.extendedResults">true</str>
        <str name="spellcheck.count">5</str>
        <str name="spellcheck.collate">true</str>
        <str name="spellcheck.onlyMorePopular">true</str>
     </lst>
         
         <arr name="last-components">
        <str>spellcheck</str>           
     </arr>
  </requestHandler>

<requestHandler name="dismax" class="solr.SearchHandler" > 
   <lst name="defaults">
    <str name="defType">dismax</str>
    <str name="echoParams">explicit</str>
    <float name="tie">0.2</float>
    <str name="qf">
       fulltext^1.2 title^2.3 text^1.2
    </str>
    <str name="pf">
       fulltext^1.2 title^1.8 text^1.2
    </str>
    <str name="bf">recip(rord(original_date),1,10000,10000)^100</str>
    <str name="bq">original_date:[NOW-10DAY TO *]^2</str>
    <str name="fl">
id,title,text,author,original_date,source,section
    </str>
    <str name="mm">
       2&lt;100% 3&lt;-1 4&lt;-2 8&lt;60%
    </str>
    <int name="ps">100</int>
    <str name="q.alt">*:*</str>
    <!-- example highlighter config, enable per-query with hl=true -->

    <str name="hl.fl">text features name</str>
    <!-- for this field, we want no fragmenting, just highlighting -->
    <str name="f.name.hl.fragsize">0</str>
    <!-- instructs Solr to return the field itself if no query terms are
         found
-->
    <str name="f.name.hl.alternateField">name</str>
    <str name="f.text.hl.fragmenter">regex</str> <!-- defined below -->
    
    <str name="spellcheck.extendedResults">true</str>
    <str name="spellcheck.count">5</str>
    <str name="spellcheck.collate">true</str>
    <str name="spellcheck.onlyMorePopular">true</str>
   </lst>
   
   <arr name="last-components">
    <str>spellcheck</str>
    <str>facetcleaner</str>
    <str>docreader</str>
    <str>queryelevation</str>
    <str>didyoumean</str>
    <str>likethis</str>
   </arr>
 </requestHandler>
  
  <searchComponent name="spellcheck" class="solr.SpellCheckComponent">
    <str name="queryAnalyzerFieldType">textSpell</str>
    <lst name="spellchecker">
      <str name="name">default</str>
      <str name="field">spellchecker</str>
      <str name="spellcheckIndexDir">./spellchecker1</str>
    </lst>
  </searchComponent>
  
  <requestHandler name="/update" class="solr.XmlUpdateRequestHandler" />
  <requestHandler name="/analysis" class="solr.AnalysisRequestHandler"
startup="lazy"/>
  <requestHandler name="/admin/"
class="org.apache.solr.handler.admin.AdminHandlers" />
  
  <requestHandler name="/admin/ping" class="PingRequestHandler">
    <lst name="defaults">
      <str name="qt">standard</str>
      <str name="q">solrpingquery</str>
      <str name="echoParams">all</str>
    </lst>
  </requestHandler>
    
  <requestHandler name="/debug/dump" class="solr.DumpRequestHandler"
startup="lazy">
    <lst name="defaults">
     <str name="echoParams">explicit</str> <!-- for all params (including
the default etc) use: 'all' -->
     <str name="echoHandler">true</str>
    </lst>
  </requestHandler>
  
  <requestHandler name="/mlt"
class="org.apache.solr.handler.MoreLikeThisHandler" />
  
  <highlighting>
   <fragmenter name="gap" class="org.apache.solr.highlight.GapFragmenter"
default="true">
    <lst name="defaults">
     <int name="hl.fragsize">100</int>
    </lst>
   </fragmenter>
   <fragmenter name="regex"
class="org.apache.solr.highlight.RegexFragmenter">
    <lst name="defaults">
      <!-- slightly smaller fragsizes work better because of slop -->
      <int name="hl.fragsize">70</int>
      <!-- allow 50% slop on fragment sizes -->
      <float name="hl.regex.slop">0.5</float> 
      <!-- a basic sentence pattern -->
      <str name="hl.regex.pattern">[-\w ,/\n\"']{20,200}</str>
    </lst>
   </fragmenter>
   <formatter name="html" class="org.apache.solr.highlight.HtmlFormatter"
default="true">
    <lst name="defaults">
     <str name="hl.simple.pre"><![CDATA[<em>]]></str>
     <str name="hl.simple.post"><![CDATA[</em>]]></str>
    </lst>
   </formatter>
  </highlighting>

  <queryResponseWriter name="xslt"
class="org.apache.solr.request.XSLTResponseWriter">
    <int name="xsltCacheLifetimeSeconds">5</int>
  </queryResponseWriter> 
     
  <admin>
    <defaultQuery>solr</defaultQuery>
  </admin>
  
  <requestHandler name="/replication" class="solr.ReplicationHandler" >
         <lst name="master">
            <str name="enable">${enable.master:false}</str>
            <str name="replicateAfter">startup</str> 
            <str name="replicateAfter">commit</str>
            <str name="replicateAfter">optimize</str>
         </lst>
         <lst name="slave">
            <str name="enable">${enable.slave:false}</str> 
            <str name="masterUrl">${slave.master.url}</str>
            <str name="pollInterval">${slave.poll.interval}</str>
         </lst>
  </requestHandler>
  
  <updateRequestProcessorChain>
    <processor
class="org.apache.solr.update.processor.SignatureUpdateProcessorFactory">
      <bool name="enabled">true</bool>
      <bool name="overwriteDupes">false</bool>
      <str name="signatureField">signature</str>
      <str name="fields">title,text</str>
      <str
name="signatureClass">org.apache.solr.update.processor.Lookup3Signature</str>
    </processor>    
    <processor class="solr.LogUpdateProcessorFactory" />
    <processor class="solr.RunUpdateProcessorFactory" />
  </updateRequestProcessorChain>

</config>

Again, thanks for your help!


hossman wrote:
> 
> 
> : I have encounter a situation that I can't explain. We are indexing
> documents
> : that are often duplicates so we activated deduplication like this:
> 
> FWIW: w/o providing us more info about what your schema looks like, and 
> how you are indexing documents, all we can do is speculate about some of 
> hte possible causes of your problems -- for all we know you don't have 
> your uniqueKey configured properly, or have something in DIH configured to 
> do deletes on delta imports, etc...  We need all the facts to make 
> informed suggestions.
> 
> : What I can't explain is that when I look at the documents count in the
> log,
> : I see documents disappearing.
> : 
> : 11:24:23 INFO  - [myindex] webapp=null path=null
> : params={event=newSearcher&q=*:*&wt=dismax} hits=0 status=0 QTime=0
> 
> 1) it looks like you only included the "newSearcher" related warming query 
> log messages in your email ... i assume you double checked that there were 
> no "delete" messages logged by the LogUpdateProcessor ?
> 
> 2) that's a fairly non-sensical warming query ... do you really have a 
> queryResponseWriter registered with the name "dismax" (it's typically used 
> as either a RequestHandler (qt) or QParser (defType) ... w/o knowing what 
> your default requestHandler declaration looks like, its totally possible 
> that the number you are seeing has nothing to do with the totaly number of 
> docs in your index, and instead just indicates how many docs match the 
> litteral string "*:*" in your default seearch fielt (or some set of query 
> fields if you are using dismax as the default QParser) which can 
> certainly change as you update existing documents..
> 
> As i said: full configs would make it a lot easier to help clear up what 
> you are seeing.
> 
> 
> 
> -Hoss
> 
> 
> 

-- 
View this message in context: 
http://old.nabble.com/Documents-disappearing-tp27659047p27714221.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Documents disappearing

Reply via email to