Thanks Erick,

We've never merged indexes. We don't use the MapReduceIndexerTool, but do
use an external map reduce process to reindex. To reindex from an empty
state we have a map reduce job which runs on a separate HBase cluster and
indexes into this shard. During this job each mapper is concurrently making
http update requests to the shard, but only 1 mapper should post a document
per unique "id".

Reindexing from scratch is done roughly every 3 months. In between that
time we have a worker external to solr which reads from an event stream and
posts http updates to the solr cluster.

The <uniqueKey> has never but updated to my knowledge, but if it has it
definitely wasn't updated in the last 3 months since the last reindexing.

Also since the last reindexing nothing in the solrconfig.xml or
managed-schema has been updated, nor has the index been manipulated outside
of the solr framework.

On Tue, May 14, 2019 at 5:24 PM Erick Erickson <erickerick...@gmail.com>
wrote:

> This is indeed strange. First of all, forget about explanations that
> involve the transaction log etc. When Lucene opens a searcher, it is only
> for closed segments, the tlog has nothing to do with that.
>
> Have you ever merget indexes? The MapReduceIndexerTool, if you ever used
> it, does not de-duplicate. Ditto if you ever changed the <uniqueKey>. The
> fact that you say that this clears up when you re-index the document leads
> me to wonder whether you have manipulated the index outside the normal Solr
> framework.
>
> IOW, I’ve never seen this before, so I suspect there’s something you did
> in your setup that seemed innocent at the time that lead to this
> (temporary) situation.
>
> Best,
> Erick
>
> > On May 14, 2019, at 5:43 PM, Adam Walz <a...@adamwalz.net> wrote:
> >
> > In my solr schema I have set a uniqueKey of "id" where the id field is a
> > solr.StrField. When querying with this field as a filter I would expect
> to
> > always get 1 or 0 documents as a result. However I am getting back
> multiple
> > documents with the same "id" field, but different internal `docid`s. This
> > problem is intermittent and seems to resolve itself when the document is
> > updated. This is happening on solr 7.0.1 without SolrCloud and while only
> > querying a single shard without routing.
> >
> > Any thoughts on what could be causing this behavior? This is a very large
> > single shard with 300 million documents and an index size of 750GB. I
> know
> > that is not recommended for a single shard, but could it explain these
> > duplicate results possibly because of the time it takes to commit, merge,
> > or something with tlogs?
> >
> > -- Query --
> > http://solr:8983/solr/filesearch/select?fl=id,[docid],score&fq=id:file_
> > <
> http://solr1128.ve.box.net:8985/solr/filesearch/select?fl=id,[docid],score&fq=id:file_413041895994&q=*:*
> >
> > *382506116*&q=*:*
> > <
> http://solr1128.ve.box.net:8985/solr/filesearch/select?fl=id,[docid],score&fq=id:file_413041895994&q=*:*
> >
> > -- Response --
> >
> > {
> >  "responseHeader":{
> >    "status":0,
> >    "QTime":0,
> >    "params":{
> >      "mm":" 1<-0% ",
> >      "q.alt":"*:*",
> >      "ps":"100",
> >      "echoParams":"all",
> >      "fl":"id,[docid],score",
> >      "fq":"id:file_413041895994",
> >      "sort":"score desc",
> >      "rows":"35",
> >      "version":"2.2",
> >      "q":"*:*",
> >      "tie":"0.01",
> >      "defType":"edismax",
> >      "qf":"id name_combined^10 name_zh-cn^10 name_shingle
> > name_shingle_zh-cn name_token^60 description file_content_en
> > file_content_fr file_content_de file_content_it file_content_es
> > file_content_zh-cn user_name user_email comments tags",
> >      "pf":"description name_shingle^100 name_shingle_zh-cn^100 comments
> tags",
> >      "wt":"json",
> >      "debugQuery":"off"}},
> >  "response":{"numFound":2,"start":0,"maxScore":1.0,"docs":[
> >      {
> >        "id":"file_382506116",
> >
> >        "[docid]":346266675,
> >        "score":1.0}]
> >  },{
> >
> >        "id":"file_382506116",
> >        "[docid]":170442733,
> >        "score":1.0}]
> >
> >  }}
> >
> >
> > -- Schema snippet --
> > <fields>
> >  <field name="id" type="string" indexed="true" stored="true"
> > required="true"/>
> > </fields>
> > <uniqueKey>id</uniqueKey>
> >
> > --
> > Adam Walz
>

Reply via email to