Yes, id is unique key. <field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />
> I bet that if you redefined your updateHandler to give it some name other than “/update” in solrconfig.xml two things would happen: Hmm, nice. I didn't think of that but that would definitely identify the problem. We do have other scripts writing to the index but they are not of type "talk"; talk is handled completely by a single script (although my suspicion has been that we have a rogue script somewhere). Will definitely give a rename a try, at least for my own sanity. Thanks again. On Mon, 28 Sep 2020 at 21:26, Erick Erickson <erickerick...@gmail.com> wrote: > Is your “id” field is your <uniqueKey>, and is it tokenized? It shouldn’t > be, use something like “string” or keywordTokenizer. Definitely do NOT use, > say, text_general. > > It’s very unlikely that records are not being flushed on commit, I’m > 99.99% certain that’s a red herring and that this is a problem in your > environment. > > Or that some process you don’t know about is sending documents that don’t > have the information you expect. The fact that you say you’ve disabled your > update scripts but see this second record being indexed 3 minutes later is > strong evidence that _someone_ is updating records, is there a cron job > somewhere that’s sending docs? Other?? > > I bet that if you redefined your updateHandler to give it some name other > than “/update” in solrconfig.xml two things would happen: > 1> this problem will go away > 2> you’ll get some error report from somewhere telling you that Solr is > broken because it isn’t accepting documents for update ;) > > > On Sep 28, 2020, at 9:01 AM, Mr Havercamp <mrhaverc...@gmail.com> wrote: > > > > Thanks Eric. My knowledge is fairly limited but 1) sounds feasible. Some > > logs: > > > > I write a bunch of recods to Solr: > > > > 2020-09-28 11:01:01.255 INFO (qtp918312414-21) [ x:vnc] > > o.a.s.u.p.LogUpdateProcessorFactory [vnc] webapp=/solr path=/update > > params={json.nl=flat&omitHeader=false&wt=json}{add=[ > > talk.tq0rkem4pc.jaydeep.pan...@dev.vnc.de (1679075166122934272), > > talk.tq0rkem4pc.dmitry.zolotni...@dev.vnc.de (1679075166123982848), > > talk.tq0rkem4pc.hayden.yo...@dev.vnc.de (1679075166125031424), > > talk.tq0rkem4pc.nishant.j...@dev.vnc.de (1679075166125031425), > > talk.tq0rkem4pc.macanh....@dev.vnc.de (1679075166126080000), > > talk.tq0rkem4pc.kapil.nadiyap...@dev.vnc.de (1679075166126080001), > > talk.tq0rkem4pc.sanjay.domad...@dev.vnc.de (1679075166126080002), > > talk.tq0rkem4pc.umesh.sarva...@dev.vnc.de > (1679075166127128576)],commit=} 0 > > 8 > > > > Selecting records looks good: > > > > { > > "talk_id_s":"tq0rkem4pc", > > "talk_internal_id_s":"29896", > > "from_s":"from address", > > "content_txt":["test_1000016"], > > "raw_txt":["<body xmlns=\"http://www.w3.org/1999/xhtml\ > > ">test_1000016</body>"], > > "created_dt":"2020-09-28T11:00:02Z", > > "updated_dt":"2020-09-28T11:00:02Z", > > "type_s":"talk", > > "talk_type_s":"groupchat", > > "title_s":"role__change__1_talk@conference", > > "to_ss":["bunch", "of", "names"], > > "owner_s":"owner address", > > "id":"talk.tq0rkem4pc.email@address", > > "_version_":1679075166127128576} > > > > Then, a few minutes later: > > > > 2020-09-28 11:04:33.070 INFO (qtp918312414-21) [ x:vnc] > > o.a.s.u.p.LogUpdateProcessorFactory [vnc] webapp=/solr path=/update > > params={wt=json}{add=[talk.tq0rkem4pc.hayden.yo...@dev.vnc.de > > (1679075388234399744)]} 0 1 > > 2020-09-28 11:04:33.150 INFO (qtp918312414-21) [ x:vnc] > > o.a.s.u.p.LogUpdateProcessorFactory [vnc] webapp=/solr path=/update > > params={wt=json}{add=[talk.tq0rkem4pc (1679075388318285824)]} 0 0 > > > > Checking the record again: > > > > { > > "id":"talk.tq0rkem4pc.email@address", > > "_version_":1679075388234399744}, > > { > > "id":"talk.tq0rkem4pc", > > "_version_":1679075388318285824} > > > > A couple of strange things here: > > > > 1. my talk.tq0rkem4pc.email@address record no longer has any data in it. > > Just id and version. > > > > 2. The second entry is really strange; this isn't a valid record at all > and > > I don't have any record of creating it. > > > > I've ruled out reindexing items both from my indexing script (I just > don't > > run it) and an external code snippet updating the record at a later time. > > > > Not sure if I've got the terminology right but would I be correct in > > assuming that it is possible records are not being flushed from the > buffer > > when added? I'm assuming there is some kind of buffering or caching going > > on before records are commttted? Is it possible they are getting > corrupted > > under higher than usual load? > > > > > > On Mon, 28 Sep 2020 at 20:41, Erick Erickson <erickerick...@gmail.com> > > wrote: > > > >> There are several possibilities: > >> > >> 1> you simply have some process incorrectly updating documents. > >> > >> 2> you’ve changed your schema sometime without completely deleting your > >> old index and re-indexing all documents from scratch. I recommend in > fact > >> indexing into a new collection and using collection aliasing if you > can’t > >> delete and recreate the collection before re-indexing. There’s some > support > >> for this idea because you say that the doc in question not only changes > one > >> way, but then changes back mysteriously. So seg1 (old def) merges with > seg2 > >> (new def) into seg10 using the old def because merging saw seg1 first. > Then > >> sometime later seg3 (new def) merges with seg10 and the data > mysteriously > >> comes back because that merge uses seg3 (new def) as a template for how > the > >> index “should” look. > >> > >> But I’ve never heard of Solr (well, Lucene actually) doing this by > itself, > >> and I have heard of the merging process doing “interesting” stuff with > >> segments created with changed schema definitions. > >> > >> Best, > >> Erick > >> > >>> On Sep 28, 2020, at 8:26 AM, Mr Havercamp <mrhaverc...@gmail.com> > wrote: > >>> > >>> Hi, > >>> > >>> We're seeing strange behaviour when records have been committed. It > >> doesn't > >>> happen all the time but enough that the index is very inconsistent. > >>> > >>> What happens: > >>> > >>> 1. We commit a doc to Solr, > >>> 2. The doc shows in the search results, > >>> 3. Later (may be immediate, may take minutes, may take hours), the same > >>> document is emptied of all data except version and id. > >>> > >>> We have custom scripts which add to the index but even without them > being > >>> executed we see records being updated in this way. > >>> > >>> For example committing: > >>> > >>> { id: talk.1234, from: "me", to: "you", "content": "some content", > title: > >>> "some title"} > >>> > >>> will suddenly end up as after an initial successful search: > >>> > >>> { id: talk.1234, version: 1234} > >>> > >>> Not sure how to proceed on debugging this issue. It seems to settle in > >>> after Solr has been running for a while but can just as quickly rectify > >>> itself. > >>> > >>> At a loss how to debug and proceed. > >>> > >>> Any help much appreciated. > >> > >> > >