Is your “id” field is your <uniqueKey>, and is it tokenized? It shouldn’t be, use something like “string” or keywordTokenizer. Definitely do NOT use, say, text_general.
It’s very unlikely that records are not being flushed on commit, I’m 99.99% certain that’s a red herring and that this is a problem in your environment. Or that some process you don’t know about is sending documents that don’t have the information you expect. The fact that you say you’ve disabled your update scripts but see this second record being indexed 3 minutes later is strong evidence that _someone_ is updating records, is there a cron job somewhere that’s sending docs? Other?? I bet that if you redefined your updateHandler to give it some name other than “/update” in solrconfig.xml two things would happen: 1> this problem will go away 2> you’ll get some error report from somewhere telling you that Solr is broken because it isn’t accepting documents for update ;) > On Sep 28, 2020, at 9:01 AM, Mr Havercamp <mrhaverc...@gmail.com> wrote: > > Thanks Eric. My knowledge is fairly limited but 1) sounds feasible. Some > logs: > > I write a bunch of recods to Solr: > > 2020-09-28 11:01:01.255 INFO (qtp918312414-21) [ x:vnc] > o.a.s.u.p.LogUpdateProcessorFactory [vnc] webapp=/solr path=/update > params={json.nl=flat&omitHeader=false&wt=json}{add=[ > talk.tq0rkem4pc.jaydeep.pan...@dev.vnc.de (1679075166122934272), > talk.tq0rkem4pc.dmitry.zolotni...@dev.vnc.de (1679075166123982848), > talk.tq0rkem4pc.hayden.yo...@dev.vnc.de (1679075166125031424), > talk.tq0rkem4pc.nishant.j...@dev.vnc.de (1679075166125031425), > talk.tq0rkem4pc.macanh....@dev.vnc.de (1679075166126080000), > talk.tq0rkem4pc.kapil.nadiyap...@dev.vnc.de (1679075166126080001), > talk.tq0rkem4pc.sanjay.domad...@dev.vnc.de (1679075166126080002), > talk.tq0rkem4pc.umesh.sarva...@dev.vnc.de (1679075166127128576)],commit=} 0 > 8 > > Selecting records looks good: > > { > "talk_id_s":"tq0rkem4pc", > "talk_internal_id_s":"29896", > "from_s":"from address", > "content_txt":["test_1000016"], > "raw_txt":["<body xmlns=\"http://www.w3.org/1999/xhtml\ > ">test_1000016</body>"], > "created_dt":"2020-09-28T11:00:02Z", > "updated_dt":"2020-09-28T11:00:02Z", > "type_s":"talk", > "talk_type_s":"groupchat", > "title_s":"role__change__1_talk@conference", > "to_ss":["bunch", "of", "names"], > "owner_s":"owner address", > "id":"talk.tq0rkem4pc.email@address", > "_version_":1679075166127128576} > > Then, a few minutes later: > > 2020-09-28 11:04:33.070 INFO (qtp918312414-21) [ x:vnc] > o.a.s.u.p.LogUpdateProcessorFactory [vnc] webapp=/solr path=/update > params={wt=json}{add=[talk.tq0rkem4pc.hayden.yo...@dev.vnc.de > (1679075388234399744)]} 0 1 > 2020-09-28 11:04:33.150 INFO (qtp918312414-21) [ x:vnc] > o.a.s.u.p.LogUpdateProcessorFactory [vnc] webapp=/solr path=/update > params={wt=json}{add=[talk.tq0rkem4pc (1679075388318285824)]} 0 0 > > Checking the record again: > > { > "id":"talk.tq0rkem4pc.email@address", > "_version_":1679075388234399744}, > { > "id":"talk.tq0rkem4pc", > "_version_":1679075388318285824} > > A couple of strange things here: > > 1. my talk.tq0rkem4pc.email@address record no longer has any data in it. > Just id and version. > > 2. The second entry is really strange; this isn't a valid record at all and > I don't have any record of creating it. > > I've ruled out reindexing items both from my indexing script (I just don't > run it) and an external code snippet updating the record at a later time. > > Not sure if I've got the terminology right but would I be correct in > assuming that it is possible records are not being flushed from the buffer > when added? I'm assuming there is some kind of buffering or caching going > on before records are commttted? Is it possible they are getting corrupted > under higher than usual load? > > > On Mon, 28 Sep 2020 at 20:41, Erick Erickson <erickerick...@gmail.com> > wrote: > >> There are several possibilities: >> >> 1> you simply have some process incorrectly updating documents. >> >> 2> you’ve changed your schema sometime without completely deleting your >> old index and re-indexing all documents from scratch. I recommend in fact >> indexing into a new collection and using collection aliasing if you can’t >> delete and recreate the collection before re-indexing. There’s some support >> for this idea because you say that the doc in question not only changes one >> way, but then changes back mysteriously. So seg1 (old def) merges with seg2 >> (new def) into seg10 using the old def because merging saw seg1 first. Then >> sometime later seg3 (new def) merges with seg10 and the data mysteriously >> comes back because that merge uses seg3 (new def) as a template for how the >> index “should” look. >> >> But I’ve never heard of Solr (well, Lucene actually) doing this by itself, >> and I have heard of the merging process doing “interesting” stuff with >> segments created with changed schema definitions. >> >> Best, >> Erick >> >>> On Sep 28, 2020, at 8:26 AM, Mr Havercamp <mrhaverc...@gmail.com> wrote: >>> >>> Hi, >>> >>> We're seeing strange behaviour when records have been committed. It >> doesn't >>> happen all the time but enough that the index is very inconsistent. >>> >>> What happens: >>> >>> 1. We commit a doc to Solr, >>> 2. The doc shows in the search results, >>> 3. Later (may be immediate, may take minutes, may take hours), the same >>> document is emptied of all data except version and id. >>> >>> We have custom scripts which add to the index but even without them being >>> executed we see records being updated in this way. >>> >>> For example committing: >>> >>> { id: talk.1234, from: "me", to: "you", "content": "some content", title: >>> "some title"} >>> >>> will suddenly end up as after an initial successful search: >>> >>> { id: talk.1234, version: 1234} >>> >>> Not sure how to proceed on debugging this issue. It seems to settle in >>> after Solr has been running for a while but can just as quickly rectify >>> itself. >>> >>> At a loss how to debug and proceed. >>> >>> Any help much appreciated. >> >>