Please allow me to answer my own question. I was using the ThreadLocalCleaner <https://github.com/apache/sling-org-apache-sling-commons-threads/blob/master/src/main/java/org/apache/sling/commons/threads/impl/ThreadLocalCleaner.java> class from the Apache Sling project that is a very useful (but dangerous) tool. Bottom line: it doesn't like weak references in ThreadLocals, like in Solr's CloseableThreadLocal class.
b. On Tue, Nov 19, 2019 at 4:34 PM Bram Biesbrouck < bram.biesbro...@reinvention.be> wrote: > Hi all, > > I think I might have discovered a synchronization bug when ingesting a lot > of data into Solr, but want to check with the specialists first ;-) > > I'm using a little custom written map/reduce framework that boots a > 20-something threads to do some heavy processing on data-preparation. When > this processing is done, the results of these threads are gathers in a > reduce step, where they are ingested into an (embedded) Solr instance. To > maximize throughput, I'm ingesting the data in parallel in a couple of > threads of their own and this is where I run into a synchronization error. > > As with all synchronization bugs, it happens "some" of the time and > they're hard to debug, but I think I managed to get my finger on the root > (I'm using Solr 8.3): > > in class org.apache.lucene.index.CodecReader, throws a NPE on line 84: > getFieldsReader().visitDocument(docID, visitor); > > The issue is that the getFieldsReader() getter is mapped to a ThreadLocal > (more explicitly, > org.apache.lucene.index.SegmentCoreReaders.fieldsReaderLocal) that seems to > be released (set to null) somewhere automatically, and read afterwards, > without synchronizing the two. > > I don't think I should set any resource locks of my own, since I'm only > using the SolrJ API and the /update endpoint. > > I know this is quite a low-level question, but could anyone point me in > the right direction to further investigate this issue? Ie, what could be > the reason the reader is released out-of-sync? > > best, > > b. >