It should not be required that your transient core size is greater
than or equal to the number of simultaneous updates.

Theoretically, it works like this:

- A request comes in and a reference-counted core is opened to serve
it. That may require loading the core.
- If another request comes in that bumps this core out, that core
should still be active until the current request is done.
- Once the request is done, the reference count is decremented and it's closed

So theoretically (I love that word) even though you have your
transient cache size set to 1 you can have N open transient cores, all
pending closure.

That said, I don't think there is a test case that deals with this
explicitly. The problem here is that you may have M requests queued up
for the _same_ core, each with a new update request. So theory aside,
Shawn's comment is very likely a way to get around this.

The model for transient cores is that a core is opened, used for a
while then thrown away, it wasn't built with the idea of rapidly
updating a single transient core so I can certainly believe that
that's a problem.

TestLazyCores.java has a multi-threaded test for a race condition, it
should be possible to write a test case for the above.

Best,
Erick

On Wed, Aug 22, 2018 at 9:19 AM, Shawn Heisey <apa...@elyograg.org> wrote:
> On 8/16/2018 7:14 PM, Michael Hu (CMBU) wrote:
>>
>> Environment:
>>
>>    *   solr 7.4.1
>>    *   all cores are vanilla cores with "loadOnStartUp" set to false, and
>> "transient" set to true
>>    *   we have about 75 cores with "transientCacheSize" set to 32
>>
>> Issue: we have core corruption from time to time (2-3 core corruption a
>> day)
>>
>> How to reproduce:
>>
>>    *   Set the "transientCacheSize" to 1
>>    *   Ingest high load to core1 only (no issue at this time)
>>    *   Continue ingest high load to core1 and start ingest load to core2
>> simultaneously (core2 immediately corrupted) (stack trace is attached below)
>
>
> If a core gets unloaded while you're sending data to it, operation is
> probably unpredictable.  Core corruption isn't good, but I'm not surprised
> that it happens in this scenario.
>
> Your transientCacheSize must allow all cores which are getting updates to be
> in memory at the same time, so unless that's all of your cores, the number
> should probably be larger than the number of cores getting updates, so you
> can query other cores simultaneously.
>
> Thanks,
> Shawn
>

Reply via email to