Hi, We've been experiencing some problems during search stress tests and we don't even have a clue on why is this happening.
We have the following: - 3 servers - Websphere 7 - Zookeeper 3.4.5 on each server - Solr 4.5.0 on each server - 1 shard (so it is one leader and 2 replicas) - The index contains 7M documents (About 2GB) We've run several stress tests with JMeter with 100-500 concurrent threads. Depending on how many threads, we have different scenarios, but appart from times or wether the system fully recovers or not, we have the next steps: 1. The solrs begin responding queries, with stable number of threads for each solr (Less than 10) 2. Once the test has been running for several minutes we kill one of the solrs (Most of the times the one being the leader) 3. The remaining solrs respond to the queries increasing slightly the number of threads used. 4. After a few minutes we restart the killed solr again (And here is where our problem starts) 5. Once it starts it begins increasing the number of threads used (Up to 100 or above) and the worst thing is that even the other two solrs start responding slowly (Or not responding at all). Then, depending on the number of concurrent queries, if there are few in more or less 3 minutes everything goes back to normal (thought almost no queries are attended during that period) or, if there are more than 200 concurrent queries the restarted server increases so much its used threads that it crashes. During the minutes that the three solrs are not responding there are no logs, and after making a thread dump we've seen a lot of stalled threads with sun.misc.Unsafe.park traces. I don't understand this behaviour at all, not only it works better with two solrs than restarting the third but this restart affects the behaviour of the two remaining solrs... Anybody has any clue about this? Thanks in advance -- Alejandro Marqués Rodríguez Paradigma Tecnológico http://www.paradigmatecnologico.com Avenida de Europa, 26. Ática 5. 3ª Planta 28224 Pozuelo de Alarcón Tel.: 91 352 59 42