Did you try to see where/which component like query, facet highlight... is taking time by debugQuery=on when performance is slow. Just to rule out any other component is not the culprit...
Thnx On Mon, Jun 25, 2018 at 2:06 PM, Chris Troullis <[email protected]> wrote: > FYI to all, just as an update, we rebuilt the index in question from > scratch for a second time this weekend and the problem went away on 1 node, > but we were still seeing it on the other node. After restarting the > problematic node, the problem went away. Still makes me a little uneasy as > we weren't able to determine the cause, but at least we are back to normal > query times now. > > Chris > > On Fri, Jun 15, 2018 at 8:06 AM, Chris Troullis <[email protected]> > wrote: > > > Thanks Shawn, > > > > As mentioned previously, we are hard committing every 60 seconds, which > we > > have been doing for years, and have had no issues until enabling CDCR. We > > have never seen large tlog sizes before, and even manually issuing a hard > > commit to the collection does not reduce the size of the tlogs. I believe > > this is because when using the CDCRUpdateLog the tlogs are not purged > until > > the docs have been replicated over. Anyway, since we manually purged the > > tlogs they seem to now be staying at an acceptable size, so I don't think > > that is the cause. The documents are not abnormally large, maybe ~20 > > string/numeric fields with simple whitespace tokenization. > > > > To answer your questions: > > > > -Solr version: 7.2.1 > > -What OS vendor and version Solr is running on: CentOS 6 > > -Total document count on the server (counting all index cores): 13 > > collections totaling ~60 million docs > > -Total index size on the server (counting all cores): ~60GB > > -What the total of all Solr heaps on the server is - 16GB heap (we had to > > increase for CDCR because it was using a lot more heap). > > -Whether there is software other than Solr on the server - No > > -How much total memory the server has installed - 64 GB > > > > All of this has been consistent for multiple years across multiple Solr > > versions and we have only started seeing this issue once we started using > > the CDCRUpdateLog and CDCR, hence why that is the only real thing we can > > point to. And again, the issue is only affecting 1 of the 13 collections > on > > the server, so if it was hardware/heap/GC related then I would think we > > would be seeing it for every collection, not just one, as they all share > > the same resources. > > > > I will take a look at the GC logs, but I don't think that is the cause. > > The consistent nature of the slow performance doesn't really point to GC > > issues, and we have profiling set up in New Relic and it does not show > any > > long/frequent GC pauses. > > > > We are going to try and rebuild the collection from scratch again this > > weekend as that has solved the issue in some lower environments, although > > it's not really consistent. At this point it's all we can think of to do. > > > > Thanks, > > > > Chris > > > > > > On Thu, Jun 14, 2018 at 6:23 PM, Shawn Heisey <[email protected]> > wrote: > > > >> On 6/12/2018 12:06 PM, Chris Troullis wrote: > >> > The issue we are seeing is with 1 collection in particular, after we > >> set up > >> > CDCR, we are getting extremely slow response times when retrieving > >> > documents. Debugging the query shows QTime is almost nothing, but the > >> > overall responseTime is like 5x what it should be. The problem is > >> > exacerbated by larger result sizes. IE retrieving 25 results is almost > >> > normal, but 200 results is way slower than normal. I can run the exact > >> same > >> > query multiple times in a row (so everything should be cached), and I > >> still > >> > see response times way higher than another environment that is not > using > >> > CDCR. It doesn't seem to matter if CDCR is enabled or disabled, just > >> that > >> > we are using the CDCRUpdateLog. The problem started happening even > >> before > >> > we enabled CDCR. > >> > > >> > In a lower environment we noticed that the transaction logs were huge > >> > (multiple gigs), so we tried stopping solr and deleting the tlogs then > >> > restarting, and that seemed to fix the performance issue. We tried the > >> same > >> > thing in production the other day but it had no effect, so now I don't > >> know > >> > if it was a coincidence or not. > >> > >> There is one other cause besides CDCR buffering that I know of for huge > >> transaction logs, and it has nothing to do with CDCR: A lack of hard > >> commits. It is strongly recommended to have autoCommit set to a > >> reasonably short interval (about a minute in my opinion, but 15 seconds > >> is VERY common). Most of the time openSearcher should be set to false > >> in the autoCommit config, and other mechanisms (which might include > >> autoSoftCommit) should be used for change visibility. The example > >> autoCommit settings might seem superfluous because they don't affect > >> what's searchable, but it is actually a very important configuration to > >> keep. > >> > >> Are the docs in this collection really big, by chance? > >> > >> As I went through previous threads you've started on the mailing list, I > >> have noticed that none of your messages provided some details that would > >> be useful for looking into performance problems: > >> > >> * What OS vendor and version Solr is running on. > >> * Total document count on the server (counting all index cores). > >> * Total index size on the server (counting all cores). > >> * What the total of all Solr heaps on the server is. > >> * Whether there is software other than Solr on the server. > >> * How much total memory the server has installed. > >> > >> If you name the OS, I can use that information to help you gather some > >> additional info which will actually show me most of that list. Total > >> document count is something that I cannot get from the info I would help > >> you gather. > >> > >> Something else that can cause performance issues is GC pauses. If you > >> provide a GC log (The script that starts Solr logs this by default), we > >> can analyze it to see if that's a problem. > >> > >> Attachments to messages on the mailing list typically do not make it to > >> the list, so a file sharing website is a better way to share large > >> logfiles. A paste website is good for log data that's smaller. > >> > >> Thanks, > >> Shawn > >> > >> > > >
