Also check the Zookeeper config file settings for Zookeeper's keep-alive, in some versions that was set to 15 seconds. Still shouldn't be tripped by a 4 second GC pause.
Best, Erick On Thu, Oct 15, 2015 at 7:33 AM, Lorenzo Fundaró <lorenzo.fund...@dawandamail.com> wrote: > On 15 October 2015 at 15:59, Eric Torti <ericzu...@gmail.com> wrote: > >> Hi, Lorenzo, >> >> I don't think this has a direct relation to your problem but it looks >> like you're setting -DzkClientTimeout twice. From what I know about >> > > Thanks, I am aware of this double setting, but I also think the jvm is > taking the last argument which is the intended > one. > > >> setting VM arguments twice, you're >> probably ending up with the last one being enforced. >> >> Just something to be aware of I guess. >> >> I don't think this relates to your problem because the GC pauses are >> not superior to 30s which seems to be the time zookeeper would let a >> node be irresponsive before considering it in recovery. >> > > Ok, good to consider. Thanks :) > >> >> Best, >> >> Eric Torti >> >> On Thu, Oct 15, 2015 at 6:51 AM, Lorenzo Fundaró >> <lorenzo.fund...@dawandamail.com> wrote: >> > On 14 October 2015 at 20:35, Pushkar Raste <pushkar.ra...@gmail.com> >> wrote: >> > >> >> You may want to start solr with following settings to enable logging GC >> >> details. Here are some flags you might want to enable. >> >> >> >> -Xloggc:<path_to_log_dir>/gc.log >> >> -XX:+PrintGCDetails >> >> -XX:+PrintGCDateStamps >> >> -XX:+PrintGCTimeStamps >> >> -XX:+PrintTenuringDistribution >> >> -XX:+PrintGCApplicationStoppedTime >> >> -XX:+PrintHeapAtGC >> >> >> >> Once you have GC logs, look for string "Total time for which application >> >> threads were stopped" to check if you have long pauses (you may get long >> >> pauses even with young generation GC). >> >> >> > >> > >> > Yes, there are several lines indicating that threads are being stopped. >> > There is this one particularly that draw my attention because right >> after a >> > second it happened 2 of my replicas went into >> > recovery mode, including the one who suffered the thread stop. >> > >> > solr_gc.log.1.current:2015-10-15T07:47:03.263+0000: 251173.653: Total >> time >> > for which application threads were stopped: 1.4936161 seconds, Stopping >> > threads took: 0.0000502 seconds >> > (is a second of stopped threads enough to have a node in recovery node ?) >> > When this happened, the leader had a couple of connection resets while >> > trying to communicate with this replica. >> > >> > and this server the highest stop takes 4s. >> > >> > solr_gc.log.1.current:2015-10-14T20:24:01.353+0000: 210191.743: Total >> time >> > for which application threads were stopped: 4.0111066 seconds, Stopping >> > threads took: 0.0000776 seconds >> > >> > These are the jvm flags >> > >> > -XX:NewSize=256m -XX:MaxNewSize=256m >> > /usr/lib/jvm/java-8-oracle/bin/java -server -Xss256k -Xms16g -Xmx16g >> > -XX:NewRatio=3 -XX:SurvivorRatio=4 -XX:TargetSurvivorRatio=90 >> > -XX:MaxTenuringThreshold=8 -XX:+UseConcMarkSweepGC -XX:+UseParNewGC >> > -XX:ConcGCThreads=4 -XX:ParallelGCThreads=4 -XX:+CMSScavengeBeforeRemark >> > -XX:PretenureSizeThreshold=64m -XX:+UseCMSInitiatingOccupancyOnly >> > -XX:CMSInitiatingOccupancyFraction=50 >> -XX:CMSMaxAbortablePrecleanTime=6000 >> > -XX:+CMSParallelRemarkEnabled -XX:+ParallelRefProcEnabled -verbose:gc >> > -XX:+PrintHeapAtGC -XX:+PrintGCDetails -XX:+PrintGCDateStamps >> > -XX:+PrintGCTimeStamps -XX:+PrintTenuringDistribution >> > -XX:+PrintGCApplicationStoppedTime -Xloggc:/solr_gc.log >> > -DzkClientTimeout=15000 -Duser.timezone=UTC >> > -Djava.net.preferIPv4Stack=true -DzkClientTimeout=30000 -XX:NewSize=256m >> > -XX:MaxNewSize=256m >> > >> > All of the options are the default that come with the solr startup >> script, >> > only the ones specified in the first line are being put by us. >> > >> > >> > >> > >> >> >> >> -- Pushkar Raste >> >> >> >> On Wed, Oct 14, 2015 at 11:47 AM, Lorenzo Fundaró < >> >> lorenzo.fund...@dawandamail.com> wrote: >> >> >> >> > <<What do you see if you attach >> >> > &debug=true to the query?>> >> >> > >> >> > "debug": { "rawquerystring": "*:*", "querystring": "*:*", >> "parsedquery": >> >> > "(+MatchAllDocsQuery(*:*))/no_coord", "parsedquery_toString": "+*:*", >> " >> >> > explain": { "Product:47047358": "\n1.0 = (MATCH) MatchAllDocsQuery, >> >> product >> >> > of:\n 1.0 = queryNorm\n", "Product:32211113": "\n1.0 = (MATCH) >> >> > MatchAllDocsQuery, product of:\n 1.0 = queryNorm\n", >> "Product:30852121": >> >> > "\n1.0 >> >> > = (MATCH) MatchAllDocsQuery, product of:\n 1.0 = queryNorm\n", " >> >> > Product:35018929": "\n1.0 = (MATCH) MatchAllDocsQuery, product of:\n >> 1.0 >> >> = >> >> > queryNorm\n", "Product:31682082": "\n1.0 = (MATCH) MatchAllDocsQuery, >> >> > product of:\n 1.0 = queryNorm\n", "Product:31077677": "\n1.0 = (MATCH) >> >> > MatchAllDocsQuery, product of:\n 1.0 = queryNorm\n", >> "Product:22298365": >> >> > "\n1.0 >> >> > = (MATCH) MatchAllDocsQuery, product of:\n 1.0 = queryNorm\n", " >> >> > Product:41094514": "\n1.0 = (MATCH) MatchAllDocsQuery, product of:\n >> 1.0 >> >> = >> >> > queryNorm\n", "Product:13106166": "\n1.0 = (MATCH) MatchAllDocsQuery, >> >> > product of:\n 1.0 = queryNorm\n", "Product:19142249": "\n1.0 = (MATCH) >> >> > MatchAllDocsQuery, product of:\n 1.0 = queryNorm\n", >> "Product:38243373": >> >> > "\n1.0 >> >> > = (MATCH) MatchAllDocsQuery, product of:\n 1.0 = queryNorm\n", " >> >> > Product:20434065": "\n1.0 = (MATCH) MatchAllDocsQuery, product of:\n >> 1.0 >> >> = >> >> > queryNorm\n", "Product:25194801": "\n1.0 = (MATCH) MatchAllDocsQuery, >> >> > product of:\n 1.0 = queryNorm\n", "Product:885482": "\n1.0 = (MATCH) >> >> > MatchAllDocsQuery, product of:\n 1.0 = queryNorm\n", >> "Product:45356790": >> >> > "\n1.0 >> >> > = (MATCH) MatchAllDocsQuery, product of:\n 1.0 = queryNorm\n", " >> >> > Product:67719831": "\n1.0 = (MATCH) MatchAllDocsQuery, product of:\n >> 1.0 >> >> = >> >> > queryNorm\n", "Product:12843394": "\n1.0 = (MATCH) MatchAllDocsQuery, >> >> > product of:\n 1.0 = queryNorm\n", "Product:38126213": "\n1.0 = (MATCH) >> >> > MatchAllDocsQuery, product of:\n 1.0 = queryNorm\n", >> "Product:38798130": >> >> > "\n1.0 >> >> > = (MATCH) MatchAllDocsQuery, product of:\n 1.0 = queryNorm\n", " >> >> > Product:30292169": "\n1.0 = (MATCH) MatchAllDocsQuery, product of:\n >> 1.0 >> >> = >> >> > queryNorm\n", "Product:11535854": "\n1.0 = (MATCH) MatchAllDocsQuery, >> >> > product of:\n 1.0 = queryNorm\n", "Product:8443674": "\n1.0 = (MATCH) >> >> > MatchAllDocsQuery, product of:\n 1.0 = queryNorm\n", >> "Product:51012182": >> >> > "\n1.0 >> >> > = (MATCH) MatchAllDocsQuery, product of:\n 1.0 = queryNorm\n", " >> >> > Product:75780871": "\n1.0 = (MATCH) MatchAllDocsQuery, product of:\n >> 1.0 >> >> = >> >> > queryNorm\n", "Product:20227881": "\n1.0 = (MATCH) MatchAllDocsQuery, >> >> > product of:\n 1.0 = queryNorm\n", "Product:38093629": "\n1.0 = (MATCH) >> >> > MatchAllDocsQuery, product of:\n 1.0 = queryNorm\n", >> "Product:3142218": >> >> > "\n1.0 >> >> > = (MATCH) MatchAllDocsQuery, product of:\n 1.0 = queryNorm\n", " >> >> > Product:15295602": "\n1.0 = (MATCH) MatchAllDocsQuery, product of:\n >> 1.0 >> >> = >> >> > queryNorm\n", "Product:3375982": "\n1.0 = (MATCH) MatchAllDocsQuery, >> >> > product of:\n 1.0 = queryNorm\n", "Product:38276777": "\n1.0 = (MATCH) >> >> > MatchAllDocsQuery, product of:\n 1.0 = queryNorm\n", >> "Product:10726118": >> >> > "\n1.0 >> >> > = (MATCH) MatchAllDocsQuery, product of:\n 1.0 = queryNorm\n", " >> >> > Product:50827742": "\n1.0 = (MATCH) MatchAllDocsQuery, product of:\n >> 1.0 >> >> = >> >> > queryNorm\n", "Product:5771722": "\n1.0 = (MATCH) MatchAllDocsQuery, >> >> > product of:\n 1.0 = queryNorm\n", "Product:3245678": "\n1.0 = (MATCH) >> >> > MatchAllDocsQuery, product of:\n 1.0 = queryNorm\n", >> "Product:13702130": >> >> > "\n1.0 >> >> > = (MATCH) MatchAllDocsQuery, product of:\n 1.0 = queryNorm\n", " >> >> > Product:25679953": "\n1.0 = (MATCH) MatchAllDocsQuery, product of:\n >> 1.0 >> >> = >> >> > queryNorm\n" }, "QParser": "ExtendedDismaxQParser", "altquerystring": >> >> null, >> >> > "boost_queries": null, "parsed_boost_queries": [], "boostfuncs": >> null, " >> >> > filter_queries": [ "{!cost=1 cache=true}type_s:Product AND >> >> > is_valid_b:true", >> >> > "{!cost=50 cache=true}in_languages_t:de", "{!cost=99 >> >> > cache=false}(shipping_country_codes_mt: (DE OR EURO OR EUR OR ALL)) >> AND >> >> > (cents_ri: [* TO 3000])" ], "parsed_filter_queries": [ >> "+type_s:Product >> >> > +is_valid_b:true", "in_languages_t:de", "{!cache=false >> >> > cost=99}+(shipping_country_codes_mt:de shipping_country_codes_mt:euro >> >> > shipping_country_codes_mt:eur shipping_country_codes_mt:all) >> +cents_ri:[* >> >> > TO 3000]" ], "timing": { "time": 18, "prepare": { "time": 0, "query": >> { " >> >> > time": 0 }, "facet": { "time": 0 }, "mlt": { "time": 0 }, >> "highlight": { >> >> " >> >> > time": 0 }, "stats": { "time": 0 }, "expand": { "time": 0 }, >> >> "spellcheck": >> >> > { >> >> > "time": 0 }, "debug": { "time": 0 } }, "process": { "time": 18, >> "query": >> >> { >> >> > " >> >> > time": 0 }, "facet": { "time": 0 }, "mlt": { "time": 0 }, >> "highlight": { >> >> " >> >> > time": 0 }, "stats": { "time": 0 }, "expand": { "time": 0 }, >> >> "spellcheck": >> >> > { >> >> > "time": 0 }, "debug": { "time": 18 } } >> >> > >> >> > I think this clause is wrong: >> >> > (cents_ri: [* 3000]) >> >> > >> >> > I think you mean >> >> > (cents_ri: [* TO 3000]) >> >> > >> >> > I think I made no difference. I tried both and they both worked. >> >> > >> >> > But are these slow queries constant or intermittent? >> >> > >> >> > They are definetly cached. The second time runs in no time. >> >> > >> >> > I gonna try adding them in the pre warmcache too. And see the results. >> >> > >> >> > The field that I used for sorting is indexed but not stored and it's >> not >> >> a >> >> > DocValue. I tried the query without the sort and the performance didnt >> >> > change significantly. >> >> > >> >> > By the time I saw that log, the server was getting around 7k updates >> per >> >> > minute. The query information I don't have it now, but I think it will >> >> > qualify as heavy load query. >> >> > >> >> > Thank you ! >> >> > >> >> > >> >> > On 14 October 2015 at 17:14, Erick Erickson <erickerick...@gmail.com> >> >> > wrote: >> >> > >> >> > > A couple of things don't particularly make sense here: >> >> > > >> >> > > You specify edismax, q=*:* yet you specify qf= >> >> > > You're searching across whatever you defined as the default >> >> > > field in the request handler. What do you see if you attach >> >> > > &debug=true to the query? >> >> > > >> >> > > I think this clause is wrong: >> >> > > (cents_ri: [* 3000]) >> >> > > >> >> > > I think you mean >> >> > > (cents_ri: [* TO 3000]) >> >> > > >> >> > > I'm not sure either of those is the problem, but are places I'd >> start. >> >> > > >> >> > > As far as the size of your filter cache goes, a hit ratio of .87 >> >> actually >> >> > > isn't bad. Upping the size would add some marginal benefit, but it's >> >> > > unlikely to be a magic bullet. >> >> > > >> >> > > But are these slow queries constant or intermittent? In other words, >> >> > > are all queries of this general form slow or just the first few? In >> >> > > particular >> >> > > is the first query that mentions sorting on this field slow but >> >> > subsequent >> >> > > ones faster? In that case consider adding a query to the newSearcher >> >> > > event in solrconfig.xml that mentions this sort, that would pre-warm >> >> > > the sort values. Also, defining all fields that you sort on as >> >> > > docValues="true" >> >> > > is recommended at this point. >> >> > > >> >> > > What I'd try is removing clauses to see which one is the problem. On >> >> > > the surface this is surprisingly slow. And how heavily loaded is the >> >> > > server? >> >> > > Your autocommit settings look fine, my question is more how much >> >> indexing >> >> > > and querying is going on when you take these measurements. >> >> > > >> >> > > Best, >> >> > > Erick >> >> > > >> >> > > On Wed, Oct 14, 2015 at 3:03 AM, Lorenzo Fundaró >> >> > > <lorenzo.fund...@dawandamail.com> wrote: >> >> > > > Hello, >> >> > > > >> >> > > > I have following conf for filters and commits : >> >> > > > >> >> > > > Concurrent LFU Cache(maxSize=64, initialSize=64, minSize=57, >> >> > > > acceptableSize=60, cleanupThread=false, timeDecay=true, >> >> > autowarmCount=8, >> >> > > > regenerator=org.apache.solr.search.SolrIndexSearcher$2@169ee0fd) >> >> > > > >> >> > > > <autoCommit> >> >> > > > <!-- Every 15 seconds --> >> >> > > > <maxTime>${solr.autoCommit.maxTime:15000}</maxTime> >> >> > > > <openSearcher>false</openSearcher> >> >> > > > </autoCommit> >> >> > > > >> >> > > > <autoSoftCommit> >> >> > > > <!-- Every 10 minutes --> >> >> > > > <maxTime>${solr.autoSoftCommit.maxTime:600000}</maxTime> >> >> > > > </autoSoftCommit> >> >> > > > >> >> > > > and the following stats for filters: >> >> > > > >> >> > > > lookups = 3602 >> >> > > > hits = 3148 >> >> > > > hit ratio = 0.87 >> >> > > > inserts = 455 >> >> > > > evictions = 400 >> >> > > > size = 63 >> >> > > > warmupTime = 770 >> >> > > > >> >> > > > *Problem: *a lot of slow queries, for example: >> >> > > > >> >> > > > {q=*:*&tie=1.0&defType=edismax&qt=standard&json.nl >> >> > > =map&qf=&fl=pk_i,score&start=0&sort=view_counter_i >> >> > > > desc&fq={!cost=1 cache=true}type_s:Product AND >> >> > > is_valid_b:true&fq={!cost=50 >> >> > > > cache=true}in_languages_t:de&fq={!cost=99 >> >> > > > cache=false}(shipping_country_codes_mt: (DE OR EURO OR EUR OR >> ALL)) >> >> AND >> >> > > > (cents_ri: [* 3000])&rows=36&wt=json} hits=3768003 status=0 >> >> QTime=1378 >> >> > > > >> >> > > > I could increase the size of the filter so I would decrease the >> >> amount >> >> > of >> >> > > > evictions, but it seems to me this would not be solving the root >> >> > problem. >> >> > > > >> >> > > > Some ideas on where/how to start for optimisation ? Is it actually >> >> > normal >> >> > > > that this query takes this time ? >> >> > > > >> >> > > > We have an index of ~14 million docs. 4 replicas with two cores >> and 1 >> >> > > shard >> >> > > > each. >> >> > > > >> >> > > > thank you. >> >> > > > >> >> > > > >> >> > > > -- >> >> > > > >> >> > > > -- >> >> > > > Lorenzo Fundaro >> >> > > > Backend Engineer >> >> > > > E-Mail: lorenzo.fund...@dawandamail.com >> >> > > > >> >> > > > Fax + 49 - (0)30 - 25 76 08 52 >> >> > > > Tel + 49 - (0)179 - 51 10 982 >> >> > > > >> >> > > > DaWanda GmbH >> >> > > > Windscheidstraße 18 >> >> > > > 10627 Berlin >> >> > > > >> >> > > > Geschäftsführer: Claudia Helming, Michael Pütz >> >> > > > Amtsgericht Charlottenburg HRB 104695 B >> >> > > >> >> > >> >> > >> >> > >> >> > -- >> >> > >> >> > -- >> >> > Lorenzo Fundaro >> >> > Backend Engineer >> >> > E-Mail: lorenzo.fund...@dawandamail.com >> >> > >> >> > Fax + 49 - (0)30 - 25 76 08 52 >> >> > Tel + 49 - (0)179 - 51 10 982 >> >> > >> >> > DaWanda GmbH >> >> > Windscheidstraße 18 >> >> > 10627 Berlin >> >> > >> >> > Geschäftsführer: Claudia Helming, Michael Pütz >> >> > Amtsgericht Charlottenburg HRB 104695 B >> >> > >> >> >> > >> > >> > >> > -- >> > >> > -- >> > Lorenzo Fundaro >> > Backend Engineer >> > E-Mail: lorenzo.fund...@dawandamail.com >> > >> > Fax + 49 - (0)30 - 25 76 08 52 >> > Tel + 49 - (0)179 - 51 10 982 >> > >> > DaWanda GmbH >> > Windscheidstraße 18 >> > 10627 Berlin >> > >> > Geschäftsführer: Claudia Helming, Michael Pütz >> > Amtsgericht Charlottenburg HRB 104695 B >> > > > > -- > > -- > Lorenzo Fundaro > Backend Engineer > E-Mail: lorenzo.fund...@dawandamail.com > > Fax + 49 - (0)30 - 25 76 08 52 > Tel + 49 - (0)179 - 51 10 982 > > DaWanda GmbH > Windscheidstraße 18 > 10627 Berlin > > Geschäftsführer: Claudia Helming, Michael Pütz > Amtsgericht Charlottenburg HRB 104695 B