Also check the Zookeeper config file settings for Zookeeper's keep-alive,
in some versions that was set to 15 seconds. Still shouldn't
be tripped by a 4 second GC pause.

Best,
Erick

On Thu, Oct 15, 2015 at 7:33 AM, Lorenzo Fundaró
<lorenzo.fund...@dawandamail.com> wrote:
> On 15 October 2015 at 15:59, Eric Torti <ericzu...@gmail.com> wrote:
>
>> Hi, Lorenzo,
>>
>> I don't think this has a direct relation to your problem but it looks
>> like you're setting -DzkClientTimeout twice. From what I know about
>>
>
> Thanks, I am aware of this double setting, but I also think the jvm is
> taking the last argument which is the intended
> one.
>
>
>> setting VM arguments twice, you're
>> probably ending up with the last one being enforced.
>>
>> Just something to be aware of I guess.
>>
>> I don't think this relates to your problem because the GC pauses are
>> not superior to 30s which seems to be the time zookeeper would let a
>> node be irresponsive before considering it in recovery.
>>
>
> Ok, good to consider. Thanks :)
>
>>
>> Best,
>>
>> Eric Torti
>>
>> On Thu, Oct 15, 2015 at 6:51 AM, Lorenzo Fundaró
>> <lorenzo.fund...@dawandamail.com> wrote:
>> > On 14 October 2015 at 20:35, Pushkar Raste <pushkar.ra...@gmail.com>
>> wrote:
>> >
>> >> You may want to start solr with following settings to enable logging GC
>> >> details. Here are some flags you might want to enable.
>> >>
>> >> -Xloggc:<path_to_log_dir>/gc.log
>> >> -XX:+PrintGCDetails
>> >> -XX:+PrintGCDateStamps
>> >> -XX:+PrintGCTimeStamps
>> >> -XX:+PrintTenuringDistribution
>> >> -XX:+PrintGCApplicationStoppedTime
>> >> -XX:+PrintHeapAtGC
>> >>
>> >> Once you have GC logs, look for string "Total time for which application
>> >> threads were stopped" to check if you have long pauses (you may get long
>> >> pauses even with young generation GC).
>> >>
>> >
>> >
>> > Yes, there are several lines indicating that threads are being stopped.
>> > There is this one particularly that draw my attention because right
>> after a
>> > second it happened 2 of my replicas went into
>> > recovery mode, including the one who suffered the thread stop.
>> >
>> > solr_gc.log.1.current:2015-10-15T07:47:03.263+0000: 251173.653: Total
>> time
>> > for which application threads were stopped: 1.4936161 seconds, Stopping
>> > threads took: 0.0000502 seconds
>> > (is a second of stopped threads enough to have a node in recovery node ?)
>> > When this happened, the leader had a couple of connection resets while
>> > trying to communicate with this replica.
>> >
>> > and this server the highest stop takes 4s.
>> >
>> > solr_gc.log.1.current:2015-10-14T20:24:01.353+0000: 210191.743: Total
>> time
>> > for which application threads were stopped: 4.0111066 seconds, Stopping
>> > threads took: 0.0000776 seconds
>> >
>> > These are the jvm flags
>> >
>> > -XX:NewSize=256m -XX:MaxNewSize=256m
>> > /usr/lib/jvm/java-8-oracle/bin/java -server -Xss256k -Xms16g -Xmx16g
>> > -XX:NewRatio=3 -XX:SurvivorRatio=4 -XX:TargetSurvivorRatio=90
>> > -XX:MaxTenuringThreshold=8 -XX:+UseConcMarkSweepGC -XX:+UseParNewGC
>> > -XX:ConcGCThreads=4 -XX:ParallelGCThreads=4 -XX:+CMSScavengeBeforeRemark
>> > -XX:PretenureSizeThreshold=64m -XX:+UseCMSInitiatingOccupancyOnly
>> > -XX:CMSInitiatingOccupancyFraction=50
>> -XX:CMSMaxAbortablePrecleanTime=6000
>> > -XX:+CMSParallelRemarkEnabled -XX:+ParallelRefProcEnabled -verbose:gc
>> > -XX:+PrintHeapAtGC -XX:+PrintGCDetails -XX:+PrintGCDateStamps
>> > -XX:+PrintGCTimeStamps -XX:+PrintTenuringDistribution
>> > -XX:+PrintGCApplicationStoppedTime -Xloggc:/solr_gc.log
>> > -DzkClientTimeout=15000   -Duser.timezone=UTC
>> > -Djava.net.preferIPv4Stack=true -DzkClientTimeout=30000  -XX:NewSize=256m
>> > -XX:MaxNewSize=256m
>> >
>> > All of the options are the default that come with the solr startup
>> script,
>> > only the ones specified in the first line are being put by us.
>> >
>> >
>> >
>> >
>> >>
>> >> -- Pushkar Raste
>> >>
>> >> On Wed, Oct 14, 2015 at 11:47 AM, Lorenzo Fundaró <
>> >> lorenzo.fund...@dawandamail.com> wrote:
>> >>
>> >> > <<What do you see if you attach
>> >> > &debug=true to the query?>>
>> >> >
>> >> > "debug": { "rawquerystring": "*:*", "querystring": "*:*",
>> "parsedquery":
>> >> > "(+MatchAllDocsQuery(*:*))/no_coord", "parsedquery_toString": "+*:*",
>> "
>> >> > explain": { "Product:47047358": "\n1.0 = (MATCH) MatchAllDocsQuery,
>> >> product
>> >> > of:\n 1.0 = queryNorm\n", "Product:32211113": "\n1.0 = (MATCH)
>> >> > MatchAllDocsQuery, product of:\n 1.0 = queryNorm\n",
>> "Product:30852121":
>> >> > "\n1.0
>> >> > = (MATCH) MatchAllDocsQuery, product of:\n 1.0 = queryNorm\n", "
>> >> > Product:35018929": "\n1.0 = (MATCH) MatchAllDocsQuery, product of:\n
>> 1.0
>> >> =
>> >> > queryNorm\n", "Product:31682082": "\n1.0 = (MATCH) MatchAllDocsQuery,
>> >> > product of:\n 1.0 = queryNorm\n", "Product:31077677": "\n1.0 = (MATCH)
>> >> > MatchAllDocsQuery, product of:\n 1.0 = queryNorm\n",
>> "Product:22298365":
>> >> > "\n1.0
>> >> > = (MATCH) MatchAllDocsQuery, product of:\n 1.0 = queryNorm\n", "
>> >> > Product:41094514": "\n1.0 = (MATCH) MatchAllDocsQuery, product of:\n
>> 1.0
>> >> =
>> >> > queryNorm\n", "Product:13106166": "\n1.0 = (MATCH) MatchAllDocsQuery,
>> >> > product of:\n 1.0 = queryNorm\n", "Product:19142249": "\n1.0 = (MATCH)
>> >> > MatchAllDocsQuery, product of:\n 1.0 = queryNorm\n",
>> "Product:38243373":
>> >> > "\n1.0
>> >> > = (MATCH) MatchAllDocsQuery, product of:\n 1.0 = queryNorm\n", "
>> >> > Product:20434065": "\n1.0 = (MATCH) MatchAllDocsQuery, product of:\n
>> 1.0
>> >> =
>> >> > queryNorm\n", "Product:25194801": "\n1.0 = (MATCH) MatchAllDocsQuery,
>> >> > product of:\n 1.0 = queryNorm\n", "Product:885482": "\n1.0 = (MATCH)
>> >> > MatchAllDocsQuery, product of:\n 1.0 = queryNorm\n",
>> "Product:45356790":
>> >> > "\n1.0
>> >> > = (MATCH) MatchAllDocsQuery, product of:\n 1.0 = queryNorm\n", "
>> >> > Product:67719831": "\n1.0 = (MATCH) MatchAllDocsQuery, product of:\n
>> 1.0
>> >> =
>> >> > queryNorm\n", "Product:12843394": "\n1.0 = (MATCH) MatchAllDocsQuery,
>> >> > product of:\n 1.0 = queryNorm\n", "Product:38126213": "\n1.0 = (MATCH)
>> >> > MatchAllDocsQuery, product of:\n 1.0 = queryNorm\n",
>> "Product:38798130":
>> >> > "\n1.0
>> >> > = (MATCH) MatchAllDocsQuery, product of:\n 1.0 = queryNorm\n", "
>> >> > Product:30292169": "\n1.0 = (MATCH) MatchAllDocsQuery, product of:\n
>> 1.0
>> >> =
>> >> > queryNorm\n", "Product:11535854": "\n1.0 = (MATCH) MatchAllDocsQuery,
>> >> > product of:\n 1.0 = queryNorm\n", "Product:8443674": "\n1.0 = (MATCH)
>> >> > MatchAllDocsQuery, product of:\n 1.0 = queryNorm\n",
>> "Product:51012182":
>> >> > "\n1.0
>> >> > = (MATCH) MatchAllDocsQuery, product of:\n 1.0 = queryNorm\n", "
>> >> > Product:75780871": "\n1.0 = (MATCH) MatchAllDocsQuery, product of:\n
>> 1.0
>> >> =
>> >> > queryNorm\n", "Product:20227881": "\n1.0 = (MATCH) MatchAllDocsQuery,
>> >> > product of:\n 1.0 = queryNorm\n", "Product:38093629": "\n1.0 = (MATCH)
>> >> > MatchAllDocsQuery, product of:\n 1.0 = queryNorm\n",
>> "Product:3142218":
>> >> > "\n1.0
>> >> > = (MATCH) MatchAllDocsQuery, product of:\n 1.0 = queryNorm\n", "
>> >> > Product:15295602": "\n1.0 = (MATCH) MatchAllDocsQuery, product of:\n
>> 1.0
>> >> =
>> >> > queryNorm\n", "Product:3375982": "\n1.0 = (MATCH) MatchAllDocsQuery,
>> >> > product of:\n 1.0 = queryNorm\n", "Product:38276777": "\n1.0 = (MATCH)
>> >> > MatchAllDocsQuery, product of:\n 1.0 = queryNorm\n",
>> "Product:10726118":
>> >> > "\n1.0
>> >> > = (MATCH) MatchAllDocsQuery, product of:\n 1.0 = queryNorm\n", "
>> >> > Product:50827742": "\n1.0 = (MATCH) MatchAllDocsQuery, product of:\n
>> 1.0
>> >> =
>> >> > queryNorm\n", "Product:5771722": "\n1.0 = (MATCH) MatchAllDocsQuery,
>> >> > product of:\n 1.0 = queryNorm\n", "Product:3245678": "\n1.0 = (MATCH)
>> >> > MatchAllDocsQuery, product of:\n 1.0 = queryNorm\n",
>> "Product:13702130":
>> >> > "\n1.0
>> >> > = (MATCH) MatchAllDocsQuery, product of:\n 1.0 = queryNorm\n", "
>> >> > Product:25679953": "\n1.0 = (MATCH) MatchAllDocsQuery, product of:\n
>> 1.0
>> >> =
>> >> > queryNorm\n" }, "QParser": "ExtendedDismaxQParser", "altquerystring":
>> >> null,
>> >> > "boost_queries": null, "parsed_boost_queries": [], "boostfuncs":
>> null, "
>> >> > filter_queries": [ "{!cost=1 cache=true}type_s:Product AND
>> >> > is_valid_b:true",
>> >> > "{!cost=50 cache=true}in_languages_t:de", "{!cost=99
>> >> > cache=false}(shipping_country_codes_mt: (DE OR EURO OR EUR OR ALL))
>> AND
>> >> > (cents_ri: [* TO 3000])" ], "parsed_filter_queries": [
>> "+type_s:Product
>> >> > +is_valid_b:true", "in_languages_t:de", "{!cache=false
>> >> > cost=99}+(shipping_country_codes_mt:de shipping_country_codes_mt:euro
>> >> > shipping_country_codes_mt:eur shipping_country_codes_mt:all)
>> +cents_ri:[*
>> >> > TO 3000]" ], "timing": { "time": 18, "prepare": { "time": 0, "query":
>> { "
>> >> > time": 0 }, "facet": { "time": 0 }, "mlt": { "time": 0 },
>> "highlight": {
>> >> "
>> >> > time": 0 }, "stats": { "time": 0 }, "expand": { "time": 0 },
>> >> "spellcheck":
>> >> > {
>> >> > "time": 0 }, "debug": { "time": 0 } }, "process": { "time": 18,
>> "query":
>> >> {
>> >> > "
>> >> > time": 0 }, "facet": { "time": 0 }, "mlt": { "time": 0 },
>> "highlight": {
>> >> "
>> >> > time": 0 }, "stats": { "time": 0 }, "expand": { "time": 0 },
>> >> "spellcheck":
>> >> > {
>> >> > "time": 0 }, "debug": { "time": 18 } }
>> >> >
>> >> > I think this clause is wrong:
>> >> > (cents_ri: [* 3000])
>> >> >
>> >> > I think you mean
>> >> > (cents_ri: [* TO 3000])
>> >> >
>> >> > I think I made no difference. I tried both and they both worked.
>> >> >
>> >> > But are these slow queries constant or intermittent?
>> >> >
>> >> > They are definetly cached. The second time runs in no time.
>> >> >
>> >> > I gonna try adding them in the pre warmcache too. And see the results.
>> >> >
>> >> > The field that I used for sorting is indexed but not stored and it's
>> not
>> >> a
>> >> > DocValue. I tried the query without the sort and the performance didnt
>> >> > change significantly.
>> >> >
>> >> > By the time I saw that log, the server was getting around 7k updates
>> per
>> >> > minute. The query information I don't have it now, but I think it will
>> >> > qualify as heavy load query.
>> >> >
>> >> > Thank you !
>> >> >
>> >> >
>> >> > On 14 October 2015 at 17:14, Erick Erickson <erickerick...@gmail.com>
>> >> > wrote:
>> >> >
>> >> > > A couple of things don't particularly make sense here:
>> >> > >
>> >> > > You specify edismax, q=*:* yet you specify qf=
>> >> > > You're searching across whatever you defined as the default
>> >> > > field in the request handler. What do you see if you attach
>> >> > > &debug=true to the query?
>> >> > >
>> >> > > I think this clause is wrong:
>> >> > > (cents_ri: [* 3000])
>> >> > >
>> >> > > I think you mean
>> >> > > (cents_ri: [* TO 3000])
>> >> > >
>> >> > > I'm not sure either of those is the problem, but are places I'd
>> start.
>> >> > >
>> >> > > As far as the size of your filter cache goes, a hit ratio of .87
>> >> actually
>> >> > > isn't bad. Upping the size would add some marginal benefit, but it's
>> >> > > unlikely to be a magic bullet.
>> >> > >
>> >> > > But are these slow queries constant or intermittent? In other words,
>> >> > > are all queries of this general form slow or just the first few? In
>> >> > > particular
>> >> > > is the first query that mentions sorting on this field slow but
>> >> > subsequent
>> >> > > ones faster? In that case consider adding a query to the newSearcher
>> >> > > event in solrconfig.xml that mentions this sort, that would pre-warm
>> >> > > the sort values. Also, defining all fields that you sort on as
>> >> > > docValues="true"
>> >> > > is recommended at this point.
>> >> > >
>> >> > > What I'd try is removing clauses to see which one is the problem. On
>> >> > > the surface this is surprisingly slow. And how heavily loaded is the
>> >> > > server?
>> >> > > Your autocommit settings look fine, my question is more how much
>> >> indexing
>> >> > > and querying is going on when you take these measurements.
>> >> > >
>> >> > > Best,
>> >> > > Erick
>> >> > >
>> >> > > On Wed, Oct 14, 2015 at 3:03 AM, Lorenzo Fundaró
>> >> > > <lorenzo.fund...@dawandamail.com> wrote:
>> >> > > > Hello,
>> >> > > >
>> >> > > > I have following conf for filters and commits :
>> >> > > >
>> >> > > > Concurrent LFU Cache(maxSize=64, initialSize=64, minSize=57,
>> >> > > > acceptableSize=60, cleanupThread=false, timeDecay=true,
>> >> > autowarmCount=8,
>> >> > > > regenerator=org.apache.solr.search.SolrIndexSearcher$2@169ee0fd)
>> >> > > >
>> >> > > >      <autoCommit>
>> >> > > >        <!-- Every 15 seconds -->
>> >> > > >        <maxTime>${solr.autoCommit.maxTime:15000}</maxTime>
>> >> > > >        <openSearcher>false</openSearcher>
>> >> > > >      </autoCommit>
>> >> > > >
>> >> > > >      <autoSoftCommit>
>> >> > > >        <!-- Every 10 minutes -->
>> >> > > >        <maxTime>${solr.autoSoftCommit.maxTime:600000}</maxTime>
>> >> > > >      </autoSoftCommit>
>> >> > > >
>> >> > > > and the following stats for filters:
>> >> > > >
>> >> > > > lookups = 3602
>> >> > > > hits  =  3148
>> >> > > > hit ratio = 0.87
>> >> > > > inserts = 455
>> >> > > > evictions = 400
>> >> > > > size = 63
>> >> > > > warmupTime = 770
>> >> > > >
>> >> > > > *Problem: *a lot of slow queries, for example:
>> >> > > >
>> >> > > > {q=*:*&tie=1.0&defType=edismax&qt=standard&json.nl
>> >> > > =map&qf=&fl=pk_i,score&start=0&sort=view_counter_i
>> >> > > > desc&fq={!cost=1 cache=true}type_s:Product AND
>> >> > > is_valid_b:true&fq={!cost=50
>> >> > > > cache=true}in_languages_t:de&fq={!cost=99
>> >> > > > cache=false}(shipping_country_codes_mt: (DE OR EURO OR EUR OR
>> ALL))
>> >> AND
>> >> > > > (cents_ri: [* 3000])&rows=36&wt=json} hits=3768003 status=0
>> >> QTime=1378
>> >> > > >
>> >> > > > I could increase the size of the filter so I would decrease the
>> >> amount
>> >> > of
>> >> > > > evictions, but it seems to me this would not be solving the root
>> >> > problem.
>> >> > > >
>> >> > > > Some ideas on where/how to start for optimisation ? Is it actually
>> >> > normal
>> >> > > > that this query takes this time ?
>> >> > > >
>> >> > > > We have an index of ~14 million docs. 4 replicas with two cores
>> and 1
>> >> > > shard
>> >> > > > each.
>> >> > > >
>> >> > > > thank you.
>> >> > > >
>> >> > > >
>> >> > > > --
>> >> > > >
>> >> > > > --
>> >> > > > Lorenzo Fundaro
>> >> > > > Backend Engineer
>> >> > > > E-Mail: lorenzo.fund...@dawandamail.com
>> >> > > >
>> >> > > > Fax       + 49 - (0)30 - 25 76 08 52
>> >> > > > Tel        + 49 - (0)179 - 51 10 982
>> >> > > >
>> >> > > > DaWanda GmbH
>> >> > > > Windscheidstraße 18
>> >> > > > 10627 Berlin
>> >> > > >
>> >> > > > Geschäftsführer: Claudia Helming, Michael Pütz
>> >> > > > Amtsgericht Charlottenburg HRB 104695 B
>> >> > >
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> >
>> >> > --
>> >> > Lorenzo Fundaro
>> >> > Backend Engineer
>> >> > E-Mail: lorenzo.fund...@dawandamail.com
>> >> >
>> >> > Fax       + 49 - (0)30 - 25 76 08 52
>> >> > Tel        + 49 - (0)179 - 51 10 982
>> >> >
>> >> > DaWanda GmbH
>> >> > Windscheidstraße 18
>> >> > 10627 Berlin
>> >> >
>> >> > Geschäftsführer: Claudia Helming, Michael Pütz
>> >> > Amtsgericht Charlottenburg HRB 104695 B
>> >> >
>> >>
>> >
>> >
>> >
>> > --
>> >
>> > --
>> > Lorenzo Fundaro
>> > Backend Engineer
>> > E-Mail: lorenzo.fund...@dawandamail.com
>> >
>> > Fax       + 49 - (0)30 - 25 76 08 52
>> > Tel        + 49 - (0)179 - 51 10 982
>> >
>> > DaWanda GmbH
>> > Windscheidstraße 18
>> > 10627 Berlin
>> >
>> > Geschäftsführer: Claudia Helming, Michael Pütz
>> > Amtsgericht Charlottenburg HRB 104695 B
>>
>
>
>
> --
>
> --
> Lorenzo Fundaro
> Backend Engineer
> E-Mail: lorenzo.fund...@dawandamail.com
>
> Fax       + 49 - (0)30 - 25 76 08 52
> Tel        + 49 - (0)179 - 51 10 982
>
> DaWanda GmbH
> Windscheidstraße 18
> 10627 Berlin
>
> Geschäftsführer: Claudia Helming, Michael Pütz
> Amtsgericht Charlottenburg HRB 104695 B

Reply via email to