Re: When the number of collections exceeds one thousand, the construction of indexing speed drops sharply

苗海泉 Tue, 27 Feb 2018 05:44:34 -0800

Thank you, we were 49 shard 49 nodes, but later found that in this case,
often disconnect between solr and zookeepr, zookeeper too many nodes caused
solr instability, so reduced to 25 A follow-up performance can not keep up
also need to increase back.


Very slow when solr and zookeeper not found any errors, just build the
index slow, automatic commit inside the log display is slow, but the main
reason may not lie in the commit place.

I am sorry, I do not know how to look at the utilization of java heap,
through the gc log, gc time is not long, I posted the log:


{Heap before GC invocations=1144021 (full 72):
 garbage-first heap   total 33554432K, used 26982419K [0x00007f1478000000,
0x00007f1478808000, 0x00007f1c78000000)
  region size 8192K, 204 young (1671168K), 26 survivors (212992K)
 Metaspace       used 41184K, capacity 41752K, committed 67072K, reserved
67584K
2018-02-27T21:43:01.793+0800: 4668016.044: [GC pause (G1 Evacuation Pause)
(young)
Desired survivor size 109051904 bytes, new threshold 1 (max 15)
- age   1:  113878760 bytes,  113878760 total
- age   2:   21264744 bytes,  135143504 total
- age   3:   17020096 bytes,  152163600 total
- age   4:   26870864 bytes,  179034464 total
, 0.0579794 secs]
   [Parallel Time: 46.9 ms, GC Workers: 18]
      [GC Worker Start (ms): Min: 4668016046.1, Avg: 4668016046.3, Max:
4668016046.4, Diff: 0.3]
      [Ext Root Scanning (ms): Min: 2.4, Avg: 6.5, Max: 46.3, Diff: 43.9,
Sum: 116.9]
      [Update RS (ms): Min: 0.0, Avg: 3.4, Max: 6.0, Diff: 6.0, Sum: 62.0]
         [Processed Buffers: Min: 0, Avg: 6.3, Max: 16, Diff: 16, Sum: 113]
      [Scan RS (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.1, Sum: 0.5]
      [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0,
Sum: 0.0]
      [Object Copy (ms): Min: 0.1, Avg: 23.8, Max: 25.5, Diff: 25.5, Sum:
428.1]
      [Termination (ms): Min: 0.0, Avg: 12.7, Max: 13.5, Diff: 13.5, Sum:
228.9]
         [Termination Attempts: Min: 1, Avg: 1.0, Max: 1, Diff: 0, Sum: 18]
      [GC Worker Other (ms): Min: 0.0, Avg: 0.1, Max: 0.4, Diff: 0.4, Sum:
1.2]
      [GC Worker Total (ms): Min: 46.4, Avg: 46.6, Max: 46.7, Diff: 0.3,
Sum: 838.0]
      [GC Worker End (ms): Min: 4668016092.8, Avg: 4668016092.8, Max:
4668016092.8, Diff: 0.0]
   [Code Root Fixup: 0.2 ms]
   [Code Root Purge: 0.0 ms]
   [Clear CT: 0.3 ms]
   [Other: 10.7 ms]
      [Choose CSet: 0.0 ms]
      [Ref Proc: 5.9 ms]
      [Ref Enq: 0.2 ms]
      [Redirty Cards: 0.2 ms]
      [Humongous Register: 2.2 ms]
      [Humongous Reclaim: 0.4 ms]
      [Free CSet: 0.4 ms]
   [Eden: 1424.0M(1424.0M)->0.0B(1552.0M) Survivors: 208.0M->80.0M Heap:
25.7G(32.0G)->24.3G(32.0G)]
Heap after GC invocations=1144022 (full 72):
 garbage-first heap   total 33554432K, used 25489656K [0x00007f1478000000,
0x00007f1478808000, 0x00007f1c78000000)
  region size 8192K, 10 young (81920K), 10 survivors (81920K)
 Metaspace       used 41184K, capacity 41752K, committed 67072K, reserved
67584K
}
 [Times: user=0.84 sys=0.01, real=0.05 secs]
2018-02-27T21:43:01.851+0800: 4668016.102: Total time for which application
threads were stopped: 0.0661383 seconds, Stopping threads took: 0.0004141
seconds
2018-02-27T21:43:02.092+0800: 4668016.343: [GC concurrent-mark-end,
2.5757061 secs]
2018-02-27T21:43:02.100+0800: 4668016.351: [GC remark
2018-02-27T21:43:02.100+0800: 4668016.351: [Finalize Marking, 0.0016508
secs] 2018-02-27T21:43:02.102+0800: 4668016.352: [GC ref-proc, 0.0277818
secs] 2018-02-27T21:43:02.129+0800: 4668016.380: [Unloading, 0.0118102
secs], 0.0704296 secs]
 [Times: user=0.85 sys=0.04, real=0.07 secs]
2018-02-27T21:43:02.171+0800: 4668016.422: Total time for which application
threads were stopped: 0.0785762 seconds, Stopping threads took: 0.0006159
seconds
2018-02-27T21:43:02.178+0800: 4668016.429: [GC cleanup 24G->24G(32G),
0.0391915 secs]
 [Times: user=0.64 sys=0.00, real=0.04 secs]
2018-02-27T21:43:02.218+0800: 4668016.469: Total time for which application
threads were stopped: 0.0470020 seconds, Stopping threads took: 0.0001684
seconds
2018-02-27T21:43:02.540+0800: 4668016.791: Total time for which application
threads were stopped: 0.0074829 seconds, Stopping threads took: 0.0004834
seconds
{Heap before GC invocations=1144023 (full 72):
 garbage-first heap   total 33554432K, used 27078904K [0x00007f1478000000,
0x00007f1478808000, 0x00007f1c78000000)
  region size 8192K, 204 young (1671168K), 10 survivors (81920K)
 Metaspace       used 41184K, capacity 41752K, committed 67072K, reserved
67584K
2018-02-27T21:43:04.076+0800: 4668018.326: [GC pause (G1 Evacuation Pause)
(young)
Desired survivor size 109051904 bytes, new threshold 15 (max 15)
- age   1:   47719032 bytes,   47719032 total
, 0.0554183 secs]
   [Parallel Time: 48.0 ms, GC Workers: 18]
      [GC Worker Start (ms): Min: 4668018329.0, Avg: 4668018329.1, Max:
4668018329.3, Diff: 0.3]
      [Ext Root Scanning (ms): Min: 2.9, Avg: 5.7, Max: 47.4, Diff: 44.6,
Sum: 103.0]
      [Update RS (ms): Min: 0.0, Avg: 14.3, Max: 16.2, Diff: 16.2, Sum:
257.6]
         [Processed Buffers: Min: 0, Avg: 17.4, Max: 22, Diff: 22, Sum: 314]
      [Scan RS (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.1, Sum: 0.5]
      [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0,
Sum: 0.0]
      [Object Copy (ms): Min: 0.1, Avg: 10.9, Max: 11.9, Diff: 11.8, Sum:
196.9]
      [Termination (ms): Min: 0.0, Avg: 16.6, Max: 17.6, Diff: 17.6, Sum:
299.1]
         [Termination Attempts: Min: 1, Avg: 1.0, Max: 1, Diff: 0, Sum: 18]
      [GC Worker Other (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.0, Sum:
0.5]
      [GC Worker Total (ms): Min: 47.5, Avg: 47.6, Max: 47.8, Diff: 0.3,
Sum: 857.6]
      [GC Worker End (ms): Min: 4668018376.7, Avg: 4668018376.8, Max:
4668018376.8, Diff: 0.0]
   [Code Root Fixup: 0.2 ms]
   [Code Root Purge: 0.0 ms]
   [Clear CT: 0.2 ms]
   [Other: 7.1 ms]
      [Choose CSet: 0.0 ms]
      [Ref Proc: 2.3 ms]
      [Ref Enq: 0.2 ms]
      [Redirty Cards: 0.2 ms]
      [Humongous Register: 2.2 ms]
      [Humongous Reclaim: 0.4 ms]
      [Free CSet: 0.4 ms]
   [Eden: 1552.0M(1552.0M)->0.0B(1488.0M) Survivors: 80.0M->144.0M Heap:
25.8G(32.0G)->24.4G(32.0G)]
Heap after GC invocations=1144024 (full 72):
 garbage-first heap   total 33554432K, used 25550050K [0x00007f1478000000,
0x00007f1478808000, 0x00007f1c78000000)
  region size 8192K, 18 young (147456K), 18 survivors (147456K)
 Metaspace       used 41184K, capacity 41752K, committed 67072K, reserved
67584K
}
 [Times: user=0.82 sys=0.00, real=0.05 secs]




2018-02-27 20:58 GMT+08:00 Emir Arnautović <emir.arnauto...@sematext.com>:

> Ah, so there are ~560 shards per node and not all nodes are indexing at
> the same time. Why is that? You can have better throughput if indexing on
> all nodes. If happy with shard size, you can create new collection with 49
> shards every 2h and have everything the same and index on all nodes.
>
> Back to main question: what is the heap utilisation? When you restart node
> what is heap utilisation? Do you see any errors in your logs? Do you see
> any errors in ZK logs?
>
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>
>
>
> > On 27 Feb 2018, at 13:22, 苗海泉 <mseaspr...@gmail.com> wrote:
> >
> > Thanks  for you reply again.
> > I just said that you may have some misunderstanding, we have 49 solr
> nodes,
> > each collection has 25 shards, each shard has only one replica of the
> data,
> > there is no copy, and I reduce the part of the cache. If you need the
> > metric data, I can check Come out to tell you, in addition we are only
> > additional system, there will not be any change action.
> >
> > 2018-02-27 20:05 GMT+08:00 Emir Arnautović <emir.arnauto...@sematext.com
> >:
> >
> >> Hi,
> >> It is hard to tell without looking more into your metrics. It seems to
> me
> >> that you are reaching limits of your cluster. I would doublecheck if
> memory
> >> is the issue. If I got it right, you have ~1120 shards per node. It
> takes
> >> some heap just to keep them open. If you have some caches enabled and
> if it
> >> is append only system, old shards will keep caches until reloaded.
> >> Probably will not make much diff, but with 25x2=50 shards and 49 nodes,
> >> one node will need to handle double indexing load.
> >>
> >> Emir
> >> --
> >> Monitoring - Log Management - Alerting - Anomaly Detection
> >> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
> >>
> >>
> >>
> >>> On 27 Feb 2018, at 12:54, 苗海泉 <mseaspr...@gmail.com> wrote:
> >>>
> >>> In addition, we found that the rate was normal when the number of
> >>> collections was kept below 936 and the speed was slower and slower at
> >> 984.
> >>> Therefore, we could only temporarily delete the older collection, but
> now
> >>> we need more Online collection, there has been no good way to confuse
> us
> >>> for a long time, very much hope to give a solution to the problem of
> >> ideas,
> >>> greatly appreciated
> >>>
> >>> 2018-02-27 19:46 GMT+08:00 苗海泉 <mseaspr...@gmail.com>:
> >>>
> >>>> Thank you for reply.
> >>>> One collection has 25 shard one replica, one solr node has about 5T on
> >>>> desk.
> >>>> GC is checked ,and modify as follow :
> >>>> SOLR_JAVA_MEM="-Xms32768m -Xmx32768m "
> >>>> GC_TUNE=" \
> >>>> -XX:+UseG1GC \
> >>>> -XX:+PerfDisableSharedMem \
> >>>> -XX:+ParallelRefProcEnabled \
> >>>> -XX:G1HeapRegionSize=8m \
> >>>> -XX:MaxGCPauseMillis=250 \
> >>>> -XX:InitiatingHeapOccupancyPercent=75 \
> >>>> -XX:+UseLargePages \
> >>>> -XX:+AggressiveOpts \
> >>>> -XX:+UseLargePages"
> >>>>
> >>>> 2018-02-27 19:27 GMT+08:00 Emir Arnautović <
> >> emir.arnauto...@sematext.com>:
> >>>>
> >>>>> Hi,
> >>>>> To get more complete picture, can you tell us how many
> shards/replicas
> >> do
> >>>>> you have per collection? Also what is index size on disk? Did you
> >> check GC?
> >>>>>
> >>>>> BTW, using 32GB heap prevents you from using compressed oops,
> resulting
> >>>>> in less memory available than 31GB.
> >>>>>
> >>>>> Thanks,
> >>>>> Emir
> >>>>> --
> >>>>> Monitoring - Log Management - Alerting - Anomaly Detection
> >>>>> Solr & Elasticsearch Consulting Support Training -
> >> http://sematext.com/
> >>>>>
> >>>>>
> >>>>>
> >>>>>> On 27 Feb 2018, at 11:36, 苗海泉 <mseaspr...@gmail.com> wrote:
> >>>>>>
> >>>>>> I encountered a more serious problem in the process of using solr.
> We
> >>>>> use
> >>>>>> the solr version is 6.0, our daily amount of data is about 500
> billion
> >>>>>> documents, create a collection every hour, the online collection of
> >> more
> >>>>>> than a thousand, 49 solr nodes. If the collection in less than 800,
> >> the
> >>>>>> speed is still very fast, if the collection of the number of 1100 or
> >> so,
> >>>>>> the construction of solr index will drop sharply, one of the
> original
> >>>>>> program speed of about 2-3 million TPS, Dropped to only a few
> hundred
> >> or
> >>>>>> even tens of TPS, who have encountered a similar situation, there is
> >> no
> >>>>>> good idea to find this issue. By the way, solr a node memory we
> >> assigned
> >>>>>> 32G,We checked the memory, cpu, disk IO, network IO occupancy is no
> >>>>>> problem, belong to the normal state. Which friend encountered a
> >> similar
> >>>>>> problem, please inform the solution, thank you very much.
> >>>>>
> >>>>>
> >>>>
> >>>>
> >>>> --
> >>>> ==============================
> >>>> 联创科技
> >>>> 知行如一
> >>>> ==============================
> >>>>
> >>>
> >>>
> >>>
> >>> --
> >>> ==============================
> >>> 联创科技
> >>> 知行如一
> >>> ==============================
> >>
> >>
> >
> >
> > --
> > ==============================
> > 联创科技
> > 知行如一
> > ==============================
>
>


-- 
==============================
联创科技
知行如一
==============================

Re: When the number of collections exceeds one thousand, the construction of indexing speed drops sharply

Reply via email to