Hi Tim,

For what it is worth,  behind Trove (http://trove.nla.gov.au/) are 3
SOLR-managed indices and 1 Lucene index. None of ours is as big as one
of your shards, and one of our SOLR-managed indices is tiny, but your
experiences with long GC pauses are familar to us.

One of the most difficult indices to tune is our bibliographic index
of around 38M mostly metadata records which is around 125GB and 97MB
tii files.

We need to commit updates and reopen the index every 90 seconds, and
the facet recalculation (using UnInverted) was taking quite a lot of
time, and seemed to generate lots of objects to be collected on each
reopening.

Although we've been through several rounds of tuning which have seemed
to work, at least temporarily, a few months ago we started getting 12
sec "full gc" times every 90 secs, which was no good!

We've noticed/did three things:

1) optimise to 1 segment - we'd got to the stage where 50% of the
documents had been updated (hence deleted), and the maxdocid was 50%
bigger than it needed to be, and hence datastructures whose size was
proportional to maxdocid had increased a lot.  Optimising to 1 segment
greatly reduced full GC frequency and times.

2) for most of our facets, forcing the facets to be filters rather
than uninverted happened to work better - but this depends on many
factors, and certainly isnt a cure-all for all facets - uninverted
often works much better than filters!

3) after lots of benchmarking real updates and queries on a dev
system, we came up with this set of JVM parameters that worked "best"
for our environment (at the moment!):

-Xmx17000M -XX:NewSize=3500M -XX:SurvivorRatio=3
-XX:+UseConcMarkSweepGC -XX:+UseParNewGC \
-XX:+CMSIncrementalMode

I can't say exactly why, except that with this combination of
parameters and our data, a much bigger newgen led to less movement of
objects to oldgen, and non-full-GC collections on oldgen worked much
better.  Currently we are seeing less than 10 Full GC's a day, and
they almost always take less than 4 seconds.

This index is running on an 8 core X5570 machine with 64GB, sharing it
with a large/busy mysql instance and the Trove web server.

One of our other indices is only updated once per day, but is larger:
33.5M docs representing full text of archived web pages, 246GB, tii
file is 36MB.

JVM parms are  -Xmx10000M -XX:+UseConcMarkSweepGC -XX:+UseParNewGC.

It also does less than 10 Full GC's per day, taking less than 5 sec each.

Our other large index, newspapers, is a native Lucene index, about
180GB with comparatively large tii of 280MB (probably for the same
reason your tii is large - the contents of this database is mostly
OCR'ed text).  This index is updated/reopened every 3 minutes (to
incorporate OCR text corrections and tagging) and we use a bitmap to
represent all facet values, which typically take 5 secs to rebuild on
each reopen.

JVM parms: -mx15000M -XX:+UseConcMarkSweepGC -XX:+UseParNewGC

Although this JVM usually does fewer than 5 GC's per day, these Full
GC's often take 20-30 seconds, and we need to test increasing the
Newsize on this JVM to see if we can reduce these pauses.

The web archive and newspaper index are running on 8 core X5570
machine with 72GB.

We are also running a separate copy/version of this index behind the
site  http://newspapers.nla.gov.au/ - the main difference is that the
Trove version using shingling (inspired by the Hathi Trust results) to
improve searches containing common words.  This other version is
running on a machine with 32GB and 8 X5460 cores and  has JVM parms:
  -mx11500M  -XX:+UseConcMarkSweepGC -XX:+UseParNewGC


Apart from the old newspapers index, all other SOLR/lucene indices are
maintained on SSDs (Intel x25m 160GB), which whilst not having
anything to do with GCs, work very very well - we couldnt cope with
our current query volumes on rotating disk without spending a great
deal of money.  The old newspaper index is running on a SAN with 24
fast disks backing it, and we can't support the same query rate on it
as we can with the other newspaper index on SSDs (even before the
shingling change).

Kent Fitch
Trove development team
National Library of Australia

Reply via email to