Re: Document Update performances Improvement

Paras Lehana Tue, 22 Oct 2019 23:05:56 -0700

Hi Nicolas,

What kind of change exactly on the merge factor side ?



We increased maxMergeAtOnce and segmentsPerTier from 5 to 50. This will
make Solr to merge segments less frequently after many index updates. Yes,
you need to find the sweet spot here but do try increasing these values
from the default ones. I strongly recommend you to give a 2 min read to this
<https://lucene.apache.org/solr/guide/6_6/indexconfig-in-solrconfig.html#merge-factors>.
Do note that increasing these values will require you to have larger
physical storage until segments merge.

Besides this, do review your autoCommit config
<https://lucene.apache.org/solr/guide/6_6/updatehandlers-in-solrconfig.html#UpdateHandlersinSolrConfig-autoCommit>
or the frequency of your hard commits. In our case, we don't want real time
updates - so we can always commit less frequently. This makes indexing
faster. How often do you commit? Are you committing after each XML is
indexed? If yes, what is your batch (XML) size? Review default settings of
autoCommit and considering increasing it. Do you want real time reflection
of updates? If no, you can compromise on commits and merge factors and do
faster indexing. Don't so soft commits then.

In our case, I have set autoCommit to commit after 50,000 documents are
indexed. After EdgeNGrams tokenization, while full indexing, we have seen
index to get over 60 GBs. Once we are done with full indexing, I optimize
the index and the index size comes below 13 GB! Since we can trade off
space temporarily for increased indexing speed, we are still committed to
find sweeter spots for faster indexing. For statistics purpose, we have
over 250 million documents for indexing that converges to 60 million unique
documents after atomic updates (full indexing).



> Would you say atomical update is faster than regular replacement of
> documents?


No, I don't say that. Either of the two configs (autoCommit, Merge Policy)
will impact regular indexing too. In our case, non-atomic indexing is out
of question.

On Wed, 23 Oct 2019 at 00:43, Nicolas Paris <nicolas.pa...@riseup.net>
wrote:

> > We, at Auto-Suggest, also do atomic updates daily and specifically
> > changing merge factor gave us a boost of ~4x
>
> Interesting. What kind of change exactly on the merge factor side ?
>
>
> > At current configuration, our core atomically updates ~423 documents
> > per second.
>
> Would you say atomical update is faster than regular replacement of
> documents ? (considering my first thought on this below)
>
> > > I am wondering if **atomic update feature** would faster the process.
> > > From one hand, using this feature would save network because only a
> > > small subset of the document would be send from the client to the
> > > server.
> > > On the other hand, the server will have to collect the values from the
> > > disk and reindex them. In addition, this implies to store the values
> > > every fields (I am not storing every fields) and use more space.
>
>
> Thanks Paras
>
>
>
> On Tue, Oct 22, 2019 at 01:00:10PM +0530, Paras Lehana wrote:
> > Hi Nicolas,
> >
> > Have you tried playing with values of *IndexConfig*
> > <https://lucene.apache.org/solr/guide/6_6/indexconfig-in-solrconfig.html
> >
> > (merge factor, segment size, maxBufferedDocs, Merge Policies)? We, at
> > Auto-Suggest, also do atomic updates daily and specifically changing
> merge
> > factor gave us a boost of ~4x during indexing. At current configuration,
> > our core atomically updates ~423 documents per second.
> >
> > On Sun, 20 Oct 2019 at 02:07, Nicolas Paris <nicolas.pa...@riseup.net>
> > wrote:
> >
> > > > Maybe you need to give more details. I recommend always to try and
> > > > test yourself as you know your own solution best. What performance do
> > > > your use car needs and what is your current performance?
> > >
> > > I have 10 collections on 4 shards (no replications). The collections
> are
> > > quite large ranging from 2GB to 60 GB per shard. In every case, the
> > > update process only add several values to an indexed array field on a
> > > document subset of each collection. The proportion of the subset is
> from
> > > 0 to 100%, and 95% of time below 20%. The array field represents 1 over
> > > 20 fields which are mainly unstored fields with some large textual
> > > fields.
> > >
> > > The 4 solr instance collocate with the spark. Right now I tested with
> 40
> > > spark executors. Commit timing and commit number document are both set
> > > to 20000. Each shard has 20g of memory.
> > > Loading/replacing the largest collection is about 2 hours - which is
> > > quite fast I guess. Updating 5% percent of documents of each
> > > collections, is about half an hour.
> > >
> > > Because my need is "only" to append several values to an array I
> suspect
> > > there is some trick to make things faster.
> > >
> > >
> > >
> > > On Sat, Oct 19, 2019 at 10:10:36PM +0200, Jörn Franke wrote:
> > > > Maybe you need to give more details. I recommend always to try and
> test
> > > yourself as you know your own solution best. Depending on your spark
> > > process atomic updates  could be faster.
> > > >
> > > > With Spark-Solr additional complexity comes. You could have too many
> > > executors for your Solr instance(s), ie a too high parallelism.
> > > >
> > > > Probably the most important question is:
> > > > What performance do your use car needs and what is your current
> > > performance?
> > > >
> > > > Once this is clear further architecture aspects can be derived, such
> as
> > > number of spark executors, number of Solr instances, sharding,
> replication,
> > > commit timing etc.
> > > >
> > > > > Am 19.10.2019 um 21:52 schrieb Nicolas Paris <
> nicolas.pa...@riseup.net
> > > >:
> > > > >
> > > > > Hi community,
> > > > >
> > > > > Any advice to speed-up updates ?
> > > > > Is there any advice on commit, memory, docvalues, stored or any
> tips to
> > > > > faster things ?
> > > > >
> > > > > Thanks
> > > > >
> > > > >
> > > > >> On Wed, Oct 16, 2019 at 12:47:47AM +0200, Nicolas Paris wrote:
> > > > >> Hi
> > > > >>
> > > > >> I am looking for a way to faster the update of documents.
> > > > >>
> > > > >> In my context, the update replaces one of the many existing
> indexed
> > > > >> fields, and keep the others as is.
> > > > >>
> > > > >> Right now, I am building the whole document, and replacing the
> > > existing
> > > > >> one by id.
> > > > >>
> > > > >> I am wondering if **atomic update feature** would faster the
> process.
> > > > >>
> > > > >> From one hand, using this feature would save network because only
> a
> > > > >> small subset of the document would be send from the client to the
> > > > >> server.
> > > > >> On the other hand, the server will have to collect the values
> from the
> > > > >> disk and reindex them. In addition, this implies to store the
> values
> > > for
> > > > >> every fields (I am not storing every fields) and use more space.
> > > > >>
> > > > >> Also I have read about the ConcurrentUpdateSolrServer class might
> be
> > > an
> > > > >> optimized way of updating documents.
> > > > >>
> > > > >> I am using spark-solr library to deal with solr-cloud. If
> something
> > > > >> exist to faster the process, I would be glad to implement it in
> that
> > > > >> library.
> > > > >> Also, I have split the collection over multiple shard, and I admit
> > > this
> > > > >> faster the update process, but who knows ?
> > > > >>
> > > > >> Thoughts ?
> > > > >>
> > > > >> --
> > > > >> nicolas
> > > > >>
> > > > >
> > > > > --
> > > > > nicolas
> > > >
> > >
> > > --
> > > nicolas
> > >
> >
> >
> > --
> > --
> > Regards,
> >
> > *Paras Lehana* [65871]
> > Software Programmer, Auto-Suggest,
> > IndiaMART Intermesh Ltd.
> >
> > 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
> > Noida, UP, IN - 201303
> >
> > Mob.: +91-9560911996
> > Work: 01203916600 | Extn:  *8173*
> >
> > --
> > IMPORTANT:
> > NEVER share your IndiaMART OTP/ Password with anyone.
>
> --
> nicolas
>


-- 
-- 
Regards,

*Paras Lehana* [65871]
Development Engineer, Auto-Suggest,
IndiaMART Intermesh Ltd.

8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
Noida, UP, IN - 201303

Mob.: +91-9560911996
Work: 01203916600 | Extn:  *8173*

-- 
IMPORTANT: 
NEVER share your IndiaMART OTP/ Password with anyone.

Re: Document Update performances Improvement

Reply via email to