Re: [Q] Faster Atomic Updates - use docValues?

Paras Lehana Wed, 11 Dec 2019 02:07:17 -0800

Just to update, I kept the defaults. The indexing got only a little boost
though I have decided to continue with the defaults and do incremental
experiments only. To my surprise, our development server had only 12GB RAM,
of which 8G was allocated to Java. Because I could not increase the RAM, I
tried decreasing it to 4G and guess what! My indexing speed got a boost of
over *50x*. Erick, thanks for helping. I think I should do more homework
about GCs also. Your GC guess seems to be valid. I have raised the request
to increase RAM on the development to 24GB.


On Mon, 9 Dec 2019 at 20:23, Erick Erickson <erickerick...@gmail.com> wrote:

> Note that that article is from 2011. That was in the Solr 3x days when
> many, many, many things were different. There was no SolrCloud for
> instance. Plus Tom’s problem space is indexing _books_. Whole, complete,
> books. Which is, actually, not “normal” indexing at all as most Solr
> indexes are much smaller documents. Books are a perfectly reasonable
> use-case of course, but have a whole bunch of special requirements.
>
> get-by-id should be very efficient, _except_ that the longer you spend
> before opening a new searcher, the larger the internal data buffers
> supporting get-by-id need to be.
>
> Anyway, best of luck
> Erick
>
> > On Dec 9, 2019, at 1:05 AM, Paras Lehana <paras.leh...@indiamart.com>
> wrote:
> >
> > Hi Erick,
> >
> > I have reverted back to original values and yes, I did see improvement. I
> > will collect more stats. *Thank you for helping. :)*
> >
> > Also, here is the reference article that I had referred for changing
> > values:
> >
> https://www.hathitrust.org/blogs/large-scale-search/forty-days-and-forty-nights-re-indexing-7-million-books-part-1
> >
> > The article was perhaps for normal indexing and thus, suggested
> increasing
> > mergeFactor and then finally optimizing. In my case, a large number of
> > segments could have impacted get-by-id of atomic updates? Just being
> > curious.
> >
> > On Fri, 6 Dec 2019 at 19:02, Paras Lehana <paras.leh...@indiamart.com>
> > wrote:
> >
> >> Hey Erick,
> >>
> >> We have just upgraded to 8.3 before starting the indexing. We were on
> 6.6
> >> before that.
> >>
> >> Thank you for your continued support and resources. Again, I have
> already
> >> taken your suggestion to start afresh and that's what I'm going to do.
> >> Don't get me wrong but I have been just asking doubts. I will surely get
> >> back with my experience after performing the full indexing.
> >>
> >> Thanks again! :)
> >>
> >> On Fri, 6 Dec 2019 at 18:48, Erick Erickson <erickerick...@gmail.com>
> >> wrote:
> >>
> >>> Nothing implicitly handles optimization, you must continue to do that
> >>> externally.
> >>>
> >>> Until you get to the bottom of your indexing slowdown, I wouldn’t
> bother
> >>> with it at all, trying to do all these things at once is what lead to
> your
> >>> problem in the first place, please change one thing at a time. You say:
> >>>
> >>> “For a full indexing, optimizations occurred 30 times between batches”.
> >>>
> >>> This is horrible. I’m not sure what version of Solr you’re using. If
> it’s
> >>> 7.4 or earlier, this means the the entire index was rewritten 30 times.
> >>> The first time it would condense all segments into a single segment, or
> >>> 1/30 of the total. The second time it would rewrite all that, 2/30 of
> the
> >>> index into a new segment. The third time 3/30. And so on.
> >>>
> >>> If Solr 7.5 or later, it wouldn’t be as bad, assuming your index was
> over
> >>> 5G. But still.
> >>>
> >>> See:
> >>>
> https://lucidworks.com/post/segment-merging-deleted-documents-optimize-may-bad/
> >>> for 7.4 and earlier,
> >>> https://lucidworks.com/post/solr-and-optimizing-your-index-take-ii/
> for
> >>> 7.5 and later
> >>>
> >>> Eventually you can optimize by sending in an http or curl request like
> >>> this:
> >>> ../solr/collection/update?optimize=true
> >>>
> >>> You also changed to using StandardDirectory. The default has heuristics
> >>> built in
> >>> to choose the best directory implementation.
> >>>
> >>> I can’t emphasize enough that you’re changing lots of things at one
> time.
> >>> I
> >>> _strongly_ urge you to go back to the standard setup, make _no_
> >>> modifications
> >>> and change things one at a time. Some very bright people have done a
> lot
> >>> of work to try to make Lucene/Solr work well.
> >>>
> >>> Make one change at a time. Measure. If that change isn’t helpful, undo
> it
> >>> and
> >>> move to the next one. You’re trying to second-guess the Lucene/Solr
> >>> developers who have years of understanding how this all works. Assume
> they
> >>> picked reasonable options for defaults and that Lucene/Solr performs
> >>> reasonably
> >>> well. When I get unexplainably poor results, I usually assume it was
> the
> >>> last
> >>> thing I changed….
> >>>
> >>> Best,
> >>> Erick
> >>>
> >>>
> >>>
> >>>
> >>>> On Dec 6, 2019, at 1:31 AM, Paras Lehana <paras.leh...@indiamart.com>
> >>> wrote:
> >>>>
> >>>> Hi Erick,
> >>>>
> >>>> I believed optimizing explicitly merges segments and that's why I was
> >>>> expecting it to give performance boost. I know that optimizations
> should
> >>>> not be done very frequently. For a full indexing, optimizations
> >>> occurred 30
> >>>> times between batches. I take your suggestion to undo all the changes
> >>> and
> >>>> that's what I'm going to do. I mentioned about the optimizations
> giving
> >>> an
> >>>> indexing boost (for sometime) only to support your point of my
> >>> mergePolicy
> >>>> backfiring. I will certainly read again about the merge process.
> >>>>
> >>>> Taking your suggestions - so, commits would be handled by autoCommit.
> >>> What
> >>>> implicitly handles optimizations? I think the merge policy or is there
> >>> any
> >>>> other setting I'm missing?
> >>>>
> >>>> I'm indexing via Curl API on the same server. The Current Speed of
> curl
> >>> is
> >>>> only 50k (down from 1300k in the first batch). I think - as the curl
> is
> >>>> transmitting the XML, the documents are getting indexing. Because then
> >>> only
> >>>> would speed be so low. I don't think that the whole XML is taking the
> >>>> memory - I remember I had to change the curl options to get rid of the
> >>>> transmission error for large files.
> >>>>
> >>>> This is my curl request:
> >>>>
> >>>> curl 'http://localhost:$port/solr/product/update?commit=true'  -T
> >>>> batch1.xml -X POST -H 'Content-type:text/xml
> >>>>
> >>>> Although, we had been doing this since ages - I think I should now
> >>> consider
> >>>> using the solr post service (since the indexing files stays on the
> same
> >>>> server) or using Solarium (we use PHP to make XMLs).
> >>>>
> >>>> On Thu, 5 Dec 2019 at 20:00, Erick Erickson <erickerick...@gmail.com>
> >>> wrote:
> >>>>
> >>>>>> I think I should have also done optimize between batches, no?
> >>>>>
> >>>>> No, no, no, no. Absolutely not. Never. Never, never, never between
> >>> batches.
> >>>>> I don’t  recommend optimizing at _all_ unless there are demonstrable
> >>>>> improvements.
> >>>>>
> >>>>> Please don’t take this the wrong way, the whole merge process is
> really
> >>>>> hard to get your head around. But the very fact that you’d suggest
> >>>>> optimizing between batches shows that the entire merge process is
> >>>>> opaque to you. I’ve seen many people just start changing things and
> >>>>> get themselves into a bad place, then try to change more things to
> get
> >>>>> out of that hole. Rinse. Repeat.
> >>>>>
> >>>>> I _strongly_ recommend that you undo all your changes. Neither
> >>>>> commit nor optimize from outside Solr. Set your autocommit
> >>>>> settings to something like 5 minutes with openSearcher=true.
> >>>>> Set all autowarm counts in your caches in solrconfig.xml to 0,
> >>>>> especially filterCache and queryResultCache.
> >>>>>
> >>>>> Do not set soft commit at all, leave it at -1.
> >>>>>
> >>>>> Repeat do _not_ commit or optimize from the client! Just let your
> >>>>> autocommit settings do the commits.
> >>>>>
> >>>>> It’s also pushing things to send 5M docs in a single XML packet.
> >>>>> That all has to be held in memory and then indexed, adding to
> >>>>> pressure on the heap. I usually index from SolrJ in batches
> >>>>> of 1,000. See:
> >>>>> https://lucidworks.com/post/indexing-with-solrj/
> >>>>>
> >>>>> Simply put, your slowdown should not be happening. I strongly
> >>>>> believe that it’s something in your environment, most likely
> >>>>> 1> your changes eventually shoot you in the foot OR
> >>>>> 2> you are running in too little memory and eventually GC is killing
> >>> you.
> >>>>> Really, analyze your GC logs. OR
> >>>>> 3> you are running on underpowered hardware which just can’t take the
> >>> load
> >>>>> OR
> >>>>> 4> something else in your environment
> >>>>>
> >>>>> I’ve never heard of a Solr installation with such a massive slowdown
> >>> during
> >>>>> indexing that was fixed by tweaking things like the merge policy etc.
> >>>>>
> >>>>> Best,
> >>>>> Erick
> >>>>>
> >>>>>
> >>>>>> On Dec 5, 2019, at 12:57 AM, Paras Lehana <
> paras.leh...@indiamart.com
> >>>>
> >>>>> wrote:
> >>>>>>
> >>>>>> Hey Erick,
> >>>>>>
> >>>>>> This is a huge red flag to me: "(but I could only test for the first
> >>> few
> >>>>>>> thousand documents”.
> >>>>>>
> >>>>>>
> >>>>>> Yup, that's probably where the culprit lies. I could only test for
> the
> >>>>>> starting batch because I had to wait for a day to actually compare.
> I
> >>>>>> tweaked the merge values and kept whatever gave a speed boost. My
> >>> first
> >>>>>> batch of 5 million docs took only 40 minutes (atomic updates
> included)
> >>>>> and
> >>>>>> the last batch of 5 million took more than 18 hours. If this is an
> >>> issue
> >>>>> of
> >>>>>> mergePolicy, I think I should have also done optimize between
> batches,
> >>>>> no?
> >>>>>> I remember, when I indexed a single XML of 80 million after
> optimizing
> >>>>> the
> >>>>>> core already indexed with 30 XMLs of 5 million each, I could post 80
> >>>>>> million in a day only.
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>> The indexing rate you’re seeing is abysmal unless these are _huge_
> >>>>>>> documents
> >>>>>>
> >>>>>>
> >>>>>> Documents only contain the suggestion name, possible titles,
> >>>>>> phonetics/spellcheck/synonym fields and numerical fields for
> boosting.
> >>>>> They
> >>>>>> are far smaller than what a Search Document would contain.
> >>> Auto-Suggest
> >>>>> is
> >>>>>> only concerned about suggestions so you can guess how simple the
> >>>>> documents
> >>>>>> would be.
> >>>>>>
> >>>>>>
> >>>>>> Some data is held on the heap and some in the OS RAM due to
> >>> MMapDirectory
> >>>>>>
> >>>>>>
> >>>>>> I'm using StandardDirectory (which will make Solr choose the right
> >>>>>> implementation). Also, planning to read more about these (looking
> >>> forward
> >>>>>> to use MMap). Thanks for the article!
> >>>>>>
> >>>>>>
> >>>>>> You're right. I should change one thing at a time. Let me experiment
> >>> and
> >>>>>> then I will summarize here what I tried. Thank you for your
> >>> responses. :)
> >>>>>>
> >>>>>> On Wed, 4 Dec 2019 at 20:31, Erick Erickson <
> erickerick...@gmail.com>
> >>>>> wrote:
> >>>>>>
> >>>>>>> This is a huge red flag to me: "(but I could only test for the
> first
> >>> few
> >>>>>>> thousand documents”
> >>>>>>>
> >>>>>>> You’re probably right that that would speed things up, but pretty
> >>> soon
> >>>>>>> when you’re indexing
> >>>>>>> your entire corpus there are lots of other considerations.
> >>>>>>>
> >>>>>>> The indexing rate you’re seeing is abysmal unless these are _huge_
> >>>>>>> documents, but you
> >>>>>>> indicate that at the start you’re getting 1,400 docs/second so I
> >>> don’t
> >>>>>>> think the complexity
> >>>>>>> of the docs is the issue here.
> >>>>>>>
> >>>>>>> Do note that when we’re throwing RAM figures out, we need to draw a
> >>>>> sharp
> >>>>>>> distinction
> >>>>>>> between Java heap and total RAM. Some data is held on the heap and
> >>> some
> >>>>> in
> >>>>>>> the OS
> >>>>>>> RAM due to MMapDirectory, see Uwe’s excellent article:
> >>>>>>>
> >>>>>
> >>>
> https://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
> >>>>>>>
> >>>>>>> Uwe recommends about 25% of your available physical RAM be
> allocated
> >>> to
> >>>>>>> Java as
> >>>>>>> a starting point. Your particular Solr installation may need a
> larger
> >>>>>>> percent, IDK.
> >>>>>>>
> >>>>>>> But basically I’d go back to all default settings and change one
> >>> thing
> >>>>> at
> >>>>>>> a time.
> >>>>>>> First, I’d look at GC performance. Is it taking all your CPU? In
> >>> which
> >>>>>>> case you probably need to
> >>>>>>> increase your heap. I pick this first because it’s very common that
> >>> this
> >>>>>>> is a root cause.
> >>>>>>>
> >>>>>>> Next, I’d put a profiler on it to see exactly where I’m spending
> >>> time.
> >>>>>>> Otherwise you wind
> >>>>>>> up making random changes and hoping one of them works.
> >>>>>>>
> >>>>>>> Best,
> >>>>>>> Erick
> >>>>>>>
> >>>>>>>> On Dec 4, 2019, at 3:21 AM, Paras Lehana <
> >>> paras.leh...@indiamart.com>
> >>>>>>> wrote:
> >>>>>>>>
> >>>>>>>> (but I could only test for the first few
> >>>>>>>> thousand documents
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>> --
> >>>>>> --
> >>>>>> Regards,
> >>>>>>
> >>>>>> *Paras Lehana* [65871]
> >>>>>> Development Engineer, Auto-Suggest,
> >>>>>> IndiaMART Intermesh Ltd.
> >>>>>>
> >>>>>> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
> >>>>>> Noida, UP, IN - 201303
> >>>>>>
> >>>>>> Mob.: +91-9560911996
> >>>>>> Work: 01203916600 | Extn:  *8173*
> >>>>>>
> >>>>>> --
> >>>>>> *
> >>>>>> *
> >>>>>>
> >>>>>> <https://www.facebook.com/IndiaMART/videos/578196442936091/>
> >>>>>
> >>>>>
> >>>>
> >>>> --
> >>>> --
> >>>> Regards,
> >>>>
> >>>> *Paras Lehana* [65871]
> >>>> Development Engineer, Auto-Suggest,
> >>>> IndiaMART Intermesh Ltd.
> >>>>
> >>>> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
> >>>> Noida, UP, IN - 201303
> >>>>
> >>>> Mob.: +91-9560911996
> >>>> Work: 01203916600 | Extn:  *8173*
> >>>>
> >>>> --
> >>>> *
> >>>> *
> >>>>
> >>>> <https://www.facebook.com/IndiaMART/videos/578196442936091/>
> >>>
> >>>
> >>
> >> --
> >> --
> >> Regards,
> >>
> >> *Paras Lehana* [65871]
> >> Development Engineer, Auto-Suggest,
> >> IndiaMART Intermesh Ltd.
> >>
> >> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
> >> Noida, UP, IN - 201303
> >>
> >> Mob.: +91-9560911996
> >> Work: 01203916600 | Extn:  *8173*
> >>
> >
> >
> > --
> > --
> > Regards,
> >
> > *Paras Lehana* [65871]
> > Development Engineer, Auto-Suggest,
> > IndiaMART Intermesh Ltd.
> >
> > 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
> > Noida, UP, IN - 201303
> >
> > Mob.: +91-9560911996
> > Work: 01203916600 | Extn:  *8173*
> >
> > --
> > *
> > *
> >
> > <https://www.facebook.com/IndiaMART/videos/578196442936091/>
>
>

-- 
-- 
Regards,

*Paras Lehana* [65871]
Development Engineer, Auto-Suggest,
IndiaMART Intermesh Ltd.

8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
Noida, UP, IN - 201303

Mob.: +91-9560911996
Work: 01203916600 | Extn:  *8173*

-- 
*
*

 <https://www.facebook.com/IndiaMART/videos/578196442936091/>

Re: [Q] Faster Atomic Updates - use docValues?

Reply via email to