Re: [Q] Faster Atomic Updates - use docValues?

Paras Lehana Fri, 06 Dec 2019 05:33:35 -0800

Hey Erick,

We have just upgraded to 8.3 before starting the indexing. We were on 6.6
before that.


Thank you for your continued support and resources. Again, I have already
taken your suggestion to start afresh and that's what I'm going to do.
Don't get me wrong but I have been just asking doubts. I will surely get
back with my experience after performing the full indexing.

Thanks again! :)

On Fri, 6 Dec 2019 at 18:48, Erick Erickson <erickerick...@gmail.com> wrote:

> Nothing implicitly handles optimization, you must continue to do that
> externally.
>
> Until you get to the bottom of your indexing slowdown, I wouldn’t bother
> with it at all, trying to do all these things at once is what lead to your
> problem in the first place, please change one thing at a time. You say:
>
> “For a full indexing, optimizations occurred 30 times between batches”.
>
> This is horrible. I’m not sure what version of Solr you’re using. If it’s
> 7.4 or earlier, this means the the entire index was rewritten 30 times.
> The first time it would condense all segments into a single segment, or
> 1/30 of the total. The second time it would rewrite all that, 2/30 of the
> index into a new segment. The third time 3/30. And so on.
>
> If Solr 7.5 or later, it wouldn’t be as bad, assuming your index was over
> 5G. But still.
>
> See:
> https://lucidworks.com/post/segment-merging-deleted-documents-optimize-may-bad/
> for 7.4 and earlier,
> https://lucidworks.com/post/solr-and-optimizing-your-index-take-ii/ for
> 7.5 and later
>
> Eventually you can optimize by sending in an http or curl request like
> this:
> ../solr/collection/update?optimize=true
>
> You also changed to using StandardDirectory. The default has heuristics
> built in
> to choose the best directory implementation.
>
> I can’t emphasize enough that you’re changing lots of things at one time. I
> _strongly_ urge you to go back to the standard setup, make _no_
> modifications
> and change things one at a time. Some very bright people have done a lot
> of work to try to make Lucene/Solr work well.
>
> Make one change at a time. Measure. If that change isn’t helpful, undo it
> and
> move to the next one. You’re trying to second-guess the Lucene/Solr
> developers who have years of understanding how this all works. Assume they
> picked reasonable options for defaults and that Lucene/Solr performs
> reasonably
> well. When I get unexplainably poor results, I usually assume it was the
> last
> thing I changed….
>
> Best,
> Erick
>
>
>
>
> > On Dec 6, 2019, at 1:31 AM, Paras Lehana <paras.leh...@indiamart.com>
> wrote:
> >
> > Hi Erick,
> >
> > I believed optimizing explicitly merges segments and that's why I was
> > expecting it to give performance boost. I know that optimizations should
> > not be done very frequently. For a full indexing, optimizations occurred
> 30
> > times between batches. I take your suggestion to undo all the changes and
> > that's what I'm going to do. I mentioned about the optimizations giving
> an
> > indexing boost (for sometime) only to support your point of my
> mergePolicy
> > backfiring. I will certainly read again about the merge process.
> >
> > Taking your suggestions - so, commits would be handled by autoCommit.
> What
> > implicitly handles optimizations? I think the merge policy or is there
> any
> > other setting I'm missing?
> >
> > I'm indexing via Curl API on the same server. The Current Speed of curl
> is
> > only 50k (down from 1300k in the first batch). I think - as the curl is
> > transmitting the XML, the documents are getting indexing. Because then
> only
> > would speed be so low. I don't think that the whole XML is taking the
> > memory - I remember I had to change the curl options to get rid of the
> > transmission error for large files.
> >
> > This is my curl request:
> >
> > curl 'http://localhost:$port/solr/product/update?commit=true'  -T
> > batch1.xml -X POST -H 'Content-type:text/xml
> >
> > Although, we had been doing this since ages - I think I should now
> consider
> > using the solr post service (since the indexing files stays on the same
> > server) or using Solarium (we use PHP to make XMLs).
> >
> > On Thu, 5 Dec 2019 at 20:00, Erick Erickson <erickerick...@gmail.com>
> wrote:
> >
> >>> I think I should have also done optimize between batches, no?
> >>
> >> No, no, no, no. Absolutely not. Never. Never, never, never between
> batches.
> >> I don’t  recommend optimizing at _all_ unless there are demonstrable
> >> improvements.
> >>
> >> Please don’t take this the wrong way, the whole merge process is really
> >> hard to get your head around. But the very fact that you’d suggest
> >> optimizing between batches shows that the entire merge process is
> >> opaque to you. I’ve seen many people just start changing things and
> >> get themselves into a bad place, then try to change more things to get
> >> out of that hole. Rinse. Repeat.
> >>
> >> I _strongly_ recommend that you undo all your changes. Neither
> >> commit nor optimize from outside Solr. Set your autocommit
> >> settings to something like 5 minutes with openSearcher=true.
> >> Set all autowarm counts in your caches in solrconfig.xml to 0,
> >> especially filterCache and queryResultCache.
> >>
> >> Do not set soft commit at all, leave it at -1.
> >>
> >> Repeat do _not_ commit or optimize from the client! Just let your
> >> autocommit settings do the commits.
> >>
> >> It’s also pushing things to send 5M docs in a single XML packet.
> >> That all has to be held in memory and then indexed, adding to
> >> pressure on the heap. I usually index from SolrJ in batches
> >> of 1,000. See:
> >> https://lucidworks.com/post/indexing-with-solrj/
> >>
> >> Simply put, your slowdown should not be happening. I strongly
> >> believe that it’s something in your environment, most likely
> >> 1> your changes eventually shoot you in the foot OR
> >> 2> you are running in too little memory and eventually GC is killing
> you.
> >> Really, analyze your GC logs. OR
> >> 3> you are running on underpowered hardware which just can’t take the
> load
> >> OR
> >> 4> something else in your environment
> >>
> >> I’ve never heard of a Solr installation with such a massive slowdown
> during
> >> indexing that was fixed by tweaking things like the merge policy etc.
> >>
> >> Best,
> >> Erick
> >>
> >>
> >>> On Dec 5, 2019, at 12:57 AM, Paras Lehana <paras.leh...@indiamart.com>
> >> wrote:
> >>>
> >>> Hey Erick,
> >>>
> >>> This is a huge red flag to me: "(but I could only test for the first
> few
> >>>> thousand documents”.
> >>>
> >>>
> >>> Yup, that's probably where the culprit lies. I could only test for the
> >>> starting batch because I had to wait for a day to actually compare. I
> >>> tweaked the merge values and kept whatever gave a speed boost. My first
> >>> batch of 5 million docs took only 40 minutes (atomic updates included)
> >> and
> >>> the last batch of 5 million took more than 18 hours. If this is an
> issue
> >> of
> >>> mergePolicy, I think I should have also done optimize between batches,
> >> no?
> >>> I remember, when I indexed a single XML of 80 million after optimizing
> >> the
> >>> core already indexed with 30 XMLs of 5 million each, I could post 80
> >>> million in a day only.
> >>>
> >>>
> >>>
> >>>> The indexing rate you’re seeing is abysmal unless these are _huge_
> >>>> documents
> >>>
> >>>
> >>> Documents only contain the suggestion name, possible titles,
> >>> phonetics/spellcheck/synonym fields and numerical fields for boosting.
> >> They
> >>> are far smaller than what a Search Document would contain. Auto-Suggest
> >> is
> >>> only concerned about suggestions so you can guess how simple the
> >> documents
> >>> would be.
> >>>
> >>>
> >>> Some data is held on the heap and some in the OS RAM due to
> MMapDirectory
> >>>
> >>>
> >>> I'm using StandardDirectory (which will make Solr choose the right
> >>> implementation). Also, planning to read more about these (looking
> forward
> >>> to use MMap). Thanks for the article!
> >>>
> >>>
> >>> You're right. I should change one thing at a time. Let me experiment
> and
> >>> then I will summarize here what I tried. Thank you for your responses.
> :)
> >>>
> >>> On Wed, 4 Dec 2019 at 20:31, Erick Erickson <erickerick...@gmail.com>
> >> wrote:
> >>>
> >>>> This is a huge red flag to me: "(but I could only test for the first
> few
> >>>> thousand documents”
> >>>>
> >>>> You’re probably right that that would speed things up, but pretty soon
> >>>> when you’re indexing
> >>>> your entire corpus there are lots of other considerations.
> >>>>
> >>>> The indexing rate you’re seeing is abysmal unless these are _huge_
> >>>> documents, but you
> >>>> indicate that at the start you’re getting 1,400 docs/second so I don’t
> >>>> think the complexity
> >>>> of the docs is the issue here.
> >>>>
> >>>> Do note that when we’re throwing RAM figures out, we need to draw a
> >> sharp
> >>>> distinction
> >>>> between Java heap and total RAM. Some data is held on the heap and
> some
> >> in
> >>>> the OS
> >>>> RAM due to MMapDirectory, see Uwe’s excellent article:
> >>>>
> >>
> https://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
> >>>>
> >>>> Uwe recommends about 25% of your available physical RAM be allocated
> to
> >>>> Java as
> >>>> a starting point. Your particular Solr installation may need a larger
> >>>> percent, IDK.
> >>>>
> >>>> But basically I’d go back to all default settings and change one thing
> >> at
> >>>> a time.
> >>>> First, I’d look at GC performance. Is it taking all your CPU? In which
> >>>> case you probably need to
> >>>> increase your heap. I pick this first because it’s very common that
> this
> >>>> is a root cause.
> >>>>
> >>>> Next, I’d put a profiler on it to see exactly where I’m spending time.
> >>>> Otherwise you wind
> >>>> up making random changes and hoping one of them works.
> >>>>
> >>>> Best,
> >>>> Erick
> >>>>
> >>>>> On Dec 4, 2019, at 3:21 AM, Paras Lehana <paras.leh...@indiamart.com
> >
> >>>> wrote:
> >>>>>
> >>>>> (but I could only test for the first few
> >>>>> thousand documents
> >>>>
> >>>>
> >>>
> >>> --
> >>> --
> >>> Regards,
> >>>
> >>> *Paras Lehana* [65871]
> >>> Development Engineer, Auto-Suggest,
> >>> IndiaMART Intermesh Ltd.
> >>>
> >>> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
> >>> Noida, UP, IN - 201303
> >>>
> >>> Mob.: +91-9560911996
> >>> Work: 01203916600 | Extn:  *8173*
> >>>
> >>> --
> >>> *
> >>> *
> >>>
> >>> <https://www.facebook.com/IndiaMART/videos/578196442936091/>
> >>
> >>
> >
> > --
> > --
> > Regards,
> >
> > *Paras Lehana* [65871]
> > Development Engineer, Auto-Suggest,
> > IndiaMART Intermesh Ltd.
> >
> > 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
> > Noida, UP, IN - 201303
> >
> > Mob.: +91-9560911996
> > Work: 01203916600 | Extn:  *8173*
> >
> > --
> > *
> > *
> >
> > <https://www.facebook.com/IndiaMART/videos/578196442936091/>
>
>

-- 
-- 
Regards,

*Paras Lehana* [65871]
Development Engineer, Auto-Suggest,
IndiaMART Intermesh Ltd.

8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
Noida, UP, IN - 201303

Mob.: +91-9560911996
Work: 01203916600 | Extn:  *8173*

-- 
*
*

 <https://www.facebook.com/IndiaMART/videos/578196442936091/>

Re: [Q] Faster Atomic Updates - use docValues?

Reply via email to