Re: [Q] Faster Atomic Updates - use docValues?

Paras Lehana Sun, 08 Dec 2019 22:07:07 -0800

Hi Erick,

I have reverted back to original values and yes, I did see improvement. I
will collect more stats. *Thank you for helping. :)*


Also, here is the reference article that I had referred for changing
values:
https://www.hathitrust.org/blogs/large-scale-search/forty-days-and-forty-nights-re-indexing-7-million-books-part-1

The article was perhaps for normal indexing and thus, suggested increasing
mergeFactor and then finally optimizing. In my case, a large number of
segments could have impacted get-by-id of atomic updates? Just being
curious.

On Fri, 6 Dec 2019 at 19:02, Paras Lehana <paras.leh...@indiamart.com>
wrote:

> Hey Erick,
>
> We have just upgraded to 8.3 before starting the indexing. We were on 6.6
> before that.
>
> Thank you for your continued support and resources. Again, I have already
> taken your suggestion to start afresh and that's what I'm going to do.
> Don't get me wrong but I have been just asking doubts. I will surely get
> back with my experience after performing the full indexing.
>
> Thanks again! :)
>
> On Fri, 6 Dec 2019 at 18:48, Erick Erickson <erickerick...@gmail.com>
> wrote:
>
>> Nothing implicitly handles optimization, you must continue to do that
>> externally.
>>
>> Until you get to the bottom of your indexing slowdown, I wouldn’t bother
>> with it at all, trying to do all these things at once is what lead to your
>> problem in the first place, please change one thing at a time. You say:
>>
>> “For a full indexing, optimizations occurred 30 times between batches”.
>>
>> This is horrible. I’m not sure what version of Solr you’re using. If it’s
>> 7.4 or earlier, this means the the entire index was rewritten 30 times.
>> The first time it would condense all segments into a single segment, or
>> 1/30 of the total. The second time it would rewrite all that, 2/30 of the
>> index into a new segment. The third time 3/30. And so on.
>>
>> If Solr 7.5 or later, it wouldn’t be as bad, assuming your index was over
>> 5G. But still.
>>
>> See:
>> https://lucidworks.com/post/segment-merging-deleted-documents-optimize-may-bad/
>> for 7.4 and earlier,
>> https://lucidworks.com/post/solr-and-optimizing-your-index-take-ii/ for
>> 7.5 and later
>>
>> Eventually you can optimize by sending in an http or curl request like
>> this:
>> ../solr/collection/update?optimize=true
>>
>> You also changed to using StandardDirectory. The default has heuristics
>> built in
>> to choose the best directory implementation.
>>
>> I can’t emphasize enough that you’re changing lots of things at one time.
>> I
>> _strongly_ urge you to go back to the standard setup, make _no_
>> modifications
>> and change things one at a time. Some very bright people have done a lot
>> of work to try to make Lucene/Solr work well.
>>
>> Make one change at a time. Measure. If that change isn’t helpful, undo it
>> and
>> move to the next one. You’re trying to second-guess the Lucene/Solr
>> developers who have years of understanding how this all works. Assume they
>> picked reasonable options for defaults and that Lucene/Solr performs
>> reasonably
>> well. When I get unexplainably poor results, I usually assume it was the
>> last
>> thing I changed….
>>
>> Best,
>> Erick
>>
>>
>>
>>
>> > On Dec 6, 2019, at 1:31 AM, Paras Lehana <paras.leh...@indiamart.com>
>> wrote:
>> >
>> > Hi Erick,
>> >
>> > I believed optimizing explicitly merges segments and that's why I was
>> > expecting it to give performance boost. I know that optimizations should
>> > not be done very frequently. For a full indexing, optimizations
>> occurred 30
>> > times between batches. I take your suggestion to undo all the changes
>> and
>> > that's what I'm going to do. I mentioned about the optimizations giving
>> an
>> > indexing boost (for sometime) only to support your point of my
>> mergePolicy
>> > backfiring. I will certainly read again about the merge process.
>> >
>> > Taking your suggestions - so, commits would be handled by autoCommit.
>> What
>> > implicitly handles optimizations? I think the merge policy or is there
>> any
>> > other setting I'm missing?
>> >
>> > I'm indexing via Curl API on the same server. The Current Speed of curl
>> is
>> > only 50k (down from 1300k in the first batch). I think - as the curl is
>> > transmitting the XML, the documents are getting indexing. Because then
>> only
>> > would speed be so low. I don't think that the whole XML is taking the
>> > memory - I remember I had to change the curl options to get rid of the
>> > transmission error for large files.
>> >
>> > This is my curl request:
>> >
>> > curl 'http://localhost:$port/solr/product/update?commit=true'  -T
>> > batch1.xml -X POST -H 'Content-type:text/xml
>> >
>> > Although, we had been doing this since ages - I think I should now
>> consider
>> > using the solr post service (since the indexing files stays on the same
>> > server) or using Solarium (we use PHP to make XMLs).
>> >
>> > On Thu, 5 Dec 2019 at 20:00, Erick Erickson <erickerick...@gmail.com>
>> wrote:
>> >
>> >>> I think I should have also done optimize between batches, no?
>> >>
>> >> No, no, no, no. Absolutely not. Never. Never, never, never between
>> batches.
>> >> I don’t  recommend optimizing at _all_ unless there are demonstrable
>> >> improvements.
>> >>
>> >> Please don’t take this the wrong way, the whole merge process is really
>> >> hard to get your head around. But the very fact that you’d suggest
>> >> optimizing between batches shows that the entire merge process is
>> >> opaque to you. I’ve seen many people just start changing things and
>> >> get themselves into a bad place, then try to change more things to get
>> >> out of that hole. Rinse. Repeat.
>> >>
>> >> I _strongly_ recommend that you undo all your changes. Neither
>> >> commit nor optimize from outside Solr. Set your autocommit
>> >> settings to something like 5 minutes with openSearcher=true.
>> >> Set all autowarm counts in your caches in solrconfig.xml to 0,
>> >> especially filterCache and queryResultCache.
>> >>
>> >> Do not set soft commit at all, leave it at -1.
>> >>
>> >> Repeat do _not_ commit or optimize from the client! Just let your
>> >> autocommit settings do the commits.
>> >>
>> >> It’s also pushing things to send 5M docs in a single XML packet.
>> >> That all has to be held in memory and then indexed, adding to
>> >> pressure on the heap. I usually index from SolrJ in batches
>> >> of 1,000. See:
>> >> https://lucidworks.com/post/indexing-with-solrj/
>> >>
>> >> Simply put, your slowdown should not be happening. I strongly
>> >> believe that it’s something in your environment, most likely
>> >> 1> your changes eventually shoot you in the foot OR
>> >> 2> you are running in too little memory and eventually GC is killing
>> you.
>> >> Really, analyze your GC logs. OR
>> >> 3> you are running on underpowered hardware which just can’t take the
>> load
>> >> OR
>> >> 4> something else in your environment
>> >>
>> >> I’ve never heard of a Solr installation with such a massive slowdown
>> during
>> >> indexing that was fixed by tweaking things like the merge policy etc.
>> >>
>> >> Best,
>> >> Erick
>> >>
>> >>
>> >>> On Dec 5, 2019, at 12:57 AM, Paras Lehana <paras.leh...@indiamart.com
>> >
>> >> wrote:
>> >>>
>> >>> Hey Erick,
>> >>>
>> >>> This is a huge red flag to me: "(but I could only test for the first
>> few
>> >>>> thousand documents”.
>> >>>
>> >>>
>> >>> Yup, that's probably where the culprit lies. I could only test for the
>> >>> starting batch because I had to wait for a day to actually compare. I
>> >>> tweaked the merge values and kept whatever gave a speed boost. My
>> first
>> >>> batch of 5 million docs took only 40 minutes (atomic updates included)
>> >> and
>> >>> the last batch of 5 million took more than 18 hours. If this is an
>> issue
>> >> of
>> >>> mergePolicy, I think I should have also done optimize between batches,
>> >> no?
>> >>> I remember, when I indexed a single XML of 80 million after optimizing
>> >> the
>> >>> core already indexed with 30 XMLs of 5 million each, I could post 80
>> >>> million in a day only.
>> >>>
>> >>>
>> >>>
>> >>>> The indexing rate you’re seeing is abysmal unless these are _huge_
>> >>>> documents
>> >>>
>> >>>
>> >>> Documents only contain the suggestion name, possible titles,
>> >>> phonetics/spellcheck/synonym fields and numerical fields for boosting.
>> >> They
>> >>> are far smaller than what a Search Document would contain.
>> Auto-Suggest
>> >> is
>> >>> only concerned about suggestions so you can guess how simple the
>> >> documents
>> >>> would be.
>> >>>
>> >>>
>> >>> Some data is held on the heap and some in the OS RAM due to
>> MMapDirectory
>> >>>
>> >>>
>> >>> I'm using StandardDirectory (which will make Solr choose the right
>> >>> implementation). Also, planning to read more about these (looking
>> forward
>> >>> to use MMap). Thanks for the article!
>> >>>
>> >>>
>> >>> You're right. I should change one thing at a time. Let me experiment
>> and
>> >>> then I will summarize here what I tried. Thank you for your
>> responses. :)
>> >>>
>> >>> On Wed, 4 Dec 2019 at 20:31, Erick Erickson <erickerick...@gmail.com>
>> >> wrote:
>> >>>
>> >>>> This is a huge red flag to me: "(but I could only test for the first
>> few
>> >>>> thousand documents”
>> >>>>
>> >>>> You’re probably right that that would speed things up, but pretty
>> soon
>> >>>> when you’re indexing
>> >>>> your entire corpus there are lots of other considerations.
>> >>>>
>> >>>> The indexing rate you’re seeing is abysmal unless these are _huge_
>> >>>> documents, but you
>> >>>> indicate that at the start you’re getting 1,400 docs/second so I
>> don’t
>> >>>> think the complexity
>> >>>> of the docs is the issue here.
>> >>>>
>> >>>> Do note that when we’re throwing RAM figures out, we need to draw a
>> >> sharp
>> >>>> distinction
>> >>>> between Java heap and total RAM. Some data is held on the heap and
>> some
>> >> in
>> >>>> the OS
>> >>>> RAM due to MMapDirectory, see Uwe’s excellent article:
>> >>>>
>> >>
>> https://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
>> >>>>
>> >>>> Uwe recommends about 25% of your available physical RAM be allocated
>> to
>> >>>> Java as
>> >>>> a starting point. Your particular Solr installation may need a larger
>> >>>> percent, IDK.
>> >>>>
>> >>>> But basically I’d go back to all default settings and change one
>> thing
>> >> at
>> >>>> a time.
>> >>>> First, I’d look at GC performance. Is it taking all your CPU? In
>> which
>> >>>> case you probably need to
>> >>>> increase your heap. I pick this first because it’s very common that
>> this
>> >>>> is a root cause.
>> >>>>
>> >>>> Next, I’d put a profiler on it to see exactly where I’m spending
>> time.
>> >>>> Otherwise you wind
>> >>>> up making random changes and hoping one of them works.
>> >>>>
>> >>>> Best,
>> >>>> Erick
>> >>>>
>> >>>>> On Dec 4, 2019, at 3:21 AM, Paras Lehana <
>> paras.leh...@indiamart.com>
>> >>>> wrote:
>> >>>>>
>> >>>>> (but I could only test for the first few
>> >>>>> thousand documents
>> >>>>
>> >>>>
>> >>>
>> >>> --
>> >>> --
>> >>> Regards,
>> >>>
>> >>> *Paras Lehana* [65871]
>> >>> Development Engineer, Auto-Suggest,
>> >>> IndiaMART Intermesh Ltd.
>> >>>
>> >>> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
>> >>> Noida, UP, IN - 201303
>> >>>
>> >>> Mob.: +91-9560911996
>> >>> Work: 01203916600 | Extn:  *8173*
>> >>>
>> >>> --
>> >>> *
>> >>> *
>> >>>
>> >>> <https://www.facebook.com/IndiaMART/videos/578196442936091/>
>> >>
>> >>
>> >
>> > --
>> > --
>> > Regards,
>> >
>> > *Paras Lehana* [65871]
>> > Development Engineer, Auto-Suggest,
>> > IndiaMART Intermesh Ltd.
>> >
>> > 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
>> > Noida, UP, IN - 201303
>> >
>> > Mob.: +91-9560911996
>> > Work: 01203916600 | Extn:  *8173*
>> >
>> > --
>> > *
>> > *
>> >
>> > <https://www.facebook.com/IndiaMART/videos/578196442936091/>
>>
>>
>
> --
> --
> Regards,
>
> *Paras Lehana* [65871]
> Development Engineer, Auto-Suggest,
> IndiaMART Intermesh Ltd.
>
> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
> Noida, UP, IN - 201303
>
> Mob.: +91-9560911996
> Work: 01203916600 | Extn:  *8173*
>


-- 
-- 
Regards,

*Paras Lehana* [65871]
Development Engineer, Auto-Suggest,
IndiaMART Intermesh Ltd.

8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
Noida, UP, IN - 201303

Mob.: +91-9560911996
Work: 01203916600 | Extn:  *8173*

-- 
*
*

 <https://www.facebook.com/IndiaMART/videos/578196442936091/>

Re: [Q] Faster Atomic Updates - use docValues?

Reply via email to