Hey Erick, We have just upgraded to 8.3 before starting the indexing. We were on 6.6 before that.
Thank you for your continued support and resources. Again, I have already taken your suggestion to start afresh and that's what I'm going to do. Don't get me wrong but I have been just asking doubts. I will surely get back with my experience after performing the full indexing. Thanks again! :) On Fri, 6 Dec 2019 at 18:48, Erick Erickson <erickerick...@gmail.com> wrote: > Nothing implicitly handles optimization, you must continue to do that > externally. > > Until you get to the bottom of your indexing slowdown, I wouldn’t bother > with it at all, trying to do all these things at once is what lead to your > problem in the first place, please change one thing at a time. You say: > > “For a full indexing, optimizations occurred 30 times between batches”. > > This is horrible. I’m not sure what version of Solr you’re using. If it’s > 7.4 or earlier, this means the the entire index was rewritten 30 times. > The first time it would condense all segments into a single segment, or > 1/30 of the total. The second time it would rewrite all that, 2/30 of the > index into a new segment. The third time 3/30. And so on. > > If Solr 7.5 or later, it wouldn’t be as bad, assuming your index was over > 5G. But still. > > See: > https://lucidworks.com/post/segment-merging-deleted-documents-optimize-may-bad/ > for 7.4 and earlier, > https://lucidworks.com/post/solr-and-optimizing-your-index-take-ii/ for > 7.5 and later > > Eventually you can optimize by sending in an http or curl request like > this: > ../solr/collection/update?optimize=true > > You also changed to using StandardDirectory. The default has heuristics > built in > to choose the best directory implementation. > > I can’t emphasize enough that you’re changing lots of things at one time. I > _strongly_ urge you to go back to the standard setup, make _no_ > modifications > and change things one at a time. Some very bright people have done a lot > of work to try to make Lucene/Solr work well. > > Make one change at a time. Measure. If that change isn’t helpful, undo it > and > move to the next one. You’re trying to second-guess the Lucene/Solr > developers who have years of understanding how this all works. Assume they > picked reasonable options for defaults and that Lucene/Solr performs > reasonably > well. When I get unexplainably poor results, I usually assume it was the > last > thing I changed…. > > Best, > Erick > > > > > > On Dec 6, 2019, at 1:31 AM, Paras Lehana <paras.leh...@indiamart.com> > wrote: > > > > Hi Erick, > > > > I believed optimizing explicitly merges segments and that's why I was > > expecting it to give performance boost. I know that optimizations should > > not be done very frequently. For a full indexing, optimizations occurred > 30 > > times between batches. I take your suggestion to undo all the changes and > > that's what I'm going to do. I mentioned about the optimizations giving > an > > indexing boost (for sometime) only to support your point of my > mergePolicy > > backfiring. I will certainly read again about the merge process. > > > > Taking your suggestions - so, commits would be handled by autoCommit. > What > > implicitly handles optimizations? I think the merge policy or is there > any > > other setting I'm missing? > > > > I'm indexing via Curl API on the same server. The Current Speed of curl > is > > only 50k (down from 1300k in the first batch). I think - as the curl is > > transmitting the XML, the documents are getting indexing. Because then > only > > would speed be so low. I don't think that the whole XML is taking the > > memory - I remember I had to change the curl options to get rid of the > > transmission error for large files. > > > > This is my curl request: > > > > curl 'http://localhost:$port/solr/product/update?commit=true' -T > > batch1.xml -X POST -H 'Content-type:text/xml > > > > Although, we had been doing this since ages - I think I should now > consider > > using the solr post service (since the indexing files stays on the same > > server) or using Solarium (we use PHP to make XMLs). > > > > On Thu, 5 Dec 2019 at 20:00, Erick Erickson <erickerick...@gmail.com> > wrote: > > > >>> I think I should have also done optimize between batches, no? > >> > >> No, no, no, no. Absolutely not. Never. Never, never, never between > batches. > >> I don’t recommend optimizing at _all_ unless there are demonstrable > >> improvements. > >> > >> Please don’t take this the wrong way, the whole merge process is really > >> hard to get your head around. But the very fact that you’d suggest > >> optimizing between batches shows that the entire merge process is > >> opaque to you. I’ve seen many people just start changing things and > >> get themselves into a bad place, then try to change more things to get > >> out of that hole. Rinse. Repeat. > >> > >> I _strongly_ recommend that you undo all your changes. Neither > >> commit nor optimize from outside Solr. Set your autocommit > >> settings to something like 5 minutes with openSearcher=true. > >> Set all autowarm counts in your caches in solrconfig.xml to 0, > >> especially filterCache and queryResultCache. > >> > >> Do not set soft commit at all, leave it at -1. > >> > >> Repeat do _not_ commit or optimize from the client! Just let your > >> autocommit settings do the commits. > >> > >> It’s also pushing things to send 5M docs in a single XML packet. > >> That all has to be held in memory and then indexed, adding to > >> pressure on the heap. I usually index from SolrJ in batches > >> of 1,000. See: > >> https://lucidworks.com/post/indexing-with-solrj/ > >> > >> Simply put, your slowdown should not be happening. I strongly > >> believe that it’s something in your environment, most likely > >> 1> your changes eventually shoot you in the foot OR > >> 2> you are running in too little memory and eventually GC is killing > you. > >> Really, analyze your GC logs. OR > >> 3> you are running on underpowered hardware which just can’t take the > load > >> OR > >> 4> something else in your environment > >> > >> I’ve never heard of a Solr installation with such a massive slowdown > during > >> indexing that was fixed by tweaking things like the merge policy etc. > >> > >> Best, > >> Erick > >> > >> > >>> On Dec 5, 2019, at 12:57 AM, Paras Lehana <paras.leh...@indiamart.com> > >> wrote: > >>> > >>> Hey Erick, > >>> > >>> This is a huge red flag to me: "(but I could only test for the first > few > >>>> thousand documents”. > >>> > >>> > >>> Yup, that's probably where the culprit lies. I could only test for the > >>> starting batch because I had to wait for a day to actually compare. I > >>> tweaked the merge values and kept whatever gave a speed boost. My first > >>> batch of 5 million docs took only 40 minutes (atomic updates included) > >> and > >>> the last batch of 5 million took more than 18 hours. If this is an > issue > >> of > >>> mergePolicy, I think I should have also done optimize between batches, > >> no? > >>> I remember, when I indexed a single XML of 80 million after optimizing > >> the > >>> core already indexed with 30 XMLs of 5 million each, I could post 80 > >>> million in a day only. > >>> > >>> > >>> > >>>> The indexing rate you’re seeing is abysmal unless these are _huge_ > >>>> documents > >>> > >>> > >>> Documents only contain the suggestion name, possible titles, > >>> phonetics/spellcheck/synonym fields and numerical fields for boosting. > >> They > >>> are far smaller than what a Search Document would contain. Auto-Suggest > >> is > >>> only concerned about suggestions so you can guess how simple the > >> documents > >>> would be. > >>> > >>> > >>> Some data is held on the heap and some in the OS RAM due to > MMapDirectory > >>> > >>> > >>> I'm using StandardDirectory (which will make Solr choose the right > >>> implementation). Also, planning to read more about these (looking > forward > >>> to use MMap). Thanks for the article! > >>> > >>> > >>> You're right. I should change one thing at a time. Let me experiment > and > >>> then I will summarize here what I tried. Thank you for your responses. > :) > >>> > >>> On Wed, 4 Dec 2019 at 20:31, Erick Erickson <erickerick...@gmail.com> > >> wrote: > >>> > >>>> This is a huge red flag to me: "(but I could only test for the first > few > >>>> thousand documents” > >>>> > >>>> You’re probably right that that would speed things up, but pretty soon > >>>> when you’re indexing > >>>> your entire corpus there are lots of other considerations. > >>>> > >>>> The indexing rate you’re seeing is abysmal unless these are _huge_ > >>>> documents, but you > >>>> indicate that at the start you’re getting 1,400 docs/second so I don’t > >>>> think the complexity > >>>> of the docs is the issue here. > >>>> > >>>> Do note that when we’re throwing RAM figures out, we need to draw a > >> sharp > >>>> distinction > >>>> between Java heap and total RAM. Some data is held on the heap and > some > >> in > >>>> the OS > >>>> RAM due to MMapDirectory, see Uwe’s excellent article: > >>>> > >> > https://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html > >>>> > >>>> Uwe recommends about 25% of your available physical RAM be allocated > to > >>>> Java as > >>>> a starting point. Your particular Solr installation may need a larger > >>>> percent, IDK. > >>>> > >>>> But basically I’d go back to all default settings and change one thing > >> at > >>>> a time. > >>>> First, I’d look at GC performance. Is it taking all your CPU? In which > >>>> case you probably need to > >>>> increase your heap. I pick this first because it’s very common that > this > >>>> is a root cause. > >>>> > >>>> Next, I’d put a profiler on it to see exactly where I’m spending time. > >>>> Otherwise you wind > >>>> up making random changes and hoping one of them works. > >>>> > >>>> Best, > >>>> Erick > >>>> > >>>>> On Dec 4, 2019, at 3:21 AM, Paras Lehana <paras.leh...@indiamart.com > > > >>>> wrote: > >>>>> > >>>>> (but I could only test for the first few > >>>>> thousand documents > >>>> > >>>> > >>> > >>> -- > >>> -- > >>> Regards, > >>> > >>> *Paras Lehana* [65871] > >>> Development Engineer, Auto-Suggest, > >>> IndiaMART Intermesh Ltd. > >>> > >>> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142, > >>> Noida, UP, IN - 201303 > >>> > >>> Mob.: +91-9560911996 > >>> Work: 01203916600 | Extn: *8173* > >>> > >>> -- > >>> * > >>> * > >>> > >>> <https://www.facebook.com/IndiaMART/videos/578196442936091/> > >> > >> > > > > -- > > -- > > Regards, > > > > *Paras Lehana* [65871] > > Development Engineer, Auto-Suggest, > > IndiaMART Intermesh Ltd. > > > > 8th Floor, Tower A, Advant-Navis Business Park, Sector 142, > > Noida, UP, IN - 201303 > > > > Mob.: +91-9560911996 > > Work: 01203916600 | Extn: *8173* > > > > -- > > * > > * > > > > <https://www.facebook.com/IndiaMART/videos/578196442936091/> > > -- -- Regards, *Paras Lehana* [65871] Development Engineer, Auto-Suggest, IndiaMART Intermesh Ltd. 8th Floor, Tower A, Advant-Navis Business Park, Sector 142, Noida, UP, IN - 201303 Mob.: +91-9560911996 Work: 01203916600 | Extn: *8173* -- * * <https://www.facebook.com/IndiaMART/videos/578196442936091/>