Hi Erick, I have reverted back to original values and yes, I did see improvement. I will collect more stats. *Thank you for helping. :)*
Also, here is the reference article that I had referred for changing values: https://www.hathitrust.org/blogs/large-scale-search/forty-days-and-forty-nights-re-indexing-7-million-books-part-1 The article was perhaps for normal indexing and thus, suggested increasing mergeFactor and then finally optimizing. In my case, a large number of segments could have impacted get-by-id of atomic updates? Just being curious. On Fri, 6 Dec 2019 at 19:02, Paras Lehana <paras.leh...@indiamart.com> wrote: > Hey Erick, > > We have just upgraded to 8.3 before starting the indexing. We were on 6.6 > before that. > > Thank you for your continued support and resources. Again, I have already > taken your suggestion to start afresh and that's what I'm going to do. > Don't get me wrong but I have been just asking doubts. I will surely get > back with my experience after performing the full indexing. > > Thanks again! :) > > On Fri, 6 Dec 2019 at 18:48, Erick Erickson <erickerick...@gmail.com> > wrote: > >> Nothing implicitly handles optimization, you must continue to do that >> externally. >> >> Until you get to the bottom of your indexing slowdown, I wouldn’t bother >> with it at all, trying to do all these things at once is what lead to your >> problem in the first place, please change one thing at a time. You say: >> >> “For a full indexing, optimizations occurred 30 times between batches”. >> >> This is horrible. I’m not sure what version of Solr you’re using. If it’s >> 7.4 or earlier, this means the the entire index was rewritten 30 times. >> The first time it would condense all segments into a single segment, or >> 1/30 of the total. The second time it would rewrite all that, 2/30 of the >> index into a new segment. The third time 3/30. And so on. >> >> If Solr 7.5 or later, it wouldn’t be as bad, assuming your index was over >> 5G. But still. >> >> See: >> https://lucidworks.com/post/segment-merging-deleted-documents-optimize-may-bad/ >> for 7.4 and earlier, >> https://lucidworks.com/post/solr-and-optimizing-your-index-take-ii/ for >> 7.5 and later >> >> Eventually you can optimize by sending in an http or curl request like >> this: >> ../solr/collection/update?optimize=true >> >> You also changed to using StandardDirectory. The default has heuristics >> built in >> to choose the best directory implementation. >> >> I can’t emphasize enough that you’re changing lots of things at one time. >> I >> _strongly_ urge you to go back to the standard setup, make _no_ >> modifications >> and change things one at a time. Some very bright people have done a lot >> of work to try to make Lucene/Solr work well. >> >> Make one change at a time. Measure. If that change isn’t helpful, undo it >> and >> move to the next one. You’re trying to second-guess the Lucene/Solr >> developers who have years of understanding how this all works. Assume they >> picked reasonable options for defaults and that Lucene/Solr performs >> reasonably >> well. When I get unexplainably poor results, I usually assume it was the >> last >> thing I changed…. >> >> Best, >> Erick >> >> >> >> >> > On Dec 6, 2019, at 1:31 AM, Paras Lehana <paras.leh...@indiamart.com> >> wrote: >> > >> > Hi Erick, >> > >> > I believed optimizing explicitly merges segments and that's why I was >> > expecting it to give performance boost. I know that optimizations should >> > not be done very frequently. For a full indexing, optimizations >> occurred 30 >> > times between batches. I take your suggestion to undo all the changes >> and >> > that's what I'm going to do. I mentioned about the optimizations giving >> an >> > indexing boost (for sometime) only to support your point of my >> mergePolicy >> > backfiring. I will certainly read again about the merge process. >> > >> > Taking your suggestions - so, commits would be handled by autoCommit. >> What >> > implicitly handles optimizations? I think the merge policy or is there >> any >> > other setting I'm missing? >> > >> > I'm indexing via Curl API on the same server. The Current Speed of curl >> is >> > only 50k (down from 1300k in the first batch). I think - as the curl is >> > transmitting the XML, the documents are getting indexing. Because then >> only >> > would speed be so low. I don't think that the whole XML is taking the >> > memory - I remember I had to change the curl options to get rid of the >> > transmission error for large files. >> > >> > This is my curl request: >> > >> > curl 'http://localhost:$port/solr/product/update?commit=true' -T >> > batch1.xml -X POST -H 'Content-type:text/xml >> > >> > Although, we had been doing this since ages - I think I should now >> consider >> > using the solr post service (since the indexing files stays on the same >> > server) or using Solarium (we use PHP to make XMLs). >> > >> > On Thu, 5 Dec 2019 at 20:00, Erick Erickson <erickerick...@gmail.com> >> wrote: >> > >> >>> I think I should have also done optimize between batches, no? >> >> >> >> No, no, no, no. Absolutely not. Never. Never, never, never between >> batches. >> >> I don’t recommend optimizing at _all_ unless there are demonstrable >> >> improvements. >> >> >> >> Please don’t take this the wrong way, the whole merge process is really >> >> hard to get your head around. But the very fact that you’d suggest >> >> optimizing between batches shows that the entire merge process is >> >> opaque to you. I’ve seen many people just start changing things and >> >> get themselves into a bad place, then try to change more things to get >> >> out of that hole. Rinse. Repeat. >> >> >> >> I _strongly_ recommend that you undo all your changes. Neither >> >> commit nor optimize from outside Solr. Set your autocommit >> >> settings to something like 5 minutes with openSearcher=true. >> >> Set all autowarm counts in your caches in solrconfig.xml to 0, >> >> especially filterCache and queryResultCache. >> >> >> >> Do not set soft commit at all, leave it at -1. >> >> >> >> Repeat do _not_ commit or optimize from the client! Just let your >> >> autocommit settings do the commits. >> >> >> >> It’s also pushing things to send 5M docs in a single XML packet. >> >> That all has to be held in memory and then indexed, adding to >> >> pressure on the heap. I usually index from SolrJ in batches >> >> of 1,000. See: >> >> https://lucidworks.com/post/indexing-with-solrj/ >> >> >> >> Simply put, your slowdown should not be happening. I strongly >> >> believe that it’s something in your environment, most likely >> >> 1> your changes eventually shoot you in the foot OR >> >> 2> you are running in too little memory and eventually GC is killing >> you. >> >> Really, analyze your GC logs. OR >> >> 3> you are running on underpowered hardware which just can’t take the >> load >> >> OR >> >> 4> something else in your environment >> >> >> >> I’ve never heard of a Solr installation with such a massive slowdown >> during >> >> indexing that was fixed by tweaking things like the merge policy etc. >> >> >> >> Best, >> >> Erick >> >> >> >> >> >>> On Dec 5, 2019, at 12:57 AM, Paras Lehana <paras.leh...@indiamart.com >> > >> >> wrote: >> >>> >> >>> Hey Erick, >> >>> >> >>> This is a huge red flag to me: "(but I could only test for the first >> few >> >>>> thousand documents”. >> >>> >> >>> >> >>> Yup, that's probably where the culprit lies. I could only test for the >> >>> starting batch because I had to wait for a day to actually compare. I >> >>> tweaked the merge values and kept whatever gave a speed boost. My >> first >> >>> batch of 5 million docs took only 40 minutes (atomic updates included) >> >> and >> >>> the last batch of 5 million took more than 18 hours. If this is an >> issue >> >> of >> >>> mergePolicy, I think I should have also done optimize between batches, >> >> no? >> >>> I remember, when I indexed a single XML of 80 million after optimizing >> >> the >> >>> core already indexed with 30 XMLs of 5 million each, I could post 80 >> >>> million in a day only. >> >>> >> >>> >> >>> >> >>>> The indexing rate you’re seeing is abysmal unless these are _huge_ >> >>>> documents >> >>> >> >>> >> >>> Documents only contain the suggestion name, possible titles, >> >>> phonetics/spellcheck/synonym fields and numerical fields for boosting. >> >> They >> >>> are far smaller than what a Search Document would contain. >> Auto-Suggest >> >> is >> >>> only concerned about suggestions so you can guess how simple the >> >> documents >> >>> would be. >> >>> >> >>> >> >>> Some data is held on the heap and some in the OS RAM due to >> MMapDirectory >> >>> >> >>> >> >>> I'm using StandardDirectory (which will make Solr choose the right >> >>> implementation). Also, planning to read more about these (looking >> forward >> >>> to use MMap). Thanks for the article! >> >>> >> >>> >> >>> You're right. I should change one thing at a time. Let me experiment >> and >> >>> then I will summarize here what I tried. Thank you for your >> responses. :) >> >>> >> >>> On Wed, 4 Dec 2019 at 20:31, Erick Erickson <erickerick...@gmail.com> >> >> wrote: >> >>> >> >>>> This is a huge red flag to me: "(but I could only test for the first >> few >> >>>> thousand documents” >> >>>> >> >>>> You’re probably right that that would speed things up, but pretty >> soon >> >>>> when you’re indexing >> >>>> your entire corpus there are lots of other considerations. >> >>>> >> >>>> The indexing rate you’re seeing is abysmal unless these are _huge_ >> >>>> documents, but you >> >>>> indicate that at the start you’re getting 1,400 docs/second so I >> don’t >> >>>> think the complexity >> >>>> of the docs is the issue here. >> >>>> >> >>>> Do note that when we’re throwing RAM figures out, we need to draw a >> >> sharp >> >>>> distinction >> >>>> between Java heap and total RAM. Some data is held on the heap and >> some >> >> in >> >>>> the OS >> >>>> RAM due to MMapDirectory, see Uwe’s excellent article: >> >>>> >> >> >> https://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html >> >>>> >> >>>> Uwe recommends about 25% of your available physical RAM be allocated >> to >> >>>> Java as >> >>>> a starting point. Your particular Solr installation may need a larger >> >>>> percent, IDK. >> >>>> >> >>>> But basically I’d go back to all default settings and change one >> thing >> >> at >> >>>> a time. >> >>>> First, I’d look at GC performance. Is it taking all your CPU? In >> which >> >>>> case you probably need to >> >>>> increase your heap. I pick this first because it’s very common that >> this >> >>>> is a root cause. >> >>>> >> >>>> Next, I’d put a profiler on it to see exactly where I’m spending >> time. >> >>>> Otherwise you wind >> >>>> up making random changes and hoping one of them works. >> >>>> >> >>>> Best, >> >>>> Erick >> >>>> >> >>>>> On Dec 4, 2019, at 3:21 AM, Paras Lehana < >> paras.leh...@indiamart.com> >> >>>> wrote: >> >>>>> >> >>>>> (but I could only test for the first few >> >>>>> thousand documents >> >>>> >> >>>> >> >>> >> >>> -- >> >>> -- >> >>> Regards, >> >>> >> >>> *Paras Lehana* [65871] >> >>> Development Engineer, Auto-Suggest, >> >>> IndiaMART Intermesh Ltd. >> >>> >> >>> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142, >> >>> Noida, UP, IN - 201303 >> >>> >> >>> Mob.: +91-9560911996 >> >>> Work: 01203916600 | Extn: *8173* >> >>> >> >>> -- >> >>> * >> >>> * >> >>> >> >>> <https://www.facebook.com/IndiaMART/videos/578196442936091/> >> >> >> >> >> > >> > -- >> > -- >> > Regards, >> > >> > *Paras Lehana* [65871] >> > Development Engineer, Auto-Suggest, >> > IndiaMART Intermesh Ltd. >> > >> > 8th Floor, Tower A, Advant-Navis Business Park, Sector 142, >> > Noida, UP, IN - 201303 >> > >> > Mob.: +91-9560911996 >> > Work: 01203916600 | Extn: *8173* >> > >> > -- >> > * >> > * >> > >> > <https://www.facebook.com/IndiaMART/videos/578196442936091/> >> >> > > -- > -- > Regards, > > *Paras Lehana* [65871] > Development Engineer, Auto-Suggest, > IndiaMART Intermesh Ltd. > > 8th Floor, Tower A, Advant-Navis Business Park, Sector 142, > Noida, UP, IN - 201303 > > Mob.: +91-9560911996 > Work: 01203916600 | Extn: *8173* > -- -- Regards, *Paras Lehana* [65871] Development Engineer, Auto-Suggest, IndiaMART Intermesh Ltd. 8th Floor, Tower A, Advant-Navis Business Park, Sector 142, Noida, UP, IN - 201303 Mob.: +91-9560911996 Work: 01203916600 | Extn: *8173* -- * * <https://www.facebook.com/IndiaMART/videos/578196442936091/>