Just to update, I kept the defaults. The indexing got only a little boost though I have decided to continue with the defaults and do incremental experiments only. To my surprise, our development server had only 12GB RAM, of which 8G was allocated to Java. Because I could not increase the RAM, I tried decreasing it to 4G and guess what! My indexing speed got a boost of over *50x*. Erick, thanks for helping. I think I should do more homework about GCs also. Your GC guess seems to be valid. I have raised the request to increase RAM on the development to 24GB.
On Mon, 9 Dec 2019 at 20:23, Erick Erickson <erickerick...@gmail.com> wrote: > Note that that article is from 2011. That was in the Solr 3x days when > many, many, many things were different. There was no SolrCloud for > instance. Plus Tom’s problem space is indexing _books_. Whole, complete, > books. Which is, actually, not “normal” indexing at all as most Solr > indexes are much smaller documents. Books are a perfectly reasonable > use-case of course, but have a whole bunch of special requirements. > > get-by-id should be very efficient, _except_ that the longer you spend > before opening a new searcher, the larger the internal data buffers > supporting get-by-id need to be. > > Anyway, best of luck > Erick > > > On Dec 9, 2019, at 1:05 AM, Paras Lehana <paras.leh...@indiamart.com> > wrote: > > > > Hi Erick, > > > > I have reverted back to original values and yes, I did see improvement. I > > will collect more stats. *Thank you for helping. :)* > > > > Also, here is the reference article that I had referred for changing > > values: > > > https://www.hathitrust.org/blogs/large-scale-search/forty-days-and-forty-nights-re-indexing-7-million-books-part-1 > > > > The article was perhaps for normal indexing and thus, suggested > increasing > > mergeFactor and then finally optimizing. In my case, a large number of > > segments could have impacted get-by-id of atomic updates? Just being > > curious. > > > > On Fri, 6 Dec 2019 at 19:02, Paras Lehana <paras.leh...@indiamart.com> > > wrote: > > > >> Hey Erick, > >> > >> We have just upgraded to 8.3 before starting the indexing. We were on > 6.6 > >> before that. > >> > >> Thank you for your continued support and resources. Again, I have > already > >> taken your suggestion to start afresh and that's what I'm going to do. > >> Don't get me wrong but I have been just asking doubts. I will surely get > >> back with my experience after performing the full indexing. > >> > >> Thanks again! :) > >> > >> On Fri, 6 Dec 2019 at 18:48, Erick Erickson <erickerick...@gmail.com> > >> wrote: > >> > >>> Nothing implicitly handles optimization, you must continue to do that > >>> externally. > >>> > >>> Until you get to the bottom of your indexing slowdown, I wouldn’t > bother > >>> with it at all, trying to do all these things at once is what lead to > your > >>> problem in the first place, please change one thing at a time. You say: > >>> > >>> “For a full indexing, optimizations occurred 30 times between batches”. > >>> > >>> This is horrible. I’m not sure what version of Solr you’re using. If > it’s > >>> 7.4 or earlier, this means the the entire index was rewritten 30 times. > >>> The first time it would condense all segments into a single segment, or > >>> 1/30 of the total. The second time it would rewrite all that, 2/30 of > the > >>> index into a new segment. The third time 3/30. And so on. > >>> > >>> If Solr 7.5 or later, it wouldn’t be as bad, assuming your index was > over > >>> 5G. But still. > >>> > >>> See: > >>> > https://lucidworks.com/post/segment-merging-deleted-documents-optimize-may-bad/ > >>> for 7.4 and earlier, > >>> https://lucidworks.com/post/solr-and-optimizing-your-index-take-ii/ > for > >>> 7.5 and later > >>> > >>> Eventually you can optimize by sending in an http or curl request like > >>> this: > >>> ../solr/collection/update?optimize=true > >>> > >>> You also changed to using StandardDirectory. The default has heuristics > >>> built in > >>> to choose the best directory implementation. > >>> > >>> I can’t emphasize enough that you’re changing lots of things at one > time. > >>> I > >>> _strongly_ urge you to go back to the standard setup, make _no_ > >>> modifications > >>> and change things one at a time. Some very bright people have done a > lot > >>> of work to try to make Lucene/Solr work well. > >>> > >>> Make one change at a time. Measure. If that change isn’t helpful, undo > it > >>> and > >>> move to the next one. You’re trying to second-guess the Lucene/Solr > >>> developers who have years of understanding how this all works. Assume > they > >>> picked reasonable options for defaults and that Lucene/Solr performs > >>> reasonably > >>> well. When I get unexplainably poor results, I usually assume it was > the > >>> last > >>> thing I changed…. > >>> > >>> Best, > >>> Erick > >>> > >>> > >>> > >>> > >>>> On Dec 6, 2019, at 1:31 AM, Paras Lehana <paras.leh...@indiamart.com> > >>> wrote: > >>>> > >>>> Hi Erick, > >>>> > >>>> I believed optimizing explicitly merges segments and that's why I was > >>>> expecting it to give performance boost. I know that optimizations > should > >>>> not be done very frequently. For a full indexing, optimizations > >>> occurred 30 > >>>> times between batches. I take your suggestion to undo all the changes > >>> and > >>>> that's what I'm going to do. I mentioned about the optimizations > giving > >>> an > >>>> indexing boost (for sometime) only to support your point of my > >>> mergePolicy > >>>> backfiring. I will certainly read again about the merge process. > >>>> > >>>> Taking your suggestions - so, commits would be handled by autoCommit. > >>> What > >>>> implicitly handles optimizations? I think the merge policy or is there > >>> any > >>>> other setting I'm missing? > >>>> > >>>> I'm indexing via Curl API on the same server. The Current Speed of > curl > >>> is > >>>> only 50k (down from 1300k in the first batch). I think - as the curl > is > >>>> transmitting the XML, the documents are getting indexing. Because then > >>> only > >>>> would speed be so low. I don't think that the whole XML is taking the > >>>> memory - I remember I had to change the curl options to get rid of the > >>>> transmission error for large files. > >>>> > >>>> This is my curl request: > >>>> > >>>> curl 'http://localhost:$port/solr/product/update?commit=true' -T > >>>> batch1.xml -X POST -H 'Content-type:text/xml > >>>> > >>>> Although, we had been doing this since ages - I think I should now > >>> consider > >>>> using the solr post service (since the indexing files stays on the > same > >>>> server) or using Solarium (we use PHP to make XMLs). > >>>> > >>>> On Thu, 5 Dec 2019 at 20:00, Erick Erickson <erickerick...@gmail.com> > >>> wrote: > >>>> > >>>>>> I think I should have also done optimize between batches, no? > >>>>> > >>>>> No, no, no, no. Absolutely not. Never. Never, never, never between > >>> batches. > >>>>> I don’t recommend optimizing at _all_ unless there are demonstrable > >>>>> improvements. > >>>>> > >>>>> Please don’t take this the wrong way, the whole merge process is > really > >>>>> hard to get your head around. But the very fact that you’d suggest > >>>>> optimizing between batches shows that the entire merge process is > >>>>> opaque to you. I’ve seen many people just start changing things and > >>>>> get themselves into a bad place, then try to change more things to > get > >>>>> out of that hole. Rinse. Repeat. > >>>>> > >>>>> I _strongly_ recommend that you undo all your changes. Neither > >>>>> commit nor optimize from outside Solr. Set your autocommit > >>>>> settings to something like 5 minutes with openSearcher=true. > >>>>> Set all autowarm counts in your caches in solrconfig.xml to 0, > >>>>> especially filterCache and queryResultCache. > >>>>> > >>>>> Do not set soft commit at all, leave it at -1. > >>>>> > >>>>> Repeat do _not_ commit or optimize from the client! Just let your > >>>>> autocommit settings do the commits. > >>>>> > >>>>> It’s also pushing things to send 5M docs in a single XML packet. > >>>>> That all has to be held in memory and then indexed, adding to > >>>>> pressure on the heap. I usually index from SolrJ in batches > >>>>> of 1,000. See: > >>>>> https://lucidworks.com/post/indexing-with-solrj/ > >>>>> > >>>>> Simply put, your slowdown should not be happening. I strongly > >>>>> believe that it’s something in your environment, most likely > >>>>> 1> your changes eventually shoot you in the foot OR > >>>>> 2> you are running in too little memory and eventually GC is killing > >>> you. > >>>>> Really, analyze your GC logs. OR > >>>>> 3> you are running on underpowered hardware which just can’t take the > >>> load > >>>>> OR > >>>>> 4> something else in your environment > >>>>> > >>>>> I’ve never heard of a Solr installation with such a massive slowdown > >>> during > >>>>> indexing that was fixed by tweaking things like the merge policy etc. > >>>>> > >>>>> Best, > >>>>> Erick > >>>>> > >>>>> > >>>>>> On Dec 5, 2019, at 12:57 AM, Paras Lehana < > paras.leh...@indiamart.com > >>>> > >>>>> wrote: > >>>>>> > >>>>>> Hey Erick, > >>>>>> > >>>>>> This is a huge red flag to me: "(but I could only test for the first > >>> few > >>>>>>> thousand documents”. > >>>>>> > >>>>>> > >>>>>> Yup, that's probably where the culprit lies. I could only test for > the > >>>>>> starting batch because I had to wait for a day to actually compare. > I > >>>>>> tweaked the merge values and kept whatever gave a speed boost. My > >>> first > >>>>>> batch of 5 million docs took only 40 minutes (atomic updates > included) > >>>>> and > >>>>>> the last batch of 5 million took more than 18 hours. If this is an > >>> issue > >>>>> of > >>>>>> mergePolicy, I think I should have also done optimize between > batches, > >>>>> no? > >>>>>> I remember, when I indexed a single XML of 80 million after > optimizing > >>>>> the > >>>>>> core already indexed with 30 XMLs of 5 million each, I could post 80 > >>>>>> million in a day only. > >>>>>> > >>>>>> > >>>>>> > >>>>>>> The indexing rate you’re seeing is abysmal unless these are _huge_ > >>>>>>> documents > >>>>>> > >>>>>> > >>>>>> Documents only contain the suggestion name, possible titles, > >>>>>> phonetics/spellcheck/synonym fields and numerical fields for > boosting. > >>>>> They > >>>>>> are far smaller than what a Search Document would contain. > >>> Auto-Suggest > >>>>> is > >>>>>> only concerned about suggestions so you can guess how simple the > >>>>> documents > >>>>>> would be. > >>>>>> > >>>>>> > >>>>>> Some data is held on the heap and some in the OS RAM due to > >>> MMapDirectory > >>>>>> > >>>>>> > >>>>>> I'm using StandardDirectory (which will make Solr choose the right > >>>>>> implementation). Also, planning to read more about these (looking > >>> forward > >>>>>> to use MMap). Thanks for the article! > >>>>>> > >>>>>> > >>>>>> You're right. I should change one thing at a time. Let me experiment > >>> and > >>>>>> then I will summarize here what I tried. Thank you for your > >>> responses. :) > >>>>>> > >>>>>> On Wed, 4 Dec 2019 at 20:31, Erick Erickson < > erickerick...@gmail.com> > >>>>> wrote: > >>>>>> > >>>>>>> This is a huge red flag to me: "(but I could only test for the > first > >>> few > >>>>>>> thousand documents” > >>>>>>> > >>>>>>> You’re probably right that that would speed things up, but pretty > >>> soon > >>>>>>> when you’re indexing > >>>>>>> your entire corpus there are lots of other considerations. > >>>>>>> > >>>>>>> The indexing rate you’re seeing is abysmal unless these are _huge_ > >>>>>>> documents, but you > >>>>>>> indicate that at the start you’re getting 1,400 docs/second so I > >>> don’t > >>>>>>> think the complexity > >>>>>>> of the docs is the issue here. > >>>>>>> > >>>>>>> Do note that when we’re throwing RAM figures out, we need to draw a > >>>>> sharp > >>>>>>> distinction > >>>>>>> between Java heap and total RAM. Some data is held on the heap and > >>> some > >>>>> in > >>>>>>> the OS > >>>>>>> RAM due to MMapDirectory, see Uwe’s excellent article: > >>>>>>> > >>>>> > >>> > https://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html > >>>>>>> > >>>>>>> Uwe recommends about 25% of your available physical RAM be > allocated > >>> to > >>>>>>> Java as > >>>>>>> a starting point. Your particular Solr installation may need a > larger > >>>>>>> percent, IDK. > >>>>>>> > >>>>>>> But basically I’d go back to all default settings and change one > >>> thing > >>>>> at > >>>>>>> a time. > >>>>>>> First, I’d look at GC performance. Is it taking all your CPU? In > >>> which > >>>>>>> case you probably need to > >>>>>>> increase your heap. I pick this first because it’s very common that > >>> this > >>>>>>> is a root cause. > >>>>>>> > >>>>>>> Next, I’d put a profiler on it to see exactly where I’m spending > >>> time. > >>>>>>> Otherwise you wind > >>>>>>> up making random changes and hoping one of them works. > >>>>>>> > >>>>>>> Best, > >>>>>>> Erick > >>>>>>> > >>>>>>>> On Dec 4, 2019, at 3:21 AM, Paras Lehana < > >>> paras.leh...@indiamart.com> > >>>>>>> wrote: > >>>>>>>> > >>>>>>>> (but I could only test for the first few > >>>>>>>> thousand documents > >>>>>>> > >>>>>>> > >>>>>> > >>>>>> -- > >>>>>> -- > >>>>>> Regards, > >>>>>> > >>>>>> *Paras Lehana* [65871] > >>>>>> Development Engineer, Auto-Suggest, > >>>>>> IndiaMART Intermesh Ltd. > >>>>>> > >>>>>> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142, > >>>>>> Noida, UP, IN - 201303 > >>>>>> > >>>>>> Mob.: +91-9560911996 > >>>>>> Work: 01203916600 | Extn: *8173* > >>>>>> > >>>>>> -- > >>>>>> * > >>>>>> * > >>>>>> > >>>>>> <https://www.facebook.com/IndiaMART/videos/578196442936091/> > >>>>> > >>>>> > >>>> > >>>> -- > >>>> -- > >>>> Regards, > >>>> > >>>> *Paras Lehana* [65871] > >>>> Development Engineer, Auto-Suggest, > >>>> IndiaMART Intermesh Ltd. > >>>> > >>>> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142, > >>>> Noida, UP, IN - 201303 > >>>> > >>>> Mob.: +91-9560911996 > >>>> Work: 01203916600 | Extn: *8173* > >>>> > >>>> -- > >>>> * > >>>> * > >>>> > >>>> <https://www.facebook.com/IndiaMART/videos/578196442936091/> > >>> > >>> > >> > >> -- > >> -- > >> Regards, > >> > >> *Paras Lehana* [65871] > >> Development Engineer, Auto-Suggest, > >> IndiaMART Intermesh Ltd. > >> > >> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142, > >> Noida, UP, IN - 201303 > >> > >> Mob.: +91-9560911996 > >> Work: 01203916600 | Extn: *8173* > >> > > > > > > -- > > -- > > Regards, > > > > *Paras Lehana* [65871] > > Development Engineer, Auto-Suggest, > > IndiaMART Intermesh Ltd. > > > > 8th Floor, Tower A, Advant-Navis Business Park, Sector 142, > > Noida, UP, IN - 201303 > > > > Mob.: +91-9560911996 > > Work: 01203916600 | Extn: *8173* > > > > -- > > * > > * > > > > <https://www.facebook.com/IndiaMART/videos/578196442936091/> > > -- -- Regards, *Paras Lehana* [65871] Development Engineer, Auto-Suggest, IndiaMART Intermesh Ltd. 8th Floor, Tower A, Advant-Navis Business Park, Sector 142, Noida, UP, IN - 201303 Mob.: +91-9560911996 Work: 01203916600 | Extn: *8173* -- * * <https://www.facebook.com/IndiaMART/videos/578196442936091/>