Hey Erick, This is a huge red flag to me: "(but I could only test for the first few > thousand documents”.
Yup, that's probably where the culprit lies. I could only test for the starting batch because I had to wait for a day to actually compare. I tweaked the merge values and kept whatever gave a speed boost. My first batch of 5 million docs took only 40 minutes (atomic updates included) and the last batch of 5 million took more than 18 hours. If this is an issue of mergePolicy, I think I should have also done optimize between batches, no? I remember, when I indexed a single XML of 80 million after optimizing the core already indexed with 30 XMLs of 5 million each, I could post 80 million in a day only. > The indexing rate you’re seeing is abysmal unless these are _huge_ > documents Documents only contain the suggestion name, possible titles, phonetics/spellcheck/synonym fields and numerical fields for boosting. They are far smaller than what a Search Document would contain. Auto-Suggest is only concerned about suggestions so you can guess how simple the documents would be. Some data is held on the heap and some in the OS RAM due to MMapDirectory I'm using StandardDirectory (which will make Solr choose the right implementation). Also, planning to read more about these (looking forward to use MMap). Thanks for the article! You're right. I should change one thing at a time. Let me experiment and then I will summarize here what I tried. Thank you for your responses. :) On Wed, 4 Dec 2019 at 20:31, Erick Erickson <erickerick...@gmail.com> wrote: > This is a huge red flag to me: "(but I could only test for the first few > thousand documents” > > You’re probably right that that would speed things up, but pretty soon > when you’re indexing > your entire corpus there are lots of other considerations. > > The indexing rate you’re seeing is abysmal unless these are _huge_ > documents, but you > indicate that at the start you’re getting 1,400 docs/second so I don’t > think the complexity > of the docs is the issue here. > > Do note that when we’re throwing RAM figures out, we need to draw a sharp > distinction > between Java heap and total RAM. Some data is held on the heap and some in > the OS > RAM due to MMapDirectory, see Uwe’s excellent article: > https://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html > > Uwe recommends about 25% of your available physical RAM be allocated to > Java as > a starting point. Your particular Solr installation may need a larger > percent, IDK. > > But basically I’d go back to all default settings and change one thing at > a time. > First, I’d look at GC performance. Is it taking all your CPU? In which > case you probably need to > increase your heap. I pick this first because it’s very common that this > is a root cause. > > Next, I’d put a profiler on it to see exactly where I’m spending time. > Otherwise you wind > up making random changes and hoping one of them works. > > Best, > Erick > > > On Dec 4, 2019, at 3:21 AM, Paras Lehana <paras.leh...@indiamart.com> > wrote: > > > > (but I could only test for the first few > > thousand documents > > -- -- Regards, *Paras Lehana* [65871] Development Engineer, Auto-Suggest, IndiaMART Intermesh Ltd. 8th Floor, Tower A, Advant-Navis Business Park, Sector 142, Noida, UP, IN - 201303 Mob.: +91-9560911996 Work: 01203916600 | Extn: *8173* -- * * <https://www.facebook.com/IndiaMART/videos/578196442936091/>