On 4/13/2018 10:11 AM, Jesus Olivan wrote: > we're trying to launch a full import of 375 millions of docs aprox. from a > MySQL database to our solrcloud cluster. Until now, this full import > process takes around 24/27 hours to finish due to an huge import query > (several group bys, left joins, etc), but after another import query > modification (adding more complexity), we're unable to execute this full > import from MySQL. > > We've done some research about migrating to PostgreSQL, but this option is > now a real option at this time, because it implies a big refatoring from > several dev teams. > > Is there some alternative ways to perform successfully this full import > process?
DIH is a capable tool, and for what it does, it's remarkably efficient. It can't really be made any faster, because it's single threaded. To get increased index speed with Solr, you must index documents from several sources/processes/threads at the same time. Writing custom software that can retrieve information from your source, build the documents you require, and send several update requests simultaneously will yield the best results. The source itself may be a bottleneck though -- this is frequently the case, and Solr is often MUCH faster than the information source. You said that you're unable to execute an updated import from MySQL. What exactly happens when you try? Are there any errors in your solr logfile? I'm not going to debate whether MySQL or PostgreSQL is the better solution. For my indexes, my source data is in MySQL. It works well, but full rebuilds using DIH are slower than I would like -- because it's single-threaded. Our overall system architecture would probably be improved by a switch to PostgreSQL, but it would be an extremely time-consuming transition process. We aren't having any real issues with MySQL, so we have no incentive to spend the required effort. Thanks, Shawn