On 3/20/2016 6:11 PM, Amit Jha wrote: > In my case I am using DIH to index the data and Query is having 2 join > statements. To index 70K documents it is taking 3-4Hours. Document size would > be around 10-20KB. DB is MSSQL and using solr4.2.10 in cloud mode.
My source data is in a MySQL database. I use DIH for full rebuilds and SolrJ for maintenance. My index is sharded, but I'm not running SolrCloud. When using DIH, all of my shards build at once, and each one achieves about 750 docs per second. With six large shards, rebuilding a 146 million document index takes 9-10 hours. It produces a total index size in the ballpark of 170GB. DIH has a performance limitation -- it's single-threaded. I obtain the speeds that I do because all of my shards import at the same time -- six dataimport instances running at the same time, each one with a single thread, importing a little more than 24 million documents. I have discovered that Solr is the bottleneck on my setup. The data retrieval from MySQL can proceed much faster than Solr can handle with a single indexing thread. My situation is a little bit unusual -- as Erick mentioned, usually the bottleneck is data retrieval, not Solr. At this point, if I want to make bulk indexing go faster, I need to build a SolrJ application that can index with multiple threads to each Solr core at the same time. This is on my roadmap, but it's not going to be a trivial project. At 10-20K, your documents are large, but not excessively so. If 70000 documents takes 3-4 hours, then there's one of a few problems happening. 1) your database is VERY slow. 2) your analysis chain in schema.xml is running SUPER slow analysis components. 3) Your server or its configuration is not providing enough resources (CPU/RAM/IO) so Solr can run efficiently. #2 seems rather unlikely, so I would suspect one of the other two. ---- I have seen one situation related to the Microsoft side of your setup that might cause a problem like this. If any of your machines are running on Windows Server 2012 and you have bridged NICs (usually for failover in the event of a switch failure), then you will need to break the bridge and just run one NIC. The performance improvement on the network when a bridged NIC is removed from Server 2012 is enough to blow your mind, especially if the access is over a high-latency network link, like a VPN or WAN connection. The same setup on Server 2003 or Server 2008 has very good performance. Microsoft seems to have a bug with bridged NICs in Server 2012. Last time I tried to figure out whether it could be fixed, I ran into this problem: https://xkcd.com/979/ Thanks, Shawn