On 3/20/2016 6:11 PM, Amit Jha wrote:
> In my case I am using DIH to index the data and Query is having 2 join 
> statements. To index 70K documents it is taking 3-4Hours. Document size would 
> be around 10-20KB. DB is MSSQL and using solr4.2.10 in cloud mode.

My source data is in a MySQL database.  I use DIH for full rebuilds and
SolrJ for maintenance.

My index is sharded, but I'm not running SolrCloud.  When using DIH, all
of my shards build at once, and each one achieves about 750 docs per
second.  With six large shards, rebuilding a 146 million document index
takes 9-10 hours.  It produces a total index size in the ballpark of 170GB.

DIH has a performance limitation -- it's single-threaded.  I obtain the
speeds that I do because all of my shards import at the same time -- six
dataimport instances running at the same time, each one with a single
thread, importing a little more than 24 million documents.  I have
discovered that Solr is the bottleneck on my setup.  The data retrieval
from MySQL can proceed much faster than Solr can handle with a single
indexing thread.  My situation is a little bit unusual -- as Erick
mentioned, usually the bottleneck is data retrieval, not Solr.

At this point, if I want to make bulk indexing go faster, I need to
build a SolrJ application that can index with multiple threads to each
Solr core at the same time.  This is on my roadmap, but it's not going
to be a trivial project.

At 10-20K, your documents are large, but not excessively so.  If 70000
documents takes 3-4 hours, then there's one of a few problems happening.

1) your database is VERY slow.
2) your analysis chain in schema.xml is running SUPER slow analysis
components.
3) Your server or its configuration is not providing enough resources
(CPU/RAM/IO) so Solr can run efficiently.

#2 seems rather unlikely, so I would suspect one of the other two.

----

I have seen one situation related to the Microsoft side of your setup
that might cause a problem like this.  If any of your machines are
running on Windows Server 2012 and you have bridged NICs (usually for
failover in the event of a switch failure), then you will need to break
the bridge and just run one NIC.

The performance improvement on the network when a bridged NIC is removed
from Server 2012 is enough to blow your mind, especially if the access
is over a high-latency network link, like a VPN or WAN connection.  The
same setup on Server 2003 or Server 2008 has very good performance.
 Microsoft seems to have a bug with bridged NICs in Server 2012.  Last
time I tried to figure out whether it could be fixed, I ran into this
problem:

https://xkcd.com/979/

Thanks,
Shawn

Reply via email to