Rerunning the Data Import Handler again on the the linux machine has started producing some errors and warnings:
On the node on which DIH was started: WARN SolrWriter Error creating document : SolrInputDocument org.apache.solr.common.SolrException: No registered leader was found after waiting for 4000ms , collection: collectionmain slice: shard1 On the second node: WARN ReplicationHandler Exception while writing response for params: command=filecontent&checksum=true&generation=1047&qt=/replication&wt=filestream&file=_1oo_Lucene50_0.tip java.nio.file.NoSuchFileException: /var/solr/data/collectionmain_shard2_replica1/data/index/_1oo_Lucene50_0.tip ERROR Index fetch failed :org.apache.solr.common.SolrException: Unable to download _169.si completely. Downloaded 0!=466 ReplicationHandler Index fetch failed :org.apache.solr.common.SolrException: Unable to download _169.si completely. Downloaded 0!=466 WARN IndexFetcher File _1pd_Lucene50_0.tim did not match. expected checksum is 3549855722 and actual is checksum 2062372352. expected length is 72522 and actual length is 39227 WARN UpdateLog Log replay finished. recoveryInfo=RecoveryInfo{adds=840638 deletes=0 deleteByQuery=0 errors=0 positionOfStart=554264} Any suggestions about this? Thanks On Mon, Feb 1, 2016 at 10:03 PM, Erick Erickson <erickerick...@gmail.com> wrote: > The first thing I'd be looking at is how I the JDBC batch size compares > between the two machines..... > > AFAIK, Solr shouldn't notice the difference, and since a large majority > of the development is done on Linux-based systems, I'd be surprised if > this was worse than Windows, which would lead me to the one thing that > is definitely different between the two: Your JDBC driver and its settings. > At least that's where I'd look first. > > If nothing immediate pops up, I'd probably write a small driver program to > just access the database from the two machines and process your 10M > records _without_ sending them to Solr and see what the comparison is. > > You can also forgo DIH and do a simple import program via SolrJ. The > advantage here is that the comparison I'm talking about above is > really simple, just comment out the call that sends data to Solr. Here's an > example... > > https://lucidworks.com/blog/2012/02/14/indexing-with-solrj/ > > Best, > Erick > > On Mon, Feb 1, 2016 at 7:34 PM, Troy Edwards <tedwards415...@gmail.com> > wrote: > > Sorry, I should explain further. The Data Import Handler had been running > > for a while retrieving only about 150000 records from the database. Both > in > > development env (windows) and linux machine it took about 3 mins. > > > > The query has been changed and we are now trying to retrieve about 10 > > million records. We do expect the time to increase. > > > > With the new query the time taken on windows machine is consistently > around > > 40 mins. While the DIH is running queries slow down i.e. a query that > > typically took 60 msec takes 100 msec. > > > > The time taken on linux machine is consistently around 2.5 hours. While > the > > DIH is running queries take about 200 to 400 msec. > > > > Thanks! > > > > On Mon, Feb 1, 2016 at 8:45 PM, Erick Erickson <erickerick...@gmail.com> > > wrote: > > > >> What happens if you run just the SQL query from the > >> windows box and from the linux box? Is there any chance > >> that somehow the connection from the linux box is > >> just slower? > >> > >> Best, > >> Erick > >> > >> On Mon, Feb 1, 2016 at 6:36 PM, Alexandre Rafalovitch > >> <arafa...@gmail.com> wrote: > >> > What are you importing from? Is the source and Solr machine collocated > >> > in the same fashion on dev and prod? > >> > > >> > Have you tried running this on a Linux dev machine? Perhaps your prod > >> > machine is loaded much more than a dev. > >> > > >> > Regards, > >> > Alex. > >> > ---- > >> > Newsletter and resources for Solr beginners and intermediates: > >> > http://www.solr-start.com/ > >> > > >> > > >> > On 2 February 2016 at 13:21, Troy Edwards <tedwards415...@gmail.com> > >> wrote: > >> >> We have a windows development machine on which the Data Import > Handler > >> >> consistently takes about 40 mins to finish. Queries run fine. JVM > >> memory is > >> >> 2 GB per node. > >> >> > >> >> But on a linux machine it consistently takes about 2.5 hours. The > >> queries > >> >> also run slower. JVM memory here is also 2 GB per node. > >> >> > >> >> How should I go about analyzing and tuning the linux machine? > >> >> > >> >> Thanks > >> >