Re: Data Import Handler takes different time on different machines

Troy Edwards Tue, 02 Feb 2016 10:12:12 -0800

Rerunning the Data Import Handler again on the the linux machine has
started producing some errors and warnings:


On the node on which DIH was started:

WARN SolrWriter Error creating document : SolrInputDocument

org.apache.solr.common.SolrException: No registered leader was found
after waiting for 4000ms , collection: collectionmain slice: shard1



On the second node:

WARN ReplicationHandler Exception while writing response for params:
command=filecontent&checksum=true&generation=1047&qt=/replication&wt=filestream&file=_1oo_Lucene50_0.tip

java.nio.file.NoSuchFileException:
/var/solr/data/collectionmain_shard2_replica1/data/index/_1oo_Lucene50_0.tip


ERROR

Index fetch failed :org.apache.solr.common.SolrException: Unable to
download _169.si completely. Downloaded 0!=466


ReplicationHandler Index fetch failed
:org.apache.solr.common.SolrException: Unable to download _169.si
completely. Downloaded 0!=466

WARN
IndexFetcher File _1pd_Lucene50_0.tim did not match. expected checksum is
3549855722 and actual is checksum 2062372352. expected length is 72522 and
actual length is 39227

WARN UpdateLog Log replay finished. recoveryInfo=RecoveryInfo{adds=840638
deletes=0 deleteByQuery=0 errors=0 positionOfStart=554264}


Any suggestions about this?

Thanks

On Mon, Feb 1, 2016 at 10:03 PM, Erick Erickson <erickerick...@gmail.com>
wrote:

> The first thing I'd be looking at is how I the JDBC batch size compares
> between the two machines.....
>
> AFAIK, Solr shouldn't notice the difference, and since a large majority
> of the development is done on Linux-based systems, I'd be surprised if
> this was worse than Windows, which would lead me to the one thing that
> is definitely different between the two: Your JDBC driver and its settings.
> At least that's where I'd look first.
>
> If nothing immediate pops up, I'd probably write a small driver program to
> just access the database from the two machines and process your 10M
> records _without_ sending them to Solr and see what the comparison is.
>
> You can also forgo DIH and do a simple import program via SolrJ. The
> advantage here is that the comparison I'm talking about above is
> really simple, just comment out the call that sends data to Solr. Here's an
> example...
>
> https://lucidworks.com/blog/2012/02/14/indexing-with-solrj/
>
> Best,
> Erick
>
> On Mon, Feb 1, 2016 at 7:34 PM, Troy Edwards <tedwards415...@gmail.com>
> wrote:
> > Sorry, I should explain further. The Data Import Handler had been running
> > for a while retrieving only about 150000 records from the database. Both
> in
> > development env (windows) and linux machine it took about 3 mins.
> >
> > The query has been changed and we are now trying to retrieve about 10
> > million records. We do expect the time to increase.
> >
> > With the new query the time taken on windows machine is consistently
> around
> > 40 mins. While the DIH is running queries slow down i.e. a query that
> > typically took 60 msec takes 100 msec.
> >
> > The time taken on linux machine is consistently around 2.5 hours. While
> the
> > DIH is running queries take about 200  to 400 msec.
> >
> > Thanks!
> >
> > On Mon, Feb 1, 2016 at 8:45 PM, Erick Erickson <erickerick...@gmail.com>
> > wrote:
> >
> >> What happens if you run just the SQL query from the
> >> windows box and from the linux box? Is there any chance
> >> that somehow the connection from the linux box is
> >> just slower?
> >>
> >> Best,
> >> Erick
> >>
> >> On Mon, Feb 1, 2016 at 6:36 PM, Alexandre Rafalovitch
> >> <arafa...@gmail.com> wrote:
> >> > What are you importing from? Is the source and Solr machine collocated
> >> > in the same fashion on dev and prod?
> >> >
> >> > Have you tried running this on a Linux dev machine? Perhaps your prod
> >> > machine is loaded much more than a dev.
> >> >
> >> > Regards,
> >> >    Alex.
> >> > ----
> >> > Newsletter and resources for Solr beginners and intermediates:
> >> > http://www.solr-start.com/
> >> >
> >> >
> >> > On 2 February 2016 at 13:21, Troy Edwards <tedwards415...@gmail.com>
> >> wrote:
> >> >> We have a windows development machine on which the Data Import
> Handler
> >> >> consistently takes about 40 mins to finish. Queries run fine. JVM
> >> memory is
> >> >> 2 GB per node.
> >> >>
> >> >> But on a linux machine it consistently takes about 2.5 hours. The
> >> queries
> >> >> also run slower. JVM memory here is also 2 GB per node.
> >> >>
> >> >> How should I go about analyzing and tuning the linux machine?
> >> >>
> >> >> Thanks
> >>
>

Re: Data Import Handler takes different time on different machines

Reply via email to