RE: DataImportHandler | Query | performance

Prateek Jain J Fri, 23 Dec 2016 08:22:04 -0800

Thanks a lot Shawn.

Regards,
Prateek Jain

-----Original Message-----
From: Shawn Heisey [mailto:apa...@elyograg.org] 
Sent: 23 December 2016 01:36 PM
To: solr-user@lucene.apache.org
Subject: Re: DataImportHandler | Query | performance

On 12/23/2016 5:15 AM, Prateek Jain J wrote:
> We need some advice/views on the way we push our documents in SOLR (4.8.1). 
> So, here are the requirements:
>
> 1.       Document could be from 5 to 100 KB in size.
>
> 2.       10-50 users actively querying solr with different sort of data.
>
> 3.       Data will be available frequently to be pushed to solr (streaming). 
> It must be available with-in 15 seconds to be queried.
>
> Current scenario:
>       We dump data to a json file and have a cron job (in java, each time a 
> new file is created) which reads this file periodically and sends it to SOLR 
> using solrj (via http). This file is massive and could be of size ~GBs in 
> some cases (soft and hard solr commits are configured appropriately).
>
> Issue:
>
> 1.       Multiple cores exist in this SOLR and they too follow similar 
> pattern.
>
> 2.       This causes SOLR to hang and cause OOM in some cases due to, too 
> many FIleDescriptors opened (sometimes, due to other issues)
>
> We would like to know if using DataImportHandler give us any advantage? I 
> just gave a quick glance on Solr Wiki but not clear if it offers any 
> advantages in terms of performance (in this scenario).

If you do find a way to do this with DIH, it might make your "too many open 
files" problems *worse*, not better.  Currently these files you are talking 
about are being handled by a completely separate process, not Solr.  If you 
move this inside Solr, then Solr will open *more* files.

Your SolrJ program should read the files and construct SolrInputDocument 
objects, then send them in batches to Solr.  It should not send massive files 
directly.  That might fix the OOM issues, or it might not -- if not, then your 
Solr machine needs a larger heap.  To deal with the open files problem, you're 
going to have to fiddle with the operating system to allow it to open more 
files.

DIH has limitations that frequently make it necessary for users to write their 
own programs to do indexing.  Since you already have an external process, you 
should improve that, rather than trying to use DIH.

Thanks,
Shawn

RE: DataImportHandler | Query | performance

Reply via email to