Re: SolrCloud scaling/optimization for high request rate

2018-10-27 Thread Deepak Goel
On Fri, Oct 26, 2018 at 9:25 PM Sofiya Strochyk 
wrote:

> Hi everyone,
>
> We have a SolrCloud setup with the following configuration:
>
>- 4 nodes (3x128GB RAM Intel Xeon E5-1650v2, 1x64GB RAM Intel Xeon
>E5-1650v2, 12 cores, with SSDs)
>- One collection, 4 shards, each has only a single replica (so 4
>replicas in total), using compositeId router
>- Total index size is about 150M documents/320GB, so about 40M/80GB
>per node
>- Zookeeper is on a separate server
>- Documents consist of about 20 fields (most of them are both stored
>and indexed), average document size is about 2kB
>- Queries are mostly 2-3 words in the q field, with 2 fq parameters,
>with complex sort expression (containing IF functions)
>- We don't use faceting due to performance reasons but need to add it
>in the future
>- Majority of the documents are reindexed 2 times/day, as fast as the
>SOLR allows, in batches of 1000-1 docs. Some of the documents are also
>deleted (by id, not by query)
>- autoCommit is set to maxTime of 1 minute with openSearcher=false and
>autoSoftCommit maxTime is 30 minutes with openSearcher=true. Commits from
>clients are ignored.
>- Heap size is set to 8GB.
>
> Target query rate is up to 500 qps, maybe 300, and we need to keep
> response time at <200ms. But at the moment we only see very good search
> performance with up to 100 requests per second. Whenever it grows to about
> 200, average response time abruptly increases to 0.5-1 second. (Also it
> seems that request rate reported by SOLR in admin metrics is 2x higher than
> the real one, because for every query, every shard receives 2 requests: one
> to obtain IDs and second one to get data by IDs; so target rate for SOLR
> metrics would be 1000 qps).
>
> During high request load, CPU usage increases dramatically on the SOLR
> nodes. It doesn't reach 100% but averages at 50-70% on 3 servers and about
> 93% on 1 server (random server each time, not the smallest one).
>
> The documentation mentions replication to spread the load between the
> servers. We tested replicating to smaller servers (32GB RAM, Intel Core
> i7-4770). However, when we tested it, the replicas were going out of sync
> all the time (possibly during commits) and reported errors like "PeerSync
> Recovery was not successful - trying replication." Then they proceed with
> replication which takes hours and the leader handles all requests
> singlehandedly during that time. Also both leaders and replicas started
> encountering OOM errors (heap space) for unknown reason. Heap dump analysis
> shows that most of the memory is consumed by [J (array of long) type, my
> best guess would be that it is "_version_" field, but it's still unclear
> why it happens. Also, even though with replication request rate and CPU
> usage drop 2 times, it doesn't seem to affect mean_ms, stddev_ms or p95_ms
> numbers (p75_ms is much smaller on nodes with replication, but still not as
> low as under load of <100 requests/s).
>
> Garbage collection is much more active during high load as well. Full GC
> happens almost exclusively during those times. We have tried tuning GC
> options like suggested here
> 
> and it didn't change things though.
>
> My questions are
>
>- How do we increase throughput? Is replication the only solution?
>
> 1. Increase the CPU speed
2. Increase the heap size (& tune the GC)
3. Replication
4. Have one more node on the hardware server (if cpu is not reaching 100%)

>
>-
>- if yes - then why doesn't it affect response times, considering that
>CPU is not 100% used and index fits into memory?
>- How to deal with OOM and replicas going into recovery?
>
> 1. Increase the heap size
2. Memory debug to check for memory leaks (rare)

>
>- Is memory or CPU the main problem? (When searching on the internet,
>i never see CPU as main bottleneck for SOLR, but our case might be
>different)
>
> 1. Could be both

>
>-
>- Or do we need smaller shards? Could segments merging be a problem?
>- How to add faceting without search queries slowing down too much?
>- How to diagnose these problems and narrow down to the real reason in
>hardware or setup?
>
> 1. I would first tune all the software (OS, JVM, Solr) & benchmark the
current hardware setup
2. Then i would play around with the hardware to check performance benefits

>
>
> Any help would be much appreciated.
>

Increase in response time of 1 sec when you bump up the load indicates
Queuing happening in your setup. (Since CPU is not 100% utilised, it most
likely indicates memory-disk-network or software problem)

Last, what is the nature of your request. Are the queries the same? Or they
are very random? Random queries would need more tuning than if the queries
the same.

> Thanks!
> --
>
> *Sofiia Strochyk *
>
>
> s...@interlogic.com.ua
> [image: Inter

RE: Tesseract language

2018-10-27 Thread Martin Frank Hansen (MHQ)
Hi Rohan,

Thanks for your reply, are you using tess4j with Tika or on its own?  I will 
take a look at tess4j if I can't make it work with Tika alone.

Best regards
Martin


-Original Message-
From: Rohan Kasat 
Sent: 26. oktober 2018 21:45
To: solr-user@lucene.apache.org
Subject: Re: Tesseract language

Hi Martin,

Are you using it For image formats , I think you can try tess4j and use give 
TESSDATA_PREFIX as the home for tessarct Configs.

I have tried it and it works pretty well in my local machine.

I have used java 8 and tesseact 3 for the same.

Regards,
Rohan Kasat

On Fri, Oct 26, 2018 at 12:31 PM Martin Frank Hansen (MHQ) 
wrote:

> Hi Tim,
>
> You were right.
>
> When I called `tesseract testing/eurotext.png testing/eurotext-dan -l
> dan`, I got an error message so I downloaded "dan.traineddata" and
> added it to the Tesseract-OCR/tessdata folder. Furthermore I added the
> 'TESSDATA_PREFIX' variable to the path-variables pointing to
> "Tesseract-OCR/tessdata".
>
> Now Tesseract works with Danish language from the CMD, but now I can't
> make the code work in Java, not even with default settings (which I
> could before). Am I missing something or just mixing some things up?
>
>
>
> -Original Message-
> From: Tim Allison 
> Sent: 26. oktober 2018 19:58
> To: solr-user@lucene.apache.org
> Subject: Re: Tesseract language
>
> Tika relies on you to install tesseract and all the language libraries
> you'll need.
>
> If you can successfully call `tesseract testing/eurotext.png
> testing/eurotext-dan -l dan`, Tika _should_ be able to specify "dan"
> with your code above.
> On Fri, Oct 26, 2018 at 10:49 AM Martin Frank Hansen (MHQ)
> 
> wrote:
> >
> > Hi again,
> >
> > Now I moved the OCR part to Tika, but I still can't make it work
> > with
> Danish. It works when using default language settings and it seems
> like Tika is missing Danish dictionary.
> >
> > My java code looks like this:
> >
> > {
> > File file = new File(pathfilename);
> >
> > Metadata meta = new Metadata();
> >
> > InputStream stream = TikaInputStream.get(file);
> >
> > Parser parser = new AutoDetectParser();
> > BodyContentHandler handler = new
> > BodyContentHandler(Integer.MAX_VALUE);
> >
> > TesseractOCRConfig config = new TesseractOCRConfig();
> > config.setLanguage("dan"); // code works if this phrase
> > is
> commented out.
> >
> > ParseContext parseContext = new ParseContext();
> >
> >  parseContext.set(TesseractOCRConfig.class, config);
> >
> > parser.parse(stream, handler, meta, parseContext);
> > System.out.println(handler.toString());
> > }
> >
> > Hope that someone can help here.
> >
> > -Original Message-
> > From: Martin Frank Hansen (MHQ) 
> > Sent: 22. oktober 2018 07:58
> > To: solr-user@lucene.apache.org
> > Subject: SV: Tessera
> ct
> language
> >
> > Hi Erick,
> >
> > Thanks for the help! I will take a look at it.
> >
> >
> > Martin Frank Hansen, Senior Data Analytiker
> >
> > Data, IM & Analytics
> >
> >
> >
> > Lautrupparken 40-42, DK-2750 Ballerup E-mail m...@kmd.dk  Web
> > www.kmd.dk Mobil +4525571418
> >
> > -Oprindelig meddelelse-
> > Fra: Erick Erickson 
> > Sendt: 21. oktober 2018 22:49
> > Til: solr-user 
> > Emne: Re: Tesseract language
> >
> > Here's a skeletal program that uses Tika in a stand-alone client.
> > Rip
> the RDBMS parts out
> >
> > https://lucidworks.com/2012/02/14/indexing-with-solrj/
> > On Sun, Oct 21, 2018 at 1:13 PM Alexandre Rafalovitch <
> arafa...@gmail.com> wrote:
> > >
> > > Usually, we just say to do a custom solution using SolrJ client to
> > > connect. This gives you maximum flexibility and allows to
> > > integrate Tika either inside your code or as a server. Latest Tika
> > > actually has some off-thread handling I believe, to make it safer to 
> > > embed.
> > >
> > > For DIH alternatives, if you want configuration over custom code,
> > > you could look at something like Apache NiFI. It can push data
> > > into
> Solr.
> > > Obviously it is a bigger solution, but it is correspondingly more
> > > robust too.
> > >
> > > Regards,
> > >Alex.
> > > On Sun, 21 Oct 2018 at 11:07, Martin Frank Hansen (MHQ)
> > > 
> wrote:
> > > >
> > > > Hi Alexandre,
> > > >
> > > > Thanks for your reply.
> > > >
> > > > Yes right now it is just for testing the possibilities of Solr
> > > > and
> Tesseract.
> > > >
> > > > I will take a look at the Tika documentation to see if I can
> > > > make it
> work.
> > > >
> > > > You said that DIH are not recommended for production usage, what
> > > > is
> the recommended method(s) to upload data to a Solr instance?
> > > >
> > > > Best regards
> > > >
> > > > Martin Frank Hansen
> > > >
> > > > -Oprindelig meddelelse-
> > > > Fra: Alexandre Rafalovitch 
> > > > Sendt: 21. oktober 2018 16:26
> > > > Til: solr-user

Re: Tesseract language

2018-10-27 Thread Rohan Kasat
I used tess4j for image formats and Tika for scanned PDFs and images within
PDFs.

Regards,
Rohan Kasat

On Sat, Oct 27, 2018 at 12:39 AM Martin Frank Hansen (MHQ) 
wrote:

> Hi Rohan,
>
> Thanks for your reply, are you using tess4j with Tika or on its own?  I
> will take a look at tess4j if I can't make it work with Tika alone.
>
> Best regards
> Martin
>
>
> -Original Message-
> From: Rohan Kasat 
> Sent: 26. oktober 2018 21:45
> To: solr-user@lucene.apache.org
> Subject: Re: Tesseract language
>
> Hi Martin,
>
> Are you using it For image formats , I think you can try tess4j and use
> give TESSDATA_PREFIX as the home for tessarct Configs.
>
> I have tried it and it works pretty well in my local machine.
>
> I have used java 8 and tesseact 3 for the same.
>
> Regards,
> Rohan Kasat
>
> On Fri, Oct 26, 2018 at 12:31 PM Martin Frank Hansen (MHQ) 
> wrote:
>
> > Hi Tim,
> >
> > You were right.
> >
> > When I called `tesseract testing/eurotext.png testing/eurotext-dan -l
> > dan`, I got an error message so I downloaded "dan.traineddata" and
> > added it to the Tesseract-OCR/tessdata folder. Furthermore I added the
> > 'TESSDATA_PREFIX' variable to the path-variables pointing to
> > "Tesseract-OCR/tessdata".
> >
> > Now Tesseract works with Danish language from the CMD, but now I can't
> > make the code work in Java, not even with default settings (which I
> > could before). Am I missing something or just mixing some things up?
> >
> >
> >
> > -Original Message-
> > From: Tim Allison 
> > Sent: 26. oktober 2018 19:58
> > To: solr-user@lucene.apache.org
> > Subject: Re: Tesseract language
> >
> > Tika relies on you to install tesseract and all the language libraries
> > you'll need.
> >
> > If you can successfully call `tesseract testing/eurotext.png
> > testing/eurotext-dan -l dan`, Tika _should_ be able to specify "dan"
> > with your code above.
> > On Fri, Oct 26, 2018 at 10:49 AM Martin Frank Hansen (MHQ)
> > 
> > wrote:
> > >
> > > Hi again,
> > >
> > > Now I moved the OCR part to Tika, but I still can't make it work
> > > with
> > Danish. It works when using default language settings and it seems
> > like Tika is missing Danish dictionary.
> > >
> > > My java code looks like this:
> > >
> > > {
> > > File file = new File(pathfilename);
> > >
> > > Metadata meta = new Metadata();
> > >
> > > InputStream stream = TikaInputStream.get(file);
> > >
> > > Parser parser = new AutoDetectParser();
> > > BodyContentHandler handler = new
> > > BodyContentHandler(Integer.MAX_VALUE);
> > >
> > > TesseractOCRConfig config = new TesseractOCRConfig();
> > > config.setLanguage("dan"); // code works if this phrase
> > > is
> > commented out.
> > >
> > > ParseContext parseContext = new ParseContext();
> > >
> > >  parseContext.set(TesseractOCRConfig.class, config);
> > >
> > > parser.parse(stream, handler, meta, parseContext);
> > > System.out.println(handler.toString());
> > > }
> > >
> > > Hope that someone can help here.
> > >
> > > -Original Message-
> > > From: Martin Frank Hansen (MHQ) 
> > > Sent: 22. oktober 2018 07:58
> 
> > > To: solr-user@lucene.apache.org
> > > Subject: SV: Tessera
> > ct
> > language
> > >
> > > Hi Erick,
> > >
> > > Thanks for the help! I will take a look at it.
> > >
> > >
> > > Martin Frank Hansen, Senior Data Analytiker
> > >
> > > Data, IM & Analytics
> > >
> > >
> > >
> > > Lautrupparken 40-42, DK-2750 Ballerup E-mail m...@kmd.dk  Web
> > > www.kmd.dk Mobil +4525571418
> > >
> > > -Oprindelig meddelelse-
> > > Fra: Erick Erickson 
> > > Sendt: 21. oktober 2018 22:49
> > > Til: solr-user 
> > > Emne: Re: Tesseract language
> > >
> > > Here's a skeletal program that uses Tika in a stand-alone client.
> > > Rip
> > the RDBMS parts out
> > >
> > > https://lucidworks.com/2012/02/14/indexing-with-solrj/
> > > On Sun, Oct 21, 2018 at 1:13 PM Alexandre Rafalovitch <
> > arafa...@gmail.com> wrote:
> > > >
> > > > Usually, we just say to do a custom solution using SolrJ client to
> > > > connect. This gives you maximum flexibility and allows to
> > > > integrate Tika either inside your code or as a server. Latest Tika
> > > > actually has some off-thread handling I believe, to make it safer to
> embed.
> > > >
> > > > For DIH alternatives, if you want configuration over custom code,
> > > > you could look at something like Apache NiFI. It can push data
> > > > into
> > Solr.
> > > > Obviously it is a bigger solution, but it is correspondingly more
> > > > robust too.
> > > >
> > > > Regards,
> > > >Alex.
> > > > On Sun, 21 Oct 2018 at 11:07, Martin Frank Hansen (MHQ)
> > > > 
> > wrote:
> > > > >
> > > > > Hi Alexandre,
> > > > >
> > > > > Thanks for your reply.
> > > > >
> > > > > Yes right now it 

How to get spellcheck results per field in solr ?

2018-10-27 Thread govind nitk
Hi,

I have done suggestion using suggest component. And the results returned
are having format:

suggest: { "cityname_suggest": { }, "location_suggest": {},
"area_suggest":{} }
given cityname_suggest, location_suggest, area_suggest are different
dictionary names.

Now comparing this result structure to spellcheck response, my questions
are :
1. how to build multiple spellcheck results per dictionary ?


What I have tried :
copying multiple fields data into "get_spell" field and build spellcheck on
top of this. But is there any way to get spellcheck results per dictionary
mentioned ?

Thanks


Re: Tesseract language

2018-10-27 Thread Tim Allison
Martin,
  Let’s move this over to user@tika.

Rohan,
  Is there something about Tika’s use of tesseract for image files that can
be improved?

Best,
   Tim

On Sat, Oct 27, 2018 at 3:40 AM Rohan Kasat  wrote:

> I used tess4j for image formats and Tika for scanned PDFs and images within
> PDFs.
>
> Regards,
> Rohan Kasat
>
> On Sat, Oct 27, 2018 at 12:39 AM Martin Frank Hansen (MHQ) 
> wrote:
>
> > Hi Rohan,
> >
> > Thanks for your reply, are you using tess4j with Tika or on its own?  I
> > will take a look at tess4j if I can't make it work with Tika alone.
> >
> > Best regards
> > Martin
> >
> >
> > -Original Message-
> > From: Rohan Kasat 
> > Sent: 26. oktober 2018 21:45
> > To: solr-user@lucene.apache.org
> > Subject: Re: Tesseract language
> >
> > Hi Martin,
> >
> > Are you using it For image formats , I think you can try tess4j and use
> > give TESSDATA_PREFIX as the home for tessarct Configs.
> >
> > I have tried it and it works pretty well in my local machine.
> >
> > I have used java 8 and tesseact 3 for the same.
> >
> > Regards,
> > Rohan Kasat
> >
> > On Fri, Oct 26, 2018 at 12:31 PM Martin Frank Hansen (MHQ) 
> > wrote:
> >
> > > Hi Tim,
> > >
> > > You were right.
> > >
> > > When I called `tesseract testing/eurotext.png testing/eurotext-dan -l
> > > dan`, I got an error message so I downloaded "dan.traineddata" and
> > > added it to the Tesseract-OCR/tessdata folder. Furthermore I added the
> > > 'TESSDATA_PREFIX' variable to the path-variables pointing to
> > > "Tesseract-OCR/tessdata".
> > >
> > > Now Tesseract works with Danish language from the CMD, but now I can't
> > > make the code work in Java, not even with default settings (which I
> > > could before). Am I missing something or just mixing some things up?
> > >
> > >
> > >
> > > -Original Message-
> > > From: Tim Allison 
> > > Sent: 26. oktober 2018 19:58
> > > To: solr-user@lucene.apache.org
> > > Subject: Re: Tesseract language
> > >
> > > Tika relies on you to install tesseract and all the language libraries
> > > you'll need.
> > >
> > > If you can successfully call `tesseract testing/eurotext.png
> > > testing/eurotext-dan -l dan`, Tika _should_ be able to specify "dan"
> > > with your code above.
> > > On Fri, Oct 26, 2018 at 10:49 AM Martin Frank Hansen (MHQ)
> > > 
> > > wrote:
> > > >
> > > > Hi again,
> > > >
> > > > Now I moved the OCR part to Tika, but I still can't make it work
> > > > with
> > > Danish. It works when using default language settings and it seems
> > > like Tika is missing Danish dictionary.
> > > >
> > > > My java code looks like this:
> > > >
> > > > {
> > > > File file = new File(pathfilename);
> > > >
> > > > Metadata meta = new Metadata();
> > > >
> > > > InputStream stream = TikaInputStream.get(file);
> > > >
> > > > Parser parser = new AutoDetectParser();
> > > > BodyContentHandler handler = new
> > > > BodyContentHandler(Integer.MAX_VALUE);
> > > >
> > > > TesseractOCRConfig config = new TesseractOCRConfig();
> > > > config.setLanguage("dan"); // code works if this phrase
> > > > is
> > > commented out.
> > > >
> > > > ParseContext parseContext = new ParseContext();
> > > >
> > > >  parseContext.set(TesseractOCRConfig.class, config);
> > > >
> > > > parser.parse(stream, handler, meta, parseContext);
> > > > System.out.println(handler.toString());
> > > > }
> > > >
> > > > Hope that someone can help here.
> > > >
> > > > -Original Message-
> > > > From: Martin Frank Hansen (MHQ) 
> > > > Sent: 22. oktober 2018 07:58
> > 
> > > > To: solr-user@lucene.apache.org
> > > > Subject: SV: Tessera
> > > ct
> > > language
> > > >
> > > > Hi Erick,
> > > >
> > > > Thanks for the help! I will take a look at it.
> > > >
> > > >
> > > > Martin Frank Hansen, Senior Data Analytiker
> > > >
> > > > Data, IM & Analytics
> > > >
> > > >
> > > >
> > > > Lautrupparken 40-42, DK-2750 Ballerup E-mail m...@kmd.dk  Web
> > > > www.kmd.dk Mobil +4525571418
> > > >
> > > > -Oprindelig meddelelse-
> > > > Fra: Erick Erickson 
> > > > Sendt: 21. oktober 2018 22:49
> > > > Til: solr-user 
> > > > Emne: Re: Tesseract language
> > > >
> > > > Here's a skeletal program that uses Tika in a stand-alone client.
> > > > Rip
> > > the RDBMS parts out
> > > >
> > > > https://lucidworks.com/2012/02/14/indexing-with-solrj/
> > > > On Sun, Oct 21, 2018 at 1:13 PM Alexandre Rafalovitch <
> > > arafa...@gmail.com> wrote:
> > > > >
> > > > > Usually, we just say to do a custom solution using SolrJ client to
> > > > > connect. This gives you maximum flexibility and allows to
> > > > > integrate Tika either inside your code or as a server. Latest Tika
> > > > > actually has some off-thread handling I believe, to make it safer
> to
> > embe