Data Import Handler

2013-11-06 Thread Ramesh
Hi Folks,

 

Can anyone suggest me how can customize dataconfig.xml file 

I want to provide database details like( db_url,uname,password ) from my own
properties file instead of dataconfig.xaml file



RE: Data Import Handler

2013-11-13 Thread Ramesh
James can elaborate how to process driver="${dataimporter.request.driver}" 
url ="${dataimporter.request.url}" and all where to mention these 
my purpose is to config my DB Details(url,uname,password) in properties file

-Original Message-
From: Dyer, James [mailto:james.d...@ingramcontent.com] 
Sent: Wednesday, November 06, 2013 7:42 PM
To: solr-user@lucene.apache.org
Subject: RE: Data Import Handler

If you prepend the variable name with "dataimporter.request", you can
include variables like these as request parameters:



/dih?driver=some.driver.class&url=jdbc:url:something

If you want to include these in solrcore.properties, you can additionally
add each property to solrconfig.xml like this:



${dih.driver}
${dih.url}



Then in solrcore.properties:
 dih.driver=some.driver.class
 dih.url=jdbc:url:something

See http://wiki.apache.org/solr/SolrConfigXml?#System_property_substitution


James Dyer
Ingram Content Group
(615) 213-4311

-Original Message-
From: Ramesh [mailto:ramesh.po...@vensaiinc.com]
Sent: Wednesday, November 06, 2013 7:25 AM
To: solr-user@lucene.apache.org
Subject: Data Import Handler

Hi Folks,

 

Can anyone suggest me how can customize dataconfig.xml file 

I want to provide database details like( db_url,uname,password ) from my own
properties file instead of dataconfig.xaml file





RE: Data Import Handler

2013-11-13 Thread Ramesh
Need to be put out of solr like 

customized Mysolr_core.properties
how to access it

-Original Message-
From: Dyer, James [mailto:james.d...@ingramcontent.com] 
Sent: Wednesday, November 13, 2013 8:50 PM
To: solr-user@lucene.apache.org
Subject: RE: Data Import Handler

In solrcore.properties, put:

datasource.url=jdbc:xxx:yyy
datasource.driver=com.some.driver

In solrconfig.xml, put:



... 
${datasource.driver}
${datasource.url}
...



In data-config.xml, put:


Hope this works for you.

James Dyer
Ingram Content Group
(615) 213-4311


-Original Message-
From: Ramesh [mailto:ramesh.po...@vensaiinc.com]
Sent: Wednesday, November 13, 2013 9:00 AM
To: solr-user@lucene.apache.org
Subject: RE: Data Import Handler

James can elaborate how to process driver="${dataimporter.request.driver}" 
url ="${dataimporter.request.url}" and all where to mention these my purpose
is to config my DB Details(url,uname,password) in properties file

-Original Message-
From: Dyer, James [mailto:james.d...@ingramcontent.com]
Sent: Wednesday, November 06, 2013 7:42 PM
To: solr-user@lucene.apache.org
Subject: RE: Data Import Handler

If you prepend the variable name with "dataimporter.request", you can
include variables like these as request parameters:



/dih?driver=some.driver.class&url=jdbc:url:something

If you want to include these in solrcore.properties, you can additionally
add each property to solrconfig.xml like this:



${dih.driver}
${dih.url}



Then in solrcore.properties:
 dih.driver=some.driver.class
 dih.url=jdbc:url:something

See http://wiki.apache.org/solr/SolrConfigXml?#System_property_substitution


James Dyer
Ingram Content Group
(615) 213-4311

-Original Message-
From: Ramesh [mailto:ramesh.po...@vensaiinc.com]
Sent: Wednesday, November 06, 2013 7:25 AM
To: solr-user@lucene.apache.org
Subject: Data Import Handler

Hi Folks,

 

Can anyone suggest me how can customize dataconfig.xml file 

I want to provide database details like( db_url,uname,password ) from my own
properties file instead of dataconfig.xaml file









Solr is making Jboss server slower..

2014-02-14 Thread Ramesh
Hi,
We are using the solr search, Sometimes due to this our entire system is
becoming slow. We didn't got any clue.. So we tried to test load & stress
test through the Jmeter. Here we have tried for many times with different
testing ways. At random times in Jmeter we are getting the following
message. And randomized times again the server is getting hanged and coming
down. I have checked in our code as well All connections are closed. So can
you please check this what could be the reason for failing. 

java.lang.Thread.start0(Native Method)
java.lang.Thread.start(Thread.java:679)
java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:950)
java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1372)
java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:120)
com.ning.http.client.providers.jdk.JDKAsyncHttpProvider.execute(JDKAsyncHttpProvider.java:158)
com.ning.http.client.providers.jdk.JDKAsyncHttpProvider.execute(JDKAsyncHttpProvider.java:120)
com.ning.http.client.AsyncHttpClient.executeRequest(AsyncHttpClient.java:512)
com.ning.http.client.AsyncHttpClient$BoundRequestBuilder.execute(AsyncHttpClient.java:234)
com.digite.utils.GeneralUtils.getTagData(GeneralUtils.java:171)
com.digite.app.kanban.search.web.action.SearchAction.getTagResults(SearchAction.java:337)
com.digite.app.kanban.search.web.action.SearchAction.searchItem(SearchAction.java:73)
sun.reflect.GeneratedMethodAccessor707.invoke(Unknown Source)
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
java.lang.reflect.Method.invoke(Method.java:622)
com.opensymphony.xwork2.DefaultActionInvocation.invokeAction(DefaultActionInvocation.java:453)



Thanks in Advance... Waiting for your reply...






--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-is-making-Jboss-server-slower-tp4117343.html
Sent from the Solr - User mailing list archive at Nabble.com.


deployee issu on solr

2013-09-23 Thread Ramesh
Unable to deploying  solr 4.4 on JBoss -4.0.0  I am getting error like

 



RE: solr4.4 admin page show "loading"

2013-09-24 Thread Ramesh
Use Mozilla for better use even in IE it is not working properly

-Original Message-
From: William Bell [mailto:billnb...@gmail.com] 
Sent: Tuesday, September 24, 2013 12:02 PM
To: solr-user@lucene.apache.org
Subject: Re: solr4.4 admin page show "loading"

Use Chrome.


On Thu, Sep 19, 2013 at 7:32 AM, Micheal Chao wrote:

> hi, I have installed solr4.4 on tomcat7.0. the problem is I can't see 
> the solr admin page, it's always show "loading". I can't find any 
> error in tomcat logs, and I can send search request, and get the result.
>
> what can I do? please help me, thank you very much.
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/solr4-4-admin-page-show-loading-tp4
> 091039.html Sent from the Solr - User mailing list archive at 
> Nabble.com.
>



--
Bill Bell
billnb...@gmail.com
cell 720-256-8076




Re: Auto-optimization for Solr indexes

2015-12-20 Thread Rahul Ramesh
Hi Erick,
We index around several million documents/ day and we optimize everyday
when the relative load is low. The reason we optimize is, we dont want the
index sizes to grow too large and auto optimzie to kick in. When auto
optimize kicks in, it results in unpredictable performance as it is CPU and
IO intensive.

In older solr (4.2), when the segment size grows too large, insertion used
to fail .  Have we seen this problem in solr cloud?

Also, we have observed, recovery takes a bit more time when it is not
optimized. We dont have any quantitative measurement for the same. Its just
an observation. Is this correct observation?

If we optimize it every day, the indexes will not be skewed right?

Please let me know if my understanding is correct.

Regards,
Rahul

On Mon, Dec 21, 2015 at 9:54 AM, Erick Erickson 
wrote:

> You'll probably have to shard before you get to the TB range. At that
> point, all the optimization is done individually on each shard so it
> really doesn't matter how many shards you have.
>
> Just issuing
> http://solr:port/solr/collection/update?optimize=true
>
> is sufficient, that'll forward the optimize command to all the shards
> in the collection.
>
> Best,
> Erick
>
> On Sun, Dec 20, 2015 at 8:19 PM, Zheng Lin Edwin Yeo
>  wrote:
> > Thanks for your information Erick.
> >
> > We have yet to decide how often we will update the index to include new
> > documents that came in. Let's say we update the index once a day, then
> when
> > the indexed is updated, we do the optimization (this will be done at
> night
> > when there are not many users using the system).
> > But my index size will probably grow quite big (potentially can go up to
> > more than 1TB in the future), so does that have to be taken into
> > consideration too?
> >
> > Regards,
> > Edwin
> >
> >
> > On 21 December 2015 at 12:12, Erick Erickson 
> > wrote:
> >
> >> Much depends on how often the index is updated. If your index only
> >> changes, say, once a day then it's probably a good idea. If you're
> >> constantly updating your index, then I'd recommend that you do _not_
> >> optimize.
> >>
> >> Optimizing will create one large segment. That segment will be
> >> unlikely to be merged since it is so large relative to other segments
> >> for quite a while, resulting in significant wasted space. So if you're
> >> regularly indexing documents that _replace_ existing documents, this
> >> will skew your index.
> >>
> >> Bottom line:
> >> If you have a relatively static index the you can build and then use
> >> for an extended time (as in 12 hours plus) it can be worth the time to
> >> optimize. Otherwise I wouldn't bother.
> >>
> >> Best,
> >> Erick
> >>
> >> On Sun, Dec 20, 2015 at 7:57 PM, Zheng Lin Edwin Yeo
> >>  wrote:
> >> > Hi,
> >> >
> >> > I would like to find out, will it be good to do write a script to do
> an
> >> > auto-opitmization of the indexes at a certain time every day? Is there
> >> any
> >> > advantage to do so?
> >> >
> >> > I found that optimization can reduce the index size by quite a
> >> > signification amount, and allow the searching of the index to run
> faster.
> >> > But will there be advantage if we do the optimization every day?
> >> >
> >> > I'm using Solr 5.3.0
> >> >
> >> > Regards,
> >> > Edwin
> >>
>


Re: Auto-optimization for Solr indexes

2015-12-20 Thread Rahul Ramesh
Thanks Erick!

Rahul

On Mon, Dec 21, 2015 at 10:07 AM, Erick Erickson 
wrote:

> Rahul:
>
> bq:  we dont want the index sizes to grow too large and auto optimzie to
> kick in
>
> Not what quite what's going on. There is no "auto optimize". What
> there is is background merging that will take _some_ segments and
> merge them together. Very occasionally this will be the same as a full
> optimize if it just happens that "some" means all the segments.
>
> bq: recovery takes a bit more time when it is not optimized
>
> I'd be interested in formal measurements here. A recovery that copied
> the _entire_ index down from the leader shouldn't really have that
> much be different between an optimized and non-optimized index, but
> all things are possible. If the recovery is a "peer sync" it shouldn't
> matter at all.
>
> If you're continually adding documents that _replace_ older documents,
> optimizing will recover any "holes" left by the old updated docs. An
> update is really a mark-as-deleted for the old version and a re-index
> of the new. Since segments are write-once, the old data is left there
> until the segment is merged. Now, one of the bits of information that
> goes into deciding whether to merge a segment or not is the size.
> Another is the percentage of deleted docs. When you optimize, you get
> one huge segment. Now you have to update a lot of docs for that
> segment to have a large percentage of deleted documents and be merged,
> thus wasting space and memory.
>
> So it's a tradeoff. But if you're getting satisfactory performance
> from what you have now, there's no reason to change.
>
> Here's a wonderful video about the process. you want the third one
> down (TieredMergePolicy) as that's the default.
>
>
> http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html
>
> Best,
> Erick
>
> On Sun, Dec 20, 2015 at 8:26 PM, Rahul Ramesh  wrote:
> > Hi Erick,
> > We index around several million documents/ day and we optimize everyday
> > when the relative load is low. The reason we optimize is, we dont want
> the
> > index sizes to grow too large and auto optimzie to kick in. When auto
> > optimize kicks in, it results in unpredictable performance as it is CPU
> and
> > IO intensive.
> >
> > In older solr (4.2), when the segment size grows too large, insertion
> used
> > to fail .  Have we seen this problem in solr cloud?
> >
> > Also, we have observed, recovery takes a bit more time when it is not
> > optimized. We dont have any quantitative measurement for the same. Its
> just
> > an observation. Is this correct observation?
> >
> > If we optimize it every day, the indexes will not be skewed right?
> >
> > Please let me know if my understanding is correct.
> >
> > Regards,
> > Rahul
> >
> > On Mon, Dec 21, 2015 at 9:54 AM, Erick Erickson  >
> > wrote:
> >
> >> You'll probably have to shard before you get to the TB range. At that
> >> point, all the optimization is done individually on each shard so it
> >> really doesn't matter how many shards you have.
> >>
> >> Just issuing
> >> http://solr:port/solr/collection/update?optimize=true
> >>
> >> is sufficient, that'll forward the optimize command to all the shards
> >> in the collection.
> >>
> >> Best,
> >> Erick
> >>
> >> On Sun, Dec 20, 2015 at 8:19 PM, Zheng Lin Edwin Yeo
> >>  wrote:
> >> > Thanks for your information Erick.
> >> >
> >> > We have yet to decide how often we will update the index to include
> new
> >> > documents that came in. Let's say we update the index once a day, then
> >> when
> >> > the indexed is updated, we do the optimization (this will be done at
> >> night
> >> > when there are not many users using the system).
> >> > But my index size will probably grow quite big (potentially can go up
> to
> >> > more than 1TB in the future), so does that have to be taken into
> >> > consideration too?
> >> >
> >> > Regards,
> >> > Edwin
> >> >
> >> >
> >> > On 21 December 2015 at 12:12, Erick Erickson  >
> >> > wrote:
> >> >
> >> >> Much depends on how often the index is updated. If your index only
> >> >> changes, say, once a day then it's probably a good idea. If you're
> >> >> constantly updating your inde

Re: Pro and cons of using Solr Cloud vs standard Master Slave Replica

2016-01-11 Thread Rahul Ramesh
Please have a look at this post

https://support.lucidworks.com/hc/en-us/articles/201298317-What-is-SolrCloud-And-how-does-it-compare-to-master-slave-

We dont use Master slave architecture, however we use solr cloud and
standalone solr for our documents.

Indexing is a bit slow in cloud when compared to Standalone. This is
because of replication I think. However you will get a faster query
response.

Solr Cloud also requires a slightly elaborate setup with Zookeepers
compared to master/slave or standalone.

However, once Solr cloud is setup, it runs very smoothly and you dont have
to worry about the performance / high availability.

Please check the post, a detailed analysis and comparison between the two
has been given.

-Rahul


On Mon, Jan 11, 2016 at 4:58 PM, Gian Maria Ricci - aka Alkampfer <
alkamp...@nablasoft.com> wrote:

> Hi guys,
>
>
>
> a customer need a comprehensive list of all pro and cons of using standard
> Master Slave replica VS using Solr Cloud. I’m interested especially in
> query performance consideration, because in this specific situation the
> rate of new documents is really slow, but the amount of data is about 50
> millions of document, and the index size on disk for single core is about
> 30 GB.
>
>
>
> Such amount of data should be easily handled by a Master Slave replica
> with a  single core replicated on a certain number of slaves, but we need
> to evaluate also the option of SolrCloud, especially for fault tolerance.
>
>
>
> I’ve googled around, but did not find anything really comprehensive, so
> I’m looking for real experience from you in Mailing List. J.
>
>
>
> Thanks in advance.
>
>
>
> --
> Gian Maria Ricci
> Cell: +39 320 0136949
>
> [image:
> https://ci5.googleusercontent.com/proxy/5oNMOYAeFXZ_LDKanNfoLRHC37mAZkVVhkPN7QxMdA0K5JW2m0bm8azJe7oWZMNt8fKHNX1bzrUTd-kIyE40CmwT2Mlf8OI=s0-d-e1-ft#http://www.codewrecks.com/files/signature/mvp.png]
>  [image:
> https://ci3.googleusercontent.com/proxy/f-unQbmk6NtkHFspO5Y6x4jlIf_xrmGLUT3fU9y_7VUHSFUjLs7aUIMdZQYTh3eWIA0sBnvNX3WGXCU59chKXLuAHi2ArWdAcBclKA=s0-d-e1-ft#http://www.codewrecks.com/files/signature/linkedin.jpg]
>  [image:
> https://ci3.googleusercontent.com/proxy/gjapMzu3KEakBQUstx_-cN7gHJ_GpcIZNEPjCzOYMrPl-r1DViPE378qNAQyEWbXMTj6mcduIAGaApe9qHG1KN_hyFxQAIkdNSVT=s0-d-e1-ft#http://www.codewrecks.com/files/signature/twitter.jpg]
>  [image:
> https://ci5.googleusercontent.com/proxy/iuDOD2sdaxRDvTwS8MO7-CcXchpNJX96uaWuvagoVLcjpAPsJi88XeOonE4vHT6udVimo7yL9ZtdrYueEfH7jXnudmi_Vvw=s0-d-e1-ft#http://www.codewrecks.com/files/signature/rss.jpg]
>  [image:
> https://ci6.googleusercontent.com/proxy/EBJjfkBzcsSlAzlyR88y86YXcwaKfn3x7ydAObL1vtjJYclQr_l5TvrFx4PQ5qLNYW3yp7Ig66DJ-0tPJCDbDmYAFcamPQehwg=s0-d-e1-ft#http://www.codewrecks.com/files/signature/skype.jpg]
>
>
>


Understanding solr commit

2016-01-25 Thread Rahul Ramesh
We are facing some issue and we are finding it difficult to debug the
problem. We wanted to understand how solr commit works.
A background on our setup:
We have  3 Node Solr Cluster running in version 5.3.1. Its a index heavy
use case. In peak load, we index 400-500 documents/second.
We also want these documents to be visible as quickly as possible, hence we
run an external script which commits every 3 mins.

Consider the three nodes as N1, N2, N3. Commit is an synchronous operation.
So, we will not get control till the commit operation is complete.

Consider the following scenario. Although it looks like a basic scenario in
distributed system:-) but we just wanted to eliminate this possibility.

step 1 : At time T1, commit happens to Node N1
step 2: At same time T1, we search for all the documents inserted in Node
N2.

My question is

1. Is commit an atomic operation? I mean, will commit happen on all the
nodes at the same time?
2. Can we say that, the search result will always contain the documents
before commit / or after commit . Or can it so happen that we get new
documents fron N1, N2 but old documents (i.e., before commit)  from N3?

Thank you,
Rahul


Re: Understanding solr commit

2016-01-25 Thread Rahul Ramesh
Thanks for your replies.

A bit more detail about our setup.
The index size is close to 80Gb spread across 30 collections. The main
memory available is around 32Gb. We are always in short of memory!
Unfortunately we could not expand the memory as the server motherboard
doesnt support it.

We tried with solr auto commit features. However, sometimes we were getting
Java OOM exception and when I start digging more about it, somebody
suggested that I am not committing the collections often. So, we started
committing the collections explicitly.

Please let me know if our approach is not correct.

*Emir*,
We are committing to the collection only once. We have Node N1, N2 and N3
and for a collection Coll1, commit will happen to N1/coll1 every 3 minutes.
we are not doing it for every node. We will remove _shard<>_replica<> and
use only the collection name to commit.

*Alessandro*,
We are using Solr Cloud with replication factor of 2 and no of shards as
either 2 or 3.

Thanks,
Rahul









On Mon, Jan 25, 2016 at 4:43 PM, Alessandro Benedetti  wrote:

> Let me answer in line :
>
> On 25 January 2016 at 11:02, Rahul Ramesh  wrote:
>
> > We are facing some issue and we are finding it difficult to debug the
> > problem. We wanted to understand how solr commit works.
> > A background on our setup:
> > We have  3 Node Solr Cluster running in version 5.3.1. Its a index heavy
> > use case. In peak load, we index 400-500 documents/second.
> > We also want these documents to be visible as quickly as possible, hence
> we
> > run an external script which commits every 3 mins.
> >
>
> This is weird, why not using the auto-soft commit if you want visibility
> every 3 minutes ?
> Is there any particular reason you trigger the commit from the client ?
>
> >
> > Consider the three nodes as N1, N2, N3. Commit is an synchronous
> operation.
> > So, we will not get control till the commit operation is complete.
> >
> > Consider the following scenario. Although it looks like a basic scenario
> in
> > distributed system:-) but we just wanted to eliminate this possibility.
> >
> > step 1 : At time T1, commit happens to Node N1
> > step 2: At same time T1, we search for all the documents inserted in Node
> > N2.
> >
> > My question is
> >
> > 1. Is commit an atomic operation? I mean, will commit happen on all the
> > nodes at the same time?
> >
> Which kind of architecture of Solr are you using ? Are you using SolrCloud
> ?
>
> 2. Can we say that, the search result will always contain the documents
> > before commit / or after commit . Or can it so happen that we get new
> > documents fron N1, N2 but old documents (i.e., before commit)  from N3?
> >
> With a manual cluster it could faintly happen.
> In SolrCloud it should not, but I should double check the code !
>
> >
> > Thank you,
> > Rahul
> >
>
>
>
> --
> --
>
> Benedetti Alessandro
> Visiting card : http://about.me/alessandro_benedetti
>
> "Tyger, tyger burning bright
> In the forests of the night,
> What immortal hand or eye
> Could frame thy fearful symmetry?"
>
> William Blake - Songs of Experience -1794 England
>


Re: Understanding solr commit

2016-01-25 Thread Rahul Ramesh
Can you give us bit more details about Solr heap parameters.
Each node has 32Gb of RAM and we are using 8Gb for heap.
Index size in each node is around 80Gb
#of collections 30


Also can you give us info about auto commit (both hard and soft) you used
when experienced OOM.
 15000 15000 false 

soft commit is not enabled.

-Rahul



On Mon, Jan 25, 2016 at 6:00 PM, Emir Arnautovic <
emir.arnauto...@sematext.com> wrote:

> Hi Rahul,
> It is good that you commit only once, but not sure how external commits
> can do something auto commit cannot.
> Can you give us bit more details about Solr heap parameters. Running Solr
> on the edge of OOM is always risk of starting snowball effect and crashing
> entire cluster. Also can you give us info about auto commit (both hard and
> soft) you used when experienced OOM.
>
> Thanks,
> Emir
>
> On 25.01.2016 12:28, Rahul Ramesh wrote:
>
>> Thanks for your replies.
>>
>> A bit more detail about our setup.
>> The index size is close to 80Gb spread across 30 collections. The main
>> memory available is around 32Gb. We are always in short of memory!
>> Unfortunately we could not expand the memory as the server motherboard
>> doesnt support it.
>>
>> We tried with solr auto commit features. However, sometimes we were
>> getting
>> Java OOM exception and when I start digging more about it, somebody
>> suggested that I am not committing the collections often. So, we started
>> committing the collections explicitly.
>>
>> Please let me know if our approach is not correct.
>>
>> *Emir*,
>> We are committing to the collection only once. We have Node N1, N2 and N3
>> and for a collection Coll1, commit will happen to N1/coll1 every 3
>> minutes.
>> we are not doing it for every node. We will remove _shard<>_replica<> and
>> use only the collection name to commit.
>>
>> *Alessandro*,
>>
>> We are using Solr Cloud with replication factor of 2 and no of shards as
>> either 2 or 3.
>>
>> Thanks,
>> Rahul
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Mon, Jan 25, 2016 at 4:43 PM, Alessandro Benedetti <
>> abenede...@apache.org
>>
>>> wrote:
>>> Let me answer in line :
>>>
>>> On 25 January 2016 at 11:02, Rahul Ramesh  wrote:
>>>
>>> We are facing some issue and we are finding it difficult to debug the
>>>> problem. We wanted to understand how solr commit works.
>>>> A background on our setup:
>>>> We have  3 Node Solr Cluster running in version 5.3.1. Its a index heavy
>>>> use case. In peak load, we index 400-500 documents/second.
>>>> We also want these documents to be visible as quickly as possible, hence
>>>>
>>> we
>>>
>>>> run an external script which commits every 3 mins.
>>>>
>>>> This is weird, why not using the auto-soft commit if you want visibility
>>> every 3 minutes ?
>>> Is there any particular reason you trigger the commit from the client ?
>>>
>>> Consider the three nodes as N1, N2, N3. Commit is an synchronous
>>>>
>>> operation.
>>>
>>>> So, we will not get control till the commit operation is complete.
>>>>
>>>> Consider the following scenario. Although it looks like a basic scenario
>>>>
>>> in
>>>
>>>> distributed system:-) but we just wanted to eliminate this possibility.
>>>>
>>>> step 1 : At time T1, commit happens to Node N1
>>>> step 2: At same time T1, we search for all the documents inserted in
>>>> Node
>>>> N2.
>>>>
>>>> My question is
>>>>
>>>> 1. Is commit an atomic operation? I mean, will commit happen on all the
>>>> nodes at the same time?
>>>>
>>>> Which kind of architecture of Solr are you using ? Are you using
>>> SolrCloud
>>> ?
>>>
>>> 2. Can we say that, the search result will always contain the documents
>>>
>>>> before commit / or after commit . Or can it so happen that we get new
>>>> documents fron N1, N2 but old documents (i.e., before commit)  from N3?
>>>>
>>>> With a manual cluster it could faintly happen.
>>> In SolrCloud it should not, but I should double check the code !
>>>
>>> Thank you,
>>>> Rahul
>>>>
>>>>
>>>
>>> --
>>> --
>>>
>>> Benedetti Alessandro
>>> Visiting card : http://about.me/alessandro_benedetti
>>>
>>> "Tyger, tyger burning bright
>>> In the forests of the night,
>>> What immortal hand or eye
>>> Could frame thy fearful symmetry?"
>>>
>>> William Blake - Songs of Experience -1794 England
>>>
>>>
> --
> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> Solr & Elasticsearch Support * http://sematext.com/
>
>


Re: Understanding solr commit

2016-01-25 Thread Rahul Ramesh
Thank you Emir, Allesandro for the inputs. We use sematext for monitoring.
We understand that Solr needs more memory but unfortunately we have to move
towards an altogether new range of servers.
As you say eventually, we will have to upgrade our servers.

Thanks,
Rahul


On Mon, Jan 25, 2016 at 6:32 PM, Emir Arnautovic <
emir.arnauto...@sematext.com> wrote:

> Hi Rahul,
> It is hard to tell without seeing metrics, but 8GB heap seems small for
> such setup - e.g. with indexing buffer of 32MB and 30 collections, it will
> eat almost 1GB memory.
> About commits, you can set auto commit to be more frequent (keep
> openSearcher=false) and add soft commits every 3 min.
> What you need to tune is your heap and heap related settings - indexing
> buffer, caches. Not sure what you use for monitoring Solr, but Sematext's
> SPM (http://sematext.com/spm) is one such tool that can give you info how
> you Solr, JVM and host handle different load. One such tool can give you
> enough info to tune your Solr.
>
> Regards,
> Emir
>
>
> On 25.01.2016 13:42, Rahul Ramesh wrote:
>
>> Can you give us bit more details about Solr heap parameters.
>> Each node has 32Gb of RAM and we are using 8Gb for heap.
>> Index size in each node is around 80Gb
>> #of collections 30
>>
>>
>> Also can you give us info about auto commit (both hard and soft) you used
>> when experienced OOM.
>>  15000 15000
>> >
>>> false 
>>>
>> soft commit is not enabled.
>>
>> -Rahul
>>
>>
>>
>> On Mon, Jan 25, 2016 at 6:00 PM, Emir Arnautovic <
>> emir.arnauto...@sematext.com> wrote:
>>
>> Hi Rahul,
>>> It is good that you commit only once, but not sure how external commits
>>> can do something auto commit cannot.
>>> Can you give us bit more details about Solr heap parameters. Running Solr
>>> on the edge of OOM is always risk of starting snowball effect and
>>> crashing
>>> entire cluster. Also can you give us info about auto commit (both hard
>>> and
>>> soft) you used when experienced OOM.
>>>
>>> Thanks,
>>> Emir
>>>
>>> On 25.01.2016 12:28, Rahul Ramesh wrote:
>>>
>>> Thanks for your replies.
>>>>
>>>> A bit more detail about our setup.
>>>> The index size is close to 80Gb spread across 30 collections. The main
>>>> memory available is around 32Gb. We are always in short of memory!
>>>> Unfortunately we could not expand the memory as the server motherboard
>>>> doesnt support it.
>>>>
>>>> We tried with solr auto commit features. However, sometimes we were
>>>> getting
>>>> Java OOM exception and when I start digging more about it, somebody
>>>> suggested that I am not committing the collections often. So, we started
>>>> committing the collections explicitly.
>>>>
>>>> Please let me know if our approach is not correct.
>>>>
>>>> *Emir*,
>>>> We are committing to the collection only once. We have Node N1, N2 and
>>>> N3
>>>> and for a collection Coll1, commit will happen to N1/coll1 every 3
>>>> minutes.
>>>> we are not doing it for every node. We will remove _shard<>_replica<>
>>>> and
>>>> use only the collection name to commit.
>>>>
>>>> *Alessandro*,
>>>>
>>>> We are using Solr Cloud with replication factor of 2 and no of shards as
>>>> either 2 or 3.
>>>>
>>>> Thanks,
>>>> Rahul
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Mon, Jan 25, 2016 at 4:43 PM, Alessandro Benedetti <
>>>> abenede...@apache.org
>>>>
>>>> wrote:
>>>>> Let me answer in line :
>>>>>
>>>>> On 25 January 2016 at 11:02, Rahul Ramesh  wrote:
>>>>>
>>>>> We are facing some issue and we are finding it difficult to debug the
>>>>>
>>>>>> problem. We wanted to understand how solr commit works.
>>>>>> A background on our setup:
>>>>>> We have  3 Node Solr Cluster running in version 5.3.1. Its a index
>>>>>> heavy
>>>>>> use case. In peak load, we index 400-500 documents/second.
>>>>>> We also want these documents to be visible as quickly as possible,
>>>>>> hence
>>>>>>
>

Re: Increasing Solr5 time out from 30 seconds while starting solr

2015-12-08 Thread Rahul Ramesh
Hi Debraj,
I dont think increasing the timeout will help. Are you sure solr/ any other
program is not running on 8789? Please check the output of lsof -i :8789 .

Regards,
Rahul

On Tue, Dec 8, 2015 at 11:58 PM, Debraj Manna 
wrote:

> Can someone help me on this?
> On Dec 7, 2015 7:55 PM, "D"  wrote:
>
> > Hi,
> >
> > Many time while starting solr I see the below message and then the solr
> is
> > not reachable.
> >
> > debraj@boutique3:~/solr5$ sudo bin/solr start -p 8789
> > Waiting to see Solr listening on port 8789 [-]  Still not seeing Solr
> listening on 8789 after 30 seconds!
> >
> > However when I try to start solr again by trying to execute the same
> > command. It says that *"solr is already running on port 8789. Try using a
> > different port with -p"*
> >
> > I am having two cores in my local set-up. I am guessing this is happening
> > because one of the core is a little big. So solr is timing out while
> > loading the core. If I take one of the core out of solr then everything
> > works fine.
> >
> > Can some one let me know how can I increase this timeout value from
> > default 30 seconds?
> >
> > I am using Solr 5.2.1 on Debian 7.
> >
> > Thanks,
> >
> >
>


Re: Moving to SolrCloud, specifying dataDir correctly

2015-12-14 Thread Rahul Ramesh
We currently moved data from magnetic drive to SSD. We run Solr in cloud
mode. Only data is stored in the drive configuration is stored in ZK. We
start solr using the -s option specifying the data dir
Command to start solr
./bin/solr start -c -h  -p  -z  -s 

We followed the following steps to migrate data

1. Stop all new insertions .
2. Copy the solr data to the new location
3. restart the server with -s option pointing to new solr directory name.
4. We have a 3 node solr cluster. The restarted server will get in sync
with the other two servers.
5. Repeat this procedure for other two servers.

We used the similar procedure to upgrade from 5.2.1 to 5.3.1.





On Tue, Dec 15, 2015 at 5:07 AM, Jeff Wartes  wrote:

>
> Don’t set solr.data.dir. Instead, set the install dir. Something like:
> -Dsolr.solr.home=/data/solr
> -Dsolr.install.dir=/opt/solr
>
> I have many solrcloud collections, and separate data/install dirs, and
> I’ve never had to do anything with manual per-collection or per-replica
> data dirs.
>
> That said, it’s been a while since I set this up, and I may not remember
> all the pieces.
> You might need something like this too, for example:
>
> -Djetty.home=/opt/solr/server
>
>
> On 12/14/15, 3:11 PM, "Erick Erickson"  wrote:
>
> >Currently, it'll be a little tedious but here's what you can do (going
> >partly from memory)...
> >
> >When you create the collection, specify the special value EMPTY for
> >createNodeSet (Solr 5.3+).
> >Use ADDREPLICA to add each individual replica. When you do this, you
> >can add a dataDir for
> >each individual replica and thus keep them separate, i.e. for a
> >particular box the first
> >replica would get /data/solr/collection1_shard1_replica1, the second
> >/data/solr/collection1_shard2_replica1 and so forth.
> >
> >If you don't have Solr 5.3+, you can still to the same thing, except
> >you create your collection letting
> >the replicas fall where they will. Then do the ADDREPLICA as above.
> >When that's all done,
> >DELETEREPLICA for the original replicas.
> >
> >Best,
> >Erick
> >
> >On Mon, Dec 14, 2015 at 2:21 PM, Tom Evans 
> >wrote:
> >> On Mon, Dec 14, 2015 at 1:22 PM, Shawn Heisey 
> >>wrote:
> >>> On 12/14/2015 10:49 AM, Tom Evans wrote:
>  When I tried this in SolrCloud mode, specifying
>  "-Dsolr.data.dir=/mnt/solr/" when starting each node, it worked fine
>  for the first collection, but then the second collection tried to use
>  the same directory to store its index, which obviously failed. I fixed
>  this by changing solrconfig.xml in each collection to specify a
>  specific directory, like so:
> 
>    ${solr.data.dir:}products
> 
>  Looking back after the weekend, I'm not a big fan of this. Is there a
>  way to add a core.properties to ZK, or a way to specify
>  core.baseDatadir on the command line, or just a better way of handling
>  this that I'm not aware of?
> >>>
> >>> Since you're running SolrCloud, just let Solr handle the dataDir, don't
> >>> try to override it.  It will default to "data" relative to the
> >>> instanceDir.  Each instanceDir is likely to be in the solr home.
> >>>
> >>> With SolrCloud, your cores will not contain a "conf" directory (unless
> >>> you create it manually), therefore the on-disk locations will be *only*
> >>> data, there's not really any need to have separate locations for
> >>> instanceDir and dataDir.  All active configuration information for
> >>> SolrCloud is in zookeeper.
> >>>
> >>
> >> That makes sense, but I guess I was asking the wrong question :)
> >>
> >> We have our SSDs mounted on /data/solr, which is where our indexes
> >> should go, but our solr install is on /opt/solr, with the default solr
> >> home in /opt/solr/server/solr. How do we change where the indexes get
> >> put so they end up on the fast storage?
> >>
> >> Cheers
> >>
> >> Tom
>
>


Re: solrcloud used a lot of memory and memory keep increasing during long time run

2015-12-15 Thread Rahul Ramesh
You should actually decrease solr heap size. Let me explain a bit.

Solr requires very less heap memory for its operation and more memory for
storing data in main memory. This is because solr uses mmap for storing the
index files.
Please check the link
http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html for
understanding how solr operates on files .

Solr has typical problem of Garbage collection once you the heap size to a
large value. It will have indeterminate pauses due to GC. The amount of
heap memory required is difficult to tell. However the way we tuned this
parameter is setting it to a low value and increasing it by 1Gb whenever
OOM is thrown.

Please check the problem of having large Java Heap

http://wiki.apache.org/solr/SolrPerformanceProblems#Java_Heap


Just for your reference, in our production setup, we have data of around
60Gb/node spread across 25 collections. We have configured 8GB as heap and
the rest of the memory we will leave it to OS to manage. We do around 1000
(search + Insert)/second on the data.

I hope this helps.

Regards,
Rahul



On Tue, Dec 15, 2015 at 4:33 PM, zhenglingyun  wrote:

> Hi, list
>
> I’m new to solr. Recently I encounter a “memory leak” problem with
> solrcloud.
>
> I have two 64GB servers running a solrcloud cluster. In the solrcloud, I
> have
> one collection with about 400k docs. The index size of the collection is
> about
> 500MB. Memory for solr is 16GB.
>
> Following is "ps aux | grep solr” :
>
> /usr/java/jdk1.7.0_67-cloudera/bin/java
> -Djava.util.logging.config.file=/var/lib/solr/tomcat-deployment/conf/logging.properties
> -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager
> -Djava.net.preferIPv4Stack=true -Dsolr.hdfs.blockcache.enabled=true
> -Dsolr.hdfs.blockcache.direct.memory.allocation=true
> -Dsolr.hdfs.blockcache.blocksperbank=16384
> -Dsolr.hdfs.blockcache.slab.count=1 -Xms16608395264 -Xmx16608395264
> -XX:MaxDirectMemorySize=21590179840 -XX:+UseParNewGC
> -XX:+UseConcMarkSweepGC -XX:-CMSConcurrentMTEnabled
> -XX:CMSInitiatingOccupancyFraction=70 -XX:+CMSParallelRemarkEnabled
> -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintHeapAtGC
> -Xloggc:/var/log/solr/gc.log
> -XX:OnOutOfMemoryError=/usr/lib64/cmf/service/common/killparent.sh -DzkHost=
> bjzw-datacenter-hadoop-160.d.yourmall.cc:2181,
> bjzw-datacenter-hadoop-163.d.yourmall.cc:2181,
> bjzw-datacenter-hadoop-164.d.yourmall.cc:2181/solr
> -Dsolr.solrxml.location=zookeeper -Dsolr.hdfs.home=hdfs://datacenter/solr
> -Dsolr.hdfs.confdir=/var/run/cloudera-scm-agent/process/6288-solr-SOLR_SERVER/hadoop-conf
> -Dsolr.authentication.simple.anonymous.allowed=true
> -Dsolr.security.proxyuser.hue.hosts=*
> -Dsolr.security.proxyuser.hue.groups=* -Dhost=
> bjzw-datacenter-solr-15.d.yourmall.cc -Djetty.port=8983 -Dsolr.host=
> bjzw-datacenter-solr-15.d.yourmall.cc -Dsolr.port=8983
> -Dlog4j.configuration=file:///var/run/cloudera-scm-agent/process/6288-solr-SOLR_SERVER/log4j.properties
> -Dsolr.log=/var/log/solr -Dsolr.admin.port=8984
> -Dsolr.max.connector.thread=1 -Dsolr.solr.home=/var/lib/solr
> -Djava.net.preferIPv4Stack=true -Dsolr.hdfs.blockcache.enabled=true
> -Dsolr.hdfs.blockcache.direct.memory.allocation=true
> -Dsolr.hdfs.blockcache.blocksperbank=16384
> -Dsolr.hdfs.blockcache.slab.count=1 -Xms16608395264 -Xmx16608395264
> -XX:MaxDirectMemorySize=21590179840 -XX:+UseParNewGC
> -XX:+UseConcMarkSweepGC -XX:-CMSConcurrentMTEnabled
> -XX:CMSInitiatingOccupancyFraction=70 -XX:+CMSParallelRemarkEnabled
> -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintHeapAtGC
> -Xloggc:/var/log/solr/gc.log
> -XX:OnOutOfMemoryError=/usr/lib64/cmf/service/common/killparent.sh -DzkHost=
> bjzw-datacenter-hadoop-160.d.yourmall.cc:2181,
> bjzw-datacenter-hadoop-163.d.yourmall.cc:2181,
> bjzw-datacenter-hadoop-164.d.yourmall.cc:2181/solr
> -Dsolr.solrxml.location=zookeeper -Dsolr.hdfs.home=hdfs://datacenter/solr
> -Dsolr.hdfs.confdir=/var/run/cloudera-scm-agent/process/6288-solr-SOLR_SERVER/hadoop-conf
> -Dsolr.authentication.simple.anonymous.allowed=true
> -Dsolr.security.proxyuser.hue.hosts=*
> -Dsolr.security.proxyuser.hue.groups=* -Dhost=
> bjzw-datacenter-solr-15.d.yourmall.cc -Djetty.port=8983 -Dsolr.host=
> bjzw-datacenter-solr-15.d.yourmall.cc -Dsolr.port=8983
> -Dlog4j.configuration=file:///var/run/cloudera-scm-agent/process/6288-solr-SOLR_SERVER/log4j.properties
> -Dsolr.log=/var/log/solr -Dsolr.admin.port=8984
> -Dsolr.max.connector.thread=1 -Dsolr.solr.home=/var/lib/solr
> -Djava.endorsed.dirs=/usr/lib/bigtop-tomcat/endorsed -classpath
> /usr/lib/bigtop-tomcat/bin/bootstrap.jar
> -Dcatalina.base=/var/lib/solr/tomcat-deployment
> -Dcatalina.home=/usr/lib/bigtop-tomcat -Djava.io.tmpdir=/var/lib/solr/
> org.apache.catalina.startup.Bootstrap start
>
>
> solr version is solr4.4.0-cdh5.3.0
> jdk version is 1.7.0_67
>
> Soft commit time is 1.5s. And we have real time indexing/partialupdating
> rate about 100 docs per second.
>
> When fresh started, Solr will use ab

Querying Nested documents

2015-07-13 Thread Ramesh Nuthalapati
(Duplicate post as the xml is not formatted well in nabble, so posting
directly to the list)

Hi, I have question regarding nested documents.

My document looks like below,

 
1234
xger


0
0
   2015-06-15T13:29:07Z
ege
Duper
http://www.domain.com
zoome
parent

1234-images
http://somedomain.com/some.jpg
1:1


1234-platform-ios
ios
https://somedomain.com
somelink
false
2015-03-23T10:58:00Z
-12-30T19:00:00Z


1234-platform-android

android
somedomain.com
somelink
false
2015-03-23T10:58:00Z
-12-30T19:00:00Z



Right now I can query like this

http://localhost:8983/solr/demo/select?q=
{!parent%20which=%27type:parent%27}&fl=*,[child%20parentFilter=type:parent%20childFilter=image_uri_s:*]&indent=true

and get the parent and child document with matching criteria (just parent
and image child document).

*But, I want to get all other children* (1234-platform-ios and
1234-platform-andriod) even if i query based on image_uri_s (1234-images)
although they are other children which are part of the parent document.

Is it possible ?

Appreciate your help !

Thanks,
Ramesh


Re: Querying Nested documents

2015-07-14 Thread Ramesh Nuthalapati
Yes you are right.

So the query you are saying should be like below .. or did I misunderstood
it

http://localhost:8983/solr/demo/select?q= {!parent
which='type:parent'}&fl=*,[child parentFilter=type:parent
childFilter=(image_uri_s:somevalue) OR (-image_uri_s:*)]&indent=true

If so, I am getting an error with parsing field name.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Querying-Nested-documents-tp4217169p4217348.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Querying Nested documents

2015-07-15 Thread Ramesh Nuthalapati
Mikhail - 

This worked great.

http://localhost:8983/solr/demo/select?q={!parent 
which='type:parent'}image_uri_s:somevalue&fl=*,[child 
parentFilter=type:parent 
childFilter=-type:parent]&indent=true 

Thank you.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Querying-Nested-documents-tp4217169p4217534.html
Sent from the Solr - User mailing list archive at Nabble.com.


Solr 6.1.x Release Date ??

2016-06-06 Thread Ramesh Shankar
Hi,

Any idea of Solr 6.1.X Release Date ??

I am interested in the [subquery] transformer and like to know the release
date since its available only in 6.1.x

Thanks & Regards
Ramesh


Re: Solr 6.1.x Release Date ??

2016-06-09 Thread Ramesh Shankar
Hi,

I found it working in [subquery] transformer solr-6.1.0-79 nightly builds.

Regards
Ramesh

On Tue, Jun 7, 2016 at 11:08 AM, Ramesh Shankar  wrote:

> Hi,
>
> Any idea of Solr 6.1.X Release Date ??
>
> I am interested in the [subquery] transformer and like to know the release
> date since its available only in 6.1.x
>
> Thanks & Regards
> Ramesh
>


Re: Solr 6.1.x Release Date ??

2016-06-16 Thread Ramesh shankar
Hi,

Yes, i used the solr-6.1.0-79 nightly builds and [subquery] transformer is
working fine in, any idea of the expected release date for 6.1 ?

Regards
Ramesh



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-6-1-x-Release-Date-tp4280945p4282562.html
Sent from the Solr - User mailing list archive at Nabble.com.


How to retain the original format of input document in search results in SOLR - Tomcat

2013-11-20 Thread ramesh py
Hi All,



I am  new to apache solr. Recently  I could able to configure the solr with
tomcat successfully. And its working fine except the format of the search
results i.e., the format of the search results not displaying as like as
input document.



I am doing the below things



1.   Indexing the xml file into solr

2.   Format of the xml as below

**

some text

 Title1: descriptions of the title

Title2 : description of the title2

Title3 : description of title3



some text 





3.   After index, the results are displaying in the below format.



*F1 : *some text

*F2*: Title1: descriptions of the title Title2 : description of the title2
Title3 : description of title3

*F3*: some text



*Expected Result :*



*F1 : *some text

*F2*: Title1: descriptions of the title

  Title2 : description of the title2

  Title3 : description of title3

*F3*: some text





If we see the F2 field, format id getting changed i.e., input format is of
F2 field is line by line for each sub title, but in the result it
displaying as single line.





I would like to display the result like whenever any subtitle occurs in xml
file for any field, that subtitle should display in the next  line in the
results.



Can anyone please help on this. Thanks in advance.





Regards,

Ramesh p.y

-- 
Ramesh P.Y
pyrames...@gmail.com
Mobile No:+91-9176361984


7.3.1: Query of death - all nodes ran out of memory and had to be shut down

2018-08-20 Thread Ash Ramesh
Hi everyone,

We ran into an issue yesterday where all our ec2 machines, running solr,
ran out of memory and could not heal themselves. I'll try break down what
happened here.

*System Architecture:*

- Solr Version: 7.3.1
- Replica Types: TLOG/PULL
- Num Shards: 8 (default hashing mechanism)
- Doc Count: > 20m
- Index Size: 17G
- EC2 Machine Spec: 16 Core | 32G ram | 100G SSD
- Num EC2 Machines: 7+ (scales up and down)
- Max Shards per node (one node per EC2 instance): 8 (some nodes had 4,
some had 8)
- Num TLOG shard replicas: 3 (3 copies of each shard as TLOG)
- Num PULL shard replicas: 3+
- Heap: 4G

*What was run prior to the issue:*

We ran these queries around 2.55pm

We ran a bunch of deep paginated queries (offset of 1,000,000) with a
filter query. We set the timeout to 5 seconds and it did timeout. We aren't
sure if this is what caused the irrecoverable failure, but by reading this
-
https://lucene.apache.org/solr/guide/7_4/pagination-of-results.html#performance-problems-with-deep-paging
, we feel that this was the cause.

We did not use a cursor.

This cluster was healthy for about 1 week, but we noticed the degradation
soon after (within 30min) of running the offset queries mentioned above. We
currently use a single sharded collection in production, however are
transitioning to an 8 shard cluster. We hit this issue in a controlled 8
sharded environment, but don't notice any issues on our production (single
sharded) cluster. On production the query still timed out (with same num
docs etc.) but didn't go into a crazy state.

*What Happened:*

- All the EC2 instances started logging OOM error. None of the nodes were
responsive to new requests.
- We saw that the Heap usage jumped from an average of 2.7G to the max of
4G within a 5 minute window.
- CPU across all 16 cores was at 100%
- We saw that the distributed requests were timing out across all machines.
- We shutdown all the machines that only had PULL replicas on them and it
still didn't 'fix' itself.
- Eventually we shut down SOLR on the main node which had all the master
TLOG replicas. Once restarted, the machine started working again.


*Questions:*
- Did this deep pagination query *DEFINITELY* cause this issue?
- Is each node single threaded? I don't think so, but I'd like to confirm
that.
- Is there any configuration that we could use to avoid this in the future?
- Why could the nodes not recover by themselves? When we ran the same query
on the single shard cluster it failed and didn't spin out of control.

Thanks for all your help, Logs are pasted below from different timestamps.

Regards,
Ash

*Logs:*

Here are some logs we collected. Not sure if it tells a lot outside of what
we know.

*Time: 2.55pm ~ Requests are failing to complete in time*

> ERROR RequestHandlerBase org.apache.solr.common.SolrException:
> org.apache.solr.client.solrj.SolrServerException: Time allowed to handle
> this request exceeded:[
> http://10.0.9.204:8983/solr/media_shard1_replica_p57,
> http://10.0.9.204:8983/solr/media_shard4_replica_p80,
> http://10.0.9.204:8983/solr/media_shard3_replica_p73,
> http://10.0.9.204:8983/solr/media_shard2_replica_p68]
> #011at
> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:410)
> #011at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:195)
> #011at org.apache.solr.core.SolrCore.execute(SolrCore.java:2503)
> #011at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:711)
> #011at org.apache.s...
>  #011at
> org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:188)
>  #011at
> org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:168)
>  #011at
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:473)
>  #011at
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1564)
>  #011at
> org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:166)
>  #011at
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1155)
>  #011at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>  #011at
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:219)
>  #011at
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:126)
>  #011at
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
>  #011at
> org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)
>  #011at
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
>  #011at org.eclipse.jetty.server.Server.handle(Server.java:530)
>  #011at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:347)
>  #011at
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:126)
>  #011at
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
>  #011at
> org

Re: 7.3.1: Query of death - all nodes ran out of memory and had to be shut down

2018-08-20 Thread Ash Ramesh
Hi Erick,

Sorry I phrased that the wrong way. I meant to ask whether there is a high
probability that that could be the correlated cause for the issue. Do you
know why Solr itself isn't able to recover or is that to be expected with
allowing such deep pagination. We are going to be removing it going
forwards, but want to make sure that we find the root cause.

Appreciate your help as always :)

Ash

On Tue, Aug 21, 2018 at 2:59 PM Erick Erickson 
wrote:

> Did the large offsets _definitely_ cause the OOM? How do you expect
> that to be answerable? It's likely though. To return rows 1,000,000
> through 1,000,010 the system has to keep a list of 1,000,010 top
> documents. It has to be this way because you don't know (and can't
> guess) the score or a doc prior to, well, scoring it. And these very
> large structures are kept for every query being processed. Not only
> will that chew up memory, it'll chew up CPU cycles as well as this an
> ordered list.
>
> This is an anti-pattern, cursors were invented because this pattern is
> very costly (as you're finding out).
>
> Further, 4G isn't very much memory by modern standards.
>
> So it's very likely (but not guaranteed) that using cursors will fix
> this problem.
>
> Best,
> Erick
>
>
>
> On Mon, Aug 20, 2018 at 8:55 PM, Ash Ramesh  wrote:
> > Hi everyone,
> >
> > We ran into an issue yesterday where all our ec2 machines, running solr,
> > ran out of memory and could not heal themselves. I'll try break down what
> > happened here.
> >
> > *System Architecture:*
> >
> > - Solr Version: 7.3.1
> > - Replica Types: TLOG/PULL
> > - Num Shards: 8 (default hashing mechanism)
> > - Doc Count: > 20m
> > - Index Size: 17G
> > - EC2 Machine Spec: 16 Core | 32G ram | 100G SSD
> > - Num EC2 Machines: 7+ (scales up and down)
> > - Max Shards per node (one node per EC2 instance): 8 (some nodes had 4,
> > some had 8)
> > - Num TLOG shard replicas: 3 (3 copies of each shard as TLOG)
> > - Num PULL shard replicas: 3+
> > - Heap: 4G
> >
> > *What was run prior to the issue:*
> >
> > We ran these queries around 2.55pm
> >
> > We ran a bunch of deep paginated queries (offset of 1,000,000) with a
> > filter query. We set the timeout to 5 seconds and it did timeout. We
> aren't
> > sure if this is what caused the irrecoverable failure, but by reading
> this
> > -
> >
> https://lucene.apache.org/solr/guide/7_4/pagination-of-results.html#performance-problems-with-deep-paging
> > , we feel that this was the cause.
> >
> > We did not use a cursor.
> >
> > This cluster was healthy for about 1 week, but we noticed the degradation
> > soon after (within 30min) of running the offset queries mentioned above.
> We
> > currently use a single sharded collection in production, however are
> > transitioning to an 8 shard cluster. We hit this issue in a controlled 8
> > sharded environment, but don't notice any issues on our production
> (single
> > sharded) cluster. On production the query still timed out (with same num
> > docs etc.) but didn't go into a crazy state.
> >
> > *What Happened:*
> >
> > - All the EC2 instances started logging OOM error. None of the nodes were
> > responsive to new requests.
> > - We saw that the Heap usage jumped from an average of 2.7G to the max of
> > 4G within a 5 minute window.
> > - CPU across all 16 cores was at 100%
> > - We saw that the distributed requests were timing out across all
> machines.
> > - We shutdown all the machines that only had PULL replicas on them and it
> > still didn't 'fix' itself.
> > - Eventually we shut down SOLR on the main node which had all the master
> > TLOG replicas. Once restarted, the machine started working again.
> >
> >
> > *Questions:*
> > - Did this deep pagination query *DEFINITELY* cause this issue?
> > - Is each node single threaded? I don't think so, but I'd like to confirm
> > that.
> > - Is there any configuration that we could use to avoid this in the
> future?
> > - Why could the nodes not recover by themselves? When we ran the same
> query
> > on the single shard cluster it failed and didn't spin out of control.
> >
> > Thanks for all your help, Logs are pasted below from different
> timestamps.
> >
> > Regards,
> > Ash
> >
> > *Logs:*
> >
> > Here are some logs we collected. Not sure if it tells a lot outside of
> what
> > we know.
> >
> > *Time: 2.55pm ~ Requests are failing to

Re: 7.3.1: Query of death - all nodes ran out of memory and had to be shut down

2018-08-22 Thread Ash Ramesh
Thank you all :) We have made the necessary changes to mitigate this issue

On Wed, Aug 22, 2018 at 6:01 AM Shawn Heisey  wrote:

> On 8/20/2018 9:55 PM, Ash Ramesh wrote:
> > We ran a bunch of deep paginated queries (offset of 1,000,000) with a
> > filter query. We set the timeout to 5 seconds and it did timeout. We
> aren't
> > sure if this is what caused the irrecoverable failure, but by reading
> this
> > -
> >
> https://lucene.apache.org/solr/guide/7_4/pagination-of-results.html#performance-problems-with-deep-paging
> > , we feel that this was the cause.
>
> Yes, this is most likely the cause.
>
> Since you have three shards, the problem is even worse than Erick
> described.  Those 110 results will be returned by EVERY shard, and
> consolidated on the machine that's actually making the query.  So it
> will have three million results in memory that it must sort.
>
> Unless you're running on Windows, the bin/solr script will configure
> Java to kill itself when OutOfMemoryError occurs.  It does this because
> program behavior after OOME occurs is completely unpredictable, so
> there's a good chance that if it keeps running, it will corrupt the index.
>
> If you're going to be doing queries like this, you need a larger heap.
> There's no way around that.
>
> Thanks,
> Shawn
>
>

-- 
*P.S. We've launched a new blog to share the latest ideas and case studies 
from our team. Check it out here: product.canva.com 
<http://product.canva.com/>. ***
** <https://canva.com>Empowering the world 
to design
Also, we're hiring. Apply here! 
<https://about.canva.com/careers/>
 <https://twitter.com/canva> 
<https://facebook.com/canva> <https://au.linkedin.com/company/canva> 
<https://instagram.com/canva>







Understanding how timeAllowed works in a distributed cluster

2018-08-22 Thread Ash Ramesh
Hi again,

Specs: 7.3.1 | 8 Shards | Solr Cloud

I was wondering how the timeAllowed parameter works when you architect your
cluster in a sharded and distributed manner. This is the curl command and
the timing

Query:
http://localhost:9983/solr/media/select?fq=someField:someRedactedValue&q=*:*&qq=*&rows=10&start=40&timeAllowed=5000

Time: 14.16 sec

The request actually returns results, which apparently is expected, but why
does it not timeout? I've grepped the internet to understand this, but
still haven't gotten a full explanation. I inferred that the behaviour is
different between a single sharded cluster and multi sharded cluster.

Best Regards,

Ash

-- 
*P.S. We've launched a new blog to share the latest ideas and case studies 
from our team. Check it out here: product.canva.com 
. ***
** Empowering the world 
to design
Also, we're hiring. Apply here! 

  
  








Re: Understanding how timeAllowed works in a distributed cluster

2018-08-22 Thread Ash Ramesh
Sorry I didn't understand your answer. Could you break that down a bit
further. Thank you :)


On Wed, Aug 22, 2018 at 11:53 PM Mikhail Khludnev  wrote:

> timeAllowed is not a realtime (as well as java per se). There is only a few
> places which it can break.
>
> On Wed, Aug 22, 2018 at 4:09 PM Ash Ramesh  wrote:
>
> > Hi again,
> >
> > Specs: 7.3.1 | 8 Shards | Solr Cloud
> >
> > I was wondering how the timeAllowed parameter works when you architect
> your
> > cluster in a sharded and distributed manner. This is the curl command and
> > the timing
> >
> > Query:
> >
> >
> http://localhost:9983/solr/media/select?fq=someField:someRedactedValue&q=*:*&qq=*&rows=10&start=40&timeAllowed=5000
> > <
> >
> http://localhost:9983/solr/media/select/images?fq=brand:BACsIfMoQVQ&q=*:*&qq=*&rows=10&start=40&timeAllowed=5000
> > >
> > Time: 14.16 sec
> >
> > The request actually returns results, which apparently is expected, but
> why
> > does it not timeout? I've grepped the internet to understand this, but
> > still haven't gotten a full explanation. I inferred that the behaviour is
> > different between a single sharded cluster and multi sharded cluster.
> >
> > Best Regards,
> >
> > Ash
> >
> > --
> > *P.S. We've launched a new blog to share the latest ideas and case
> studies
> > from our team. Check it out here: product.canva.com
> > <http://product.canva.com/>. ***
> > ** <https://canva.com>Empowering the world
> > to design
> > Also, we're hiring. Apply here!
> > <https://about.canva.com/careers/>
> >  <https://twitter.com/canva>
> > <https://facebook.com/canva> <https://au.linkedin.com/company/canva>
> > <https://instagram.com/canva>
> >
> >
> >
> >
> >
> >
>
> --
> Sincerely yours
> Mikhail Khludnev
>

-- 
*P.S. We've launched a new blog to share the latest ideas and case studies 
from our team. Check it out here: product.canva.com 
<http://product.canva.com/>. ***
** <https://canva.com>Empowering the world 
to design
Also, we're hiring. Apply here! 
<https://about.canva.com/careers/>
 <https://twitter.com/canva> 
<https://facebook.com/canva> <https://au.linkedin.com/company/canva> 
<https://instagram.com/canva>







Solr Cloud not routing to PULL replicas

2018-08-28 Thread Ash Ramesh
Hi again,

We are currently using Solr 7.3.1 and have a 8 shard collection. All our
TLOGs are in seperate machines & PULLs in others. Since not all shards are
in the same machine, the request will be distributed. However, we are
seeing that most of the 'distributed' parts of the requests are being
routed to the TLOG machines. This is evident as the TLOGs are saturated at
80%+ CPU while the PULL machines are sitting at 25% even through the load
balancer only routes to the PULL machines. I know we can use
'preferLocalShards', but that still doesn't solve the problem.

Is there something we have configured incorrectly? We are currently rushing
to upgrade to 7.4.0 so we can take advantage of
'shards.preference=replica.location:local,replica.type:PULL' parameter. In
the meantime, we would like to know if there is a reason for this behavior
and if there is anything we can do to avoid it.

Thank you & regards,

Ash

-- 
*P.S. We've launched a new blog to share the latest ideas and case studies 
from our team. Check it out here: product.canva.com 
. ***
** Empowering the world 
to design
Also, we're hiring. Apply here! 

  
  








Potential bug? maxConnectionsPerHost on requestHandler configuration

2018-09-10 Thread Ash Ramesh
Hi,

I tried setting up a bespoke ShardHandlerFactory configuration for each
request handler in solrconfig.xml. However when I stepped through the code
in debug mode (via IntelliJ) I could see that the ShardHandler created and
used in the searcher still didn't reflect the values in solrconfig (even
after a core RELOAD).

I did find that it did reflect changes to the ShardHandlerFactory in
solr.xml when I changed it, pushed to ZK and restarted Solr.

Is this expected or am I going about this the wrong way.

Example RequestHandler syntax:




6
1000
99



We are trying to understand why our machines have their CPUs stalling at
60-80% all the time. We suspect it's because of the maxConnections, but ran
into this issue first.

Best Regards,

Ash

-- 
*P.S. We've launched a new blog to share the latest ideas and case studies 
from our team. Check it out here: product.canva.com 
. ***
** Empowering the world 
to design
Also, we're hiring. Apply here! 

  
  








Questions about stored fields and updates.

2018-11-03 Thread Ash Ramesh
Hi everyone,

My company currently uses SOLR to completely hydrate client objects by
storing all fields (stored=true). Therefore we have 2 types of fields:

   1. indexed=true | stored=true : For fields that will be used for
   searching, sorting, etc.
   2. indexed=false | stored=true: For fields that only need hydrating for
   clients

We are re-architecting this so that we will eventually only get the id from
SOLR (fl=id) and hydrate from another data source. This means we can
obviously delete all the indexed=false | stored=true fields to reduce our
index size.

However, when it comes to the indexed=true | stored=true fields, we are not
sure whether to also set them to be stored=false and perform in-place
updates or leave it as is and perform atomic updates. We've done a fair bit
of research on the archives of this mailing list, but are still a bit
confused:

1. Will having the fields be converted from indexed=true | stored=true ->
indexed=true | stored=false cause our index size to reduce? Will it also
mean that indexing will be less compute expensive due to the compression of
stored field logic?
2. Are atomic updates preferred to in-place updates? Obviously if we move
to index only fields, then we have to do in-place updates all the time.
This isn't an issue for us, but we are a bit concerned about how SOLR's
indexing speed will suffer & deleted docs increase. Currently we perform
both.

Some points about our SOLR usecase:
- 40-60M docs with 8 shards (PULL/TLOG structure) Solr 7.4
- No need for extremely fast indexing
- Need for high query throughput (thus why we only want to retrieve the id
field and hydrate with a faster db store)

Thanks everyone, always appreciate the good information being shared here
daily :)

Regards,

Ash

-- 
*P.S. We've launched a new blog to share the latest ideas and case studies 
from our team. Check it out here: product.canva.com 
. ***
** Empowering the world 
to design
Also, we're hiring. Apply here! 

  
  








Re: Questions about stored fields and updates.

2018-11-04 Thread Ash Ramesh
Sorry Shawn,

I seem to have gotten my wording wrong. I meant that we wanted to move away
from atomic-updates to replacing/reindexing the document entirely again
when changes are made.
https://lucene.apache.org/solr/guide/7_5/uploading-data-with-index-handlers.html#adding-documents

Regards,

Ash

On Mon, Nov 5, 2018 at 11:29 AM Shawn Heisey  wrote:

> On 11/3/2018 9:45 PM, Ash Ramesh wrote:
> > My company currently uses SOLR to completely hydrate client objects by
> > storing all fields (stored=true). Therefore we have 2 types of fields:
> >
> > 1. indexed=true | stored=true : For fields that will be used for
> > searching, sorting, etc.
> > 2. indexed=false | stored=true: For fields that only need hydrating
> for
> > clients
> >
> > We are re-architecting this so that we will eventually only get the id
> from
> > SOLR (fl=id) and hydrate from another data source. This means we can
> > obviously delete all the indexed=false | stored=true fields to reduce our
> > index size.
> >
> > However, when it comes to the indexed=true | stored=true fields, we are
> not
> > sure whether to also set them to be stored=false and perform in-place
> > updates or leave it as is and perform atomic updates. We've done a fair
> bit
> > of research on the archives of this mailing list, but are still a bit
> > confused:
> >
> > 1. Will having the fields be converted from indexed=true | stored=true ->
> > indexed=true | stored=false cause our index size to reduce? Will it also
> > mean that indexing will be less compute expensive due to the compression
> of
> > stored field logic?
>
> Pretty much anything you change from true to false in the schema will
> reduce index size.
>
> Removal of stored data will not *directly* improve query speed -- stored
> data is not used during the query phase.  It might *indirectly* increase
> query speed by removing data from the OS disk cache, leaving more room
> for inverted index data.
>
> The direct improvement from removing stored data will be during data
> retrieval (after the query itself).  It will also mean there is less
> data to compress, which means that indexing speed might increase.
>
> > 2. Are atomic updates preferred to in-place updates? Obviously if we move
> > to index only fields, then we have to do in-place updates all the time.
> > This isn't an issue for us, but we are a bit concerned about how SOLR's
> > indexing speed will suffer & deleted docs increase. Currently we perform
> > both.
>
> If you change stored to false, you will most likely not be able to do
> atomic updates.  Atomic update functionality has very specific
> requirements:
>
>
> https://lucene.apache.org/solr/guide/7_5/updating-parts-of-documents.html#field-storage
>
> In-place updates have requirements that are even more strict than atomic
> updates -- the field cannot be indexed:
>
>
> https://lucene.apache.org/solr/guide/7_5/updating-parts-of-documents.html#in-place-updates
>
> Thanks,
> Shawn
>
>

-- 
*P.S. We've launched a new blog to share the latest ideas and case studies 
from our team. Check it out here: product.canva.com 
<http://product.canva.com/>. ***
** <https://canva.com>Empowering the world 
to design
Also, we're hiring. Apply here! 
<https://about.canva.com/careers/>
 <https://twitter.com/canva> 
<https://facebook.com/canva> <https://au.linkedin.com/company/canva> 
<https://instagram.com/canva>







Re: Questions about stored fields and updates.

2018-11-04 Thread Ash Ramesh
Also thanks for the information Shawn! :)

On Mon, Nov 5, 2018 at 12:09 PM Ash Ramesh  wrote:

> Sorry Shawn,
>
> I seem to have gotten my wording wrong. I meant that we wanted to move
> away from atomic-updates to replacing/reindexing the document entirely
> again when changes are made.
> https://lucene.apache.org/solr/guide/7_5/uploading-data-with-index-handlers.html#adding-documents
>
> Regards,
>
> Ash
>
> On Mon, Nov 5, 2018 at 11:29 AM Shawn Heisey  wrote:
>
>> On 11/3/2018 9:45 PM, Ash Ramesh wrote:
>> > My company currently uses SOLR to completely hydrate client objects by
>> > storing all fields (stored=true). Therefore we have 2 types of fields:
>> >
>> > 1. indexed=true | stored=true : For fields that will be used for
>> > searching, sorting, etc.
>> > 2. indexed=false | stored=true: For fields that only need hydrating
>> for
>> > clients
>> >
>> > We are re-architecting this so that we will eventually only get the id
>> from
>> > SOLR (fl=id) and hydrate from another data source. This means we can
>> > obviously delete all the indexed=false | stored=true fields to reduce
>> our
>> > index size.
>> >
>> > However, when it comes to the indexed=true | stored=true fields, we are
>> not
>> > sure whether to also set them to be stored=false and perform in-place
>> > updates or leave it as is and perform atomic updates. We've done a fair
>> bit
>> > of research on the archives of this mailing list, but are still a bit
>> > confused:
>> >
>> > 1. Will having the fields be converted from indexed=true | stored=true
>> ->
>> > indexed=true | stored=false cause our index size to reduce? Will it also
>> > mean that indexing will be less compute expensive due to the
>> compression of
>> > stored field logic?
>>
>> Pretty much anything you change from true to false in the schema will
>> reduce index size.
>>
>> Removal of stored data will not *directly* improve query speed -- stored
>> data is not used during the query phase.  It might *indirectly* increase
>> query speed by removing data from the OS disk cache, leaving more room
>> for inverted index data.
>>
>> The direct improvement from removing stored data will be during data
>> retrieval (after the query itself).  It will also mean there is less
>> data to compress, which means that indexing speed might increase.
>>
>> > 2. Are atomic updates preferred to in-place updates? Obviously if we
>> move
>> > to index only fields, then we have to do in-place updates all the time.
>> > This isn't an issue for us, but we are a bit concerned about how SOLR's
>> > indexing speed will suffer & deleted docs increase. Currently we perform
>> > both.
>>
>> If you change stored to false, you will most likely not be able to do
>> atomic updates.  Atomic update functionality has very specific
>> requirements:
>>
>>
>> https://lucene.apache.org/solr/guide/7_5/updating-parts-of-documents.html#field-storage
>>
>> In-place updates have requirements that are even more strict than atomic
>> updates -- the field cannot be indexed:
>>
>>
>> https://lucene.apache.org/solr/guide/7_5/updating-parts-of-documents.html#in-place-updates
>>
>> Thanks,
>> Shawn
>>
>>

-- 
*P.S. We've launched a new blog to share the latest ideas and case studies 
from our team. Check it out here: product.canva.com 
<http://product.canva.com/>. ***
** <https://canva.com>Empowering the world 
to design
Also, we're hiring. Apply here! 
<https://about.canva.com/careers/>
 <https://twitter.com/canva> 
<https://facebook.com/canva> <https://au.linkedin.com/company/canva> 
<https://instagram.com/canva>







Problem understanding why QPS is so low

2019-03-19 Thread Ash Ramesh
Hi everybody,

My team run a solr cluster which has very low QPS throughput. I have been
going through the different configurations in our setup, and think that
it's probably the way we have defined our request handlers that is causing
the slowness.

Details of our cluster are below the fold.

*Questions:*

   1. Obviously we have a set of 'expensive' boosts here. Are there any
   inherent anti pattens obvious in the request handler?
   2. Is it normal for such a request handler to max out at around 13 QPS
   before latency starts hitting 2-4 seconds?
   3. Have we maybe architected our cluster incorrectly?
   4. Are there any patterns we should adopt to increase through put?


Thank you so much for taking time to read this email. We would really
appreciate any feedback. We are happy to provide more details into our
cluster if needed.

Regards,

Ash

*Information about our Cluster:*

   - *Solr Version: *7.4.0
   - *Architecture: *TLOG/PULL - 8 Shards (Default shard hashing)
  - Doc Count: 50 Million ~
  - TLOG - EC2 Machines hosting TLOGs have all 8 shards. Approximately
  12G index total
  - PULL - EC2 Machines host 2 shards. There are 4 ASGs such that each
  ASG host one of the shard combinations - [shard1, shard2], [shard3,
  shard4], [shard5, shard6], [shard7, shard8]
 - We scale up on CPU utilisation
  - Schema: No stored fields (except for `id`)
  - Indexing: Use the SolrJ Zookeeper Client to talk directly to TLOGs
  to update (fully replace) documents
 - Deleted docs: Between 10-25% depending on when the merge policy
 was last executed
  - Request Serving: PULL ASGs are wrapped around a ELB, such that we
  use the SolrJ HTTP Client to make requests.
 - All read requests are sent with the
 '"shards.preference=replica.location:local,replica.type:PULL"' in an
 attempt to direct all traffic to PULL nodes.
 - *Average QPS* per full copy of the index (PULL nodes of
   shard1-shard8): *13 queries per second*
   - *Heap Size PULL: *15G
  - Index is fully memory mapped with extra RAM to spare on all PULL
  machines
   - *Solr Caches:*
  - Document Cache: Disabled - No stored fields, seems pointless
  - Query Cache: Disabled - too many different queries no reason to use
  this
  - Filter Cache: 1600 in size (900 autowarm) - we have a set of well
  defined filter queries, we are thinking of increasing this since hit rate
  is 0.86

*Example Request Handler (Obfuscated field names and boost values)*


**

**

*  *
*  en*
*  *

*  edismax*
*  10*
*  id*
*  * _query_*

*  fieldA^0.99 fieldB^0.99 fieldC^0.99 fieldD^0.99
fieldE^0.99*
*  fieldA_$${lang}*
*  fieldB_$${lang}*
*  fieldC_$${lang}*
*  fieldD_$${lang}*
*  textContent_$${lang}*


*  2*
*  0.99*
*  true*

*  *
*  fieldA_$${lang}^0.99 fieldB_$${lang}^0.99*
**

**
*  {!type=edismax v=$qq}*
**

**
*  {!edismax qf=fieldA^0.99 mm=100% bq="" boost="" pf=""
tie=1.00 v=$qq}*
*  {!edismax qf=fieldB^0.99 mm=100% bq="" boost="" pf=""
tie=1.00 v=$qq}*
*  {!edismax qf=fieldC^0.99 mm=100% bq="" boost="" pf=""
tie=1.00 v=$qq}*
*  {!edismax qf=fieldD^0.99 mm=100% bq="" boost="" pf=""
tie=1.00 v=$qq}*

*  {!edismax qf=fieldA^0.99 fieldB^0.99 fieldC^0.99
fieldD^0.99 mm=100% bq="" boost="" pf="" tie=1.00 v=$qq}*

*  {!func}mul(termfreq(docBoostFieldB,$qq),100)*
*  if(termfreq(docBoostFieldB,$qq),1,def(docBoostFieldA,1))*
**

**
*  elevator*
**

*  *

*Notes:*

   - *We have a data science team that feeds back click through data to the
   boostFields to re-order results for popular queries*
   - *We do sorting on 'score DESC dateSubmitted DESC'*
   - *We use the 'elevator' component quite heavily - e.g.
   'elevateIds=A,B,C'*
   - *We have some localized fields - thus we do aliasing in the request
   handler*

-- 
*P.S. We've launched a new blog to share the latest ideas and case studies 
from our team. Check it out here: product.canva.com 
. ***
** Empowering the 
world to design
Also, we're hiring. Apply here! 

  
  
    
  








Performance problems with extremely common terms in collection (Solr 7.4)

2019-04-07 Thread Ash Ramesh
Hi everybody,

We have a corpus of 50+ million documents in our collection. I've noticed
that some queries with specific keywords tend to be extremely slow. E.g.
the q=`photography' or q='background'. After digging into the raw
documents, I could see that these two terms appear in greater than 90% of
all documents, which means solr has to score each of those documents.

Is there a best practise to deal with these sort of queries? Should solr be
able to handle these queries normally quickly (we have 8 shards). The
average, reproducible time for the response on these queries is between
1.5-2.5 seconds.

Please let me know if more information is required.

Regards.

Ash

-- 
*P.S. We've launched a new blog to share the latest ideas and case studies 
from our team. Check it out here: product.canva.com 
. ***
** Empowering the 
world to design
Also, we're hiring. Apply here! 

  
  
    
  








Re: Performance problems with extremely common terms in collection (Solr 7.4)

2019-04-08 Thread Ash Ramesh
Hi Toke,

Thanks for the prompt reply. I'm glad to hear that this is a common
problem. In regards to stop words, I've been thinking about trying that
out. In our business case, most of these terms are keywords related to
stock photography, therefore it's natural for 'photography' or 'background'
to appear commonly in a document's keyword list. it seems unlikely we can
use the common grams solution with our business case.

Regards,

Ash

On Mon, Apr 8, 2019 at 5:01 PM Toke Eskildsen  wrote:

> On Mon, 2019-04-08 at 09:58 +1000, Ash Ramesh wrote:
> > We have a corpus of 50+ million documents in our collection. I've
> > noticed that some queries with specific keywords tend to be extremely
> > slow.
> > E.g. the q=`photography' or q='background'. After digging into the
> > raw documents, I could see that these two terms appear in greater
> > than 90% of all documents, which means solr has to score each of
> > those documents.
>
> That is known behaviour, which can be remedied somewhat. Stop words is
> a common approach, but your samples does not seem to fit well with
> that. Instead you can look at Common Grams, where your high-frequency
> words gets concatenated with surrounding words. This only works with
> phrases though. There's a nice article at
>
>
> https://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-2
>
> - Toke Eskildsen, Royal Danish Library
>
>
>

-- 
*P.S. We've launched a new blog to share the latest ideas and case studies 
from our team. Check it out here: product.canva.com 
<https://product.canva.com/>. ***
** <https://www.canva.com/>Empowering the 
world to design
Also, we're hiring. Apply here! 
<https://about.canva.com/careers/>
 <https://twitter.com/canva> 
<https://facebook.com/canva> <https://au.linkedin.com/company/canva> 
<https://twitter.com/canva>  <https://facebook.com/canva>  
<https://au.linkedin.com/company/canva>  <https://instagram.com/canva>








Re: Solr running on Tomcat

2018-02-16 Thread Ramesh b
Yes, we are running Solr 5.3.1 version in tomcat for our production system.it 
serves well.

Thanks,
Ramesh

> On Feb 15, 2018, at 10:54 PM, GVK Prasad  wrote:
> 
> 
> I read some posts on setting up Solr to Run on Tomcat. But all these posts 
> are about Solr version 4.0 or earlier. 
> I am thinking of hosting  Solr on Tomcat for scalability. 
> Any recommendation on this. 
> 
> Prasad
> 
> 
> 
> 
> ---
> This email has been checked for viruses by Avast antivirus software.
> https://www.avast.com/antivirus


Memory requirements for TLOGs (7.3.1)

2018-07-17 Thread Ash Ramesh
Hi everybody,

I have a quick question about what the memory requirements for TLOG
machines are on 7.3.1. We currently run replication where there are 3 TLOGs
with 8gb ram (2gb heap) and N PULL replicas with 32gb ram (4gb heap). We
have > 10M documents (1 collection) with the index size being ~ 17gb. We
send all read traffic to the PULLs and send Updates and Additions to the
Leader TLOG.

We are wondering how this setup can affect performance for replication,
etc. We are thinking of increasing the heap of the TLOG to 4gb but leaving
the total memory on the machine at 8gb. What will that do to performance?
We also expect our index to grow 3/4x in the next 6 months.

Any assistance would be well appreciated :)

Regards,

Ash

-- 
*P.S. We've launched a new blog to share the latest ideas and case studies 
from our team. Check it out here: product.canva.com 
. ***
** Empowering the world 
to design
Also, we're hiring. Apply here! 

  
  








Re: Memory requirements for TLOGs (7.3.1)

2018-07-18 Thread Ash Ramesh
Thanks for the quick responses Shawn & Erick! Just to clarify another few
points:
 1. Does having a larger heap size impact ingesting additional documents to
the index (all CRUD operations) onto a TLOG?
 2. Does having a larger ram configured machine (in this case 32gb) affect
ingestion on TLOGS also?
 3. We are currently routing queries via Amazon ASG / Load Balancer. Is
this one of the recommended ways to set up SOLR infrastructure?

Best Regards,

Ash


On Thu, Jul 19, 2018 at 12:56 AM Erick Erickson 
wrote:

> There's little good reason to _not_ route searches to your TLOG
> replicas. The only difference between the PULL and TLOG replicas is
> that the TLOG replicas get a raw copy of the incoming document from
> the leader and write them to the TLOG. I.e. there's some additional
> I/O.
>
> It's possible that if you have extremely heavy indexing you might
> notice some additional load on the TLOG .vs. PULL replicas, but from
> what you've said I doubt you have that much indexing traffic.
>
> So basically I'd configure my TLOG and PULL replicas pretty much
> identically and search them both.
>
> Best,
> Erick
>
> On Wed, Jul 18, 2018 at 7:46 AM, Shawn Heisey  wrote:
> > On 7/18/2018 12:04 AM, Ash Ramesh wrote:
> >>
> >> I have a quick question about what the memory requirements for TLOG
> >> machines are on 7.3.1. We currently run replication where there are 3
> >> TLOGs
> >> with 8gb ram (2gb heap) and N PULL replicas with 32gb ram (4gb heap). We
> >> have > 10M documents (1 collection) with the index size being ~ 17gb. We
> >> send all read traffic to the PULLs and send Updates and Additions to the
> >> Leader TLOG.
> >>
> >> We are wondering how this setup can affect performance for replication,
> >> etc. We are thinking of increasing the heap of the TLOG to 4gb but
> leaving
> >> the total memory on the machine at 8gb. What will that do to
> performance?
> >> We also expect our index to grow 3/4x in the next 6 months.
> >
> >
> > Performance has more to do with index size and memory size than the type
> of
> > replication you're doing.
> >
> > SolrCloud will load balance queries across the cloud, so your low-memory
> > TLOG replicas are most likely handling queries as well.  In a SolrCloud
> > cluster, a query is not necessarily handled by the machine that you send
> the
> > query to.
> >
> > With memory resources that low compared to index size, the 8GB machines
> > probably do not perform queries as well as the 32GB machines.  If you
> > increase the heap to 4GB, that will only leave 4GB available for the OS
> disk
> > cache, and that's going to drop query performance even further.
> >
> > There is a feature in Solr 7.4 that will allow you to prefer certain
> replica
> > types, so you can tell Solr that it should prefer PULL replicas.  But
> since
> > you're running 7.3.1, you don't have that feature.
> >
> > https://issues.apache.org/jira/browse/SOLR-11982
> >
> > There is also a "preferLocalShards" parameter that has existed for longer
> > than the new feature mentioned above.  This tells Solr that it should not
> > load balance queries in the cloud if there is a local index that can
> satisfy
> > the query.  This parameter should only be used if you have an external
> load
> > balancer.
> >
> > Indexing is a heap-intensive operation that doesn't benefit much from
> having
> > a lot of extra memory for the operating system. I have no idea whether
> 2GB
> > of heap is enough or not.  Increasing the heap size MIGHT make
> performance
> > better, or it might make no difference at all.
> >
> > Thanks,
> > Shawn
> >
>

-- 
*P.S. We've launched a new blog to share the latest ideas and case studies 
from our team. Check it out here: product.canva.com 
<http://product.canva.com/>. ***
** <https://canva.com>Empowering the world 
to design
Also, we're hiring. Apply here! 
<https://about.canva.com/careers/>
 <https://twitter.com/canva> 
<https://facebook.com/canva> <https://au.linkedin.com/company/canva> 
<https://instagram.com/canva>







Best Practises around relevance tuning per query

2020-02-17 Thread Ashwin Ramesh
Hi,

We are in the process of applying a scoring model to our search results. In
particular, we would like to add scores for documents per query and user
context.

For example, we want to have a score from 500 to 1 for the top 500
documents for the query “dog” for users who speak US English.

We believe it becomes infeasible to store these scores in Solr because we
want to update the scores regularly, and the number of scores increases
rapidly with increased user attributes.

One solution we explored was to store these scores in a secondary data
store, and use this at Solr query time with a boost function such as:

`bf=mul(termfreq(id,’ID-1'),500) mul(termfreq(id,'ID-2'),499) …
mul(termfreq(id,'ID-500'),1)`

We have over a hundred thousand documents in one Solr collection, and about
fifty million in another Solr collection. We have some queries for which
roughly 80% of the results match, although this is an edge case. We wanted
to know the worst case performance, so we tested with such a query. For
both of these collections we found the a message similar to the following
in the Solr cloud logs (tested on a laptop):

Elapsed time: 5020. Exceeded allowed search time: 5000 ms.

We then tried using the following boost, which seemed simpler:

`boost=if(query($qq), 10, 1)&qq=id:(ID-1 OR ID-2 OR … OR ID-500)`

We then saw the following in the Solr cloud logs:

`The request took too long to iterate over terms.`

All responses above took over 5000 milliseconds to return.

We are considering Solr’s re-ranker, but I don’t know how we would use this
without pushing all the query-context-document scores to Solr.


The alternative solution that we are currently considering involves
invoking multiple solr queries.

This means we would make a request to solr to fetch the top N results (id,
score) for the query. E.g. q=dog, fq=featureA:foo, fq=featureB=bar, limit=N.

Another request would be made using a filter query with a set of doc ids
that we know are high value for the user’s query. E.g. q=*:*,
fq=featureA:foo, fq=featureB:bar, fq=id:(d1, d2, d3), limit=N.

We would then do a reranking phase in our service layer.

Do you have any suggestions for known patterns of how we can store and
retrieve scores per user context and query?

Regards,
Ash & Spirit.

-- 
**
** Empowering the world to design
Also, we're 
hiring. Apply here! 
 
  
   
    













Re: Best Practises around relevance tuning per query

2020-02-18 Thread Ashwin Ramesh
ping on this :)

On Tue, Feb 18, 2020 at 11:50 AM Ashwin Ramesh  wrote:

> Hi,
>
> We are in the process of applying a scoring model to our search results.
> In particular, we would like to add scores for documents per query and user
> context.
>
> For example, we want to have a score from 500 to 1 for the top 500
> documents for the query “dog” for users who speak US English.
>
> We believe it becomes infeasible to store these scores in Solr because we
> want to update the scores regularly, and the number of scores increases
> rapidly with increased user attributes.
>
> One solution we explored was to store these scores in a secondary data
> store, and use this at Solr query time with a boost function such as:
>
> `bf=mul(termfreq(id,’ID-1'),500) mul(termfreq(id,'ID-2'),499) …
> mul(termfreq(id,'ID-500'),1)`
>
> We have over a hundred thousand documents in one Solr collection, and
> about fifty million in another Solr collection. We have some queries for
> which roughly 80% of the results match, although this is an edge case. We
> wanted to know the worst case performance, so we tested with such a query.
> For both of these collections we found the a message similar to the
> following in the Solr cloud logs (tested on a laptop):
>
> Elapsed time: 5020. Exceeded allowed search time: 5000 ms.
>
> We then tried using the following boost, which seemed simpler:
>
> `boost=if(query($qq), 10, 1)&qq=id:(ID-1 OR ID-2 OR … OR ID-500)`
>
> We then saw the following in the Solr cloud logs:
>
> `The request took too long to iterate over terms.`
>
> All responses above took over 5000 milliseconds to return.
>
> We are considering Solr’s re-ranker, but I don’t know how we would use
> this without pushing all the query-context-document scores to Solr.
>
>
> The alternative solution that we are currently considering involves
> invoking multiple solr queries.
>
> This means we would make a request to solr to fetch the top N results (id,
> score) for the query. E.g. q=dog, fq=featureA:foo, fq=featureB=bar, limit=N.
>
> Another request would be made using a filter query with a set of doc ids
> that we know are high value for the user’s query. E.g. q=*:*,
> fq=featureA:foo, fq=featureB:bar, fq=id:(d1, d2, d3), limit=N.
>
> We would then do a reranking phase in our service layer.
>
> Do you have any suggestions for known patterns of how we can store and
> retrieve scores per user context and query?
>
> Regards,
> Ash & Spirit.
>

-- 
**
** <https://www.canva.com/>Empowering the world to design
Also, we're 
hiring. Apply here! <https://about.canva.com/careers/>
 
<https://twitter.com/canva> <https://facebook.com/canva> 
<https://au.linkedin.com/company/canva> <https://twitter.com/canva>  
<https://facebook.com/canva>  <https://au.linkedin.com/company/canva>  
<https://instagram.com/canva>












Re: Best Practises around relevance tuning per query

2020-02-26 Thread Ashwin Ramesh
Hi everybody,

Thank you for all the amazing feedback. I apologize for the formatting of
my question.

I guess if I was to generalize my question, 'What is the most common
approaches to storing query level features in Solr documents?'

For example, a normalized_click_score is a document level feature, but how
would you scalably also do the same for specific queries? E.g. How do you
define, *For the query 'ipod' this specific document is very relevant*.

Thanks again!

Regards,

Ash

On Wed, Feb 19, 2020 at 6:14 PM Jörn Franke  wrote:

> You are too much focus on the solution. If you would describe the business
> case in more detail without including the solution itself more people could
> help.
>
> Eg it ie not clear why you have a scoring model and why this can address
> business needs.
>
> > Am 18.02.2020 um 01:50 schrieb Ashwin Ramesh :
> >
> > Hi,
> >
> > We are in the process of applying a scoring model to our search results.
> In
> > particular, we would like to add scores for documents per query and user
> > context.
> >
> > For example, we want to have a score from 500 to 1 for the top 500
> > documents for the query “dog” for users who speak US English.
> >
> > We believe it becomes infeasible to store these scores in Solr because we
> > want to update the scores regularly, and the number of scores increases
> > rapidly with increased user attributes.
> >
> > One solution we explored was to store these scores in a secondary data
> > store, and use this at Solr query time with a boost function such as:
> >
> > `bf=mul(termfreq(id,’ID-1'),500) mul(termfreq(id,'ID-2'),499) …
> > mul(termfreq(id,'ID-500'),1)`
> >
> > We have over a hundred thousand documents in one Solr collection, and
> about
> > fifty million in another Solr collection. We have some queries for which
> > roughly 80% of the results match, although this is an edge case. We
> wanted
> > to know the worst case performance, so we tested with such a query. For
> > both of these collections we found the a message similar to the following
> > in the Solr cloud logs (tested on a laptop):
> >
> > Elapsed time: 5020. Exceeded allowed search time: 5000 ms.
> >
> > We then tried using the following boost, which seemed simpler:
> >
> > `boost=if(query($qq), 10, 1)&qq=id:(ID-1 OR ID-2 OR … OR ID-500)`
> >
> > We then saw the following in the Solr cloud logs:
> >
> > `The request took too long to iterate over terms.`
> >
> > All responses above took over 5000 milliseconds to return.
> >
> > We are considering Solr’s re-ranker, but I don’t know how we would use
> this
> > without pushing all the query-context-document scores to Solr.
> >
> >
> > The alternative solution that we are currently considering involves
> > invoking multiple solr queries.
> >
> > This means we would make a request to solr to fetch the top N results
> (id,
> > score) for the query. E.g. q=dog, fq=featureA:foo, fq=featureB=bar,
> limit=N.
> >
> > Another request would be made using a filter query with a set of doc ids
> > that we know are high value for the user’s query. E.g. q=*:*,
> > fq=featureA:foo, fq=featureB:bar, fq=id:(d1, d2, d3), limit=N.
> >
> > We would then do a reranking phase in our service layer.
> >
> > Do you have any suggestions for known patterns of how we can store and
> > retrieve scores per user context and query?
> >
> > Regards,
> > Ash & Spirit.
> >
> > --
> > **
> > ** <https://www.canva.com/>Empowering the world to design
> > Also, we're
> > hiring. Apply here! <https://about.canva.com/careers/>
> >
> > <https://twitter.com/canva> <https://facebook.com/canva>
> > <https://au.linkedin.com/company/canva> <https://twitter.com/canva>
> > <https://facebook.com/canva>  <https://au.linkedin.com/company/canva>
> > <https://instagram.com/canva>
>

-- 
**
** <https://www.canva.com/>Empowering the world to design
Also, we're 
hiring. Apply here! <https://about.canva.com/careers/>
 
<https://twitter.com/canva> <https://facebook.com/canva> 
<https://au.linkedin.com/company/canva> <https://twitter.com/canva>  
<https://facebook.com/canva>  <https://au.linkedin.com/company/canva>  
<https://instagram.com/canva>












Overseer & Backups - Questions

2020-03-09 Thread Ashwin Ramesh
Hi everybody,

Quick Specs:
- Solr 7.4 Solr Cloud
- 30gb index on 8 shards Tlog/Pull

We run daily backups on our 30gb index and noticed that the overseer does
not process other jobs on it's task list while the backup is being taken.
They remain on the pending list (in ZK). Is this expected?

Also I was wondering if there was a safe way to cancel a currently running
task or deleing pending tasks?

Regards,

Ash

-- 
**
** Empowering the world to design
Also, we're 
hiring. Apply here! 
 
  
   
    













Re: Overseer & Backups - Questions

2020-03-10 Thread Ashwin Ramesh
We use the collection API to invoke backups. The tasks we noticed that
stalled are ADDREPLICA. As expected when the backup completed a few hours
ago, the task then got completed. Is there some concurrency setting with
these tasks? Or is a backup a blocking task? We noticed that the index was
still being flushed to segments though.

Regards,

Ash

On Wed, Mar 11, 2020 at 3:18 AM Aroop Ganguly
 wrote:

> May we know how you are invoking backups ?
>
> > On Mar 9, 2020, at 11:53 PM, Ashwin Ramesh 
> wrote:
> >
> > Hi everybody,
> >
> > Quick Specs:
> > - Solr 7.4 Solr Cloud
> > - 30gb index on 8 shards Tlog/Pull
> >
> > We run daily backups on our 30gb index and noticed that the overseer does
> > not process other jobs on it's task list while the backup is being taken.
> > They remain on the pending list (in ZK). Is this expected?
> >
> > Also I was wondering if there was a safe way to cancel a currently
> running
> > task or deleing pending tasks?
> >
> > Regards,
> >
> > Ash
> >
> > --
> > **
> > ** <https://www.canva.com/>Empowering the world to design
> > Also, we're
> > hiring. Apply here! <https://about.canva.com/careers/>
> >
> > <https://twitter.com/canva> <https://facebook.com/canva>
> > <https://au.linkedin.com/company/canva> <https://twitter.com/canva>
> > <https://facebook.com/canva>  <https://au.linkedin.com/company/canva>
> > <https://instagram.com/canva>
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
>
>

-- 
**
** <https://www.canva.com/>Empowering the world to design
Also, we're 
hiring. Apply here! <https://about.canva.com/careers/>
 
<https://twitter.com/canva> <https://facebook.com/canva> 
<https://au.linkedin.com/company/canva> <https://twitter.com/canva>  
<https://facebook.com/canva>  <https://au.linkedin.com/company/canva>  
<https://instagram.com/canva>












Re: Overseer & Backups - Questions

2020-03-10 Thread Ashwin Ramesh
Hey Aroop,

Yes we sent ASYNC=

Backups are taken to an EFS drive (AWS's managed NFS)

I also thought it was async and Solr can process multiple tasks at once.
But the ZK state definitely showed that only the backup task was in
progress while all the other tasks were queued up.

Regards,

Ash

On Wed, Mar 11, 2020 at 9:21 AM Aroop Ganguly
 wrote:

> Backups on hdfs ?
> These should not be blocking if invoked asynchronously, are u doing them
> async by passing the async flag?
>
> > On Mar 10, 2020, at 3:19 PM, Ashwin Ramesh 
> wrote:
> >
> > We use the collection API to invoke backups. The tasks we noticed that
> > stalled are ADDREPLICA. As expected when the backup completed a few hours
> > ago, the task then got completed. Is there some concurrency setting with
> > these tasks? Or is a backup a blocking task? We noticed that the index
> was
> > still being flushed to segments though.
> >
> > Regards,
> >
> > Ash
> >
> > On Wed, Mar 11, 2020 at 3:18 AM Aroop Ganguly
> >  wrote:
> >
> >> May we know how you are invoking backups ?
> >>
> >>> On Mar 9, 2020, at 11:53 PM, Ashwin Ramesh 
> >> wrote:
> >>>
> >>> Hi everybody,
> >>>
> >>> Quick Specs:
> >>> - Solr 7.4 Solr Cloud
> >>> - 30gb index on 8 shards Tlog/Pull
> >>>
> >>> We run daily backups on our 30gb index and noticed that the overseer
> does
> >>> not process other jobs on it's task list while the backup is being
> taken.
> >>> They remain on the pending list (in ZK). Is this expected?
> >>>
> >>> Also I was wondering if there was a safe way to cancel a currently
> >> running
> >>> task or deleing pending tasks?
> >>>
> >>> Regards,
> >>>
> >>> Ash
> >>>
> >>> --
> >>> **
> >>> ** <https://www.canva.com/>Empowering the world to design
> >>> Also, we're
> >>> hiring. Apply here! <https://about.canva.com/careers/>
> >>>
> >>> <https://twitter.com/canva> <https://facebook.com/canva>
> >>> <https://au.linkedin.com/company/canva> <https://twitter.com/canva>
> >>> <https://facebook.com/canva>  <https://au.linkedin.com/company/canva>
> >>> <https://instagram.com/canva>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>
> >>
> >
> > --
> > **
> > ** <https://www.canva.com/>Empowering the world to design
> > Also, we're
> > hiring. Apply here! <https://about.canva.com/careers/>
> >
> > <https://twitter.com/canva> <https://facebook.com/canva>
> > <https://au.linkedin.com/company/canva> <https://twitter.com/canva>
> > <https://facebook.com/canva>  <https://au.linkedin.com/company/canva>
> > <https://instagram.com/canva>
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
>
>

-- 
**
** <https://www.canva.com/>Empowering the world to design
Also, we're 
hiring. Apply here! <https://about.canva.com/careers/>
 
<https://twitter.com/canva> <https://facebook.com/canva> 
<https://au.linkedin.com/company/canva> <https://twitter.com/canva>  
<https://facebook.com/canva>  <https://au.linkedin.com/company/canva>  
<https://instagram.com/canva>












LTR - FieldValueFeature Question

2020-04-23 Thread Ashwin Ramesh
Hi everybody,

Do we need to have 'indexed=true' to be able to retrieve the value of a
field via FieldValueFeature or is having docValue=true enough?

Currently, we have some dynamic fields as [dynamicField=true, stored=false,
indexed=false, docValue=true]. However when we noticing that the value
extracted is '0.0'.

This is the code I read around FieldFeatureValue:
https://github.com/apache/lucene-solr/blob/master/solr/contrib/ltr/src/java/org/apache/solr/ltr/feature/FieldValueFeature.java

Thanks,

Ash

-- 
**
** Empowering the world to design
Share accurate 
information on COVID-19 and spread messages of support to your community.

Here are some resources 

 
that can help.
   
   
    













Solr 7.4 - LTR reranker not adhering by Elevate Plugin

2020-05-14 Thread Ashwin Ramesh
Hi everybody,

We are running a query with both elevateIds=1,2,3 & a reranker phase using
LTR plugin.

We noticed that the results do not return in the expected order - per the
elevateIds param.
Example LTR rq param {!ltr.model=foo reRankDocs=250 efi.query=$q}

When I used the standard reranker ({!rerank reRankQuery=$titleQuery
reRankDocs=1000 reRankWeight=3}) , it did adhere.

I assumed it's because the elevate plugin runs before the reranker (LTR).
However I'm finding it hard to confirm. The model is a linear model.

Is this expected behaviour?

Regards,

Ash

-- 
**
** Empowering the world to design
Share accurate 
information on COVID-19 and spread messages of support to your community.

Here are some resources 

 
that can help.
   
   
    













Cannot add replica during backup

2020-08-10 Thread Ashwin Ramesh
Hi everybody,

We are using solr 7.6 (SolrCloud). We notices that when the backup is
running, we cannot add any replicas to the collection. By the looks of it,
the job to add the replica is put into the Overseer queue, but it is not
being processed. Is this expected? And are there any workarounds?

Our backups take about 12 hours. Maybe we should try optimize that too.

Regards,

Ash

-- 
**
** Empowering the world to design
Share accurate 
information on COVID-19 and spread messages of support to your community.

Here are some resources 

 
that can help.
   
   
    













Re: Backups in SolrCloud using snapshots of individual cores?

2020-08-10 Thread Ashwin Ramesh
I would love an answer to this too!

On Fri, Aug 7, 2020 at 12:18 AM Bram Van Dam  wrote:

> Hey folks,
>
> Been reading up about the various ways of creating backups. The whole
> "shared filesystem for Solrcloud backups"-thing is kind of a no-go in
> our environment, so I've been looking for ways around that, and here's
> what I've come up with so far:
>
> 1. Stop applications from writing to solr
>
> 2. Commit everything
>
> 3. Identify a single core for each shard in each collection
>
> 4. Snapshot that core using CREATESNAPSHOT in the Collections API
>
> 5. Once complete, re-enable application write access to Solr
>
> 6. Create a backup from these snapshots using the replication handler's
> backup function (replication?command=backup&commitName=mySnapshot)
>
> 7. Put the backups somewhere safe
>
> 8. Clean up snapshots
>
>
> This seems ... too good to be true? I've seen so many threads about how
> hard it is to create backups in SolrCloud on this mailing list over the
> years, but this seems pretty straightforward? Am I missing some
> glaringly obvious reason why this will fail catastrophically?
>
> Using Solr 7.7 in this case.
>
> Feedback much appreciated!
>
> Thanks,
>
>  - Bram
>

-- 
**
** Empowering the world to design
Share accurate 
information on COVID-19 and spread messages of support to your community.

Here are some resources 

 
that can help.
   
   
    













Re: Cannot add replica during backup

2020-08-10 Thread Ashwin Ramesh
Hey Aroop, the general process for our backup is:
- Connect all machines to an EFS drive (AWS's NFS service)
- Call the collections API to backup into EFS
- ZIP the directory once the backup is completed
- Copy the ZIP into an s3 bucket

I'll probably have to see which part of the process is the slowest.

On another note, can you simply remove the task from the ZK path to
continue the execution of tasks?

Regards,

Ash

On Tue, Aug 11, 2020 at 11:40 AM Aroop Ganguly
 wrote:

> 12 hours is extreme, we take backups of 10TB worth of indexes in 15 mins
> using the collection backup api.
> How are you taking the backup?
>
> Do you actually see any backup progress or u are just seeing the task in
> the overseer queue linger ?
> I have seen restore tasks hanging in the queue forever despite process
> completing in Solr 77 so wouldn’t be surprised this happens with backup as
> well. And also observed that unless that unless that task is removed from
> the overseer-collection-queue the next ones do not proceed.
>
> Also adding replicas while backup seems like overkill, why don’t you just
> have the appropriate replication factor in the first place and have
> autoAddReplicas=true for indemnity?
>
> > On Aug 10, 2020, at 6:32 PM, Ashwin Ramesh 
> wrote:
> >
> > Hi everybody,
> >
> > We are using solr 7.6 (SolrCloud). We notices that when the backup is
> > running, we cannot add any replicas to the collection. By the looks of
> it,
> > the job to add the replica is put into the Overseer queue, but it is not
> > being processed. Is this expected? And are there any workarounds?
> >
> > Our backups take about 12 hours. Maybe we should try optimize that too.
> >
> > Regards,
> >
> > Ash
> >
> > --
> > **
> > ** <https://www.canva.com/>Empowering the world to design
> > Share accurate
> > information on COVID-19 and spread messages of support to your community.
> >
> > Here are some resources
> > <
> https://about.canva.com/coronavirus-awareness-collection/?utm_medium=pr&utm_source=news&utm_campaign=covid19_templates>
>
> > that can help.
> > <https://twitter.com/canva> <https://facebook.com/canva>
> > <https://au.linkedin.com/company/canva> <https://twitter.com/canva>
> > <https://facebook.com/canva>  <https://au.linkedin.com/company/canva>
> > <https://instagram.com/canva>
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
>
>

-- 
**
** <https://www.canva.com/>Empowering the world to design
Share accurate 
information on COVID-19 and spread messages of support to your community.

Here are some resources 
<https://about.canva.com/coronavirus-awareness-collection/?utm_medium=pr&utm_source=news&utm_campaign=covid19_templates>
 
that can help.
 <https://twitter.com/canva> <https://facebook.com/canva> 
<https://au.linkedin.com/company/canva> <https://twitter.com/canva>  
<https://facebook.com/canva>  <https://au.linkedin.com/company/canva>  
<https://instagram.com/canva>












Re: Cannot add replica during backup

2020-08-10 Thread Ashwin Ramesh
Hi Aroop,

We have 16 shards each approx 30GB - total is ~480GB. I'm also pretty sure
it's a network issue. Very interesting that you can index 20x the data in
15 min!

>> It would also help to ensure your overseer is on a node with a role that
exempts it from any Solr index responsibilities.
How would I ensure this? First I'm hearing about this!

Thanks for all the help!!

On Tue, Aug 11, 2020 at 11:48 AM Aroop Ganguly
 wrote:

> Hi Ashwin
>
> Thanks for sharing this detail.
> Do you mind sharing how big are each of these indices ?
> I am almost sure this is network capacity and constraints related per your
> aws setup.
>
> Yes if you can confirm that the backup is complete, or you just want the
> system to move on discarding the backup process, your removal of the backup
> flag from zookeeper will help Solr in moving on to the next task in the
> queue.
>
> It would also help to ensure your overseer is on a node with a role that
> exempts it from any Solr index responsibilities.
>
>
> > On Aug 10, 2020, at 6:43 PM, Ashwin Ramesh 
> wrote:
> >
> > Hey Aroop, the general process for our backup is:
> > - Connect all machines to an EFS drive (AWS's NFS service)
> > - Call the collections API to backup into EFS
> > - ZIP the directory once the backup is completed
> > - Copy the ZIP into an s3 bucket
> >
> > I'll probably have to see which part of the process is the slowest.
> >
> > On another note, can you simply remove the task from the ZK path to
> > continue the execution of tasks?
> >
> > Regards,
> >
> > Ash
> >
> > On Tue, Aug 11, 2020 at 11:40 AM Aroop Ganguly
> >  wrote:
> >
> >> 12 hours is extreme, we take backups of 10TB worth of indexes in 15 mins
> >> using the collection backup api.
> >> How are you taking the backup?
> >>
> >> Do you actually see any backup progress or u are just seeing the task in
> >> the overseer queue linger ?
> >> I have seen restore tasks hanging in the queue forever despite process
> >> completing in Solr 77 so wouldn’t be surprised this happens with backup
> as
> >> well. And also observed that unless that unless that task is removed
> from
> >> the overseer-collection-queue the next ones do not proceed.
> >>
> >> Also adding replicas while backup seems like overkill, why don’t you
> just
> >> have the appropriate replication factor in the first place and have
> >> autoAddReplicas=true for indemnity?
> >>
> >>> On Aug 10, 2020, at 6:32 PM, Ashwin Ramesh 
> >> wrote:
> >>>
> >>> Hi everybody,
> >>>
> >>> We are using solr 7.6 (SolrCloud). We notices that when the backup is
> >>> running, we cannot add any replicas to the collection. By the looks of
> >> it,
> >>> the job to add the replica is put into the Overseer queue, but it is
> not
> >>> being processed. Is this expected? And are there any workarounds?
> >>>
> >>> Our backups take about 12 hours. Maybe we should try optimize that too.
> >>>
> >>> Regards,
> >>>
> >>> Ash
> >>>
> >>> --
> >>> **
> >>> ** <https://www.canva.com/>Empowering the world to design
> >>> Share accurate
> >>> information on COVID-19 and spread messages of support to your
> community.
> >>>
> >>> Here are some resources
> >>> <
> >>
> https://about.canva.com/coronavirus-awareness-collection/?utm_medium=pr&utm_source=news&utm_campaign=covid19_templates
> >
> >>
> >>> that can help.
> >>> <https://twitter.com/canva> <https://facebook.com/canva>
> >>> <https://au.linkedin.com/company/canva> <https://twitter.com/canva>
> >>> <https://facebook.com/canva>  <https://au.linkedin.com/company/canva>
> >>> <https://instagram.com/canva>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>
> >>
> >
> > --
> > **
> > ** <https://www.canva.com/>Empowering the world to design
> > Share accurate
> > information on COVID-19 and spread messages of support to your community.
> >
> > Here are some resources
> > <
> https://about.canva.com/coronavirus-awareness-collection/?utm_medium=pr&utm_source=news&utm_campaign=covid19_templates>
>
> > that can help.
> > <https://twitter.com/canva> <https://facebook.com/canva>
> > <https://au.linkedin.com/company/canva> <https://twitter.com/canva>
> > <https://facebook.com/canva>  <https://au.linkedin.com/company/canva>
> > <https://instagram.com/canva>
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
>
>

-- 
**
** <https://www.canva.com/>Empowering the world to design
Share accurate 
information on COVID-19 and spread messages of support to your community.

Here are some resources 
<https://about.canva.com/coronavirus-awareness-collection/?utm_medium=pr&utm_source=news&utm_campaign=covid19_templates>
 
that can help.
 <https://twitter.com/canva> <https://facebook.com/canva> 
<https://au.linkedin.com/company/canva> <https://twitter.com/canva>  
<https://facebook.com/canva>  <https://au.linkedin.com/company/canva>  
<https://instagram.com/canva>












Re: Cannot add replica during backup

2020-08-11 Thread Ashwin Ramesh
Hey Matthew,

Unfortunately, our shard leaders are across multiple nodes thus a single
EBS couldn't work. Did you manage to get around this issue yourself?

Regards,

Ash

On Tue, Aug 11, 2020 at 9:00 PM matthew sporleder 
wrote:

> I can already tell you it is EFS that is slow. I had to switch to an ebs
> disk for backups on a different project because efs couldn't keep up.
>
> > On Aug 10, 2020, at 9:43 PM, Ashwin Ramesh 
> wrote:
> >
> > Hey Aroop, the general process for our backup is:
> > - Connect all machines to an EFS drive (AWS's NFS service)
> > - Call the collections API to backup into EFS
> > - ZIP the directory once the backup is completed
> > - Copy the ZIP into an s3 bucket
> >
> > I'll probably have to see which part of the process is the slowest.
> >
> > On another note, can you simply remove the task from the ZK path to
> > continue the execution of tasks?
> >
> > Regards,
> >
> > Ash
> >
> >> On Tue, Aug 11, 2020 at 11:40 AM Aroop Ganguly
> >>  wrote:
> >>
> >> 12 hours is extreme, we take backups of 10TB worth of indexes in 15 mins
> >> using the collection backup api.
> >> How are you taking the backup?
> >>
> >> Do you actually see any backup progress or u are just seeing the task in
> >> the overseer queue linger ?
> >> I have seen restore tasks hanging in the queue forever despite process
> >> completing in Solr 77 so wouldn’t be surprised this happens with backup
> as
> >> well. And also observed that unless that unless that task is removed
> from
> >> the overseer-collection-queue the next ones do not proceed.
> >>
> >> Also adding replicas while backup seems like overkill, why don’t you
> just
> >> have the appropriate replication factor in the first place and have
> >> autoAddReplicas=true for indemnity?
> >>
> >>> On Aug 10, 2020, at 6:32 PM, Ashwin Ramesh 
> >> wrote:
> >>>
> >>> Hi everybody,
> >>>
> >>> We are using solr 7.6 (SolrCloud). We notices that when the backup is
> >>> running, we cannot add any replicas to the collection. By the looks of
> >> it,
> >>> the job to add the replica is put into the Overseer queue, but it is
> not
> >>> being processed. Is this expected? And are there any workarounds?
> >>>
> >>> Our backups take about 12 hours. Maybe we should try optimize that too.
> >>>
> >>> Regards,
> >>>
> >>> Ash
> >>>
> >>> --
> >>> **
> >>> ** <https://www.canva.com/>Empowering the world to design
> >>> Share accurate
> >>> information on COVID-19 and spread messages of support to your
> community.
> >>>
> >>> Here are some resources
> >>> <
> >>
> https://about.canva.com/coronavirus-awareness-collection/?utm_medium=pr&utm_source=news&utm_campaign=covid19_templates
> >
> >>
> >>> that can help.
> >>> <https://twitter.com/canva> <https://facebook.com/canva>
> >>> <https://au.linkedin.com/company/canva> <https://twitter.com/canva>
> >>> <https://facebook.com/canva>  <https://au.linkedin.com/company/canva>
> >>> <https://instagram.com/canva>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>
> >>
> >
> > --
> > **
> > ** <https://www.canva.com/>Empowering the world to design
> > Share accurate
> > information on COVID-19 and spread messages of support to your community.
> >
> > Here are some resources
> > <
> https://about.canva.com/coronavirus-awareness-collection/?utm_medium=pr&utm_source=news&utm_campaign=covid19_templates>
>
> > that can help.
> > <https://twitter.com/canva> <https://facebook.com/canva>
> > <https://au.linkedin.com/company/canva> <https://twitter.com/canva>
> > <https://facebook.com/canva>  <https://au.linkedin.com/company/canva>
> > <https://instagram.com/canva>
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
>

-- 
**
** <https://www.canva.com/>Empowering the world to design
Share accurate 
information on COVID-19 and spread messages of support to your community.

Here are some resources 
<https://about.canva.com/coronavirus-awareness-collection/?utm_medium=pr&utm_source=news&utm_campaign=covid19_templates>
 
that can help.
 <https://twitter.com/canva> <https://facebook.com/canva> 
<https://au.linkedin.com/company/canva> <https://twitter.com/canva>  
<https://facebook.com/canva>  <https://au.linkedin.com/company/canva>  
<https://instagram.com/canva>












Sort on docValue field is slow.

2019-05-20 Thread Ashwin Ramesh
Hello everybody,

Hoping to get advice on a specific issue - We have a collection of 50M
documents. We recently added a featuredAt field defined as such -



This field is sparely populated such that only a small subset (3-5 thousand
currently) have been tagged with that field.

We have a business case where we want to order this content by most
recently featured -> least recently featured -> the rest of the content in
any order. However adding the `sort=featuredAt desc` param results in qTime
> 5000 (our hard timeout is 5000).

The request handler processing this request is defined as follows:

  
*
  
  
id
edismax
10
id
  
  
elevator
  


We hydrate content with a seperate store.

Any advice as to how to improve the performance of this request handler +
sorting.

System/Architecture Specs:
Solr 7.4
8 Shards
TLOG / PULLs

Thank you & Regards,

Ash

-- 
*P.S. We've launched a new blog to share the latest ideas and case studies 
from our team. Check it out here: product.canva.com 
. ***
** Empowering the 
world to design
Also, we're hiring. Apply here! 

  
  
    
  








Re: Sort on docValue field is slow.

2019-05-20 Thread Ashwin Ramesh
Hi Shawn,

Thanks for the prompt response.

1. date type def - 

2. The field is brand new. I added it to schema.xml, uploaded to ZK &
reloaded the collection. After that we started indexing the few thousand.
Did we still need to do a full reindex to a fresh collection?

3. It is the only difference. I am testing the raw URL call timing
difference with and without the extra sort.

Hope this helps,

Regards,

Ash



On Mon, May 20, 2019 at 11:17 PM Shawn Heisey  wrote:

> On 5/20/2019 6:25 AM, Ashwin Ramesh wrote:
> > Hoping to get advice on a specific issue - We have a collection of 50M
> > documents. We recently added a featuredAt field defined as such -
> >
> >  > required="false"
> > multiValued="false" docValues="true"/>
>
> What is the fieldType definition for "date"?  We cannot assume that you
> have left this the same as Solr's sample configs.
>
> > This field is sparely populated such that only a small subset (3-5
> thousand
> > currently) have been tagged with that field.
>
> Did you completely reindex, or just index those few thousand records?
> When changing fields related to docValues, you must completely delete
> the old index and reindex.  That's just how docValues works.
>
> > We have a business case where we want to order this content by most
> > recently featured -> least recently featured -> the rest of the content
> in
> > any order. However adding the `sort=featuredAt desc` param results in
> qTime
> >> 5000 (our hard timeout is 5000).
>
> Is the definition of the sort parameter the ONLY difference?  Are you
> querying on the new field?  Can you share the entire query URL, or the
> code that produced it if you're using a Solr client?  What is the before
> QTime?
>
> Thanks,
> Shawn
>

-- 
*P.S. We've launched a new blog to share the latest ideas and case studies 
from our team. Check it out here: product.canva.com 
<https://product.canva.com/>. ***
** <https://www.canva.com/>Empowering the 
world to design
Also, we're hiring. Apply here! 
<https://about.canva.com/careers/>
 <https://twitter.com/canva> 
<https://facebook.com/canva> <https://au.linkedin.com/company/canva> 
<https://twitter.com/canva>  <https://facebook.com/canva>  
<https://au.linkedin.com/company/canva>  <https://instagram.com/canva>








Are docValues useful for FilterQueries?

2019-07-08 Thread Ashwin Ramesh
Hi everybody,

I can't find concrete evidence whether docValues are indeed useful for
filter queries. One example of a field:



This field will have a value between 0-1 The only usecase for this
field is to filter on a range / subset of values. There will be no scoring
/ querying on this field. Is this a good usecase for docValues? Regards, Ash

-- 
*P.S. We've launched a new blog to share the latest ideas and case studies 
from our team. Check it out here: product.canva.com 
. ***
** Empowering the 
world to design
Also, we're hiring. Apply here! 

  
  
    
  








Is it possible to skip scoring completely?

2019-09-11 Thread Ashwin Ramesh
Hi everybody,

I was wondering if there is a way we can tell solr (7.3+) to run none of
it's scoring logic. We would like to simply add a set of filter queries and
order on a specific docValue field.

e.g. "Give me all fq=color:red documents ORDER on popularityScore DESC"

Thanks in advance,

Ash

-- 
*P.S. We've launched a new blog to share the latest ideas and case studies 
from our team. Check it out here: product.canva.com 
. ***
** Empowering the 
world to design
Also, we're hiring. Apply here! 

  
  
    
  








Re: Is it possible to skip scoring completely?

2019-09-11 Thread Ashwin Ramesh
Thanks Shawn & Emir,

I just tried a * query with filters with fl=id,score. I noticed that all
scores were 1.0. Which I assume means no scoring was done. When I added a
sort after that test, scores were still 1.0.

I guess all I have to do is set q=* & set a sort.

Appreciate your help,

Ash

On Thu, Sep 12, 2019 at 4:40 PM Emir Arnautović <
emir.arnauto...@sematext.com> wrote:

> Hi Ash,
> I did not check the code, so not sure if your question is based on
> something that you find in the codebase or you are just assuming that
> scoring is called? I would assume differently: if you use only fq, then
> Solr does not have anything to score. Also, if you order by something other
> than score and do not request score to be returned, I would also assume
> that Solr will not calculate score. Again, didn’t have time to check the
> code, so these are just assumptions.
>
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>
>
>
> > On 12 Sep 2019, at 01:27, Ashwin Ramesh  wrote:
> >
> > Hi everybody,
> >
> > I was wondering if there is a way we can tell solr (7.3+) to run none of
> > it's scoring logic. We would like to simply add a set of filter queries
> and
> > order on a specific docValue field.
> >
> > e.g. "Give me all fq=color:red documents ORDER on popularityScore DESC"
> >
> > Thanks in advance,
> >
> > Ash
> >
> > --
> > *P.S. We've launched a new blog to share the latest ideas and case
> studies
> > from our team. Check it out here: product.canva.com
> > <https://product.canva.com/>. ***
> > ** <https://www.canva.com/>Empowering the
> > world to design
> > Also, we're hiring. Apply here!
> > <https://about.canva.com/careers/>
> > <https://twitter.com/canva>
> > <https://facebook.com/canva> <https://au.linkedin.com/company/canva>
> > <https://twitter.com/canva>  <https://facebook.com/canva>
> > <https://au.linkedin.com/company/canva>  <https://instagram.com/canva>
> >
> >
> >
> >
> >
> >
>
>

-- 
*P.S. We've launched a new blog to share the latest ideas and case studies 
from our team. Check it out here: product.canva.com 
<https://product.canva.com/>. ***
** <https://www.canva.com/>Empowering the 
world to design
Also, we're hiring. Apply here! 
<https://about.canva.com/careers/>
 <https://twitter.com/canva> 
<https://facebook.com/canva> <https://au.linkedin.com/company/canva> 
<https://twitter.com/canva>  <https://facebook.com/canva>  
<https://au.linkedin.com/company/canva>  <https://instagram.com/canva>








Re: Is it possible to skip scoring completely?

2019-09-12 Thread Ashwin Ramesh
Ah! Thanks so much!

On Thu., 12 Sep. 2019, 11:56 pm Shawn Heisey,  wrote:

> On 9/12/2019 12:43 AM, Ashwin Ramesh wrote:
> > I just tried a * query with filters with fl=id,score. I noticed that all
> > scores were 1.0. Which I assume means no scoring was done. When I added a
> > sort after that test, scores were still 1.0.
> >
> > I guess all I have to do is set q=* & set a sort.
>
> Don't use q=* for your query. This is a wildcard query.  What that means
> is that if the field you're querying contains 10 million different
> values, then your actual query will be built with all 10 million of
> those values.  It will be huge, and VERY slow.
>
> Use q=*:* if you mean all documents.  This is special syntax that Lucene
> and Solr understand and translate into a very fast "all documents
> query".  That query will probably also generate 1.0 for scores, though I
> haven't checked.
>
> Thanks,
> Shawn
>

-- 
*P.S. We've launched a new blog to share the latest ideas and case studies 
from our team. Check it out here: product.canva.com 
<https://product.canva.com/>. ***
** <https://www.canva.com/>Empowering the 
world to design
Also, we're hiring. Apply here! 
<https://about.canva.com/careers/>
 <https://twitter.com/canva> 
<https://facebook.com/canva> <https://au.linkedin.com/company/canva> 
<https://twitter.com/canva>  <https://facebook.com/canva>  
<https://au.linkedin.com/company/canva>  <https://instagram.com/canva>








Best field type for boosting all documents

2019-09-16 Thread Ashwin Ramesh
Hi everybody,

We have a usecase where we want to push a popularity boost for each
document in our collection. When a user searches for any term, we would
like to arbitrarily add an additional boost by this value (which is
different for each document).

E.g. q=foo&boost=def(popularityBoostField,1)

Should we define the field 'popularityBoostField' as a docValue or regular
field?

If the field is sparsely filled, will that cause any issues?

Regards,

Ash

-- 
*P.S. We've launched a new blog to share the latest ideas and case studies 
from our team. Check it out here: product.canva.com 
. ***
** Empowering the 
world to design
Also, we're hiring. Apply here! 

  
  
    
  








Dealing with multi-word keywords and SOW=true

2019-09-30 Thread Ashwin Ramesh
Hi everybody,

I am using the edismax parser and have noticed a very specific behaviour
with how sow=true (default) handles multiword keywords.

We have a field called 'keywords', which uses the general
KeywordTokenizerFactory. There are also other text fields like title and
description. etc.

When we index a document with a keyword "ice cream", for example, we know
it gets indexed into that field as "ice cream".

However, at query time, I noticed that if we run an Edismax query:
q=ice cream
qf=keywords

I do not get that document back as a match. This is due to sow=true
splitting the user's query and the final tokens not being present in the
keywords field.

I was wondering what the best practise around this was? Some thoughts I
have had:

1. Index multi-word keywords with hyphens or somelike similar. E.g. "ice
cream" -> "ice-cream"
2. Additionally index the separate words as keywords also. E.g. "ice cream"
-> "ice cream", "ice", "cream". However this method will result in the loss
of intent (q=ice would return this document).
3. Add a boost query which is an edismax query where we explicitly set
sow=false and add a huge boost. E.g*. bq={!edismax qf=keywords^1000
sow=false bq="" boost="" pf="" tie=1.00 v="ice cream"}*

Is there an industry practise solution to handle this type of problem? Keep
in mind that the other text fields may also include these terms. E.g.
title="This is ice cream", which would match the query. This specific
problem affects the keywords field for the obvious reason that the indexing
pipeline does not tokenize keywords.

Thank you for all your amazing help,

Regards,

Ash

-- 
*P.S. We've launched a new blog to share the latest ideas and case studies 
from our team. Check it out here: product.canva.com 
. ***
** Empowering the 
world to design
Also, we're hiring. Apply here! 

  
  
    
  








Re: Dealing with multi-word keywords and SOW=true

2019-09-30 Thread Ashwin Ramesh
Thanks Erick, that seems to work!

Should I leave it in qf also? For example the query "blue dog" may be
represented as separate tokens in the keyword index.



On Mon, Sep 30, 2019 at 9:32 PM Erick Erickson 
wrote:

> Have you tried taking your keyword field out of the “qf” param and adding
> it explicitly? As keyword:”ice cream”
>
> Best,
> Erick
>
> > On Sep 30, 2019, at 5:27 AM, Ashwin Ramesh  wrote:
> >
> > Hi everybody,
> >
> > I am using the edismax parser and have noticed a very specific behaviour
> > with how sow=true (default) handles multiword keywords.
> >
> > We have a field called 'keywords', which uses the general
> > KeywordTokenizerFactory. There are also other text fields like title and
> > description. etc.
> >
> > When we index a document with a keyword "ice cream", for example, we know
> > it gets indexed into that field as "ice cream".
> >
> > However, at query time, I noticed that if we run an Edismax query:
> > q=ice cream
> > qf=keywords
> >
> > I do not get that document back as a match. This is due to sow=true
> > splitting the user's query and the final tokens not being present in the
> > keywords field.
> >
> > I was wondering what the best practise around this was? Some thoughts I
> > have had:
> >
> > 1. Index multi-word keywords with hyphens or somelike similar. E.g. "ice
> > cream" -> "ice-cream"
> > 2. Additionally index the separate words as keywords also. E.g. "ice
> cream"
> > -> "ice cream", "ice", "cream". However this method will result in the
> loss
> > of intent (q=ice would return this document).
> > 3. Add a boost query which is an edismax query where we explicitly set
> > sow=false and add a huge boost. E.g*. bq={!edismax qf=keywords^1000
> > sow=false bq="" boost="" pf="" tie=1.00 v="ice cream"}*
> >
> > Is there an industry practise solution to handle this type of problem?
> Keep
> > in mind that the other text fields may also include these terms. E.g.
> > title="This is ice cream", which would match the query. This specific
> > problem affects the keywords field for the obvious reason that the
> indexing
> > pipeline does not tokenize keywords.
> >
> > Thank you for all your amazing help,
> >
> > Regards,
> >
> > Ash
> >
> > --
> > *P.S. We've launched a new blog to share the latest ideas and case
> studies
> > from our team. Check it out here: product.canva.com
> > <https://product.canva.com/>. ***
> > ** <https://www.canva.com/>Empowering the
> > world to design
> > Also, we're hiring. Apply here!
> > <https://about.canva.com/careers/>
> > <https://twitter.com/canva>
> > <https://facebook.com/canva> <https://au.linkedin.com/company/canva>
> > <https://twitter.com/canva>  <https://facebook.com/canva>
> > <https://au.linkedin.com/company/canva>  <https://instagram.com/canva>
> >
> >
> >
> >
> >
> >
>
>

-- 
*P.S. We've launched a new blog to share the latest ideas and case studies 
from our team. Check it out here: product.canva.com 
<https://product.canva.com/>. ***
** <https://www.canva.com/>Empowering the 
world to design
Also, we're hiring. Apply here! 
<https://about.canva.com/careers/>
 <https://twitter.com/canva> 
<https://facebook.com/canva> <https://au.linkedin.com/company/canva> 
<https://twitter.com/canva>  <https://facebook.com/canva>  
<https://au.linkedin.com/company/canva>  <https://instagram.com/canva>








Solr 7.6.0 - OOM Caused Down Replica. Cannot recover. Please advice

2021-02-24 Thread Ashwin Ramesh
Hi everyone,

We had an OOM event earlier this morning. This has caused one of our shards
to lose all it's replicas and it's leader is still in a down state. We have
restarted the Java process (solr) and it's still in a down state. Logs
below:

```
Feb 25, 2021 @ 11:46:43.000 2021-02-25 00:46:43.268 WARN
 (updateExecutor-3-thread-1-processing-n:10.0.10.43:8983_solr
x:search-collection-2018-10-30_shard2_5_replica_n1480
c:search-collection-2018-10-30 s:shard2_5 r:core_node1481)
[c:search-collection-2018-10-30 s:shard2_5 r:core_node1481
x:search-collection-2018-10-30_shard2_5_replica_n1480]
o.a.s.c.RecoveryStrategy Stopping recovery for
core=[search-collection-2018-10-30_shard2_5_replica_n1480]
coreNodeName=[core_node1481] ∎
Feb 25, 2021 @ 11:46:40.000 2021-02-25 00:46:40.759 WARN
 (zkCallback-7-thread-2) [c:search-collection-2018-10-30 s:shard2_5
r:core_node1481 x:search-collection-2018-10-30_shard2_5_replica_n1480]
o.a.s.c.RecoveryStrategy Stopping recovery for
core=[search-collection-2018-10-30_shard2_5_replica_n1480]
coreNodeName=[core_node1481] ∎
Feb 25, 2021 @ 11:46:35.000 2021-02-25 00:46:35.761 WARN
 (zkCallback-7-thread-2) [c:search-collection-2018-10-30 s:shard2_5
r:core_node1481 x:search-collection-2018-10-30_shard2_5_replica_n1480]
o.a.s.c.RecoveryStrategy Stopping recovery for
core=[search-collection-2018-10-30_shard2_5_replica_n1480]
coreNodeName=[core_node1481] ∎
Feb 25, 2021 @ 11:46:33.000 2021-02-25 00:46:33.270 WARN
 (updateExecutor-3-thread-2-processing-n:10.0.10.43:8983_solr
x:search-collection-2018-10-30_shard2_5_replica_n1480
c:search-collection-2018-10-30 s:shard2_5 r:core_node1481)
[c:search-collection-2018-10-30 s:shard2_5 r:core_node1481
x:search-collection-2018-10-30_shard2_5_replica_n1480]
o.a.s.c.RecoveryStrategy Stopping recovery for
core=[search-collection-2018-10-30_shard2_5_replica_n1480]
coreNodeName=[core_node1481] ∎
Feb 25, 2021 @ 11:46:30.000 2021-02-25 00:46:30.759 WARN
 (zkCallback-7-thread-2) [c:search-collection-2018-10-30 s:shard2_5
r:core_node1481 x:search-collection-2018-10-30_shard2_5_replica_n1480]
o.a.s.c.RecoveryStrategy Stopping recovery for
core=[search-collection-2018-10-30_shard2_5_replica_n1480]
coreNodeName=[core_node1481] ∎
Feb 25, 2021 @ 11:46:25.000 2021-02-25 00:46:25.761 WARN
 (zkCallback-7-thread-2) [c:search-collection-2018-10-30 s:shard2_5
r:core_node1481 x:search-collection-2018-10-30_shard2_5_replica_n1480]
o.a.s.c.RecoveryStrategy Stopping recovery for
core=[search-collection-2018-10-30_shard2_5_replica_n1480]
coreNodeName=[core_node1481] ∎
Feb 25, 2021 @ 11:46:23.000 2021-02-25 00:46:23.279 WARN
 (updateExecutor-3-thread-1-processing-n:10.0.10.43:8983_solr
x:search-collection-2018-10-30_shard2_5_replica_n1480
c:search-collection-2018-10-30 s:shard2_5 r:core_node1481)
[c:search-collection-2018-10-30 s:shard2_5 r:core_node1481
x:search-collection-2018-10-30_shard2_5_replica_n1480]
o.a.s.c.RecoveryStrategy Stopping recovery for
core=[search-collection-2018-10-30_shard2_5_replica_n1480]
coreNodeName=[core_node1481] ∎
```

Questions:
1. Is there anything we can do to force this core to go live?
2. If the core is unrecoverable, is there a way to clear the core up such
that we can reindex only that shard?

Any other advice would be great too :)

Ash

-- 
**
** Empowering the world to design
Share accurate 
information on COVID-19 and spread messages of support to your community.
Here are some resources 

 
that can help.
   
   
    













SOLR IndexSearcher Opening

2017-04-05 Thread Murari, Ramesh Babu
Hi All,
Can you please tell me what all conditions can cause a SOLR 
Instance to Re-Open a searcher.

I know Replication & Hard Commit are obvious answers to this, can you please 
help me understand what else can trigger reopening a searcher.

Thanks in advance.

Thanks & Regards,
Ramesh babu Murari





The information contained in this message is intended only for the recipient, 
and may be a confidential attorney-client communication or may otherwise be 
privileged and confidential and protected from disclosure. If the reader of 
this message is not the intended recipient, or an employee or agent responsible 
for delivering this message to the intended recipient, please be aware that any 
dissemination or copying of this communication is strictly prohibited. If you 
have received this communication in error, please immediately notify us by 
replying to the message and deleting it from your computer. S&P Global Inc. 
reserves the right, subject to applicable local law, to monitor, review and 
process the content of any electronic message or information sent to or from 
S&P Global Inc. e-mail addresses without informing the sender or recipient of 
the message. By sending electronic message or information to S&P Global Inc. 
e-mail addresses you, as the sender, are consenting to S&P Global Inc. 
processing any of your personal data therein.


Solr sorting issue : can not sort on multivalued field

2011-12-06 Thread Ramesh kumar Velusamy
Hi,

  I am getting this weird error message  `can not sort on multivalued
field: fieldname` on all the indexed fields. This is the full error message
from solr

 HTTP Status 400 - can not sort on multivalued field:
pricetype Status reportmessagecan not
sort on multivalued field: pricedescriptionThe request sent
by the client was syntactically incorrect (can not sort on multivalued
field: price).GlassFish Server Open Source Edition
3.1

I am sure that my indexed field doesnt have `multiValued=true` set on



to make sure that i have added `multiValued=false`, but i am still getting
the same error.

This is the URL request sent to solr


http://localhost:8080/apache-solr-3.1.0/select?wt=ruby&q=flat&fl=_id&sort=price+asc&limit=5&offset=0

It all works fine if i remove the sort from the request.


Here the complete stack-trace from solr log


[#|2011-12-06T16:03:35.813+0530|SEVERE|glassfish3.1|org.apache.solr.core.SolrCore|_ThreadID=22;_ThreadName=Thread-1;|org.apache.solr.common.SolrException:
can not sort on multivalued field: price
at
org.apache.solr.schema.SchemaField.checkSortability(SchemaField.java:161)
at org.apache.solr.schema.TrieField.getSortField(TrieField.java:128)
at org.apache.solr.schema.SchemaField.getSortField(SchemaField.java:144)
at org.apache.solr.search.QueryParsing.parseSort(QueryParsing.java:385)
at org.apache.solr.search.QParser.getSort(QParser.java:251)
at
org.apache.solr.handler.component.QueryComponent.prepare(QueryComponent.java:102)
at
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:173)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1372)
at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252)
at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:256)
at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:215)
at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:279)
at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175)
at
org.apache.catalina.core.StandardPipeline.doInvoke(StandardPipeline.java:655)
at
org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:595)
at com.sun.enterprise.web.WebPipeline.invoke(WebPipeline.java:98)
at
com.sun.enterprise.web.PESessionLockingStandardPipeline.invoke(PESessionLockingStandardPipeline.java:91)
at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:162)
at
org.apache.catalina.connector.CoyoteAdapter.doService(CoyoteAdapter.java:326)
at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:227)
at
com.sun.enterprise.v3.services.impl.ContainerMapper.service(ContainerMapper.java:170)
at
com.sun.grizzly.http.ProcessorTask.invokeAdapter(ProcessorTask.java:822)
at com.sun.grizzly.http.ProcessorTask.doProcess(ProcessorTask.java:719)
at com.sun.grizzly.http.ProcessorTask.process(ProcessorTask.java:1013)
at
com.sun.grizzly.http.DefaultProtocolFilter.execute(DefaultProtocolFilter.java:225)
at
com.sun.grizzly.DefaultProtocolChain.executeProtocolFilter(DefaultProtocolChain.java:137)
at
com.sun.grizzly.DefaultProtocolChain.execute(DefaultProtocolChain.java:104)
at
com.sun.grizzly.DefaultProtocolChain.execute(DefaultProtocolChain.java:90)
at
com.sun.grizzly.http.HttpProtocolChain.execute(HttpProtocolChain.java:79)
at
com.sun.grizzly.ProtocolChainContextTask.doCall(ProtocolChainContextTask.java:54)
at
com.sun.grizzly.SelectionKeyContextTask.call(SelectionKeyContextTask.java:59)
at com.sun.grizzly.ContextTask.run(ContextTask.java:71)
at
com.sun.grizzly.util.AbstractThreadPool$Worker.doWork(AbstractThreadPool.java:532)
at
com.sun.grizzly.util.AbstractThreadPool$Worker.run(AbstractThreadPool.java:513)
at java.lang.Thread.run(Thread.java:662)
|#]


[#|2011-12-06T16:03:35.814+0530|INFO|glassfish3.1|org.apache.solr.core.SolrCore|_ThreadID=22;_ThreadName=Thread-1;|[]
webapp=/apache-solr-3.5.0 path=/select
params={wt=ruby&q=flat&fl=_id&sort=price+asc&limit=5&offset=0} status=400
QTime=42 |#]


PS: I do have only one multivalued field in the document, but thats not
used in the sort. And i have verified in both the solr versions 3.1 and
3.5. same error.

 Can some one help me out..

Cheers
Ramesh vel


Re: Highlight feature

2012-05-15 Thread Ramesh K Balasubramanian
That is the default response format. If you would like to change that, you 
could extend the search handler or post process the XML data. Another option 
would be to use the javabin (if your app is java based) and build xml the way 
your app would need.
 
Best Regards,
Ramesh

>

Re: Invalid version (expected 2, but 60) on CentOS in production please Help!!!

2012-05-15 Thread Ramesh K Balasubramanian
I have seen similar errors before when the solr version and solrj version in 
the client don't match.
 
Best Regards,
Ramesh