Re: Huge Query execution time for multiple ORs

2017-11-30 Thread Emir Arnautović
Hi Faraz,
It is a bit worse than that - it also needs to calculate score, so for each 
matching doc of one query part it has to check if it appears in results of 
other query parts. If you use term query parser, you avoid calculating score - 
all doc will have score 1.
Solr is based on lucene, which is mainly inverted index: 
https://en.wikipedia.org/wiki/Inverted_index 
 so knowing that helps understand 
how expensive some queries are. It is relatively easy to figure out what steps 
are needed for different query types. Of course, Lucene includes a lot 
smartness, and it is probably not using the naive approach, but it cannot avoid 
limitations of inverted index.

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 30 Nov 2017, at 02:39, Faraz Fallahi  wrote:
> 
> Hi Toke,
> 
> Just to be clear and to understand. Does this mean that a query of the form
> author:name1 OR author:name2 OR author:name3
> 
> Is being processed like e.g.
> 
> 1 query against the index with author:name1 getting 4 result
> Then 1 query against the index with author:name2 getting 3 result
> Then 1 query against the index with author:name3 getting 1 result
> 
> And in the end all results are merged and i get a result of 8 ?
> 
> So a query of thousand authors will be splitted into thousand single
> queries against the index?
> 
> Do i understand this correctly?
> 
> Thx for the help
> Faraz
> 
> 
> Am 28.11.2017 15:39 schrieb "Toke Eskildsen" :
> 
> On Tue, 2017-11-28 at 11:07 +0100, Faraz Fallahi wrote:
>> I have a question regarding solr queries.
>> My query basically contains thousand of OR conditions for authors
>> (author:name1 OR author:name2 OR author:name3 OR author:name4 ...)
>> The execution time on my index is huge (around 15 sec). When i tag
>> all the associated documents with a custom field and value like
>> authorlist:1 and then i change my query to just search for
>> authorlist:1 it executes in 78 ms. How come there is such a big
>> difference in exec-time?
> 
> Due to the nature of inverted indexes (which lies at the heart of
> Solr), your thousands of OR-queries means thousands of lookups, whereas
> your authorlist means a single lookup. Adding to this the results for
> each author needs to be merged with the other author-results - for
> authorlist the results are there directly.
> 
> If your author lists are static, indexing them as you did in your test
> is the best solution.
> 
> If they are not static, using a filter-query will ensure that they are
> at least cached subsequently, so that only the first call will be
> slow.
> 
> If they are semi-static and there are not too many of them, you could
> do warm-up filter-queries for all the different groups so that the
> users does not pay the first-call penalty. This requires your filter-
> cache to be large enough to hold all the author lists.
> 
> - Toke Eskildsen, Royal Danish Library



Re: Huge Query execution time for multiple ORs

2017-11-30 Thread Faraz Fallahi
Uff... I See.. thx dir the explanation :)

Am 30.11.2017 3:13 nachm. schrieb "Emir Arnautović" <
emir.arnauto...@sematext.com>:

> Hi Faraz,
> It is a bit worse than that - it also needs to calculate score, so for
> each matching doc of one query part it has to check if it appears in
> results of other query parts. If you use term query parser, you avoid
> calculating score - all doc will have score 1.
> Solr is based on lucene, which is mainly inverted index:
> https://en.wikipedia.org/wiki/Inverted_index  wiki/Inverted_index> so knowing that helps understand how expensive some
> queries are. It is relatively easy to figure out what steps are needed for
> different query types. Of course, Lucene includes a lot smartness, and it
> is probably not using the naive approach, but it cannot avoid limitations
> of inverted index.
>
> HTH,
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>
>
>
> > On 30 Nov 2017, at 02:39, Faraz Fallahi 
> wrote:
> >
> > Hi Toke,
> >
> > Just to be clear and to understand. Does this mean that a query of the
> form
> > author:name1 OR author:name2 OR author:name3
> >
> > Is being processed like e.g.
> >
> > 1 query against the index with author:name1 getting 4 result
> > Then 1 query against the index with author:name2 getting 3 result
> > Then 1 query against the index with author:name3 getting 1 result
> >
> > And in the end all results are merged and i get a result of 8 ?
> >
> > So a query of thousand authors will be splitted into thousand single
> > queries against the index?
> >
> > Do i understand this correctly?
> >
> > Thx for the help
> > Faraz
> >
> >
> > Am 28.11.2017 15:39 schrieb "Toke Eskildsen" :
> >
> > On Tue, 2017-11-28 at 11:07 +0100, Faraz Fallahi wrote:
> >> I have a question regarding solr queries.
> >> My query basically contains thousand of OR conditions for authors
> >> (author:name1 OR author:name2 OR author:name3 OR author:name4 ...)
> >> The execution time on my index is huge (around 15 sec). When i tag
> >> all the associated documents with a custom field and value like
> >> authorlist:1 and then i change my query to just search for
> >> authorlist:1 it executes in 78 ms. How come there is such a big
> >> difference in exec-time?
> >
> > Due to the nature of inverted indexes (which lies at the heart of
> > Solr), your thousands of OR-queries means thousands of lookups, whereas
> > your authorlist means a single lookup. Adding to this the results for
> > each author needs to be merged with the other author-results - for
> > authorlist the results are there directly.
> >
> > If your author lists are static, indexing them as you did in your test
> > is the best solution.
> >
> > If they are not static, using a filter-query will ensure that they are
> > at least cached subsequently, so that only the first call will be
> > slow.
> >
> > If they are semi-static and there are not too many of them, you could
> > do warm-up filter-queries for all the different groups so that the
> > users does not pay the first-call penalty. This requires your filter-
> > cache to be large enough to hold all the author lists.
> >
> > - Toke Eskildsen, Royal Danish Library
>
>


Re: does the payload_check query parser have support for simple query parser operators?

2017-11-30 Thread Erik Hatcher
No it doesn’t.   The payload parsers currently just simple tokenize with no 
special syntax supported.  

 Erik

> On Nov 30, 2017, at 02:41, John Anonymous  wrote:
> 
> I would like to use wildcards and fuzzy search with the payload_check query
> parser. Are these supported?
> 
> {!payload_check f=text payloads='NOUN'}apple~1
> 
> {!payload_check f=text payloads='NOUN'}app*
> 
> Thanks


check softCommit , autocommit and hard commit count

2017-11-30 Thread Puppy Linux Distros
Hi,

I am trying to calculate the total number of softCommit , autocommit and
hard commit from the solr logs. Can you please check whether the below
commands are correct ?

Let me know how to find the total softcommit, hardcommit and autocommit
from the logs.


*1. totalcommit=`cat $solrlogfile | grep "start commit" | wc -l`*

*totalcommit =  **41906*


*2. totalsoftcommit=`cat $solrlogfile | grep "start commit" | grep
"softCommit=true" | wc -l`*

*totalsoftcommit =  **921*


*3. totalhardcommits=`cat $solrlogfile | grep "start commit" | grep
"softCommit=false" | grep "openSearcher=true" | wc -l`*

*totalhardcommits=  **40982*

*4.  totalautocommit=`cat $solrlogfile | grep "realtime" | wc -l`*

*totalautocommit= 3*



*When I did a softcommit I can see an autocommit triggered after 15 min.
There are 921 softcommit in the logs so there should be equal autocommits
in the log. I can see only 3 auto commit in the logs. Is it cuz a hard
commit triggered immediately after the softcommit ?*

-- 
Regards,

Vivek CV


Dedupe documents inside of each group

2017-11-30 Thread Diego Ceccarelli (BLOOMBERG/ QUEEN VIC)
Hello, I have a use case where I need to dedupe documents in each group based 
on a particular field:

example: 

doc1 = { field_a=1 field_b=2 }
doc2 = { field_a=1 field_b=2 }

doc3 = { field_a=1 field_b=3 }

doc4 = { field_a=2 field_b=3 }
doc5 = { field_a=2 field_b=3 }

and I want to run "Group by field_a, dedupe by field_b" obtaining:

[ group  { field_a=1 docs= [ doc1, docs3] }, group { field_a=2, docs = [doc4]} ]

(doc2 deleted because same of doc1, doc5 because same of doc4). 
I would like to be able to specify how many docs per group also.
Generalizing it could be seen as 'grouping inside grouping'. 
Anyone knows if is possible to do it in solr without changing the code? 

Thanks




Solr Wildcard Search

2017-11-30 Thread Georgy Nevsky
Can somebody help me understand how Solr Wildcard Search is working?

If I’m doing search for “ship*” term I’m getting in result many strings,
like “Shipping Weight”, “Ship From”, “Shipping Calculator”, etc.

But if I’m searching for “shipp*” I don’t get any result.



In the best we trust

Georgy Nevsky


spellcheck.q issue

2017-11-30 Thread Georgy Nevsky
I have issue with spellcheck.q parameter. Thinking it is bug.



If I’m doing search without specifying spellcheck.q parameter then I’m
getting spellcheck suggestions.

Query: /select?q=text_en-us:baring&spellcheck.dictionary=en-us&spellcheck=on

Result:



  



  1

  11

  17

  

bearing

  



  





But I really want to use spellcheck.q parameter to put clear input search
string there and I’m not getting any spellcheck suggestions.

Query:
/select?q=text_en-us:baring&spellcheck.dictionary=en-us&spellcheck.q=baring&spellcheck=on

Result:



  





Here piece of solrconfig.xml:

   

  en-us

  text_en-us

  solr.DirectSolrSpellChecker

  internal

 0.5

  2

  1

  5

  4

  0.01





I’m using latest stable version - 7.1.0.





In the best we trust

Georgy Nevsky


Re: Solr Wildcard Search

2017-11-30 Thread Rick Leir
George,
When you get those results it could be due to stemming.

Wildcard processing expands your term to multiple terms, OR'd together. It also 
takes you down a different analysis pathway, as many analysis components do not 
work with multiple terms. Look into the SolrAdmin console, and use the analysis 
tab to understand what is going on.

If you still have doubts, tell us more about your config.
Cheers --Rick


On November 30, 2017 7:06:42 AM EST, Georgy Nevsky 
 wrote:
>Can somebody help me understand how Solr Wildcard Search is working?
>
>If I’m doing search for “ship*” term I’m getting in result many
>strings,
>like “Shipping Weight”, “Ship From”, “Shipping Calculator”, etc.
>
>But if I’m searching for “shipp*” I don’t get any result.
>
>
>
>In the best we trust
>
>Georgy Nevsky

-- 
Sorry for being brief. Alternate email is rickleir at yahoo dot com 

RE: Solr Wildcard Search

2017-11-30 Thread Georgy Nevsky
I wish to understand if I can do something to get in result term "shipping"
when search for "shipp*"?

Here field definition:



  






  

Anything else can be important? Most configuration parameters are default to
Apache Solr 7.1.0.

In the best we trust
Georgy Nevsky


-Original Message-
From: Rick Leir [mailto:rl...@leirtech.com]
Sent: Thursday, November 30, 2017 7:32 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr Wildcard Search

George,
When you get those results it could be due to stemming.

Wildcard processing expands your term to multiple terms, OR'd together. It
also takes you down a different analysis pathway, as many analysis
components do not work with multiple terms. Look into the SolrAdmin console,
and use the analysis tab to understand what is going on.

If you still have doubts, tell us more about your config.
Cheers --Rick


On November 30, 2017 7:06:42 AM EST, Georgy Nevsky
 wrote:
>Can somebody help me understand how Solr Wildcard Search is working?
>
>If I’m doing search for “ship*” term I’m getting in result many
>strings, like “Shipping Weight”, “Ship From”, “Shipping Calculator”,
>etc.
>
>But if I’m searching for “shipp*” I don’t get any result.
>
>
>
>In the best we trust
>
>Georgy Nevsky

--
Sorry for being brief. Alternate email is rickleir at yahoo dot com


Re: Solr Wildcard Search

2017-11-30 Thread Atita Arora
As Rick raised the most important aspect here , that the phrase is broken
into multiple terms ORed together ,
I believe if the use case requires to perform wildcard search on phrases ,
we would need to store the entire phrase as a single term in the index
which probably is not happening right now and hence are not found when sent
across as phrases.
I tried this on my local Solr 7.1 without phrase this works as expected ,
however as soon as I do phrase search it fails for the reason as i
mentioned above.

Let me know if I can clarify further.

On Thu, Nov 30, 2017 at 6:31 PM, Georgy Nevsky 
wrote:

> I wish to understand if I can do something to get in result term "shipping"
> when search for "shipp*"?
>
> Here field definition:
>  multiValued="false"/>
>
>  positionIncrementGap="100">
>   
> 
>  ignoreCase="true"
> words="lang/stopwords_en.txt"
> />
> 
> 
>  protected="protwords.txt"/>
> 
>   
>
> Anything else can be important? Most configuration parameters are default
> to
> Apache Solr 7.1.0.
>
> In the best we trust
> Georgy Nevsky
>
>
> -Original Message-
> From: Rick Leir [mailto:rl...@leirtech.com]
> Sent: Thursday, November 30, 2017 7:32 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Solr Wildcard Search
>
> George,
> When you get those results it could be due to stemming.
>
> Wildcard processing expands your term to multiple terms, OR'd together. It
> also takes you down a different analysis pathway, as many analysis
> components do not work with multiple terms. Look into the SolrAdmin
> console,
> and use the analysis tab to understand what is going on.
>
> If you still have doubts, tell us more about your config.
> Cheers --Rick
>
>
> On November 30, 2017 7:06:42 AM EST, Georgy Nevsky
>  wrote:
> >Can somebody help me understand how Solr Wildcard Search is working?
> >
> >If I’m doing search for “ship*” term I’m getting in result many
> >strings, like “Shipping Weight”, “Ship From”, “Shipping Calculator”,
> >etc.
> >
> >But if I’m searching for “shipp*” I don’t get any result.
> >
> >
> >
> >In the best we trust
> >
> >Georgy Nevsky
>
> --
> Sorry for being brief. Alternate email is rickleir at yahoo dot com
>


RE: Solr Wildcard Search

2017-11-30 Thread Allison, Timothy B.
The initial question wasn't about a phrasal search, but I largely agree that 
diff q parsers handle the analysis chain differently for multiterms.

Yes, Porter is crazily aggressive. USE WITH CAUTION!  

As has been pointed out, use the Solr admin window and the "debug" in the query 
option to see what's going on.

Use the Solr admin Analysis feature to see how your tokens are being modified 
by each step in the analysis chain.

If you use solr admin and debug the query for "shipping", you see that it is 
stemmed to "ship"...hence all of your matches work.  Porter doesn't have rules 
for words ending in "pp", so it doesn't stem "shipp" to "ship".  So, your 
wildcard query is looking for words that start with "shipp", and given that 
"shipping" was stemmed to "ship", it won't find it.  It would find "shippqrs" 
because porter wouldn't know what to do with that 😊

Again, Porter can be very dangerous if it doesn't align with user expectations.



-Original Message-
From: Atita Arora [mailto:atitaar...@gmail.com] 
Sent: Thursday, November 30, 2017 8:16 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr Wildcard Search

As Rick raised the most important aspect here , that the phrase is broken into 
multiple terms ORed together , I believe if the use case requires to perform 
wildcard search on phrases , we would need to store the entire phrase as a 
single term in the index which probably is not happening right now and hence 
are not found when sent across as phrases.
I tried this on my local Solr 7.1 without phrase this works as expected , 
however as soon as I do phrase search it fails for the reason as i mentioned 
above.

Let me know if I can clarify further.

On Thu, Nov 30, 2017 at 6:31 PM, Georgy Nevsky 
wrote:

> I wish to understand if I can do something to get in result term "shipping"
> when search for "shipp*"?
>
> Here field definition:
>  multiValued="false"/>
>
>  positionIncrementGap="100">
>   
> 
>  ignoreCase="true"
> words="lang/stopwords_en.txt"
> />
> 
> 
>  protected="protwords.txt"/>
> 
>   
>
> Anything else can be important? Most configuration parameters are 
> default to Apache Solr 7.1.0.
>
> In the best we trust
> Georgy Nevsky
>
>
> -Original Message-
> From: Rick Leir [mailto:rl...@leirtech.com]
> Sent: Thursday, November 30, 2017 7:32 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Solr Wildcard Search
>
> George,
> When you get those results it could be due to stemming.
>
> Wildcard processing expands your term to multiple terms, OR'd 
> together. It also takes you down a different analysis pathway, as many 
> analysis components do not work with multiple terms. Look into the 
> SolrAdmin console, and use the analysis tab to understand what is 
> going on.
>
> If you still have doubts, tell us more about your config.
> Cheers --Rick
>
>
> On November 30, 2017 7:06:42 AM EST, Georgy Nevsky 
>  wrote:
> >Can somebody help me understand how Solr Wildcard Search is working?
> >
> >If I’m doing search for “ship*” term I’m getting in result many 
> >strings, like “Shipping Weight”, “Ship From”, “Shipping Calculator”, 
> >etc.
> >
> >But if I’m searching for “shipp*” I don’t get any result.
> >
> >
> >
> >In the best we trust
> >
> >Georgy Nevsky
>
> --
> Sorry for being brief. Alternate email is rickleir at yahoo dot com
>


RE: Solr Wildcard Search

2017-11-30 Thread Georgy Nevsky
I understand stemming reason. Thank you.

What do you suggest to use for stemming instead of "Porter" ? I guess, it
wasn't chosen intentionally.

In the best we trust
Georgy Nevsky


-Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org]
Sent: Thursday, November 30, 2017 8:25 AM
To: solr-user@lucene.apache.org
Subject: RE: Solr Wildcard Search

The initial question wasn't about a phrasal search, but I largely agree that
diff q parsers handle the analysis chain differently for multiterms.

Yes, Porter is crazily aggressive. USE WITH CAUTION!

As has been pointed out, use the Solr admin window and the "debug" in the
query option to see what's going on.

Use the Solr admin Analysis feature to see how your tokens are being
modified by each step in the analysis chain.

If you use solr admin and debug the query for "shipping", you see that it is
stemmed to "ship"...hence all of your matches work.  Porter doesn't have
rules for words ending in "pp", so it doesn't stem "shipp" to "ship".  So,
your wildcard query is looking for words that start with "shipp", and given
that "shipping" was stemmed to "ship", it won't find it.  It would find
"shippqrs" because porter wouldn't know what to do with that 😊

Again, Porter can be very dangerous if it doesn't align with user
expectations.



-Original Message-
From: Atita Arora [mailto:atitaar...@gmail.com]
Sent: Thursday, November 30, 2017 8:16 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr Wildcard Search

As Rick raised the most important aspect here , that the phrase is broken
into multiple terms ORed together , I believe if the use case requires to
perform wildcard search on phrases , we would need to store the entire
phrase as a single term in the index which probably is not happening right
now and hence are not found when sent across as phrases.
I tried this on my local Solr 7.1 without phrase this works as expected ,
however as soon as I do phrase search it fails for the reason as i mentioned
above.

Let me know if I can clarify further.

On Thu, Nov 30, 2017 at 6:31 PM, Georgy Nevsky 
wrote:

> I wish to understand if I can do something to get in result term
> "shipping"
> when search for "shipp*"?
>
> Here field definition:
>  multiValued="false"/>
>
>  positionIncrementGap="100">
>   
> 
>  ignoreCase="true"
> words="lang/stopwords_en.txt"
> />
> 
> 
>  protected="protwords.txt"/>
> 
>   
>
> Anything else can be important? Most configuration parameters are
> default to Apache Solr 7.1.0.
>
> In the best we trust
> Georgy Nevsky
>
>
> -Original Message-
> From: Rick Leir [mailto:rl...@leirtech.com]
> Sent: Thursday, November 30, 2017 7:32 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Solr Wildcard Search
>
> George,
> When you get those results it could be due to stemming.
>
> Wildcard processing expands your term to multiple terms, OR'd
> together. It also takes you down a different analysis pathway, as many
> analysis components do not work with multiple terms. Look into the
> SolrAdmin console, and use the analysis tab to understand what is
> going on.
>
> If you still have doubts, tell us more about your config.
> Cheers --Rick
>
>
> On November 30, 2017 7:06:42 AM EST, Georgy Nevsky
>  wrote:
> >Can somebody help me understand how Solr Wildcard Search is working?
> >
> >If I’m doing search for “ship*” term I’m getting in result many
> >strings, like “Shipping Weight”, “Ship From”, “Shipping Calculator”,
> >etc.
> >
> >But if I’m searching for “shipp*” I don’t get any result.
> >
> >
> >
> >In the best we trust
> >
> >Georgy Nevsky
>
> --
> Sorry for being brief. Alternate email is rickleir at yahoo dot com
>


Re: check softCommit , autocommit and hard commit count

2017-11-30 Thread Shawn Heisey

On 11/30/2017 4:36 AM, Puppy Linux Distros wrote:

I am trying to calculate the total number of softCommit , autocommit and
hard commit from the solr logs. Can you please check whether the below
commands are correct ?

Let me know how to find the total softcommit, hardcommit and autocommit
from the logs.


*1. totalcommit=`cat $solrlogfile | grep "start commit" | wc -l`*

*totalcommit =  **41906*


*2. totalsoftcommit=`cat $solrlogfile | grep "start commit" | grep
"softCommit=true" | wc -l`*

*totalsoftcommit =  **921*


These look reasonable ... but be aware that the default logging config 
will roll the solr.log file to a new empty file when it reaches 4 
megabytes, which doesn't really take that long on a busy server, so if 
you're only looking at "solr.log" you may have an incomplete picture.  I 
personally change the roll size limit to 4 gigabytes so solr.log covers 
a lot more time.


Solr restarts will *also* roll/archive logfiles, so you probably can't 
just look through every file in the logs directory that starts with 
"solr.log" -- it may be difficult to figure out exactly which files 
apply to the current running instance.  It might turn out that I'm 
completely wrong in that statement -- I haven't confirmed exactly what a 
Solr restart actually does with the logfiles.



*3. totalhardcommits=`cat $solrlogfile | grep "start commit" | grep
"softCommit=false" | grep "openSearcher=true" | wc -l`*

*totalhardcommits=  **40982*


If you have configured autoCommit in solrconfig.xml and have set 
openSearcher to false in that config, then there will be hard commits 
that *don't* open a new searcher, so the "openSearcher=true" part will 
not catch those commits.  Example configs in recent versions have 
autoCommit set up this way, and this recommended config for 
*everybody*.  The default autoCommit interval in the example configs is 
15 seconds, which I think is a little too aggressive, but this kind of 
commit is typically very fast, so I've never seen that config cause 
problems.


The example configs do not have autoSoftCommit configured.  If users 
want to automatically do commits for visibility, we recommend that they 
use autoSoftCommit.



*4.  totalautocommit=`cat $solrlogfile | grep "realtime" | wc -l`*

*totalautocommit= 3*


These aren't autoCommits.  They are new searchers for the realtime get 
handler, which is capable of accessing documents that haven't been 
committed yet.  In addition to the index on disk, it searches the 
transaction logs.  Opening a new realtime searcher should be very fast, 
and they happen without any configuration. I'm not sure why you're only 
seeing this happen three times here. Presumably in a log where there are 
4 total commits, you are doing a fair amount of indexing, so I would 
have expected a new realtime searcher to have been created much more 
frequently, even if there were no commits done at all.


Maybe the realtime get handler can use the standard searcher, and only 
opens a new realtime searcher in cases where new documents have been 
indexed but there hasn't been a recent commit that opens a new 
searcher.  If that's the case, then I have no idea how long it would 
wait before firing up a new realtime searcher.  I wouldn't expect that 
to be very long ... so if your indexing/committing cycles are normally 
very fast, maybe Solr doesn't feel it's necessary to open realtime 
searchers very often.


Thanks,
Shawn



RE: Solr Wildcard Search

2017-11-30 Thread Allison, Timothy B.
At the very least the English possessive filter, which you have.  Great!

Depending on what your query log analysis finds -- perhaps users are pretty 
much only searching on nouns? -- you might consider 
EnglishMinimalStemFilterFactory.

I wouldn't say that porter was or wasn't chosen intentionally.  It may be good 
for some use cases.  However, for the use cases I've seen, it has been 
disastrous.   

I have code that shows "equivalence sets" for analysis chain A vs analysis 
chain B...with some noise...assume same tokenization...  I should probably 
share that code on github or fold it into Luke somehow?  You can see this on a 
one-off basis in the Solr admin window via the Analysis tab, but to see this on 
your corpus/corpora across terms can be eye-opening, and then to cross-check it 
against query logs...quite powerful.


On one corpus, when I compared the same analysis chain A without Porter and B 
with porter, the output is e.g.:

"stemmed\tunstemmed #docs|unstemmed #docs..."

public  public 9834 | publication 1429 | publications 960 | publicly 662 | 
public's 176 | publicize 118 | publicized 107 | publicity 91 | publically 66 | 
publicizing 63 | publication's 6 | publicizes 4 | public_ 1 | publication_ 1 | 
publiced 1

effect  effective 6329 | effect 3157 | effectively 1745 | effectiveness 1198 | 
effects 831 | effected 139 | effecting 85 | effectives 1

new new 13279 | newness 6 | newed 3 | newe 2 | newing 1

order   order 7256 | orders 3125 | ordered 1840 | ordering 758 | orderly 241 | 
order's 17 | orderable 3 | orders_ 1

Imagine users searching for "publication" (~2500 docs) and getting back every 
document that mentions "public" (~10k).  That's a huge problem in many 
circumstances.  Good luck finding the name "newing".


-Original Message-
From: Georgy Nevsky [mailto:gnevsky.cn...@thomasnet.com] 
Sent: Thursday, November 30, 2017 8:31 AM
To: solr-user@lucene.apache.org
Subject: RE: Solr Wildcard Search

I understand stemming reason. Thank you.

What do you suggest to use for stemming instead of "Porter" ? I guess, it 
wasn't chosen intentionally.

In the best we trust
Georgy Nevsky


-Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org]
Sent: Thursday, November 30, 2017 8:25 AM
To: solr-user@lucene.apache.org
Subject: RE: Solr Wildcard Search

The initial question wasn't about a phrasal search, but I largely agree that 
diff q parsers handle the analysis chain differently for multiterms.

Yes, Porter is crazily aggressive. USE WITH CAUTION!

As has been pointed out, use the Solr admin window and the "debug" in the query 
option to see what's going on.

Use the Solr admin Analysis feature to see how your tokens are being modified 
by each step in the analysis chain.

If you use solr admin and debug the query for "shipping", you see that it is 
stemmed to "ship"...hence all of your matches work.  Porter doesn't have rules 
for words ending in "pp", so it doesn't stem "shipp" to "ship".  So, your 
wildcard query is looking for words that start with "shipp", and given that 
"shipping" was stemmed to "ship", it won't find it.  It would find "shippqrs" 
because porter wouldn't know what to do with that 😊

Again, Porter can be very dangerous if it doesn't align with user expectations.



-Original Message-
From: Atita Arora [mailto:atitaar...@gmail.com]
Sent: Thursday, November 30, 2017 8:16 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr Wildcard Search

As Rick raised the most important aspect here , that the phrase is broken into 
multiple terms ORed together , I believe if the use case requires to perform 
wildcard search on phrases , we would need to store the entire phrase as a 
single term in the index which probably is not happening right now and hence 
are not found when sent across as phrases.
I tried this on my local Solr 7.1 without phrase this works as expected , 
however as soon as I do phrase search it fails for the reason as i mentioned 
above.

Let me know if I can clarify further.

On Thu, Nov 30, 2017 at 6:31 PM, Georgy Nevsky 
wrote:

> I wish to understand if I can do something to get in result term 
> "shipping"
> when search for "shipp*"?
>
> Here field definition:
>  multiValued="false"/>
>
>  positionIncrementGap="100">
>   
> 
>  ignoreCase="true"
> words="lang/stopwords_en.txt"
> />
> 
> 
>  protected="protwords.txt"/>
> 
>   
>
> Anything else can be important? Most configuration parameters are 
> default to Apache Solr 7.1.0.
>
> In the best we trust
> Georgy Nevsky
>
>
> -Original Message-
> From: Rick Leir [mailto:rl...@leirtech.com]
> Sent: Thursday, November 30, 2017 7:32 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Solr Wildcard Search
>
> George,
> When you get those results it could be due to stemming.
>
> Wildcard processing expands your term to multiple terms, OR'd 
> together. It also takes you d

RE: Solr Wildcard Search

2017-11-30 Thread Allison, Timothy B.
A slightly more refined answer...  In my experience with the systems I've 
worked with, Porter and other stemmers can be useful as a "fallback field" with 
a really low boost, but you should be really careful if you're only searching 
on one field.

Cannot recommend Doug Turnbull and John Berryman's "Relevant Search" enough on 
how to layer fields...among many other great insights: 
https://www.manning.com/books/relevant-search


 -Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org] 
Sent: Thursday, November 30, 2017 9:20 AM
To: solr-user@lucene.apache.org
Subject: RE: Solr Wildcard Search

At the very least the English possessive filter, which you have.  Great!

Depending on what your query log analysis finds -- perhaps users are pretty 
much only searching on nouns? -- you might consider 
EnglishMinimalStemFilterFactory.

I wouldn't say that porter was or wasn't chosen intentionally.  It may be good 
for some use cases.  However, for the use cases I've seen, it has been 
disastrous.   

I have code that shows "equivalence sets" for analysis chain A vs analysis 
chain B...with some noise...assume same tokenization...  I should probably 
share that code on github or fold it into Luke somehow?  You can see this on a 
one-off basis in the Solr admin window via the Analysis tab, but to see this on 
your corpus/corpora across terms can be eye-opening, and then to cross-check it 
against query logs...quite powerful.


On one corpus, when I compared the same analysis chain A without Porter and B 
with porter, the output is e.g.:

"stemmed\tunstemmed #docs|unstemmed #docs..."

public  public 9834 | publication 1429 | publications 960 | publicly 662 | 
public's 176 | publicize 118 | publicized 107 | publicity 91 | publically 66 | 
publicizing 63 | publication's 6 | publicizes 4 | public_ 1 | publication_ 1 | 
publiced 1

effect  effective 6329 | effect 3157 | effectively 1745 | effectiveness 1198 | 
effects 831 | effected 139 | effecting 85 | effectives 1

new new 13279 | newness 6 | newed 3 | newe 2 | newing 1

order   order 7256 | orders 3125 | ordered 1840 | ordering 758 | orderly 241 | 
order's 17 | orderable 3 | orders_ 1

Imagine users searching for "publication" (~2500 docs) and getting back every 
document that mentions "public" (~10k).  That's a huge problem in many 
circumstances.  Good luck finding the name "newing".


-Original Message-
From: Georgy Nevsky [mailto:gnevsky.cn...@thomasnet.com]
Sent: Thursday, November 30, 2017 8:31 AM
To: solr-user@lucene.apache.org
Subject: RE: Solr Wildcard Search

I understand stemming reason. Thank you.

What do you suggest to use for stemming instead of "Porter" ? I guess, it 
wasn't chosen intentionally.

In the best we trust
Georgy Nevsky


-Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org]
Sent: Thursday, November 30, 2017 8:25 AM
To: solr-user@lucene.apache.org
Subject: RE: Solr Wildcard Search

The initial question wasn't about a phrasal search, but I largely agree that 
diff q parsers handle the analysis chain differently for multiterms.

Yes, Porter is crazily aggressive. USE WITH CAUTION!

As has been pointed out, use the Solr admin window and the "debug" in the query 
option to see what's going on.

Use the Solr admin Analysis feature to see how your tokens are being modified 
by each step in the analysis chain.

If you use solr admin and debug the query for "shipping", you see that it is 
stemmed to "ship"...hence all of your matches work.  Porter doesn't have rules 
for words ending in "pp", so it doesn't stem "shipp" to "ship".  So, your 
wildcard query is looking for words that start with "shipp", and given that 
"shipping" was stemmed to "ship", it won't find it.  It would find "shippqrs" 
because porter wouldn't know what to do with that 😊

Again, Porter can be very dangerous if it doesn't align with user expectations.



-Original Message-
From: Atita Arora [mailto:atitaar...@gmail.com]
Sent: Thursday, November 30, 2017 8:16 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr Wildcard Search

As Rick raised the most important aspect here , that the phrase is broken into 
multiple terms ORed together , I believe if the use case requires to perform 
wildcard search on phrases , we would need to store the entire phrase as a 
single term in the index which probably is not happening right now and hence 
are not found when sent across as phrases.
I tried this on my local Solr 7.1 without phrase this works as expected , 
however as soon as I do phrase search it fails for the reason as i mentioned 
above.

Let me know if I can clarify further.

On Thu, Nov 30, 2017 at 6:31 PM, Georgy Nevsky 
wrote:

> I wish to understand if I can do something to get in result term 
> "shipping"
> when search for "shipp*"?
>
> Here field definition:
>  multiValued="false"/>
>
>  positionIncrementGap="100">
>   
> 
>  ignoreCase="true"
> words="l

Issue with CDCR bootstrapping in Solr 7.1

2017-11-30 Thread Tom Peters
I'm running into an issue with the initial CDCR bootstrapping of an existing 
index. In short, after turning on CDCR only the leader replica in the target 
data center will have the documents replicated and it will not exist in any of 
the follower replicas in the target data center. All subsequent incremental 
updates made to the source datacenter will appear in all replicas in the target 
data center.

A little more details:

I have two clusters setup, a source cluster and a target cluster. Each cluster 
has only one shard and three replicas. I used the configuration detailed in the 
Source and Target sections of the reference guide as-is with the exception of 
updating the zkHost 
(https://lucene.apache.org/solr/guide/7_1/cross-data-center-replication-cdcr.html#cdcr-configuration-2).

The source data center has the following nodes:
solr01-a, solr01-b, and solr01-c

The target data center has the following nodes:
solr02-a, solr02-b, and solr02-c

Here are the steps that I've done:

1. Create collection in source and target data centers

2. Add a number of documents to the source data center

3. Verify:

$ for i in solr0{1,2}-{a,b,c}; do echo -n "$i: "; curl -s 
$i:8080/solr/mycollection/select'?q=*:*' | jq '.response.numFound'; done
solr01-a: 81
solr01-b: 81
solr01-c: 81
solr02-a: 0
solr02-b: 0
solr02-c: 0

4. Start CDCR:

$ curl 'solr01-a:8080/solr/mycollection/cdcr?action=START'

5. See if target data center has received the initial index

$ for i in solr0{1,2}-{a,b,c}; do echo -n "$i: "; curl -s 
$i:8080/solr/mycollection/select'?q=*:*' | jq '.response.numFound'; done
solr01-a: 81
solr01-b: 81
solr01-c: 81
solr02-a: 0
solr02-b: 0
solr02-c: 81

note: only -c has received the index

6. Add another document to the source cluster

7. See how many documents are in each node:

$ for i in solr0{1,2}-{a,b,c}; do echo -n "$i: "; curl -s 
$i:8080/solr/mycollection/select'?q=*:*' | jq '.response.numFound'; done
solr01-a: 82
solr01-b: 82
solr01-c: 82
solr02-a: 1
solr02-b: 1
solr02-c: 82


As you can see, the initial index only made it to one of the replicas in the 
target data center, but subsequent incremental updates have appeared everywhere 
I would expect. Any help would be greatly appreciated, thanks.



This message and any attachment may contain information that is confidential 
and/or proprietary. Any use, disclosure, copying, storing, or distribution of 
this e-mail or any attached file by anyone other than the intended recipient is 
strictly prohibited. If you have received this message in error, please notify 
the sender by reply email and delete the message and any attachments. Thank you.


RE: [EXTERNAL] - Re: Basic SolrCloud help

2017-11-30 Thread Steve Pruitt
Thanks Shawn, it all mainly made sense.

I took the hint and looked at both solr.in.cmd and solr.in.sh.  Clearly setting 
ZK_HOST is a first step.  I am sure this is explained somewhere, but I 
overlooked it.
From here, once I have Solr installed, I can run the Control Script to upload a 
config set either when creating a collection or independent of creating the 
collection.

When I install Solr on the three nodes I have planned, I run the Control Script 
and just point to somewhere on disk I have the config set stored.

One question buried in my first missive was about mixing the Solr machines.  I 
was thinking of installing Solr on two VM's running Centos and then make my 
third Solr node on my local machine, i.e. Windows.  I can't think of why this 
could be an issue,  as long as everything is setup with the right ZK hosts, 
etc.  Does anyone know of any potential issues doing this?

One last clarification.  

Per, " The definition wouldn't come from the running Solr, it would come from 
the *config* that started the running Solr."

I am not sure what "definition" and *config* are referencing.  When I initially 
install Solr it will not have a config set.  I haven't created a Collection 
yet.   The running Solr instance upon initial install has no config yet.  But, 
I think I am not understanding what "definition" and "*config*" mean.

Thanks in advance.


-S


On 11/29/2017 11:44 AM, Steve Pruitt wrote:
> I want ZK to manage the config files.  The config set and the solr.xml file.  
> I wanted to upload them explicitly.
>
> This is where my questions begin.
> I assume I upload the config files prior to starting Solr?

If you're storing solr.xml in ZK, then you need to upload that file before 
starting Solr.  Note that you cannot have server-specific configurations in 
solr.xml if it is in zookeeper -- the exact same solr.xml file will be used for 
all Solr instances connecting to that ZK ensemble.

You can upload the collection configurations either before or after you start 
Solr, but they definitely need to be there before you create collections that 
use them.

> Since I have Solr installed on locally, I can use the local scripts to upload 
> the config files?
> Looking at the example command for uploading the solr.xml file.
>
> bin/solr zk cp file:local/file/path/to/solr.xml zk:/solr.xml -z 
> localhost:2181
>   
> It lists a single ZK host (example localhost). If uploading to a ZK ensemble, 
> do I list all three hosts as in my case?
> Or, do I send it to one ZK host and ZK makes it available to the other ZK 
> hosts?

I would personally include the entire zkhost string listing all your servers 
and any chroot.  But if you only list one (and the chroot if you're using one), 
then it will *probably* work without any problems. The connection is only 
needed for a few moments with the commandline copies.

> Looking at the example command for uploading the configuration files.
>
> bin/solr zk upconfig -n  -d  with configset>
>
> I see no ZK hosts listed.  I assume this means Solr has been started in cloud 
> mode and already knows the ZK hosts?

The definition wouldn't come from the running Solr, it would come from the 
*config* that started the running Solr.

It would take an exhaustive code review to be SURE about what I'm going to say 
here, but this is how I *think* it works:  If the -z option is not provided, 
the bin/solr script is going to expect ZK_HOST to be defined in the environment 
or the include script.  If you have
*installed* Solr (rather than just extracted the archive and started it), then 
the include script is going to be in /etc/default and will typically be named 
solr.in.sh, but the name could be different if you changed the name of the 
service when you installed it.  If you have just started Solr manually, then it 
will probably be bin/solr.in.sh.

If you have defined ZK_HOST in your include script, then you probably don't 
need the -z option for the solr.xml copy command above either.

If what I've just said is correct, then I can be reasonably sure it's the case 
for 6.6.x and 7.x, but I do not know for sure with older versions.

> Per the solr.xml file.  When installing Solr, I can leave installed solr.xml 
> file in place?  ZK replaces it with the uploaded version?

As I understand it, if Solr finds solr.xml in zookeeper on startup, it is going 
to use that file, and won't even look for a local copy.

Thanks,
Shawn



Re: does the payload_check query parser have support for simple query parser operators?

2017-11-30 Thread John Anonymous
Ok, thanks.  Do you know if there are any plans to support special syntax
in the future?

On Thu, Nov 30, 2017 at 5:04 AM, Erik Hatcher 
wrote:

> No it doesn’t.   The payload parsers currently just simple tokenize with
> no special syntax supported.
>
>  Erik
>
> > On Nov 30, 2017, at 02:41, John Anonymous  wrote:
> >
> > I would like to use wildcards and fuzzy search with the payload_check
> query
> > parser. Are these supported?
> >
> > {!payload_check f=text payloads='NOUN'}apple~1
> >
> > {!payload_check f=text payloads='NOUN'}app*
> >
> > Thanks
>


Skewed IDF in multi lingual index, again

2017-11-30 Thread Markus Jelsma
Hello,

We already discussed this problem five years ago [1]. In short: documents in 
foreign languages are scored higher for some terms.

It was solved back then by using docCount instead of maxDoc when calculating 
idf, it worked really well! But, probably due to index changes, the problem is 
back for some terms, mostly proper nouns, well, just like five years ago.

We already deboost documents by 0.7 that are not in the user's preference 
language but in some cases it is not enough. I can go on by reducing that boost 
but that's not what i prefer.

I'd like to know if there are additional tricks to solve the problem.

Many thanks!
Markus

[1] 
http://lucene.472066.n3.nabble.com/Skewed-IDF-in-multi-lingual-index-td4019095.html


Re: Issue with CDCR bootstrapping in Solr 7.1

2017-11-30 Thread Amrit Sarkar
Hi Tom,

I see what you are saying and I too think this is a bug, but I will confirm
once on the code. Bootstrapping should happen on all the nodes of the
target.

Meanwhile can you index more than 100 documents in the source and do the
exact same experiment again. Followers will not copy the entire index of
Leader unless the difference in versions in docs are more than
"numRecordsToKeep", which is default 100, unless you have modified in
solrconfig.xml.

Looking forward to your analysis.

Amrit Sarkar
Search Engineer
Lucidworks, Inc.
415-589-9269
www.lucidworks.com
Twitter http://twitter.com/lucidworks
LinkedIn: https://www.linkedin.com/in/sarkaramrit2
Medium: https://medium.com/@sarkaramrit2

On Thu, Nov 30, 2017 at 9:03 PM, Tom Peters  wrote:

> I'm running into an issue with the initial CDCR bootstrapping of an
> existing index. In short, after turning on CDCR only the leader replica in
> the target data center will have the documents replicated and it will not
> exist in any of the follower replicas in the target data center. All
> subsequent incremental updates made to the source datacenter will appear in
> all replicas in the target data center.
>
> A little more details:
>
> I have two clusters setup, a source cluster and a target cluster. Each
> cluster has only one shard and three replicas. I used the configuration
> detailed in the Source and Target sections of the reference guide as-is
> with the exception of updating the zkHost (https://lucene.apache.org/
> solr/guide/7_1/cross-data-center-replication-cdcr.html#
> cdcr-configuration-2).
>
> The source data center has the following nodes:
> solr01-a, solr01-b, and solr01-c
>
> The target data center has the following nodes:
> solr02-a, solr02-b, and solr02-c
>
> Here are the steps that I've done:
>
> 1. Create collection in source and target data centers
>
> 2. Add a number of documents to the source data center
>
> 3. Verify:
>
> $ for i in solr0{1,2}-{a,b,c}; do echo -n "$i: "; curl -s
> $i:8080/solr/mycollection/select'?q=*:*' | jq '.response.numFound'; done
> solr01-a: 81
> solr01-b: 81
> solr01-c: 81
> solr02-a: 0
> solr02-b: 0
> solr02-c: 0
>
> 4. Start CDCR:
>
> $ curl 'solr01-a:8080/solr/mycollection/cdcr?action=START'
>
> 5. See if target data center has received the initial index
>
> $ for i in solr0{1,2}-{a,b,c}; do echo -n "$i: "; curl -s
> $i:8080/solr/mycollection/select'?q=*:*' | jq '.response.numFound'; done
> solr01-a: 81
> solr01-b: 81
> solr01-c: 81
> solr02-a: 0
> solr02-b: 0
> solr02-c: 81
>
> note: only -c has received the index
>
> 6. Add another document to the source cluster
>
> 7. See how many documents are in each node:
>
> $ for i in solr0{1,2}-{a,b,c}; do echo -n "$i: "; curl -s
> $i:8080/solr/mycollection/select'?q=*:*' | jq '.response.numFound'; done
> solr01-a: 82
> solr01-b: 82
> solr01-c: 82
> solr02-a: 1
> solr02-b: 1
> solr02-c: 82
>
>
> As you can see, the initial index only made it to one of the replicas in
> the target data center, but subsequent incremental updates have appeared
> everywhere I would expect. Any help would be greatly appreciated, thanks.
>
>
>
> This message and any attachment may contain information that is
> confidential and/or proprietary. Any use, disclosure, copying, storing, or
> distribution of this e-mail or any attached file by anyone other than the
> intended recipient is strictly prohibited. If you have received this
> message in error, please notify the sender by reply email and delete the
> message and any attachments. Thank you.
>


Re: Skewed IDF in multi lingual index, again

2017-11-30 Thread Walter Underwood
I’ve occasionally considered using Unicode language tags (U+E001 and friends) 
on each term. That would make a term specific to a language, so we would get 
[en]LaserJet, [fr]LaserJet, [de]LaserJet, and so on. But that is a pretty big 
hammer, because it restricts matches to the same language. If the entire 
document is in one language, might as well use a filter query for that 
language. The tags would work for multiple languages in one document.

Maybe make the untagged term a synonym. For cross-language terms like 
“LaserJet”, the untagged one would have worse idf.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Nov 30, 2017, at 8:14 AM, Markus Jelsma  wrote:
> 
> Hello,
> 
> We already discussed this problem five years ago [1]. In short: documents in 
> foreign languages are scored higher for some terms.
> 
> It was solved back then by using docCount instead of maxDoc when calculating 
> idf, it worked really well! But, probably due to index changes, the problem 
> is back for some terms, mostly proper nouns, well, just like five years ago.
> 
> We already deboost documents by 0.7 that are not in the user's preference 
> language but in some cases it is not enough. I can go on by reducing that 
> boost but that's not what i prefer.
> 
> I'd like to know if there are additional tricks to solve the problem.
> 
> Many thanks!
> Markus
> 
> [1] 
> http://lucene.472066.n3.nabble.com/Skewed-IDF-in-multi-lingual-index-td4019095.html



RE: Skewed IDF in multi lingual index, again

2017-11-30 Thread Markus Jelsma
This is unfortunately not what we want. Some customers use filters to restrict 
language, but some customers don't. They want to be able to find documents in 
all languages, so we use user preference to get their local language on top. 
Except for very relevant documents in foreign languages, hence the deboost is 
not too low.

Thanks,
Markus

 
-Original message-
> From:Walter Underwood 
> Sent: Thursday 30th November 2017 17:29
> To: solr-user@lucene.apache.org
> Subject: Re: Skewed IDF in multi lingual index, again
> 
> I’ve occasionally considered using Unicode language tags (U+E001 and friends) 
> on each term. That would make a term specific to a language, so we would get 
> [en]LaserJet, [fr]LaserJet, [de]LaserJet, and so on. But that is a pretty big 
> hammer, because it restricts matches to the same language. If the entire 
> document is in one language, might as well use a filter query for that 
> language. The tags would work for multiple languages in one document.
> 
> Maybe make the untagged term a synonym. For cross-language terms like 
> “LaserJet”, the untagged one would have worse idf.
> 
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
> 
> 
> > On Nov 30, 2017, at 8:14 AM, Markus Jelsma  
> > wrote:
> > 
> > Hello,
> > 
> > We already discussed this problem five years ago [1]. In short: documents 
> > in foreign languages are scored higher for some terms.
> > 
> > It was solved back then by using docCount instead of maxDoc when 
> > calculating idf, it worked really well! But, probably due to index changes, 
> > the problem is back for some terms, mostly proper nouns, well, just like 
> > five years ago.
> > 
> > We already deboost documents by 0.7 that are not in the user's preference 
> > language but in some cases it is not enough. I can go on by reducing that 
> > boost but that's not what i prefer.
> > 
> > I'd like to know if there are additional tricks to solve the problem.
> > 
> > Many thanks!
> > Markus
> > 
> > [1] 
> > http://lucene.472066.n3.nabble.com/Skewed-IDF-in-multi-lingual-index-td4019095.html
> 
> 


Re: Issue with CDCR bootstrapping in Solr 7.1

2017-11-30 Thread Tom Peters
Hi Amrit,

Starting with more documents doesn't appear to have made a difference. This 
time I tried with >1000 docs. Here are the steps I took:

1. Deleted the collection on both the source and target DCs.

2. Recreated the collections.

3. Indexed >1000 documents on source data center, hard commmit

  $ for i in solr0{1,2}-{a,b,c}; do echo -n "$i: "; curl -s 
$i:8080/solr/mycollection/select'?q=*:*' | jq '.response.numFound'; done
  solr01-a: 1368
  solr01-b: 1368
  solr01-c: 1368
  solr02-a: 0
  solr02-b: 0
  solr02-c: 0

4. Enabled CDCR and checked docs

  $ curl 'solr01-a:8080/solr/synacor/cdcr?action=START'

  $ for i in solr0{1,2}-{a,b,c}; do echo -n "$i: "; curl -s 
$i:8080/solr/mycollection/select'?q=*:*' | jq '.response.numFound'; done
  solr01-a: 1368
  solr01-b: 1368
  solr01-c: 1368
  solr02-a: 0
  solr02-b: 0
  solr02-c: 1368

Some additional notes:

* I do not have numRecordsToKeep defined in my solrconfig.xml, so I assume it 
will use the default of 100

* I found a way to get the follower replicas to receive the documents from the 
leader in the target data center, I have to restart the solr instance running 
on that server. Not sure if this information helps at all.

> On Nov 30, 2017, at 11:22 AM, Amrit Sarkar  wrote:
> 
> Hi Tom,
> 
> I see what you are saying and I too think this is a bug, but I will confirm
> once on the code. Bootstrapping should happen on all the nodes of the
> target.
> 
> Meanwhile can you index more than 100 documents in the source and do the
> exact same experiment again. Followers will not copy the entire index of
> Leader unless the difference in versions in docs are more than
> "numRecordsToKeep", which is default 100, unless you have modified in
> solrconfig.xml.
> 
> Looking forward to your analysis.
> 
> Amrit Sarkar
> Search Engineer
> Lucidworks, Inc.
> 415-589-9269
> www.lucidworks.com
> Twitter http://twitter.com/lucidworks
> LinkedIn: https://www.linkedin.com/in/sarkaramrit2
> Medium: https://medium.com/@sarkaramrit2
> 
> On Thu, Nov 30, 2017 at 9:03 PM, Tom Peters  wrote:
> 
>> I'm running into an issue with the initial CDCR bootstrapping of an
>> existing index. In short, after turning on CDCR only the leader replica in
>> the target data center will have the documents replicated and it will not
>> exist in any of the follower replicas in the target data center. All
>> subsequent incremental updates made to the source datacenter will appear in
>> all replicas in the target data center.
>> 
>> A little more details:
>> 
>> I have two clusters setup, a source cluster and a target cluster. Each
>> cluster has only one shard and three replicas. I used the configuration
>> detailed in the Source and Target sections of the reference guide as-is
>> with the exception of updating the zkHost (https://lucene.apache.org/
>> solr/guide/7_1/cross-data-center-replication-cdcr.html#
>> cdcr-configuration-2).
>> 
>> The source data center has the following nodes:
>>solr01-a, solr01-b, and solr01-c
>> 
>> The target data center has the following nodes:
>>solr02-a, solr02-b, and solr02-c
>> 
>> Here are the steps that I've done:
>> 
>> 1. Create collection in source and target data centers
>> 
>> 2. Add a number of documents to the source data center
>> 
>> 3. Verify:
>> 
>>$ for i in solr0{1,2}-{a,b,c}; do echo -n "$i: "; curl -s
>> $i:8080/solr/mycollection/select'?q=*:*' | jq '.response.numFound'; done
>>solr01-a: 81
>>solr01-b: 81
>>solr01-c: 81
>>solr02-a: 0
>>solr02-b: 0
>>solr02-c: 0
>> 
>> 4. Start CDCR:
>> 
>>$ curl 'solr01-a:8080/solr/mycollection/cdcr?action=START'
>> 
>> 5. See if target data center has received the initial index
>> 
>>$ for i in solr0{1,2}-{a,b,c}; do echo -n "$i: "; curl -s
>> $i:8080/solr/mycollection/select'?q=*:*' | jq '.response.numFound'; done
>>solr01-a: 81
>>solr01-b: 81
>>solr01-c: 81
>>solr02-a: 0
>>solr02-b: 0
>>solr02-c: 81
>> 
>>note: only -c has received the index
>> 
>> 6. Add another document to the source cluster
>> 
>> 7. See how many documents are in each node:
>> 
>>$ for i in solr0{1,2}-{a,b,c}; do echo -n "$i: "; curl -s
>> $i:8080/solr/mycollection/select'?q=*:*' | jq '.response.numFound'; done
>>solr01-a: 82
>>solr01-b: 82
>>solr01-c: 82
>>solr02-a: 1
>>solr02-b: 1
>>solr02-c: 82
>> 
>> 
>> As you can see, the initial index only made it to one of the replicas in
>> the target data center, but subsequent incremental updates have appeared
>> everywhere I would expect. Any help would be greatly appreciated, thanks.
>> 
>> 
>> 
>> This message and any attachment may contain information that is
>> confidential and/or proprietary. Any use, disclosure, copying, storing, or
>> distribution of this e-mail or any attached file by anyone other than the
>> intended recipient is strictly prohibited. If you have received this
>> message in error, please notify the sender by reply email and delete the
>> message and any at

Re: Skewed IDF in multi lingual index, again

2017-11-30 Thread Walter Underwood
Expanding the query to use both the tagged and untagged term might work. I’m 
not sure the effect would be a lot different than boosting the preferred 
language.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Nov 30, 2017, at 8:35 AM, Markus Jelsma  wrote:
> 
> This is unfortunately not what we want. Some customers use filters to 
> restrict language, but some customers don't. They want to be able to find 
> documents in all languages, so we use user preference to get their local 
> language on top. Except for very relevant documents in foreign languages, 
> hence the deboost is not too low.
> 
> Thanks,
> Markus
> 
> 
> -Original message-
>> From:Walter Underwood 
>> Sent: Thursday 30th November 2017 17:29
>> To: solr-user@lucene.apache.org
>> Subject: Re: Skewed IDF in multi lingual index, again
>> 
>> I’ve occasionally considered using Unicode language tags (U+E001 and 
>> friends) on each term. That would make a term specific to a language, so we 
>> would get [en]LaserJet, [fr]LaserJet, [de]LaserJet, and so on. But that is a 
>> pretty big hammer, because it restricts matches to the same language. If the 
>> entire document is in one language, might as well use a filter query for 
>> that language. The tags would work for multiple languages in one document.
>> 
>> Maybe make the untagged term a synonym. For cross-language terms like 
>> “LaserJet”, the untagged one would have worse idf.
>> 
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>> 
>>> On Nov 30, 2017, at 8:14 AM, Markus Jelsma  
>>> wrote:
>>> 
>>> Hello,
>>> 
>>> We already discussed this problem five years ago [1]. In short: documents 
>>> in foreign languages are scored higher for some terms.
>>> 
>>> It was solved back then by using docCount instead of maxDoc when 
>>> calculating idf, it worked really well! But, probably due to index changes, 
>>> the problem is back for some terms, mostly proper nouns, well, just like 
>>> five years ago.
>>> 
>>> We already deboost documents by 0.7 that are not in the user's preference 
>>> language but in some cases it is not enough. I can go on by reducing that 
>>> boost but that's not what i prefer.
>>> 
>>> I'd like to know if there are additional tricks to solve the problem.
>>> 
>>> Many thanks!
>>> Markus
>>> 
>>> [1] 
>>> http://lucene.472066.n3.nabble.com/Skewed-IDF-in-multi-lingual-index-td4019095.html
>> 
>> 



Re: Issue with CDCR bootstrapping in Solr 7.1

2017-11-30 Thread Amrit Sarkar
Tom,

This is very useful:

> I found a way to get the follower replicas to receive the documents from
> the leader in the target data center, I have to restart the solr instance
> running on that server. Not sure if this information helps at all.


You have to issue hardcommit on target after the bootstrapping is done.
Reloading makes the core opening a new searcher. While explicit commit is
issued at target leader after the BS is done, follower are left unattended
though the docs are copied over.

Amrit Sarkar
Search Engineer
Lucidworks, Inc.
415-589-9269
www.lucidworks.com
Twitter http://twitter.com/lucidworks
LinkedIn: https://www.linkedin.com/in/sarkaramrit2
Medium: https://medium.com/@sarkaramrit2

On Thu, Nov 30, 2017 at 10:06 PM, Tom Peters  wrote:

> Hi Amrit,
>
> Starting with more documents doesn't appear to have made a difference.
> This time I tried with >1000 docs. Here are the steps I took:
>
> 1. Deleted the collection on both the source and target DCs.
>
> 2. Recreated the collections.
>
> 3. Indexed >1000 documents on source data center, hard commmit
>
>   $ for i in solr0{1,2}-{a,b,c}; do echo -n "$i: "; curl -s
> $i:8080/solr/mycollection/select'?q=*:*' | jq '.response.numFound'; done
>   solr01-a: 1368
>   solr01-b: 1368
>   solr01-c: 1368
>   solr02-a: 0
>   solr02-b: 0
>   solr02-c: 0
>
> 4. Enabled CDCR and checked docs
>
>   $ curl 'solr01-a:8080/solr/synacor/cdcr?action=START'
>
>   $ for i in solr0{1,2}-{a,b,c}; do echo -n "$i: "; curl -s
> $i:8080/solr/mycollection/select'?q=*:*' | jq '.response.numFound'; done
>   solr01-a: 1368
>   solr01-b: 1368
>   solr01-c: 1368
>   solr02-a: 0
>   solr02-b: 0
>   solr02-c: 1368
>
> Some additional notes:
>
> * I do not have numRecordsToKeep defined in my solrconfig.xml, so I assume
> it will use the default of 100
>
> * I found a way to get the follower replicas to receive the documents from
> the leader in the target data center, I have to restart the solr instance
> running on that server. Not sure if this information helps at all.
>
> > On Nov 30, 2017, at 11:22 AM, Amrit Sarkar 
> wrote:
> >
> > Hi Tom,
> >
> > I see what you are saying and I too think this is a bug, but I will
> confirm
> > once on the code. Bootstrapping should happen on all the nodes of the
> > target.
> >
> > Meanwhile can you index more than 100 documents in the source and do the
> > exact same experiment again. Followers will not copy the entire index of
> > Leader unless the difference in versions in docs are more than
> > "numRecordsToKeep", which is default 100, unless you have modified in
> > solrconfig.xml.
> >
> > Looking forward to your analysis.
> >
> > Amrit Sarkar
> > Search Engineer
> > Lucidworks, Inc.
> > 415-589-9269
> > www.lucidworks.com
> > Twitter http://twitter.com/lucidworks
> > LinkedIn: https://www.linkedin.com/in/sarkaramrit2
> > Medium: https://medium.com/@sarkaramrit2
> >
> > On Thu, Nov 30, 2017 at 9:03 PM, Tom Peters  wrote:
> >
> >> I'm running into an issue with the initial CDCR bootstrapping of an
> >> existing index. In short, after turning on CDCR only the leader replica
> in
> >> the target data center will have the documents replicated and it will
> not
> >> exist in any of the follower replicas in the target data center. All
> >> subsequent incremental updates made to the source datacenter will
> appear in
> >> all replicas in the target data center.
> >>
> >> A little more details:
> >>
> >> I have two clusters setup, a source cluster and a target cluster. Each
> >> cluster has only one shard and three replicas. I used the configuration
> >> detailed in the Source and Target sections of the reference guide as-is
> >> with the exception of updating the zkHost (https://lucene.apache.org/
> >> solr/guide/7_1/cross-data-center-replication-cdcr.html#
> >> cdcr-configuration-2).
> >>
> >> The source data center has the following nodes:
> >>solr01-a, solr01-b, and solr01-c
> >>
> >> The target data center has the following nodes:
> >>solr02-a, solr02-b, and solr02-c
> >>
> >> Here are the steps that I've done:
> >>
> >> 1. Create collection in source and target data centers
> >>
> >> 2. Add a number of documents to the source data center
> >>
> >> 3. Verify:
> >>
> >>$ for i in solr0{1,2}-{a,b,c}; do echo -n "$i: "; curl -s
> >> $i:8080/solr/mycollection/select'?q=*:*' | jq '.response.numFound';
> done
> >>solr01-a: 81
> >>solr01-b: 81
> >>solr01-c: 81
> >>solr02-a: 0
> >>solr02-b: 0
> >>solr02-c: 0
> >>
> >> 4. Start CDCR:
> >>
> >>$ curl 'solr01-a:8080/solr/mycollection/cdcr?action=START'
> >>
> >> 5. See if target data center has received the initial index
> >>
> >>$ for i in solr0{1,2}-{a,b,c}; do echo -n "$i: "; curl -s
> >> $i:8080/solr/mycollection/select'?q=*:*' | jq '.response.numFound';
> done
> >>solr01-a: 81
> >>solr01-b: 81
> >>solr01-c: 81
> >>solr02-a: 0
> >>solr02-b: 0
> >>solr02-c: 81
> >>
> >>note: only -c has received the index
> >>
> >>

Re: Compile problems with anonymous SimpleCollector in custom request handler

2017-11-30 Thread Tod Olson
Shawn,

Thanks for the response! Yes, that was it, an older version unexpectedly in the 
classpath.

And for the benefit of anyone who searches the list archive with a similar 
debugging need, it's pretty easy to print out the classpath from ant's 
build.xml:

  






  
  
  
  Classpath: ${classpathProp}


-Tod

On Nov 29, 2017, at 6:00 PM, Shawn Heisey 
mailto:apa...@elyograg.org>> wrote:

On 11/29/2017 2:27 PM, Tod Olson wrote:
I'm modifying a existing custom request handler for an open source project, and 
am looking for some help with a compile error around an anonymous 
SimpleCollector. The build failure message from ant and the source of the 
specific method are below. I am compiling on a Mac with Java 1.8 and Solr 
6.4.2. There are two things I do not understand.

First:
   [javac] 
/Users/tod/src/vufind-browse-handler/browse-handler/java/org/vufind/solr/handler/BrowseRequestHandler.java:445:
 error:  is not abstract and does 
not override abstract method setNextReader(AtomicReaderContext) in Collector
   [javac] db.search(q, new SimpleCollector() {

Based on the javadoc, neither SimpleCollector nor Collector define a 
setNextReader(AtomicReaderContext) method. Grepping through the Lucene 6.4.2 
source reveals neither a setNextReader method (though maybe a couple archaic 
comments), nor an AtomicReaderContext class or interface.



Second:
   [javac] method IndexSearcher.search(Query,Collector) is not applicable
   [javac]   (argument mismatch;  cannot be 
converted to Collector)

How is it that SimpleCollector cannot be converted to Collector? Perhaps this 
is just a consequence of the first error.

For the first error:  What version of Solr/Lucene are you compiling
against?  I have found that Collector *did* have a setNextReader method
up through Lucene 4.10.4, but in 5.0, that method was gone.  I suspect
that what's causing your first problem is that you have older Lucene
jars (4.x or earlier) on your classpath, in addition to a newer version
that you actually want to use for the compile.

I think that can also explain the second problem.  It looks like
SimpleCollector didn't exist in Lucene 4.10, which is the last version
where Collector had setNextReader.  SimpleCollector is mentioned in the
javadoc for Collector as of 5.0, though.

Thanks,
Shawn





Fwd: solr-security-proxy

2017-11-30 Thread Rick Leir
Hi all
I have just been looking at solr-security-proxy, which seems to be a great 
little app to put in front of Solr (link below). But would it make more sense 
to use a whitelist of Solr parameters instead of a blacklist?
Thanks
Rick

https://github.com/dergachev/solr-security-proxy

solr-security-proxy
Node.js based reverse proxy to make a solr instance read-only, rejecting 
requests that have the potential to modify the solr index.
--invalidParams   Block these query params (comma separated)  [default: 
"qt,stream"]


-- 
Sorry for being brief. Alternate email is rickleir at yahoo dot com


Re: Issue with CDCR bootstrapping in Solr 7.1

2017-11-30 Thread Tom Peters
Hi Amrit, I tried issuing hard commits to the various nodes in the target 
cluster and it does not appear to cause the follower replicas to receive the 
initial index. The only way I can get the replicas to see the original index is 
by restarting those nodes (and take care not to restart the leader node 
otherwise it will replicate from one of the replicas which is missing the 
index).


> On Nov 30, 2017, at 12:16 PM, Amrit Sarkar  wrote:
> 
> Tom,
> 
> This is very useful:
> 
>> I found a way to get the follower replicas to receive the documents from
>> the leader in the target data center, I have to restart the solr instance
>> running on that server. Not sure if this information helps at all.
> 
> 
> You have to issue hardcommit on target after the bootstrapping is done.
> Reloading makes the core opening a new searcher. While explicit commit is
> issued at target leader after the BS is done, follower are left unattended
> though the docs are copied over.
> 
> Amrit Sarkar
> Search Engineer
> Lucidworks, Inc.
> 415-589-9269
> www.lucidworks.com
> Twitter http://twitter.com/lucidworks
> LinkedIn: https://www.linkedin.com/in/sarkaramrit2
> Medium: https://medium.com/@sarkaramrit2
> 
> On Thu, Nov 30, 2017 at 10:06 PM, Tom Peters  wrote:
> 
>> Hi Amrit,
>> 
>> Starting with more documents doesn't appear to have made a difference.
>> This time I tried with >1000 docs. Here are the steps I took:
>> 
>> 1. Deleted the collection on both the source and target DCs.
>> 
>> 2. Recreated the collections.
>> 
>> 3. Indexed >1000 documents on source data center, hard commmit
>> 
>>  $ for i in solr0{1,2}-{a,b,c}; do echo -n "$i: "; curl -s
>> $i:8080/solr/mycollection/select'?q=*:*' | jq '.response.numFound'; done
>>  solr01-a: 1368
>>  solr01-b: 1368
>>  solr01-c: 1368
>>  solr02-a: 0
>>  solr02-b: 0
>>  solr02-c: 0
>> 
>> 4. Enabled CDCR and checked docs
>> 
>>  $ curl 'solr01-a:8080/solr/synacor/cdcr?action=START'
>> 
>>  $ for i in solr0{1,2}-{a,b,c}; do echo -n "$i: "; curl -s
>> $i:8080/solr/mycollection/select'?q=*:*' | jq '.response.numFound'; done
>>  solr01-a: 1368
>>  solr01-b: 1368
>>  solr01-c: 1368
>>  solr02-a: 0
>>  solr02-b: 0
>>  solr02-c: 1368
>> 
>> Some additional notes:
>> 
>> * I do not have numRecordsToKeep defined in my solrconfig.xml, so I assume
>> it will use the default of 100
>> 
>> * I found a way to get the follower replicas to receive the documents from
>> the leader in the target data center, I have to restart the solr instance
>> running on that server. Not sure if this information helps at all.
>> 
>>> On Nov 30, 2017, at 11:22 AM, Amrit Sarkar 
>> wrote:
>>> 
>>> Hi Tom,
>>> 
>>> I see what you are saying and I too think this is a bug, but I will
>> confirm
>>> once on the code. Bootstrapping should happen on all the nodes of the
>>> target.
>>> 
>>> Meanwhile can you index more than 100 documents in the source and do the
>>> exact same experiment again. Followers will not copy the entire index of
>>> Leader unless the difference in versions in docs are more than
>>> "numRecordsToKeep", which is default 100, unless you have modified in
>>> solrconfig.xml.
>>> 
>>> Looking forward to your analysis.
>>> 
>>> Amrit Sarkar
>>> Search Engineer
>>> Lucidworks, Inc.
>>> 415-589-9269
>>> www.lucidworks.com
>>> Twitter http://twitter.com/lucidworks
>>> LinkedIn: https://www.linkedin.com/in/sarkaramrit2
>>> Medium: https://medium.com/@sarkaramrit2
>>> 
>>> On Thu, Nov 30, 2017 at 9:03 PM, Tom Peters  wrote:
>>> 
 I'm running into an issue with the initial CDCR bootstrapping of an
 existing index. In short, after turning on CDCR only the leader replica
>> in
 the target data center will have the documents replicated and it will
>> not
 exist in any of the follower replicas in the target data center. All
 subsequent incremental updates made to the source datacenter will
>> appear in
 all replicas in the target data center.
 
 A little more details:
 
 I have two clusters setup, a source cluster and a target cluster. Each
 cluster has only one shard and three replicas. I used the configuration
 detailed in the Source and Target sections of the reference guide as-is
 with the exception of updating the zkHost (https://lucene.apache.org/
 solr/guide/7_1/cross-data-center-replication-cdcr.html#
 cdcr-configuration-2).
 
 The source data center has the following nodes:
   solr01-a, solr01-b, and solr01-c
 
 The target data center has the following nodes:
   solr02-a, solr02-b, and solr02-c
 
 Here are the steps that I've done:
 
 1. Create collection in source and target data centers
 
 2. Add a number of documents to the source data center
 
 3. Verify:
 
   $ for i in solr0{1,2}-{a,b,c}; do echo -n "$i: "; curl -s
 $i:8080/solr/mycollection/select'?q=*:*' | jq '.response.numFound';
>> done
   solr01-a: 81
   solr01-b: 81
   solr01-c: 81
   solr

Re: Do i need to reindex after changing similarity setting

2017-11-30 Thread Nawab Zada Asad Iqbal
Hi Walter,

I read the following line in reference docs, what does it mean by as long
as the global similarity allows it:

"

A field type may optionally specify a  that will be used when
scoring documents that refer to fields with this type, as long as the
"global" similarity for the collection allows it.
"

On Wed, Nov 22, 2017 at 9:11 AM, Nawab Zada Asad Iqbal 
wrote:

> Thanks Walter
>
> On Mon, Nov 20, 2017 at 4:59 PM Walter Underwood 
> wrote:
>
>> Similarity is query time.
>>
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>>
>>
>> > On Nov 20, 2017, at 4:57 PM, Nawab Zada Asad Iqbal 
>> wrote:
>> >
>> > Hi,
>> >
>> > I want to switch to Classic similarity instead of BM25 (default in
>> solr7).
>> > Do I need to reindex all cores after this? Or is it only a query time
>> > setting?
>> >
>> >
>> > Thanks
>> > Nawab
>>
>>


Re: Do i need to reindex after changing similarity setting

2017-11-30 Thread Nawab Zada Asad Iqbal
This JIRA also throws some light. There is a discussion of encoding norm
during indexing. The contributor eventually comments that "norms" encoded
by different similarity are compatible to each other.

On Thu, Nov 30, 2017 at 5:12 PM, Nawab Zada Asad Iqbal 
wrote:

> Hi Walter,
>
> I read the following line in reference docs, what does it mean by as long
> as the global similarity allows it:
>
> "
>
> A field type may optionally specify a  that will be used
> when scoring documents that refer to fields with this type, as long as the
> "global" similarity for the collection allows it.
> "
>
> On Wed, Nov 22, 2017 at 9:11 AM, Nawab Zada Asad Iqbal 
> wrote:
>
>> Thanks Walter
>>
>> On Mon, Nov 20, 2017 at 4:59 PM Walter Underwood 
>> wrote:
>>
>>> Similarity is query time.
>>>
>>> wunder
>>> Walter Underwood
>>> wun...@wunderwood.org
>>> http://observer.wunderwood.org/  (my blog)
>>>
>>>
>>> > On Nov 20, 2017, at 4:57 PM, Nawab Zada Asad Iqbal 
>>> wrote:
>>> >
>>> > Hi,
>>> >
>>> > I want to switch to Classic similarity instead of BM25 (default in
>>> solr7).
>>> > Do I need to reindex all cores after this? Or is it only a query time
>>> > setting?
>>> >
>>> >
>>> > Thanks
>>> > Nawab
>>>
>>>
>


Re: Issue with CDCR bootstrapping in Solr 7.1

2017-11-30 Thread Amrit Sarkar
Tom,

(and take care not to restart the leader node otherwise it will replicate
> from one of the replicas which is missing the index).

How is this possible? Ok I will look more into it. Appreciate if someone
else also chimes in if they have similar issue.

Amrit Sarkar
Search Engineer
Lucidworks, Inc.
415-589-9269
www.lucidworks.com
Twitter http://twitter.com/lucidworks
LinkedIn: https://www.linkedin.com/in/sarkaramrit2
Medium: https://medium.com/@sarkaramrit2

On Fri, Dec 1, 2017 at 4:49 AM, Tom Peters  wrote:

> Hi Amrit, I tried issuing hard commits to the various nodes in the target
> cluster and it does not appear to cause the follower replicas to receive
> the initial index. The only way I can get the replicas to see the original
> index is by restarting those nodes (and take care not to restart the leader
> node otherwise it will replicate from one of the replicas which is missing
> the index).
>
>
> > On Nov 30, 2017, at 12:16 PM, Amrit Sarkar 
> wrote:
> >
> > Tom,
> >
> > This is very useful:
> >
> >> I found a way to get the follower replicas to receive the documents from
> >> the leader in the target data center, I have to restart the solr
> instance
> >> running on that server. Not sure if this information helps at all.
> >
> >
> > You have to issue hardcommit on target after the bootstrapping is done.
> > Reloading makes the core opening a new searcher. While explicit commit is
> > issued at target leader after the BS is done, follower are left
> unattended
> > though the docs are copied over.
> >
> > Amrit Sarkar
> > Search Engineer
> > Lucidworks, Inc.
> > 415-589-9269
> > www.lucidworks.com
> > Twitter http://twitter.com/lucidworks
> > LinkedIn: https://www.linkedin.com/in/sarkaramrit2
> > Medium: https://medium.com/@sarkaramrit2
> >
> > On Thu, Nov 30, 2017 at 10:06 PM, Tom Peters 
> wrote:
> >
> >> Hi Amrit,
> >>
> >> Starting with more documents doesn't appear to have made a difference.
> >> This time I tried with >1000 docs. Here are the steps I took:
> >>
> >> 1. Deleted the collection on both the source and target DCs.
> >>
> >> 2. Recreated the collections.
> >>
> >> 3. Indexed >1000 documents on source data center, hard commmit
> >>
> >>  $ for i in solr0{1,2}-{a,b,c}; do echo -n "$i: "; curl -s
> >> $i:8080/solr/mycollection/select'?q=*:*' | jq '.response.numFound';
> done
> >>  solr01-a: 1368
> >>  solr01-b: 1368
> >>  solr01-c: 1368
> >>  solr02-a: 0
> >>  solr02-b: 0
> >>  solr02-c: 0
> >>
> >> 4. Enabled CDCR and checked docs
> >>
> >>  $ curl 'solr01-a:8080/solr/synacor/cdcr?action=START'
> >>
> >>  $ for i in solr0{1,2}-{a,b,c}; do echo -n "$i: "; curl -s
> >> $i:8080/solr/mycollection/select'?q=*:*' | jq '.response.numFound';
> done
> >>  solr01-a: 1368
> >>  solr01-b: 1368
> >>  solr01-c: 1368
> >>  solr02-a: 0
> >>  solr02-b: 0
> >>  solr02-c: 1368
> >>
> >> Some additional notes:
> >>
> >> * I do not have numRecordsToKeep defined in my solrconfig.xml, so I
> assume
> >> it will use the default of 100
> >>
> >> * I found a way to get the follower replicas to receive the documents
> from
> >> the leader in the target data center, I have to restart the solr
> instance
> >> running on that server. Not sure if this information helps at all.
> >>
> >>> On Nov 30, 2017, at 11:22 AM, Amrit Sarkar 
> >> wrote:
> >>>
> >>> Hi Tom,
> >>>
> >>> I see what you are saying and I too think this is a bug, but I will
> >> confirm
> >>> once on the code. Bootstrapping should happen on all the nodes of the
> >>> target.
> >>>
> >>> Meanwhile can you index more than 100 documents in the source and do
> the
> >>> exact same experiment again. Followers will not copy the entire index
> of
> >>> Leader unless the difference in versions in docs are more than
> >>> "numRecordsToKeep", which is default 100, unless you have modified in
> >>> solrconfig.xml.
> >>>
> >>> Looking forward to your analysis.
> >>>
> >>> Amrit Sarkar
> >>> Search Engineer
> >>> Lucidworks, Inc.
> >>> 415-589-9269
> >>> www.lucidworks.com
> >>> Twitter http://twitter.com/lucidworks
> >>> LinkedIn: https://www.linkedin.com/in/sarkaramrit2
> >>> Medium: https://medium.com/@sarkaramrit2
> >>>
> >>> On Thu, Nov 30, 2017 at 9:03 PM, Tom Peters 
> wrote:
> >>>
>  I'm running into an issue with the initial CDCR bootstrapping of an
>  existing index. In short, after turning on CDCR only the leader
> replica
> >> in
>  the target data center will have the documents replicated and it will
> >> not
>  exist in any of the follower replicas in the target data center. All
>  subsequent incremental updates made to the source datacenter will
> >> appear in
>  all replicas in the target data center.
> 
>  A little more details:
> 
>  I have two clusters setup, a source cluster and a target cluster. Each
>  cluster has only one shard and three replicas. I used the
> configuration
>  detailed in the Source and Target sections of the reference guide
> as-is
>  with the exception of updating