date:20170413

I have forced a merge yesterday and went back to one segment.

One indexer program reindexes (most or all) every 20 minutes orso. There is 
nothing custom at that particular point. There is no autoCommit, the indexer 
program is responsible for a hard commit, it is the single source of reindexed 
data.

After one cycle we had two segments, 50 % deleted, as expected. This was stable 
for many hours and many cycles. For some reason, i now have 2/3 deletes and 
three segments, now this situation is stable. So the merges do happen, but 
sometimes they don't. When they don't, the size increases (now three segments, 
55 MB). But it appears that number of segments never decreases, and that is 
what bothers me.

I was about to set segmentsPerTier to two but then i realized i can also delete 
everything prior to indexing as opposed to deleting only items older than the 
set i am already about to reindex. This strategy works fine with other 
reindexing programs, they don't suffer this problem.

So, it is not solved, but not a problem anymore. Thanks all anyway :)
Markus
 
-Original message-
> From:Erick Erickson 
> Sent: Wednesday 12th April 2017 17:51
> To: solr-user 
> Subject: Re: maxDoc ten times greater than numDoc
> 
> Yes, this is very strange. My bet: you have something
> custom, a setting, indexing code, whatever that
> is getting in the way.
> 
> Second possibility (really stretching here): your
> merge settings are set to 10 segments having to exist
> before merging and somehow not all the docs in the
> segments are replaced. So until you get to the 10th
> re-index (and assuming a single segment is
> produced per re-index) the older segments aren't
> merged. If that were the case I'd expect to see the
> number of deleted docs drop back periodically
> then build up again. A real shot in the dark. One way
> to test this would be to specify "segmentsPerTier" of, say,
> 2 rather than the default 10, see:
> https://cwiki.apache.org/confluence/display/solr/IndexConfig+in+SolrConfig
> If this were the case I'd expect with a setting of 2 that
> your index might have 50% deleted docs, that would at
> least tell us whether we're on the right track.
> 
> Take a look at your index on disk. If you're seeing gaps
> in the numbering, you are getting merging, it may be
> that they're not happening very often.
> 
> And I take it you have no custom code here and you are
> doing commits? (hard commits are all that matters
> for merging, it doesn't matter whether openSearcher
> is set to true or false).
> 
> I just tried the "techproducts" example as follows:
> 1> indexed all the sample files with the bin/solr -e techproducts example
> 2> started re-indexing the sample docs one at a time with post.jar
> 
> It took a while, but eventually the original segments got merged away so
> I doubt it's any weirdness with a small index.
> 
> Speaking of small index, why are you sharding with only
> 8K docs? Sharding will probably slow things down for such
> a small index. This isn't germane to your question though.
> 
> Best,
> Erick
> 
> 
> On Wed, Apr 12, 2017 at 5:56 AM, Shawn Heisey  wrote:
> > On 4/12/2017 5:11 AM, Markus Jelsma wrote:
> >> One of our 2 shard collections is rather small and gets all its entries 
> >> reindexed every 20 minutes orso. Now i just noticed maxDoc is ten times 
> >> greater than numDoc, the merger is never scheduled but settings are 
> >> default. We just overwrite the existing entries, all of them.
> >>
> >> Here are the stats:
> >>
> >> Last Modified:12 minutes ago
> >> Num Docs: 8336
> >> Max Doc:82362
> >> Heap Memory Usage: -1
> >> Deleted Docs: 74026
> >> Version: 3125
> >> Segment Count: 10
> >
> > This discrepancy would typically mean that when you reindex, you're
> > indexing MOST of the documents, but not ALL of them, so at least one
> > document is still not deleted in each older segment.  When segments have
> > all their documents deleted, they are automatically removed by Lucene,
> > but if there's even one document NOT deleted, the segment will remain
> > until it is merged.
> >
> > There's no information here about how large this core is, but unless the
> > documents are REALLY enormous, I'm betting that an optimize would happen
> > quickly.  With a document count this low and an indexing pattern that
> > results in such a large maxdoc, this might be a good time to go against
> > general advice and perform an optimize at least once a day.
> >
> > An alternate idea that would not require optimizes:  If the intent is to
> > completely rebuild the index, you might want to consider issuing a
> > "delete all docs by query" before beginning the indexing process.  This
> > would ensure that none of the previous documents remain.  As long as you
> > don't do a commit that opens a new searcher before the indexing is
> > complete, clients won't ever know that everything was deleted.
> >
> >> This is the config:
> >>
> >>   6.5.0
> >>   ${solr.data.dir:}
> >>>

SOLR - 6.4.0 SolrCore Initialization Failures

2017-04-13 Thread Uchit Patel

Hi,
I have recently moved my cores from SOLR 5.1.0 to 6.4.0.  I am using windows 
environment. I have large data in cores. I have total 6 cores with total data 
142 GB. All cores are migrated perfectly but one is giving error:

SolrCore Initialization Failures
   
   - core_name: 
org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: JVM 
Error creating core [core_name]: Java heap space

I checked SOLR_HEAP="8g" in solr.in.sh 
Why I have problem in only one core ?
Thanks.
Regards,
Uchit Patel

Re: maxDoc ten times greater than numDoc

Maybe not every entry got deleted and it was holding up the segment.
E.g. a child or parent record abandoned. If, for example, the parent
record has a date field and the child does not, then deleting with a
date-based query may trigger this. I think there was a bug about
abandoned child or something.

This is pure speculation of course.

Regards,
   Alex.

http://www.solr-start.com/ - Resources for Solr users, new and experienced


On 13 April 2017 at 12:54, Markus Jelsma  wrote:
> I have forced a merge yesterday and went back to one segment.
>
> One indexer program reindexes (most or all) every 20 minutes orso. There is 
> nothing custom at that particular point. There is no autoCommit, the indexer 
> program is responsible for a hard commit, it is the single source of 
> reindexed data.
>
> After one cycle we had two segments, 50 % deleted, as expected. This was 
> stable for many hours and many cycles. For some reason, i now have 2/3 
> deletes and three segments, now this situation is stable. So the merges do 
> happen, but sometimes they don't. When they don't, the size increases (now 
> three segments, 55 MB). But it appears that number of segments never 
> decreases, and that is what bothers me.
>
> I was about to set segmentsPerTier to two but then i realized i can also 
> delete everything prior to indexing as opposed to deleting only items older 
> than the set i am already about to reindex. This strategy works fine with 
> other reindexing programs, they don't suffer this problem.
>
> So, it is not solved, but not a problem anymore. Thanks all anyway :)
> Markus
>
> -Original message-
>> From:Erick Erickson 
>> Sent: Wednesday 12th April 2017 17:51
>> To: solr-user 
>> Subject: Re: maxDoc ten times greater than numDoc
>>
>> Yes, this is very strange. My bet: you have something
>> custom, a setting, indexing code, whatever that
>> is getting in the way.
>>
>> Second possibility (really stretching here): your
>> merge settings are set to 10 segments having to exist
>> before merging and somehow not all the docs in the
>> segments are replaced. So until you get to the 10th
>> re-index (and assuming a single segment is
>> produced per re-index) the older segments aren't
>> merged. If that were the case I'd expect to see the
>> number of deleted docs drop back periodically
>> then build up again. A real shot in the dark. One way
>> to test this would be to specify "segmentsPerTier" of, say,
>> 2 rather than the default 10, see:
>> https://cwiki.apache.org/confluence/display/solr/IndexConfig+in+SolrConfig
>> If this were the case I'd expect with a setting of 2 that
>> your index might have 50% deleted docs, that would at
>> least tell us whether we're on the right track.
>>
>> Take a look at your index on disk. If you're seeing gaps
>> in the numbering, you are getting merging, it may be
>> that they're not happening very often.
>>
>> And I take it you have no custom code here and you are
>> doing commits? (hard commits are all that matters
>> for merging, it doesn't matter whether openSearcher
>> is set to true or false).
>>
>> I just tried the "techproducts" example as follows:
>> 1> indexed all the sample files with the bin/solr -e techproducts example
>> 2> started re-indexing the sample docs one at a time with post.jar
>>
>> It took a while, but eventually the original segments got merged away so
>> I doubt it's any weirdness with a small index.
>>
>> Speaking of small index, why are you sharding with only
>> 8K docs? Sharding will probably slow things down for such
>> a small index. This isn't germane to your question though.
>>
>> Best,
>> Erick
>>
>>
>> On Wed, Apr 12, 2017 at 5:56 AM, Shawn Heisey  wrote:
>> > On 4/12/2017 5:11 AM, Markus Jelsma wrote:
>> >> One of our 2 shard collections is rather small and gets all its entries 
>> >> reindexed every 20 minutes orso. Now i just noticed maxDoc is ten times 
>> >> greater than numDoc, the merger is never scheduled but settings are 
>> >> default. We just overwrite the existing entries, all of them.
>> >>
>> >> Here are the stats:
>> >>
>> >> Last Modified:12 minutes ago
>> >> Num Docs: 8336
>> >> Max Doc:82362
>> >> Heap Memory Usage: -1
>> >> Deleted Docs: 74026
>> >> Version: 3125
>> >> Segment Count: 10
>> >
>> > This discrepancy would typically mean that when you reindex, you're
>> > indexing MOST of the documents, but not ALL of them, so at least one
>> > document is still not deleted in each older segment.  When segments have
>> > all their documents deleted, they are automatically removed by Lucene,
>> > but if there's even one document NOT deleted, the segment will remain
>> > until it is merged.
>> >
>> > There's no information here about how large this core is, but unless the
>> > documents are REALLY enormous, I'm betting that an optimize would happen
>> > quickly.  With a document count this low and an indexing pattern that
>> > results in such a large maxdoc, this might be a good time to go

Re: Using BasicAuth with SolrJ Code

2017-04-13 Thread Noble Paul

That looks good. can you share the security.json (commenting out
anything that's sensitive of course)

On Wed, Apr 12, 2017 at 5:10 PM, Zheng Lin Edwin Yeo
 wrote:
> This is what I get when I run the code.
>
> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error
> from server at http://localhost:8983/solr/testing: Expected mime type
> application/octet-stream but got text/html. 
> 
> 
> Error 401 require authentication
> 
> HTTP ERROR 401
> Problem accessing /solr/testing/update. Reason:
> require authentication
> 
> 
>
> at
> org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:578)
> at
> org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:279)
> at
> org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:268)
> at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:149)
> at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:106)
> at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:71)
> at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:85)
> at testing.indexing(testing.java:2939)
> at testing.main(testing.java:329)
> Exception in thread "main"
> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error
> from server at http://localhost:8983/solr/testing: Expected mime type
> application/octet-stream but got text/html. 
> 
> 
> Error 401 require authentication
> 
> HTTP ERROR 401
> Problem accessing /solr/testing/update. Reason:
> require authentication
> 
> 
>
> at
> org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:578)
> at
> org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:279)
> at
> org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:268)
> at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:149)
> at org.apache.solr.client.solrj.SolrClient.commit(SolrClient.java:484)
> at org.apache.solr.client.solrj.SolrClient.commit(SolrClient.java:463)
> at testing.indexing(testing.java:3063)
> at testing.main(testing.java:329)
>
> Regards,
> Edwin
>
>
> On 12 April 2017 at 14:28, Noble Paul  wrote:
>
>> can u paste the stacktrace here
>>
>> On Tue, Apr 11, 2017 at 1:19 PM, Zheng Lin Edwin Yeo
>>  wrote:
>> > I found from StackOverflow  that we should declare it this way:
>> > http://stackoverflow.com/questions/43335419/using-
>> basicauth-with-solrj-code
>> >
>> >
>> > SolrRequest req = new QueryRequest(new SolrQuery("*:*"));//create a new
>> > request object
>> > req.setBasicAuthCredentials(userName, password);
>> > solrClient.request(req);
>> >
>> > Is that correct?
>> >
>> > For this, the NullPointerException is not coming out, but the SolrJ is
>> > still not able to get authenticated. I'm still getting Error Code 401
>> even
>> > after putting in this code.
>> >
>> > Any advice on which part of the SolrJ code should we place this code in?
>> >
>> > Regards,
>> > Edwin
>> >
>> >
>> > On 10 April 2017 at 23:50, Zheng Lin Edwin Yeo 
>> wrote:
>> >
>> >> Hi,
>> >>
>> >> I have just set up the Basic Authentication Plugin in Solr 6.4.2 on
>> >> SolrCloud, and I am trying to modify my SolrJ code so that the code can
>> go
>> >> through the authentication and do the indexing.
>> >>
>> >> I tried using the following code from the Solr Documentation
>> >> https://cwiki.apache.org/confluence/display/solr/Basic+Authentication+
>> >> Plugin.
>> >>
>> >> SolrRequest req ;//create a new request object
>> >> req.setBasicAuthCredentials(userName, password);
>> >> solrClient.request(req);
>> >>
>> >> However, the code complains that the req is not initialized.
>> >>
>> >> If I initialized it, it will be initialize as null.
>> >>
>> >> SolrRequest req = null;//create a new request object
>> >> req.setBasicAuthCredentials(userName, password);
>> >> solrClient.request(req);
>> >>
>> >> This will caused a null pointer exception.
>> >> Exception in thread "main" java.lang.NullPointerException
>> >>
>> >> How should we go about putting these codes, so that the error can be
>> >> prevented?
>> >>
>> >> Regards,
>> >> Edwin
>> >>
>> >>
>>
>>
>>
>> --
>> -
>> Noble Paul
>>



-- 
-
Noble Paul

Re: AW: What does the replication factor parameter in collections api do?

2017-04-13 Thread Shawn Heisey

On 4/13/2017 3:22 AM, Johannes Knaus wrote:
> Ok. Thank you for your quick reply. Though I still feel a little
> uneasy. Why is it possible then to alter replicationFactor via
> MODIFYCOLLECTION in the collections API? What would be the use case
> for this parameter at all then? 

If you use a very specific storage method for your indexes -- HDFS --
then replicationFactor has meaning beyond initial collection creation,
in conjunction with the "autoAddReplicas" feature.

https://cwiki.apache.org/confluence/display/solr/Running+Solr+on+HDFS#RunningSolronHDFS-AutomaticallyAddReplicasinSolrCloud

If you are NOT utilizing the very specific HDFS storage engine, then
everything you were told applies.  With standard storage mechanisms,
replicationFactor has zero meaning after initial collection creation,
and changing the value will have no effect.

Thanks,
Shawn

Re: SOLR - 6.4.0 SolrCore Initialization Failures

2017-04-13 Thread Shawn Heisey

On 4/13/2017 4:37 AM, Uchit Patel wrote:
> I have recently moved my cores from SOLR 5.1.0 to 6.4.0.  I am using windows 
> environment. I have large data in cores. I have total 6 cores with total data 
> 142 GB. All cores are migrated perfectly but one is giving error:
>
> SolrCore Initialization Failures
>
>- core_name: 
> org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: 
> JVM Error creating core [core_name]: Java heap space
>
> I checked SOLR_HEAP="8g" in solr.in.sh 
> Why I have problem in only one core ?

Heap space problems affect the entire process, and the reason for
needing more heap may not be apparent from logs.  An OutOfMemoryError
may be thrown from *ANY* part of the program when you run out of heap,
even pieces of the program that aren't the actual problem.

https://wiki.apache.org/solr/SolrPerformanceProblems#Java_Heap

Thanks,
Shawn

Re: Autosuggestion

2017-04-13 Thread OTH

Thanks, that's very helpful!
The third link especially is quite helpful.
Is there any recommendation regarding using FST-based vs AnalyzingInfix
suggesters?
Thanks

On Wed, Apr 12, 2017 at 6:23 PM, Andrea Gazzarini  wrote:

> Hi,
> I think you got an old post. I would have a look at the built-in feature,
> first. These posts can help you to get a quick overview:
>
> https://cwiki.apache.org/confluence/display/solr/Suggester
> http://alexbenedetti.blogspot.it/2015/07/solr-you-complete-me.html
> https://lucidworks.com/2015/03/04/solr-suggester/
>
> HTH,
> Andrea
>
>
> On 12/04/17 14:43, OTH wrote:
>
>> Hello,
>>
>> Is there any recommended way to achieve auto-suggestion in textboxes using
>> Solr?
>>
>> I'm new to Solr, but right now I have achieved this functionality by using
>> an example I found online, doing this:
>>
>> I added a copy field, which is of the following type:
>>
>>> positionIncrementGap="100">
>>  
>>> maxGramSize="10"/>
>>
>>  
>>  
>>> maxGramSize="10"/>
>>
>>  
>>
>>
>> In the search box, after each character is typed, the above field is
>> queried, and the results are shown in a drop-down list.
>>
>> However, this is performing quite slow.  I'm not sure if that has to do
>> with the front-end code, or because I'm not using the recommended approach
>> in terms of how I'm using Solr.  Is there any other recommended way to use
>> Solr to achieve this functionality?
>>
>> Thanks
>>
>>
>

Re: Using BasicAuth with SolrJ Code

2017-04-13 Thread Zheng Lin Edwin Yeo

The security.json which I'm using is the default one that is available from
the Solr Documentation https://cwiki.apache.org/confluence/display/
solr/Basic+Authentication+Plugin.

{
"authentication":{
   "blockUnknown": true,
   "class":"solr.BasicAuthPlugin",
   "credentials":{"solr":"IV0EHq1OnNrj6gvRCwvFwTrZ1+z1oBbnQdiVC3otuq0=
Ndd7LKvVBAaZIF0QAVi1ekCfAJXr1GGfLtRUXhgrF8c="}
},
"authorization":{
   "class":"solr.RuleBasedAuthorizationPlugin",
   "user-role":{"solr":"admin"},
   "permissions":[{"name":"security-edit",
  "role":"admin"}]
}}


Regards,
Edwin

On 13 April 2017 at 19:53, Noble Paul  wrote:

> That looks good. can you share the security.json (commenting out
> anything that's sensitive of course)
>
> On Wed, Apr 12, 2017 at 5:10 PM, Zheng Lin Edwin Yeo
>  wrote:
> > This is what I get when I run the code.
> >
> > org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:
> Error
> > from server at http://localhost:8983/solr/testing: Expected mime type
> > application/octet-stream but got text/html. 
> > 
> > 
> > Error 401 require authentication
> > 
> > HTTP ERROR 401
> > Problem accessing /solr/testing/update. Reason:
> > require authentication
> > 
> > 
> >
> > at
> > org.apache.solr.client.solrj.impl.HttpSolrClient.
> executeMethod(HttpSolrClient.java:578)
> > at
> > org.apache.solr.client.solrj.impl.HttpSolrClient.request(
> HttpSolrClient.java:279)
> > at
> > org.apache.solr.client.solrj.impl.HttpSolrClient.request(
> HttpSolrClient.java:268)
> > at org.apache.solr.client.solrj.SolrRequest.process(
> SolrRequest.java:149)
> > at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:106)
> > at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:71)
> > at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:85)
> > at testing.indexing(testing.java:2939)
> > at testing.main(testing.java:329)
> > Exception in thread "main"
> > org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:
> Error
> > from server at http://localhost:8983/solr/testing: Expected mime type
> > application/octet-stream but got text/html. 
> > 
> > 
> > Error 401 require authentication
> > 
> > HTTP ERROR 401
> > Problem accessing /solr/testing/update. Reason:
> > require authentication
> > 
> > 
> >
> > at
> > org.apache.solr.client.solrj.impl.HttpSolrClient.
> executeMethod(HttpSolrClient.java:578)
> > at
> > org.apache.solr.client.solrj.impl.HttpSolrClient.request(
> HttpSolrClient.java:279)
> > at
> > org.apache.solr.client.solrj.impl.HttpSolrClient.request(
> HttpSolrClient.java:268)
> > at org.apache.solr.client.solrj.SolrRequest.process(
> SolrRequest.java:149)
> > at org.apache.solr.client.solrj.SolrClient.commit(SolrClient.java:484)
> > at org.apache.solr.client.solrj.SolrClient.commit(SolrClient.java:463)
> > at testing.indexing(testing.java:3063)
> > at testing.main(testing.java:329)
> >
> > Regards,
> > Edwin
> >
> >
> > On 12 April 2017 at 14:28, Noble Paul  wrote:
> >
> >> can u paste the stacktrace here
> >>
> >> On Tue, Apr 11, 2017 at 1:19 PM, Zheng Lin Edwin Yeo
> >>  wrote:
> >> > I found from StackOverflow  that we should declare it this way:
> >> > http://stackoverflow.com/questions/43335419/using-
> >> basicauth-with-solrj-code
> >> >
> >> >
> >> > SolrRequest req = new QueryRequest(new SolrQuery("*:*"));//create a
> new
> >> > request object
> >> > req.setBasicAuthCredentials(userName, password);
> >> > solrClient.request(req);
> >> >
> >> > Is that correct?
> >> >
> >> > For this, the NullPointerException is not coming out, but the SolrJ is
> >> > still not able to get authenticated. I'm still getting Error Code 401
> >> even
> >> > after putting in this code.
> >> >
> >> > Any advice on which part of the SolrJ code should we place this code
> in?
> >> >
> >> > Regards,
> >> > Edwin
> >> >
> >> >
> >> > On 10 April 2017 at 23:50, Zheng Lin Edwin Yeo 
> >> wrote:
> >> >
> >> >> Hi,
> >> >>
> >> >> I have just set up the Basic Authentication Plugin in Solr 6.4.2 on
> >> >> SolrCloud, and I am trying to modify my SolrJ code so that the code
> can
> >> go
> >> >> through the authentication and do the indexing.
> >> >>
> >> >> I tried using the following code from the Solr Documentation
> >> >> https://cwiki.apache.org/confluence/display/solr/Basic+
> Authentication+
> >> >> Plugin.
> >> >>
> >> >> SolrRequest req ;//create a new request object
> >> >> req.setBasicAuthCredentials(userName, password);
> >> >> solrClient.request(req);
> >> >>
> >> >> However, the code complains that the req is not initialized.
> >> >>
> >> >> If I initialized it, it will be initialize as null.
> >> >>
> >> >> SolrRequest req = null;//create a new request object
> >> >> req.setBasicAuthCredentials(userName, password);
> >> >> solrClient.request(req);
> >> >>
> >> >> This will caused a null pointer exception.
> >> >> Exception in thread "main" java.lang.NullPointerException
> >> >>
> >> >> How should we go about putting these codes, so that the error can

Re: keyword-in-context for PDF document

2017-04-13 Thread ankur

Apologies, I meant "keyword-in-context".



--
View this message in context: 
http://lucene.472066.n3.nabble.com/keyword-in-content-for-PDF-document-tp4329754p4329756.html
Sent from the Solr - User mailing list archive at Nabble.com.

keyword-in-content for PDF document

2017-04-13 Thread ankur

If i am search for word "growth" in a PDF, i want to output all the sentences
with the word "growth" in it.

How can that be done?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/keyword-in-content-for-PDF-document-tp4329754.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Autosuggestion

bq:  FST-based vs AnalyzingInfix

They are two totally different things. FST-based suggesters are very
fast and compact. But they only match from the beginning of the input.

AnalyzingInfix creates a "sidecar" index that's searched like a normal
index and the _field_ is returned. Thus analyzinginfix can suggest
"my dog has fleas" when entering "fleas", but the FST-based suggesters cannot.

Best,
Erick

On Thu, Apr 13, 2017 at 6:24 AM, OTH  wrote:
> Thanks, that's very helpful!
> The third link especially is quite helpful.
> Is there any recommendation regarding using FST-based vs AnalyzingInfix
> suggesters?
> Thanks
>
> On Wed, Apr 12, 2017 at 6:23 PM, Andrea Gazzarini  wrote:
>
>> Hi,
>> I think you got an old post. I would have a look at the built-in feature,
>> first. These posts can help you to get a quick overview:
>>
>> https://cwiki.apache.org/confluence/display/solr/Suggester
>> http://alexbenedetti.blogspot.it/2015/07/solr-you-complete-me.html
>> https://lucidworks.com/2015/03/04/solr-suggester/
>>
>> HTH,
>> Andrea
>>
>>
>> On 12/04/17 14:43, OTH wrote:
>>
>>> Hello,
>>>
>>> Is there any recommended way to achieve auto-suggestion in textboxes using
>>> Solr?
>>>
>>> I'm new to Solr, but right now I have achieved this functionality by using
>>> an example I found online, doing this:
>>>
>>> I added a copy field, which is of the following type:
>>>
>>>>> positionIncrementGap="100">
>>>  
>>>>> maxGramSize="10"/>
>>>
>>>  
>>>  
>>>>> maxGramSize="10"/>
>>>
>>>  
>>>
>>>
>>> In the search box, after each character is typed, the above field is
>>> queried, and the results are shown in a drop-down list.
>>>
>>> However, this is performing quite slow.  I'm not sure if that has to do
>>> with the front-end code, or because I'm not using the recommended approach
>>> in terms of how I'm using Solr.  Is there any other recommended way to use
>>> Solr to achieve this functionality?
>>>
>>> Thanks
>>>
>>>
>>

Re: AW: What does the replication factor parameter in collections api do?

bq: Why is it possible then to alter replicationFactor via
MODIFYCOLLECTION in the collections API

Because MODIFYCOLLECTION just changes properties in the collection
definition generically and replicationFactor just happens to be one.
IOW there's no overarching reason.

It would be extra work to dis-allow that one case and possibly
introduce errors without changing any functionality so nobody was
willing to put in the effort.

Best,
Erick

On Thu, Apr 13, 2017 at 5:48 AM, Shawn Heisey  wrote:
> On 4/13/2017 3:22 AM, Johannes Knaus wrote:
>> Ok. Thank you for your quick reply. Though I still feel a little
>> uneasy. Why is it possible then to alter replicationFactor via
>> MODIFYCOLLECTION in the collections API? What would be the use case
>> for this parameter at all then?
>
> If you use a very specific storage method for your indexes -- HDFS --
> then replicationFactor has meaning beyond initial collection creation,
> in conjunction with the "autoAddReplicas" feature.
>
> https://cwiki.apache.org/confluence/display/solr/Running+Solr+on+HDFS#RunningSolronHDFS-AutomaticallyAddReplicasinSolrCloud
>
> If you are NOT utilizing the very specific HDFS storage engine, then
> everything you were told applies.  With standard storage mechanisms,
> replicationFactor has zero meaning after initial collection creation,
> and changing the value will have no effect.
>
> Thanks,
> Shawn
>

Re: maxDoc ten times greater than numDoc

If you want to be brave

Through a clever bit of reflection, the parameters that
TieredMergePolicy uses to decide what segments to reclaim are settable
in solrconfig.xml (undocumented, so use at your own risk). You could
try bumping

reclaimDeletesWeight

in your TieredMergePolicy configuration if you wanted to experiment.

There's no good reason not to set your segments per tier, it won't hurt.

 But as you say you have a solution so this is just for curiosity's sake.

Best,
Erick

On Thu, Apr 13, 2017 at 4:42 AM, Alexandre Rafalovitch
 wrote:
> Maybe not every entry got deleted and it was holding up the segment.
> E.g. a child or parent record abandoned. If, for example, the parent
> record has a date field and the child does not, then deleting with a
> date-based query may trigger this. I think there was a bug about
> abandoned child or something.
>
> This is pure speculation of course.
>
> Regards,
>Alex.
> 
> http://www.solr-start.com/ - Resources for Solr users, new and experienced
>
>
> On 13 April 2017 at 12:54, Markus Jelsma  wrote:
>> I have forced a merge yesterday and went back to one segment.
>>
>> One indexer program reindexes (most or all) every 20 minutes orso. There is 
>> nothing custom at that particular point. There is no autoCommit, the indexer 
>> program is responsible for a hard commit, it is the single source of 
>> reindexed data.
>>
>> After one cycle we had two segments, 50 % deleted, as expected. This was 
>> stable for many hours and many cycles. For some reason, i now have 2/3 
>> deletes and three segments, now this situation is stable. So the merges do 
>> happen, but sometimes they don't. When they don't, the size increases (now 
>> three segments, 55 MB). But it appears that number of segments never 
>> decreases, and that is what bothers me.
>>
>> I was about to set segmentsPerTier to two but then i realized i can also 
>> delete everything prior to indexing as opposed to deleting only items older 
>> than the set i am already about to reindex. This strategy works fine with 
>> other reindexing programs, they don't suffer this problem.
>>
>> So, it is not solved, but not a problem anymore. Thanks all anyway :)
>> Markus
>>
>> -Original message-
>>> From:Erick Erickson 
>>> Sent: Wednesday 12th April 2017 17:51
>>> To: solr-user 
>>> Subject: Re: maxDoc ten times greater than numDoc
>>>
>>> Yes, this is very strange. My bet: you have something
>>> custom, a setting, indexing code, whatever that
>>> is getting in the way.
>>>
>>> Second possibility (really stretching here): your
>>> merge settings are set to 10 segments having to exist
>>> before merging and somehow not all the docs in the
>>> segments are replaced. So until you get to the 10th
>>> re-index (and assuming a single segment is
>>> produced per re-index) the older segments aren't
>>> merged. If that were the case I'd expect to see the
>>> number of deleted docs drop back periodically
>>> then build up again. A real shot in the dark. One way
>>> to test this would be to specify "segmentsPerTier" of, say,
>>> 2 rather than the default 10, see:
>>> https://cwiki.apache.org/confluence/display/solr/IndexConfig+in+SolrConfig
>>> If this were the case I'd expect with a setting of 2 that
>>> your index might have 50% deleted docs, that would at
>>> least tell us whether we're on the right track.
>>>
>>> Take a look at your index on disk. If you're seeing gaps
>>> in the numbering, you are getting merging, it may be
>>> that they're not happening very often.
>>>
>>> And I take it you have no custom code here and you are
>>> doing commits? (hard commits are all that matters
>>> for merging, it doesn't matter whether openSearcher
>>> is set to true or false).
>>>
>>> I just tried the "techproducts" example as follows:
>>> 1> indexed all the sample files with the bin/solr -e techproducts example
>>> 2> started re-indexing the sample docs one at a time with post.jar
>>>
>>> It took a while, but eventually the original segments got merged away so
>>> I doubt it's any weirdness with a small index.
>>>
>>> Speaking of small index, why are you sharding with only
>>> 8K docs? Sharding will probably slow things down for such
>>> a small index. This isn't germane to your question though.
>>>
>>> Best,
>>> Erick
>>>
>>>
>>> On Wed, Apr 12, 2017 at 5:56 AM, Shawn Heisey  wrote:
>>> > On 4/12/2017 5:11 AM, Markus Jelsma wrote:
>>> >> One of our 2 shard collections is rather small and gets all its entries 
>>> >> reindexed every 20 minutes orso. Now i just noticed maxDoc is ten times 
>>> >> greater than numDoc, the merger is never scheduled but settings are 
>>> >> default. We just overwrite the existing entries, all of them.
>>> >>
>>> >> Here are the stats:
>>> >>
>>> >> Last Modified:12 minutes ago
>>> >> Num Docs: 8336
>>> >> Max Doc:82362
>>> >> Heap Memory Usage: -1
>>> >> Deleted Docs: 74026
>>> >> Version: 3125
>>> >> Segment Count: 10
>>> >
>>> > This discrepancy would typical

Re: Grouped Result sort issue

2017-04-13 Thread alessandro.benedetti

I had the chance to make some investigation code side,
and I basically confirm what Erick hypothesized and what Diego Ceccarelli
mentioned in this other thread [1].

Grouping happens with a 2 collector phases strategy :

1) first phase retrieve and sort the groups
2) second phase retrieve the top documents per group and sort them.

The phases are indipendent so the documents you retrieve in phase 2 don't
affect the order of groups ( that was estabilished in phase1 ).

Specifically to the phase 1, we keep for each group the most representative
value(s).
If the sort is by score asc for each group, the min score is stored in
values.
If the sort is by score desc, for each group the max score is stored in
values.

Then when ordering the groups, we just add them to a treeSet using the field
comparator on those values.

To conclude, when retrieving one doc per group and a flat list, this
behaviour may sound counter-intuitive.
I guess you should use the field collapsing[2] and you should see a
consistent behavior to what you expect.

Cheers

[1]
http://lucene.472066.n3.nabble.com/Question-about-grouping-in-distribute-mode-td4327679.html
[2]
https://cwiki.apache.org/confluence/display/solr/Collapse+and+Expand+Results

-
---
Alessandro Benedetti
Search Consultant, R&D Software Engineer, Director
Sease Ltd. - www.sease.io
--
View this message in context:
http://lucene.472066.n3.nabble.com/Grouped-Result-sort-issue-tp4329255p4329784.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: keyword-in-content for PDF document

With great difficulty. PDF does not usually preserve the text flow, it
uses instead absolute positioning for text fragments. Extraction will
try to approximate the right thing, but it is an approximation. And if
you have two columns, it is harder again. Some documents may have
accessibility layer, which would help.

I'd start from using Tika (or extract handler with extractOnly=true)
on the documents you have and seeing what comes out. See
https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika

Then you have to figure out whether you are searching just a word or
across the sentence boundaries. You could probably (somehow) split on
sentence boundary if you want to store each sentence as a value in a
multivalued field. Or you could try using highlighter to return only
the sentence.

Of course, defining the sentence boundary is a lot trickier than it
seems at first.. (eg. "He works for B.B.C.")

Regards,
   Alex.

http://www.solr-start.com/ - Resources for Solr users, new and experienced

On 13 April 2017 at 15:54, ankur  wrote:
> If i am search for word "growth" in a PDF, i want to output all the sentences
> with the word "growth" in it.
>
> How can that be done?
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/keyword-in-content-for-PDF-document-tp4329754.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Filtering results by minimum relevancy score

2017-04-13 Thread Walter Underwood

BM25 came out of work on probabilistic engines, but using BM25 in Solr doesn’t 
automatically make it probabilistic.

I read a paper once that showed the two models are not that different, maybe by 
Karen Sparck-Jones.

Still, even with a probabilistic model, relevance cutoffs don’t work. It is 
still too easy for a good match to have a low score. We’re back to increasing 
the good hits vs reducing the bad hits. You really only achieve one of those 
two.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Apr 12, 2017, at 7:41 PM, Koji Sekiguchi  
> wrote:
> 
> Hi Walter,
> 
> May I ask a tangential question? I'm curious the following line you wrote:
> 
> > Solr is a vector-space engine. Some early engines (Verity VDK) were 
> > probabilistic engines. Those do give an absolute estimate of the relevance 
> > of each hit. Unfortunately, the relevance of results is just not as good as 
> > vector-space engines. So, probabilistic engines are mostly dead.
> 
> Can you elaborate this?
> 
> I thought Okapi BM25, which is the default Similarity on Solr, is based on 
> the probabilistic
> model. Did you mean that Lucene/Solr is still based on vector space model but 
> they built
> BM25Similarity on top of it and therefore, BM25Similarity is not pure 
> probabilistic scoring
> system or Okapi BM25 is not originally probabilistic?
> 
> As for me, I prefer the idea of vector space than probabilistic for the 
> information retrieval,
> and I stick with ClassicSimilarity for my projects.
> 
> Thanks,
> 
> Koji
> 
> 
> On 2017/04/13 4:08, Walter Underwood wrote:
>> Fine. It can’t be done. If it was easy, Solr/Lucene would already have the 
>> feature, right?
>> Solr is a vector-space engine. Some early engines (Verity VDK) were 
>> probabilistic engines. Those do give an absolute estimate of the relevance 
>> of each hit. Unfortunately, the relevance of results is just not as good as 
>> vector-space engines. So, probabilistic engines are mostly dead.
>> But, “you don’t want to do it” is very good advice. Instead of trying to 
>> reduce bad hits, work on increasing good hits. It is really hard, sometimes 
>> not possible, to optimize both. Increasing the good hits makes your 
>> customers happy. Reducing the bad hits makes your UX team happy.
>> Here is a process. Start collecting the clicks on the search results page 
>> (SRP) with each query. Look at queries that have below average clickthrough. 
>> See if those can be combined into categories, then address each category.
>> Some categories that I have used:
>> * One word or two? “babysitter”, “baby-sitter”, and “baby sitter” are all 
>> valid. Use synonyms or shingles (and maybe the word delimiter filter) to 
>> match these.
>> * Misspellings. These should be about 10% of queries. Use fuzzy matching. I 
>> recommend the patch in SOLR-629.
>> * Alternate vocabulary. You sell a “laptop”, but people call it a 
>> “notebook”. People search for “kids movies”, but your movie genre is 
>> “Children and Family”. Use synonyms.
>> * Missing content. People can’t find anything about beach parking because 
>> there isn’t a page about that. Instead, there are scraps of info about beach 
>> parking in multiple other pages. Fix the content.
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>>> On Apr 12, 2017, at 11:44 AM, David Kramer  wrote:
>>> 
>>> The idea is to not return poorly matching results, not to limit the number 
>>> of results returned.  One query may have hundreds of excellent matches and 
>>> another query may have 7. So cutting off by the number of results is 
>>> trivial but not useful.
>>> 
>>> Again, we are not doing this for performance reasons. We’re doing this 
>>> because we don’t want to show products that are not very relevant to the 
>>> search terms specified by the user for UX reasons.
>>> 
>>> I had hoped that the responses would have been more focused on “it’ can’t 
>>> be done” or “here’s how to do it” than “you don’t want to do it”.   I’m 
>>> still left not knowing if it’s even possible. The one concrete answer of 
>>> using frange doesn’t help as referencing score in either the q or the fq 
>>> produces an “undefined field” error.
>>> 
>>> Thanks.
>>> 
>>> On 4/11/17, 8:59 AM, "Dorian Hoxha"  wrote:
>>> 
>>>Can't the filter be used in cases when you're paginating in
>>>sharded-scenario ?
>>>So if you do limit=10, offset=10, each shard will return 20 docs ?
>>>While if you do limit=10, _score<=last_page.min_score, then each shard 
>>> will
>>>return 10 docs ? (they will still score all docs, but merging will be
>>>faster)
>>> 
>>>Makes sense ?
>>> 
>>>On Tue, Apr 11, 2017 at 12:49 PM, alessandro.benedetti 
>>> >>> wrote:
>>> 
 Can i ask what is the final requirement here ?
 What are you trying to do ?
 - just display less results ?
 you can easily do at search client time, cutting after a certain amount
>>>

Re: keyword-in-content for PDF document

2017-04-13 Thread ankur

Thanks Alex. Yes, I am using TIKA. So, to some extent it preserves the text
flow.

There is something interesting in your reply, "Or you could try using
highlighter to return only 
the sentence. ".

I didnt understand that bit. How do we use Highlighter to return the
sentence?

To make sure, I want to return all sentences where the word "Growth"
appears. 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/keyword-in-context-for-PDF-document-tp4329754p4329794.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: keyword-in-content for PDF document

2017-04-13 Thread Allison, Timothy B.

If you don't care about sentence boundaries, but just want a window around 
target terms and you want concordance functionality (sort before, after, etc), 
you might check out LUCENE-5317, which is available as a standalone jar on my 
github site [1] and is available through maven central.

Using a highlighter, too, will get you close.

See a crummy image of LUCENE-5317 [2] or the full presentation [3]

[1] https://github.com/tballison/lucene-addons/tree/6.5-0.1
[2] https://twitter.com/_tallison/status/852492398793981952
[3] 
https://github.com/tballison/share/blob/master/slides/TextProcessingAndAdvancedSearch_tallison_MITRE_201510_final_abbrev.pdf
 slide 23ff.


-Original Message-
From: ankur [mailto:ankur.sancheti.netw...@gmail.com] 
Sent: Thursday, April 13, 2017 12:08 PM
To: solr-user@lucene.apache.org
Subject: Re: keyword-in-content for PDF document

Thanks Alex. Yes, I am using TIKA. So, to some extent it preserves the text
flow.

There is something interesting in your reply, "Or you could try using
highlighter to return only 
the sentence. ".

I didnt understand that bit. How do we use Highlighter to return the
sentence?

To make sure, I want to return all sentences where the word "Growth"
appears. 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/keyword-in-context-for-PDF-document-tp4329754p4329794.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: keyword-in-content for PDF document

The boundary scanner supports sentence as per:
https://cwiki.apache.org/confluence/display/solr/Highlighting

So, the word in context should - if I remember correctly - give you
the sentence that word is in even if the field has longer text.

Regards,
   Alex.

http://www.solr-start.com/ - Resources for Solr users, new and experienced


On 13 April 2017 at 19:07, ankur  wrote:
> Thanks Alex. Yes, I am using TIKA. So, to some extent it preserves the text
> flow.
>
> There is something interesting in your reply, "Or you could try using
> highlighter to return only
> the sentence. ".
>
> I didnt understand that bit. How do we use Highlighter to return the
> sentence?
>
> To make sure, I want to return all sentences where the word "Growth"
> appears.
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/keyword-in-context-for-PDF-document-tp4329754p4329794.html
> Sent from the Solr - User mailing list archive at Nabble.com.

keywords not found - google like feature

Hello All,

When we search google, sometimes google returns results with mention of
keywords not found (mentioned as strike-through)

Does Solr provide such feature ?


Thanks,
Nilesh Kamani

Re: Long GC pauses while reading Solr docs using Cursor approach

2017-04-13 Thread Chetas Joshi

Hi Shawn,

Thanks for the insights into the memory requirements. Looks like cursor
approach is going to require a lot of memory for millions of documents.
If I run a query that returns only 500K documents still keeping 100K docs
per page, I don't see long GC pauses. So it is not really the number of
rows per page but the overall number of docs. May be I can reduce the
document cache and the field cache. What do you think?

Erick,

I was using the streaming approach to get back results from Solr but I was
running into some run time exceptions. That bug has been fixed in solr 6.0.
But because of some reasons, I won't be able to move to Java 8 and hence I
will have to stick to solr 5.5.0. That is the reason I had to switch to the
cursor approach.

Thanks!

On Wed, Apr 12, 2017 at 8:37 PM, Erick Erickson 
wrote:

> You're missing the point of my comment. Since they already are
> docValues, you can use the /export functionality to get the results
> back as a _stream_ and avoid all of the overhead of the aggregator
> node doing a merge sort and all of that.
>
> You'll have to do this from SolrJ, but see CloudSolrStream. You can
> see examples of its usage in StreamingTest.java.
>
> this should
> 1> complete much, much faster. The design goal is 400K rows/second but YMMV
> 2> use vastly less memory on your Solr instances.
> 3> only require _one_ query
>
> Best,
> Erick
>
> On Wed, Apr 12, 2017 at 7:36 PM, Shawn Heisey  wrote:
> > On 4/12/2017 5:19 PM, Chetas Joshi wrote:
> >> I am getting back 100K results per page.
> >> The fields have docValues enabled and I am getting sorted results based
> on "id" and 2 more fields (String: 32 Bytes and Long: 8 Bytes).
> >>
> >> I have a solr Cloud of 80 nodes. There will be one shard that will get
> top 100K docs from each shard and apply merge sort. So, the max memory
> usage of any shard could be 40 bytes * 100K * 80 = 320 MB. Why would heap
> memory usage shoot up from 8 GB to 17 GB?
> >
> > From what I understand, Java overhead for a String object is 56 bytes
> > above the actual byte size of the string itself.  And each character in
> > the string will be two bytes -- Java uses UTF-16 for character
> > representation internally.  If I'm right about these numbers, it means
> > that each of those id values will take 120 bytes -- and that doesn't
> > include the size the actual response (xml, json, etc).
> >
> > I don't know what the overhead for a long is, but you can be sure that
> > it's going to take more than eight bytes total memory usage for each one.
> >
> > Then there is overhead for all the Lucene memory structures required to
> > execute the query and gather results, plus Solr memory structures to
> > keep track of everything.  I have absolutely no idea how much memory
> > Lucene and Solr use to accomplish a query, but it's not going to be
> > small when you have 200 million documents per shard.
> >
> > Speaking of Solr memory requirements, under normal query circumstances
> > the aggregating node is going to receive at least 100K results from
> > *every* shard in the collection, which it will condense down to the
> > final result with 100K entries.  The behavior during a cursor-based
> > request may be more memory-efficient than what I have described, but I
> > am unsure whether that is the case.
> >
> > If the cursor behavior is not more efficient, then each entry in those
> > results will contain the uniqueKey value and the score.  That's going to
> > be many megabytes for every shard.  If there are 80 shards, it would
> > probably be over a gigabyte for one request.
> >
> > Thanks,
> > Shawn
> >
>

Need help with auto-suggester

2017-04-13 Thread OTH

Hello,

I've followed the steps here to set up auto-suggest:
https://lucidworks.com/2015/03/04/solr-suggester/

So basically I configured the auto-suggester in solrconfig.xml, where I
told it which field in my index needs to be used for auto-suggestion.

The problem is:
When the user searches in the text box in the front end, if they are
searching for cities, I also need the countries to appear in the drop-down
list which the user sees.
The field which is being searched is only 'city' here.  However, I need to
retrieve the corresponding value in the 'country' field as well.

How could I do this using the suggester?

Thanks

Re: Autosuggestion

2017-04-13 Thread OTH

Hello
So, from what I've picked up so far:
FST only matches from the beginning of the input, but can handle spelling
errors and do stemming.
AnalyzingInfix can't handle spelling errors or stemming but can match from
the middle of the string.
(Is there anyway to achieve both of the functionalities above, if need be?)
Performance-wise, FST's are faster and more compact?

Thanks

On Thu, Apr 13, 2017 at 7:57 PM, Erick Erickson 
wrote:

> bq:  FST-based vs AnalyzingInfix
>
> They are two totally different things. FST-based suggesters are very
> fast and compact. But they only match from the beginning of the input.
>
> AnalyzingInfix creates a "sidecar" index that's searched like a normal
> index and the _field_ is returned. Thus analyzinginfix can suggest
> "my dog has fleas" when entering "fleas", but the FST-based suggesters
> cannot.
>
> Best,
> Erick
>
> On Thu, Apr 13, 2017 at 6:24 AM, OTH  wrote:
> > Thanks, that's very helpful!
> > The third link especially is quite helpful.
> > Is there any recommendation regarding using FST-based vs AnalyzingInfix
> > suggesters?
> > Thanks
> >
> > On Wed, Apr 12, 2017 at 6:23 PM, Andrea Gazzarini 
> wrote:
> >
> >> Hi,
> >> I think you got an old post. I would have a look at the built-in
> feature,
> >> first. These posts can help you to get a quick overview:
> >>
> >> https://cwiki.apache.org/confluence/display/solr/Suggester
> >> http://alexbenedetti.blogspot.it/2015/07/solr-you-complete-me.html
> >> https://lucidworks.com/2015/03/04/solr-suggester/
> >>
> >> HTH,
> >> Andrea
> >>
> >>
> >> On 12/04/17 14:43, OTH wrote:
> >>
> >>> Hello,
> >>>
> >>> Is there any recommended way to achieve auto-suggestion in textboxes
> using
> >>> Solr?
> >>>
> >>> I'm new to Solr, but right now I have achieved this functionality by
> using
> >>> an example I found online, doing this:
> >>>
> >>> I added a copy field, which is of the following type:
> >>>
> >>> >>> positionIncrementGap="100">
> >>>  
> >>> >>> maxGramSize="10"/>
> >>>
> >>>  
> >>>  
> >>> minGramSize="2"
> >>> maxGramSize="10"/>
> >>>
> >>>  
> >>>
> >>>
> >>> In the search box, after each character is typed, the above field is
> >>> queried, and the results are shown in a drop-down list.
> >>>
> >>> However, this is performing quite slow.  I'm not sure if that has to do
> >>> with the front-end code, or because I'm not using the recommended
> approach
> >>> in terms of how I'm using Solr.  Is there any other recommended way to
> use
> >>> Solr to achieve this functionality?
> >>>
> >>> Thanks
> >>>
> >>>
> >>
>

Re: keywords not found - google like feature

Something like this. Does SOLR have such feature ?

[image: Inline image 1]

On Thu, Apr 13, 2017 at 1:49 PM, Nilesh Kamani 
wrote:

> Hello All,
>
> When we search google, sometimes google returns results with mention of
> keywords not found (mentioned as strike-through)
>
> Does Solr provide such feature ?
>
>
> Thanks,
> Nilesh Kamani
>

Re: keywords not found - google like feature

Pasted images are generally stripped out, you'll have to provide an
external link.

On Thu, Apr 13, 2017 at 12:04 PM, Nilesh Kamani 
wrote:

> Something like this. Does SOLR have such feature ?
>
> [image: Inline image 1]
>
> On Thu, Apr 13, 2017 at 1:49 PM, Nilesh Kamani 
> wrote:
>
>> Hello All,
>>
>> When we search google, sometimes google returns results with mention of
>> keywords not found (mentioned as strike-through)
>>
>> Does Solr provide such feature ?
>>
>>
>> Thanks,
>> Nilesh Kamani
>>
>
>

Re: keywords not found - google like feature

Are you asking visual representation or an actual feature. Because if
all your keywords/clauses are optional (default SHOULD) then Solr
automatically tries to match maximum number of them and then less and
less. So, if all words do not match, it will return results that match
less number of words.

And words not-matched is effectively your strike-through negative
space. You can probably recover that from debug info, though it will
be not pretty and perhaps a bit slower.

The real issue here is ranking. Does Google do something special with
ranking when they do strike through. Do they do some grouping and
ranking within groups, not just a global one?

The biggest question is - of course - what is your business - as
opposed to look-alike - objective. Because explaining your needs
through a similarity with other product's secret implementation is a
long way to get there. Too much precision loss in each explanation
round.

Regards,
   Alex.

http://www.solr-start.com/ - Resources for Solr users, new and experienced

On 13 April 2017 at 20:49, Nilesh Kamani  wrote:
> Hello All,
>
> When we search google, sometimes google returns results with mention of
> keywords not found (mentioned as strike-through)
>
> Does Solr provide such feature ?
>
>
> Thanks,
> Nilesh Kamani

Re: keywords not found - google like feature

Here is the example.
https://www.google.ca/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#safe=off&q=solr+spring+trump

You will see this under search results.  Missing: trump

I am not asking for visual representation of such feature.
Is there anyway solr is returning such info in response ?
My client has this specific requirements that when he searches he wants to
know what keywords were not found in results.




On Thu, Apr 13, 2017 at 3:34 PM, Alexandre Rafalovitch 
wrote:

> Are you asking visual representation or an actual feature. Because if
> all your keywords/clauses are optional (default SHOULD) then Solr
> automatically tries to match maximum number of them and then less and
> less. So, if all words do not match, it will return results that match
> less number of words.
>
> And words not-matched is effectively your strike-through negative
> space. You can probably recover that from debug info, though it will
> be not pretty and perhaps a bit slower.
>
> The real issue here is ranking. Does Google do something special with
> ranking when they do strike through. Do they do some grouping and
> ranking within groups, not just a global one?
>
> The biggest question is - of course - what is your business - as
> opposed to look-alike - objective. Because explaining your needs
> through a similarity with other product's secret implementation is a
> long way to get there. Too much precision loss in each explanation
> round.
>
> Regards,
>Alex.
> 
> http://www.solr-start.com/ - Resources for Solr users, new and experienced
>
>
> On 13 April 2017 at 20:49, Nilesh Kamani  wrote:
> > Hello All,
> >
> > When we search google, sometimes google returns results with mention of
> > keywords not found (mentioned as strike-through)
> >
> > Does Solr provide such feature ?
> >
> >
> > Thanks,
> > Nilesh Kamani
>

RE: keywords not found - google like feature

Hi - There is no such feature out-of-the-box in Solr. But you probably could 
modify a highlighter implementation to return this information, the highlighter 
is the component that comes closest to that feature.

Regards,
Markus

 
 
-Original message-
> From:Nilesh Kamani 
> Sent: Thursday 13th April 2017 21:52
> To: solr-user@lucene.apache.org
> Subject: Re: keywords not found - google like feature
> 
> Here is the example.
> https://www.google.ca/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#safe=off&q=solr+spring+trump
> 
> You will see this under search results.  Missing: trump
> 
> I am not asking for visual representation of such feature.
> Is there anyway solr is returning such info in response ?
> My client has this specific requirements that when he searches he wants to
> know what keywords were not found in results.
> 
> 
> 
> 
> On Thu, Apr 13, 2017 at 3:34 PM, Alexandre Rafalovitch 
> wrote:
> 
> > Are you asking visual representation or an actual feature. Because if
> > all your keywords/clauses are optional (default SHOULD) then Solr
> > automatically tries to match maximum number of them and then less and
> > less. So, if all words do not match, it will return results that match
> > less number of words.
> >
> > And words not-matched is effectively your strike-through negative
> > space. You can probably recover that from debug info, though it will
> > be not pretty and perhaps a bit slower.
> >
> > The real issue here is ranking. Does Google do something special with
> > ranking when they do strike through. Do they do some grouping and
> > ranking within groups, not just a global one?
> >
> > The biggest question is - of course - what is your business - as
> > opposed to look-alike - objective. Because explaining your needs
> > through a similarity with other product's secret implementation is a
> > long way to get there. Too much precision loss in each explanation
> > round.
> >
> > Regards,
> >Alex.
> > 
> > http://www.solr-start.com/ - Resources for Solr users, new and experienced
> >
> >
> > On 13 April 2017 at 20:49, Nilesh Kamani  wrote:
> > > Hello All,
> > >
> > > When we search google, sometimes google returns results with mention of
> > > keywords not found (mentioned as strike-through)
> > >
> > > Does Solr provide such feature ?
> > >
> > >
> > > Thanks,
> > > Nilesh Kamani
> >
>

RE: maxDoc ten times greater than numDoc

Thanks, but i am not going to be brave this time :)

I have tried reclaimDeletesWeight on an ordinary index some time ago and it was 
very aggresive with slightly higher values than default. I think setting this 
weight in this situation would be analogous to a forceMerge every time, which 
makes sense.

Thanks,
Markus
 
-Original message-
> From:Erick Erickson 
> Sent: Thursday 13th April 2017 17:07
> To: solr-user 
> Subject: Re: maxDoc ten times greater than numDoc
> 
> If you want to be brave
> 
> Through a clever bit of reflection, the parameters that
> TieredMergePolicy uses to decide what segments to reclaim are settable
> in solrconfig.xml (undocumented, so use at your own risk). You could
> try bumping
> 
> reclaimDeletesWeight
> 
> in your TieredMergePolicy configuration if you wanted to experiment.
> 
> There's no good reason not to set your segments per tier, it won't hurt.
> 
>  But as you say you have a solution so this is just for curiosity's sake.
> 
> Best,
> Erick
> 
> On Thu, Apr 13, 2017 at 4:42 AM, Alexandre Rafalovitch
>  wrote:
> > Maybe not every entry got deleted and it was holding up the segment.
> > E.g. a child or parent record abandoned. If, for example, the parent
> > record has a date field and the child does not, then deleting with a
> > date-based query may trigger this. I think there was a bug about
> > abandoned child or something.
> >
> > This is pure speculation of course.
> >
> > Regards,
> >Alex.
> > 
> > http://www.solr-start.com/ - Resources for Solr users, new and experienced
> >
> >
> > On 13 April 2017 at 12:54, Markus Jelsma  wrote:
> >> I have forced a merge yesterday and went back to one segment.
> >>
> >> One indexer program reindexes (most or all) every 20 minutes orso. There 
> >> is nothing custom at that particular point. There is no autoCommit, the 
> >> indexer program is responsible for a hard commit, it is the single source 
> >> of reindexed data.
> >>
> >> After one cycle we had two segments, 50 % deleted, as expected. This was 
> >> stable for many hours and many cycles. For some reason, i now have 2/3 
> >> deletes and three segments, now this situation is stable. So the merges do 
> >> happen, but sometimes they don't. When they don't, the size increases (now 
> >> three segments, 55 MB). But it appears that number of segments never 
> >> decreases, and that is what bothers me.
> >>
> >> I was about to set segmentsPerTier to two but then i realized i can also 
> >> delete everything prior to indexing as opposed to deleting only items 
> >> older than the set i am already about to reindex. This strategy works fine 
> >> with other reindexing programs, they don't suffer this problem.
> >>
> >> So, it is not solved, but not a problem anymore. Thanks all anyway :)
> >> Markus
> >>
> >> -Original message-
> >>> From:Erick Erickson 
> >>> Sent: Wednesday 12th April 2017 17:51
> >>> To: solr-user 
> >>> Subject: Re: maxDoc ten times greater than numDoc
> >>>
> >>> Yes, this is very strange. My bet: you have something
> >>> custom, a setting, indexing code, whatever that
> >>> is getting in the way.
> >>>
> >>> Second possibility (really stretching here): your
> >>> merge settings are set to 10 segments having to exist
> >>> before merging and somehow not all the docs in the
> >>> segments are replaced. So until you get to the 10th
> >>> re-index (and assuming a single segment is
> >>> produced per re-index) the older segments aren't
> >>> merged. If that were the case I'd expect to see the
> >>> number of deleted docs drop back periodically
> >>> then build up again. A real shot in the dark. One way
> >>> to test this would be to specify "segmentsPerTier" of, say,
> >>> 2 rather than the default 10, see:
> >>> https://cwiki.apache.org/confluence/display/solr/IndexConfig+in+SolrConfig
> >>> If this were the case I'd expect with a setting of 2 that
> >>> your index might have 50% deleted docs, that would at
> >>> least tell us whether we're on the right track.
> >>>
> >>> Take a look at your index on disk. If you're seeing gaps
> >>> in the numbering, you are getting merging, it may be
> >>> that they're not happening very often.
> >>>
> >>> And I take it you have no custom code here and you are
> >>> doing commits? (hard commits are all that matters
> >>> for merging, it doesn't matter whether openSearcher
> >>> is set to true or false).
> >>>
> >>> I just tried the "techproducts" example as follows:
> >>> 1> indexed all the sample files with the bin/solr -e techproducts example
> >>> 2> started re-indexing the sample docs one at a time with post.jar
> >>>
> >>> It took a while, but eventually the original segments got merged away so
> >>> I doubt it's any weirdness with a small index.
> >>>
> >>> Speaking of small index, why are you sharding with only
> >>> 8K docs? Sharding will probably slow things down for such
> >>> a small index. This isn't germane to your question though.
> >>>
> >>> Best,
> >>> Erick
> >>>
> >>>
> >>>

Re: keywords not found - google like feature

2017-04-13 Thread David Hastings

Another ugly solution would be to use the debugQuery=true option, then
analyze the reults in explain, if the word isnt in the explain, then you
strike it out.

On Thu, Apr 13, 2017 at 4:01 PM, Markus Jelsma 
wrote:

> Hi - There is no such feature out-of-the-box in Solr. But you probably
> could modify a highlighter implementation to return this information, the
> highlighter is the component that comes closest to that feature.
>
> Regards,
> Markus
>
>
>
> -Original message-
> > From:Nilesh Kamani 
> > Sent: Thursday 13th April 2017 21:52
> > To: solr-user@lucene.apache.org
> > Subject: Re: keywords not found - google like feature
> >
> > Here is the example.
> > https://www.google.ca/webhp?sourceid=chrome-instant&ion=1&;
> espv=2&ie=UTF-8#safe=off&q=solr+spring+trump
> >
> > You will see this under search results.  Missing: trump
> >
> > I am not asking for visual representation of such feature.
> > Is there anyway solr is returning such info in response ?
> > My client has this specific requirements that when he searches he wants
> to
> > know what keywords were not found in results.
> >
> >
> >
> >
> > On Thu, Apr 13, 2017 at 3:34 PM, Alexandre Rafalovitch <
> arafa...@gmail.com>
> > wrote:
> >
> > > Are you asking visual representation or an actual feature. Because if
> > > all your keywords/clauses are optional (default SHOULD) then Solr
> > > automatically tries to match maximum number of them and then less and
> > > less. So, if all words do not match, it will return results that match
> > > less number of words.
> > >
> > > And words not-matched is effectively your strike-through negative
> > > space. You can probably recover that from debug info, though it will
> > > be not pretty and perhaps a bit slower.
> > >
> > > The real issue here is ranking. Does Google do something special with
> > > ranking when they do strike through. Do they do some grouping and
> > > ranking within groups, not just a global one?
> > >
> > > The biggest question is - of course - what is your business - as
> > > opposed to look-alike - objective. Because explaining your needs
> > > through a similarity with other product's secret implementation is a
> > > long way to get there. Too much precision loss in each explanation
> > > round.
> > >
> > > Regards,
> > >Alex.
> > > 
> > > http://www.solr-start.com/ - Resources for Solr users, new and
> experienced
> > >
> > >
> > > On 13 April 2017 at 20:49, Nilesh Kamani 
> wrote:
> > > > Hello All,
> > > >
> > > > When we search google, sometimes google returns results with mention
> of
> > > > keywords not found (mentioned as strike-through)
> > > >
> > > > Does Solr provide such feature ?
> > > >
> > > >
> > > > Thanks,
> > > > Nilesh Kamani
> > >
> >
>

Re: keywords not found - google like feature

2017-04-13 Thread simon

Regardless of the business case (which would be good to know) you might
want to try something along the lines of
http://stackoverflow.com/questions/25038080/how-can-i-tell-solr-to-return-the-hit-search-terms-per-document
- basically generate pseudo-fields using the exists() function query which
will return a boolean if the term is in a specific field.
I've used this for simple cases where it worked well, though I wouldn't
like to speculate on how well this scales if you have an edismax query
where you might need to generate multiple term/field combinations.

HTH

-Simon

On Thu, Apr 13, 2017 at 3:34 PM, Alexandre Rafalovitch 
wrote:

> Are you asking visual representation or an actual feature. Because if
> all your keywords/clauses are optional (default SHOULD) then Solr
> automatically tries to match maximum number of them and then less and
> less. So, if all words do not match, it will return results that match
> less number of words.
>
> And words not-matched is effectively your strike-through negative
> space. You can probably recover that from debug info, though it will
> be not pretty and perhaps a bit slower.
>
> The real issue here is ranking. Does Google do something special with
> ranking when they do strike through. Do they do some grouping and
> ranking within groups, not just a global one?
>
> The biggest question is - of course - what is your business - as
> opposed to look-alike - objective. Because explaining your needs
> through a similarity with other product's secret implementation is a
> long way to get there. Too much precision loss in each explanation
> round.
>
> Regards,
>Alex.
> 
> http://www.solr-start.com/ - Resources for Solr users, new and experienced
>
>
> On 13 April 2017 at 20:49, Nilesh Kamani  wrote:
> > Hello All,
> >
> > When we search google, sometimes google returns results with mention of
> > keywords not found (mentioned as strike-through)
> >
> > Does Solr provide such feature ?
> >
> >
> > Thanks,
> > Nilesh Kamani
>

Re: keywords not found - google like feature

Thanks for your input guys. I will look into it.

On Thu, Apr 13, 2017 at 4:07 PM, simon  wrote:

> Regardless of the business case (which would be good to know) you might
> want to try something along the lines of
> http://stackoverflow.com/questions/25038080/how-can-i-
> tell-solr-to-return-the-hit-search-terms-per-document
> - basically generate pseudo-fields using the exists() function query which
> will return a boolean if the term is in a specific field.
> I've used this for simple cases where it worked well, though I wouldn't
> like to speculate on how well this scales if you have an edismax query
> where you might need to generate multiple term/field combinations.
>
> HTH
>
> -Simon
>
> On Thu, Apr 13, 2017 at 3:34 PM, Alexandre Rafalovitch  >
> wrote:
>
> > Are you asking visual representation or an actual feature. Because if
> > all your keywords/clauses are optional (default SHOULD) then Solr
> > automatically tries to match maximum number of them and then less and
> > less. So, if all words do not match, it will return results that match
> > less number of words.
> >
> > And words not-matched is effectively your strike-through negative
> > space. You can probably recover that from debug info, though it will
> > be not pretty and perhaps a bit slower.
> >
> > The real issue here is ranking. Does Google do something special with
> > ranking when they do strike through. Do they do some grouping and
> > ranking within groups, not just a global one?
> >
> > The biggest question is - of course - what is your business - as
> > opposed to look-alike - objective. Because explaining your needs
> > through a similarity with other product's secret implementation is a
> > long way to get there. Too much precision loss in each explanation
> > round.
> >
> > Regards,
> >Alex.
> > 
> > http://www.solr-start.com/ - Resources for Solr users, new and
> experienced
> >
> >
> > On 13 April 2017 at 20:49, Nilesh Kamani 
> wrote:
> > > Hello All,
> > >
> > > When we search google, sometimes google returns results with mention of
> > > keywords not found (mentioned as strike-through)
> > >
> > > Does Solr provide such feature ?
> > >
> > >
> > > Thanks,
> > > Nilesh Kamani
> >
>

RE: keywords not found - google like feature

Hi - That is not going to be that easy out-of-the-box. In regular setups the 
output you find in debugging mode contains stemmed versions of the original 
input text.

At best you use KeepWordsFilterFactory to get unstemmed terms, but those tokens 
would, in usual cases, also have passed through filters such as LowerCase, 
AsciiFolding or some language specific normalizer. Causing them not to match 
most original input tokens.

Regards,
Markus

 
 
-Original message-
> From:David Hastings 
> Sent: Thursday 13th April 2017 22:05
> To: solr-user@lucene.apache.org
> Subject: Re: keywords not found - google like feature
> 
> Another ugly solution would be to use the debugQuery=true option, then
> analyze the reults in explain, if the word isnt in the explain, then you
> strike it out.
> 
> On Thu, Apr 13, 2017 at 4:01 PM, Markus Jelsma 
> wrote:
> 
> > Hi - There is no such feature out-of-the-box in Solr. But you probably
> > could modify a highlighter implementation to return this information, the
> > highlighter is the component that comes closest to that feature.
> >
> > Regards,
> > Markus
> >
> >
> >
> > -Original message-
> > > From:Nilesh Kamani 
> > > Sent: Thursday 13th April 2017 21:52
> > > To: solr-user@lucene.apache.org
> > > Subject: Re: keywords not found - google like feature
> > >
> > > Here is the example.
> > > https://www.google.ca/webhp?sourceid=chrome-instant&ion=1&;
> > espv=2&ie=UTF-8#safe=off&q=solr+spring+trump
> > >
> > > You will see this under search results.  Missing: trump
> > >
> > > I am not asking for visual representation of such feature.
> > > Is there anyway solr is returning such info in response ?
> > > My client has this specific requirements that when he searches he wants
> > to
> > > know what keywords were not found in results.
> > >
> > >
> > >
> > >
> > > On Thu, Apr 13, 2017 at 3:34 PM, Alexandre Rafalovitch <
> > arafa...@gmail.com>
> > > wrote:
> > >
> > > > Are you asking visual representation or an actual feature. Because if
> > > > all your keywords/clauses are optional (default SHOULD) then Solr
> > > > automatically tries to match maximum number of them and then less and
> > > > less. So, if all words do not match, it will return results that match
> > > > less number of words.
> > > >
> > > > And words not-matched is effectively your strike-through negative
> > > > space. You can probably recover that from debug info, though it will
> > > > be not pretty and perhaps a bit slower.
> > > >
> > > > The real issue here is ranking. Does Google do something special with
> > > > ranking when they do strike through. Do they do some grouping and
> > > > ranking within groups, not just a global one?
> > > >
> > > > The biggest question is - of course - what is your business - as
> > > > opposed to look-alike - objective. Because explaining your needs
> > > > through a similarity with other product's secret implementation is a
> > > > long way to get there. Too much precision loss in each explanation
> > > > round.
> > > >
> > > > Regards,
> > > >Alex.
> > > > 
> > > > http://www.solr-start.com/ - Resources for Solr users, new and
> > experienced
> > > >
> > > >
> > > > On 13 April 2017 at 20:49, Nilesh Kamani 
> > wrote:
> > > > > Hello All,
> > > > >
> > > > > When we search google, sometimes google returns results with mention
> > of
> > > > > keywords not found (mentioned as strike-through)
> > > > >
> > > > > Does Solr provide such feature ?
> > > > >
> > > > >
> > > > > Thanks,
> > > > > Nilesh Kamani
> > > >
> > >
> >
>

Re: keywords not found - google like feature