Re: Way for DataImportHandler to use bind variables

2018-05-03 Thread Mikhail Khludnev
DIH does string replacement

https://github.com/apache/lucene-solr/blob/8b9c2a3185d824a9aaae5c993b872205358729dd/solr/contrib/dataimporthandler/src/java/org/apache/solr/handler/dataimport/SqlEntityProcessor.java#L73


Hard refactoring is required to make it use preparedStatement.
However there should be just a few jdbc calls that means it shouldn't be a
problem.
There a few DIH facilities to avoid like N+1 pitfalls.


On Wed, May 2, 2018 at 10:03 PM, Mike Konikoff 
wrote:

> Is there a way to configure the DataImportHandler to use bind variables for
> the entity queries? To improve database performance.
>
> Thanks,
>
> Mike
>



-- 
Sincerely yours
Mikhail Khludnev


Re: Autocomplete returning shingles

2018-05-03 Thread Federico Méndez
Can you just add the SingleFilter to your field?
https://lucene.apache.org/solr/guide/6_6/filter-descriptions.html#FilterDescriptions-ShingleFilter


  
  



On Wed, May 2, 2018 at 2:04 PM, O. Klein  wrote:

> I need to use autocomplete with edismax (ngrams,edgegrams) to return
> shingled
> suggestions. Field value "new york city" needs to return on query "ne" ->
> "new","new york","new york city". With suggester this is easy. But im
> forced
> to use edismax because I need to apply mutliple filter queries.
>
> What is best approach to deal with this?
>
> Any suggestions are appreciated.
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>


Re: User queries end up in filterCache if facetting is enabled

2018-05-03 Thread Mikhail Khludnev
Enum facets, facet refinements and
https://lucene.apache.org/solr/guide/6_6/query-settings-in-solrconfig.html
comes to my mind.

On Wed, May 2, 2018 at 11:58 PM, Markus Jelsma 
wrote:

> Hello,
>
> Anyone here to reproduce this oddity? It shows up in all our collections
> once we enable the stats page to show filterCache entries.
>
> Is this normal? Am i completely missing something?
>
> Thanks,
> Markus
>
>
>
> -Original message-
> > From:Markus Jelsma 
> > Sent: Tuesday 1st May 2018 17:32
> > To: Solr-user 
> > Subject: User queries end up in filterCache if facetting is enabled
> >
> > Hello,
> >
> > We noticed the number of entries of the filterCache to be higher than we
> expected, using showItems="1024" something unexpected was listed as entries
> of the filterCache, the complete Query.toString() of our user queries,
> massive entries, a lot of them.
> >
> > We also spotted all entries of fields we facet on, even though we don't
> use them as filtes, but that is caused by facet.field=enum, and should be
> expected, right?
> >
> > Now, the user query entries are not expected. In the simplest set up,
> searching for something and only enabling the facet engine with facet=true
> causes it to appears in the cache as an entry. The following queries:
> >
> > http://localhost:8983/solr/search/select?q=content_nl:nog&facet=true
> > http://localhost:8983/solr/search/select?q=*:*&facet=true
> >
> > become listed as:
> >
> > CACHE.searcher.filterCache.item_*:*:
> > org.apache.solr.search.BitDocSet@​70051ee0
> >
> > CACHE.searcher.filterCache.item_content_nl:nog:
> > org.apache.solr.search.BitDocSet@​13150cf6
> >
> > This is on 7.3, but 7.2.1 does this as well.
> >
> > So, should i expect this? Can i disable this? Bug?
> >
> >
> > Thanks,
> > Markus
> >
> >
> >
> >
>



-- 
Sincerely yours
Mikhail Khludnev


Re: User queries end up in filterCache if facetting is enabled

2018-05-03 Thread Mikhail Khludnev
I mean
https://lucene.apache.org/solr/guide/6_6/query-settings-in-solrconfig.html#QuerySettingsinSolrConfig-useFilterForSortedQuery


On Thu, May 3, 2018 at 10:42 AM, Mikhail Khludnev  wrote:

> Enum facets, facet refinements and https://lucene.apache.org/
> solr/guide/6_6/query-settings-in-solrconfig.html comes to my mind.
>
> On Wed, May 2, 2018 at 11:58 PM, Markus Jelsma  > wrote:
>
>> Hello,
>>
>> Anyone here to reproduce this oddity? It shows up in all our collections
>> once we enable the stats page to show filterCache entries.
>>
>> Is this normal? Am i completely missing something?
>>
>> Thanks,
>> Markus
>>
>>
>>
>> -Original message-
>> > From:Markus Jelsma 
>> > Sent: Tuesday 1st May 2018 17:32
>> > To: Solr-user 
>> > Subject: User queries end up in filterCache if facetting is enabled
>> >
>> > Hello,
>> >
>> > We noticed the number of entries of the filterCache to be higher than
>> we expected, using showItems="1024" something unexpected was listed as
>> entries of the filterCache, the complete Query.toString() of our user
>> queries, massive entries, a lot of them.
>> >
>> > We also spotted all entries of fields we facet on, even though we don't
>> use them as filtes, but that is caused by facet.field=enum, and should be
>> expected, right?
>> >
>> > Now, the user query entries are not expected. In the simplest set up,
>> searching for something and only enabling the facet engine with facet=true
>> causes it to appears in the cache as an entry. The following queries:
>> >
>> > http://localhost:8983/solr/search/select?q=content_nl:nog&facet=true
>> > http://localhost:8983/solr/search/select?q=*:*&facet=true
>> >
>> > become listed as:
>> >
>> > CACHE.searcher.filterCache.item_*:*:
>> > org.apache.solr.search.BitDocSet@​70051ee0
>> >
>> > CACHE.searcher.filterCache.item_content_nl:nog:
>> > org.apache.solr.search.BitDocSet@​13150cf6
>> >
>> > This is on 7.3, but 7.2.1 does this as well.
>> >
>> > So, should i expect this? Can i disable this? Bug?
>> >
>> >
>> > Thanks,
>> > Markus
>> >
>> >
>> >
>> >
>>
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
>



-- 
Sincerely yours
Mikhail Khludnev


RE: User queries end up in filterCache if facetting is enabled

2018-05-03 Thread Markus Jelsma
Thanks Mikhail,

But i thought about that setting too, but i do sort by score, as does Solr 
/select handler by default. The enum method accounts for all the values for a 
facet field, but not the user queries i see ending up in the cache.

Any other suggestions to shed light on this oddity?

Thanks!
Markus

 
 
-Original message-
> From:Mikhail Khludnev 
> Sent: Thursday 3rd May 2018 9:43
> To: solr-user 
> Subject: Re: User queries end up in filterCache if facetting is enabled
> 
> I mean
> https://lucene.apache.org/solr/guide/6_6/query-settings-in-solrconfig.html#QuerySettingsinSolrConfig-useFilterForSortedQuery
> 
> 
> On Thu, May 3, 2018 at 10:42 AM, Mikhail Khludnev  wrote:
> 
> > Enum facets, facet refinements and https://lucene.apache.org/
> > solr/guide/6_6/query-settings-in-solrconfig.html comes to my mind.
> >
> > On Wed, May 2, 2018 at 11:58 PM, Markus Jelsma  > > wrote:
> >
> >> Hello,
> >>
> >> Anyone here to reproduce this oddity? It shows up in all our collections
> >> once we enable the stats page to show filterCache entries.
> >>
> >> Is this normal? Am i completely missing something?
> >>
> >> Thanks,
> >> Markus
> >>
> >>
> >>
> >> -Original message-
> >> > From:Markus Jelsma 
> >> > Sent: Tuesday 1st May 2018 17:32
> >> > To: Solr-user 
> >> > Subject: User queries end up in filterCache if facetting is enabled
> >> >
> >> > Hello,
> >> >
> >> > We noticed the number of entries of the filterCache to be higher than
> >> we expected, using showItems="1024" something unexpected was listed as
> >> entries of the filterCache, the complete Query.toString() of our user
> >> queries, massive entries, a lot of them.
> >> >
> >> > We also spotted all entries of fields we facet on, even though we don't
> >> use them as filtes, but that is caused by facet.field=enum, and should be
> >> expected, right?
> >> >
> >> > Now, the user query entries are not expected. In the simplest set up,
> >> searching for something and only enabling the facet engine with facet=true
> >> causes it to appears in the cache as an entry. The following queries:
> >> >
> >> > http://localhost:8983/solr/search/select?q=content_nl:nog&facet=true
> >> > http://localhost:8983/solr/search/select?q=*:*&facet=true
> >> >
> >> > become listed as:
> >> >
> >> > CACHE.searcher.filterCache.item_*:*:
> >> > org.apache.solr.search.BitDocSet@​70051ee0
> >> >
> >> > CACHE.searcher.filterCache.item_content_nl:nog:
> >> > org.apache.solr.search.BitDocSet@​13150cf6
> >> >
> >> > This is on 7.3, but 7.2.1 does this as well.
> >> >
> >> > So, should i expect this? Can i disable this? Bug?
> >> >
> >> >
> >> > Thanks,
> >> > Markus
> >> >
> >> >
> >> >
> >> >
> >>
> >
> >
> >
> > --
> > Sincerely yours
> > Mikhail Khludnev
> >
> 
> 
> 
> -- 
> Sincerely yours
> Mikhail Khludnev
> 


Re: Autocomplete returning shingles

2018-05-03 Thread Alessandro Benedetti
So, your problem is you want to return shingle suggestions from a field in
input but apply multiple filter queries to the documents you want to fetch
suggestions from.

Are you building an auxiliary index for that ?
You need to design it accordingly.
If you want to to map each suggestion to a single document in the auxiliary
index, when you build this auxiliary index you need to calculate the
shingles client side and push the multiple documents ( suggestion) per
original field content.

To do that automatically in Solr I was thinking you could write an
UpdateRequestProcessor that given in input the document, split it in
multiple docs, but unfortunately the current architecture of
UpdateRequestProcessors takes in input 1 Doc and and returns in output just
1 doc.
So it is not a viable approach.

Unfortunately the shingle filter here doesn't help, as the user want shingle
in output ( analysers doesn't affect stored content)

Cheers




-
---
Alessandro Benedetti
Search Consultant, R&D Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Regarding LTR feature

2018-05-03 Thread Alessandro Benedetti
Mmmm, first of all, you know that each Solr feature is calculated per
document right ?
So you want to calculate the payload score for the document you are
re-ranking, based on the query ( your External Feature Information) and
normalize across the different documents?

I would go with this feature and use the normalization LTR functionality :

{ 
  "store" : "my_feature_store", 
  "name" : "in_aggregated_terms", 
  "class" : "org.apache.solr.ltr.feature.SolrFeature", 
  "params" : { "q" : "{!payload_score 
f=aggregated_terms func=max v=${query}}" } 
} 

Then in the model you specify something like :

"name" : "myModelName",
   "features" : [
   {
 "name" : "isBook"
   },
...
   {
 "name" : "in_aggregated_terms",
 "norm": {
 "class" : "org.apache.solr.ltr.norm.MinMaxNormalizer",
 "params" : { "min":"x", "max":"y" }
 }
   },
   }

Give it a try, let me know




-
---
Alessandro Benedetti
Search Consultant, R&D Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Shard size variation

2018-05-03 Thread Michael Joyner
We generally try not to change defaults when possible, sounds like there 
will be new default settings for the segment sizes and merging policy?


Am I right in thinking that expungeDeletes will (in theory) be a 7.4 
forwards option?



On 05/02/2018 01:29 PM, Erick Erickson wrote:

You can always increase the maximum segment size. For large indexes
that should reduce the number of segments. But watch your indexing
stats, I can't predict the consequences of bumping it to 100G for
instance. I'd _expect_  bursty I/O whne those large segments started
to be created or merged

You'll be interested in LUCENE-7976 (Solr 7.4?), especially (probably)
the idea of increasing the segment sizes and/or a related JIRA that
allows you to tweak how aggressively solr merges segments that have
deleted docs.

NOTE: that JIRA has the consequence that _by default_ the optimize
with no parameters respects the maximum segment size, which is a
change from now.

Finally, expungeDeletes may be useful as that too will respect max
segment size, again after LUCENE-7976 is committed.

Best,
Erick

On Wed, May 2, 2018 at 9:22 AM, Michael Joyner  wrote:

The main reason we go this route is that after awhile (with default
settings) we end up with hundreds of shards and performance of course drops
abysmally as a result. By using a stepped optimize a) we don't run into the
we need the 3x+ head room issue, b) optimize performance penalty during
optimize is less than the hundreds of shards not being optimized performance
penalty.

BTW, as we use batched a batch insert/update cycle [once daily] we only do
optimize to a segment of 1 after a complete batch has been run. Though
during the batch we reduce segment counts down to a max of 16 every 250K
insert/updates to prevent the large segment count performance penalty.


On 04/30/2018 07:10 PM, Erick Erickson wrote:

There's really no good way to purge deleted documents from the index
other than to wait until merging happens.

Optimize/forceMerge and expungeDeletes both suffer from the problem
that they create massive segments that then stick around for a very
long time, see:

https://lucidworks.com/2017/10/13/segment-merging-deleted-documents-optimize-may-bad/

Best,
Erick

On Mon, Apr 30, 2018 at 1:56 PM, Michael Joyner 
wrote:

Based on experience, 2x head room is room is not always enough, sometimes
not even 3x, if you are optimizing from many segments down to 1 segment
in a
single go.

We have however figured out a way that can work with as little as 51%
free
space via the following iteration cycle:

public void solrOptimize() {
  int initialMaxSegments = 256;
  int finalMaxSegments = 1;
  if (isShowSegmentCounter()) {
  log.info("Optimizing ...");
  }
  try (SolrClient solrServerInstance = getSolrClientInstance()){
  for (int segments=initialMaxSegments;
segments>=finalMaxSegments; segments--) {
  if (isShowSegmentCounter()) {
  System.out.println("Optimizing to a max of
"+segments+"
segments.");
  }
  solrServerInstance.optimize(true, true, segments);
  }
  } catch (SolrServerException | IOException e) {
  throw new RuntimeException(e);

  }
  }


On 04/30/2018 04:23 PM, Walter Underwood wrote:

You need 2X the minimum index size in disk space anyway, so don’t worry
about keeping the indexes as small as possible. Worry about having
enough
headroom.

If your indexes are 250 GB, you need 250 GB of free space.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


On Apr 30, 2018, at 1:13 PM, Antony A  wrote:

Thanks Erick/Deepak.

The cloud is running on baremetal (128 GB/24 cpu).

Is there an option to run a compact on the data files to make the size
equal on both the clouds? I am trying find all the options before I add
the
new fields into the production cloud.

Thanks
AA

On Mon, Apr 30, 2018 at 10:45 AM, Erick Erickson

wrote:


Anthony:

You are probably seeing the results of removing deleted documents from
the shards as they're merged. Even on replicas in the same _shard_,
the size of the index on disk won't necessarily be identical. This has
to do with which segments are selected for merging, which are not
necessarily coordinated across replicas.

The test is if the number of docs on each collection is the same. If
it is, then don't worry about index sizes.

Best,
Erick

On Mon, Apr 30, 2018 at 9:38 AM, Deepak Goel 
wrote:

Could you please also give the machine details of the two clouds you
are
running?



Deepak
"The greatness of a nation can be judged by the way its animals are
treated. Please stop cruelty to Animals, become a Vegan"

+91 73500 12833
deic...@gmail.com

Facebook: https://www.facebook.com/deicool
LinkedIn: www.linkedin.com/in/deicool

"Plant a Tree, Go Green"

Make In India : http://www.makeinindia.com/home

On Mon, Apr 30, 2018 at 9:51 PM, Antony A 

wro

RE: User queries end up in filterCache if facetting is enabled

2018-05-03 Thread Markus Jelsma
By the way, the queries end up in the filterCache regardless of the value set 
in useFilterForSortedQuery.

Thanks,
Markus

-Original message-
> From:Markus Jelsma 
> Sent: Thursday 3rd May 2018 12:05
> To: solr-user@lucene.apache.org; solr-user 
> Subject: RE: User queries end up in filterCache if facetting is enabled
> 
> Thanks Mikhail,
> 
> But i thought about that setting too, but i do sort by score, as does Solr 
> /select handler by default. The enum method accounts for all the values for a 
> facet field, but not the user queries i see ending up in the cache.
> 
> Any other suggestions to shed light on this oddity?
> 
> Thanks!
> Markus
> 
>  
>  
> -Original message-
> > From:Mikhail Khludnev 
> > Sent: Thursday 3rd May 2018 9:43
> > To: solr-user 
> > Subject: Re: User queries end up in filterCache if facetting is enabled
> > 
> > I mean
> > https://lucene.apache.org/solr/guide/6_6/query-settings-in-solrconfig.html#QuerySettingsinSolrConfig-useFilterForSortedQuery
> > 
> > 
> > On Thu, May 3, 2018 at 10:42 AM, Mikhail Khludnev  wrote:
> > 
> > > Enum facets, facet refinements and https://lucene.apache.org/
> > > solr/guide/6_6/query-settings-in-solrconfig.html comes to my mind.
> > >
> > > On Wed, May 2, 2018 at 11:58 PM, Markus Jelsma  > > > wrote:
> > >
> > >> Hello,
> > >>
> > >> Anyone here to reproduce this oddity? It shows up in all our collections
> > >> once we enable the stats page to show filterCache entries.
> > >>
> > >> Is this normal? Am i completely missing something?
> > >>
> > >> Thanks,
> > >> Markus
> > >>
> > >>
> > >>
> > >> -Original message-
> > >> > From:Markus Jelsma 
> > >> > Sent: Tuesday 1st May 2018 17:32
> > >> > To: Solr-user 
> > >> > Subject: User queries end up in filterCache if facetting is enabled
> > >> >
> > >> > Hello,
> > >> >
> > >> > We noticed the number of entries of the filterCache to be higher than
> > >> we expected, using showItems="1024" something unexpected was listed as
> > >> entries of the filterCache, the complete Query.toString() of our user
> > >> queries, massive entries, a lot of them.
> > >> >
> > >> > We also spotted all entries of fields we facet on, even though we don't
> > >> use them as filtes, but that is caused by facet.field=enum, and should be
> > >> expected, right?
> > >> >
> > >> > Now, the user query entries are not expected. In the simplest set up,
> > >> searching for something and only enabling the facet engine with 
> > >> facet=true
> > >> causes it to appears in the cache as an entry. The following queries:
> > >> >
> > >> > http://localhost:8983/solr/search/select?q=content_nl:nog&facet=true
> > >> > http://localhost:8983/solr/search/select?q=*:*&facet=true
> > >> >
> > >> > become listed as:
> > >> >
> > >> > CACHE.searcher.filterCache.item_*:*:
> > >> > org.apache.solr.search.BitDocSet@​70051ee0
> > >> >
> > >> > CACHE.searcher.filterCache.item_content_nl:nog:
> > >> > org.apache.solr.search.BitDocSet@​13150cf6
> > >> >
> > >> > This is on 7.3, but 7.2.1 does this as well.
> > >> >
> > >> > So, should i expect this? Can i disable this? Bug?
> > >> >
> > >> >
> > >> > Thanks,
> > >> > Markus
> > >> >
> > >> >
> > >> >
> > >> >
> > >>
> > >
> > >
> > >
> > > --
> > > Sincerely yours
> > > Mikhail Khludnev
> > >
> > 
> > 
> > 
> > -- 
> > Sincerely yours
> > Mikhail Khludnev
> > 
> 


Solr question about deleting core permanently

2018-05-03 Thread Alexey Ponomarenko
Hi, I have a question
https://stackoverflow.com/questions/50150507/how-can-i-delete-all-fields-after-corecollection-was-deleted
can you help me?

This is regarding deleting solr core permanently.


Re: Shard size variation

2018-05-03 Thread Erick Erickson
"We generally try not to change defaults when possible, sounds like
there will be new default settings for the segment sizes and merging
policy?"

usually wise.

No, there won't be any change in the default settings.

What _will_ change is the behavior of a forceMerge (aka optimize) and
expungeDeletes when using TieredMergePolicy (the default) in that they
will by default respect maxSegmentSizeMB which has defaulted to 5G
since forever. The fact that optimize merged down to a single segment
by default is, in one view, a bug.

The current implementation can hover around 50% deleted docs in an
index, that behavior won't change with LUCENE-7976. The percentage
could possibly be larger if you've optimized before, see the problem
statement on that JIRA.

The other behavior that'll change is if you _have_ merged down to one
segment, that very large segment will be eligible for merging in
situations where it wasn't before so your index should hover around
50% deleted docs if it's large to begin with. See Mike's blog here:
https://www.elastic.co/blog/lucenes-handling-of-deleted-documents

LUCENE-8263 is where we're discussing adding a new parameter to TMP,
that won't be in LUCENE-7976

If you absolutely _insist_ on having one large segment like currently,
you will be able to use the existing maxSegments option for optimize
and get your one large segment back. It's strongly recommended that
you _don't_ do that though, although you can. Optimize will purge all
deleted docs from the index as now though, just respecting max segment
size.

"Am I right in thinking that expungeDeletes will (in theory) be a 7.4
forwards option?"

expungeDeletes is a current option and has been around for quite a
while, see: 
https://lucene.apache.org/solr/guide/7_3/uploading-data-with-index-handlers.html.
It suffers from the same problem forceMerge/optimize does, however
since it can create very large segments. That operation will also
respect the max segment size as of LUCENE-7976.

Best,
Erick


On Thu, May 3, 2018 at 7:02 AM, Michael Joyner  wrote:
> We generally try not to change defaults when possible, sounds like there
> will be new default settings for the segment sizes and merging policy?
>
> Am I right in thinking that expungeDeletes will (in theory) be a 7.4
> forwards option?
>
>
> On 05/02/2018 01:29 PM, Erick Erickson wrote:
>>
>> You can always increase the maximum segment size. For large indexes
>> that should reduce the number of segments. But watch your indexing
>> stats, I can't predict the consequences of bumping it to 100G for
>> instance. I'd _expect_  bursty I/O whne those large segments started
>> to be created or merged
>>
>> You'll be interested in LUCENE-7976 (Solr 7.4?), especially (probably)
>> the idea of increasing the segment sizes and/or a related JIRA that
>> allows you to tweak how aggressively solr merges segments that have
>> deleted docs.
>>
>> NOTE: that JIRA has the consequence that _by default_ the optimize
>> with no parameters respects the maximum segment size, which is a
>> change from now.
>>
>> Finally, expungeDeletes may be useful as that too will respect max
>> segment size, again after LUCENE-7976 is committed.
>>
>> Best,
>> Erick
>>
>> On Wed, May 2, 2018 at 9:22 AM, Michael Joyner  wrote:
>>>
>>> The main reason we go this route is that after awhile (with default
>>> settings) we end up with hundreds of shards and performance of course
>>> drops
>>> abysmally as a result. By using a stepped optimize a) we don't run into
>>> the
>>> we need the 3x+ head room issue, b) optimize performance penalty during
>>> optimize is less than the hundreds of shards not being optimized
>>> performance
>>> penalty.
>>>
>>> BTW, as we use batched a batch insert/update cycle [once daily] we only
>>> do
>>> optimize to a segment of 1 after a complete batch has been run. Though
>>> during the batch we reduce segment counts down to a max of 16 every 250K
>>> insert/updates to prevent the large segment count performance penalty.
>>>
>>>
>>> On 04/30/2018 07:10 PM, Erick Erickson wrote:

 There's really no good way to purge deleted documents from the index
 other than to wait until merging happens.

 Optimize/forceMerge and expungeDeletes both suffer from the problem
 that they create massive segments that then stick around for a very
 long time, see:


 https://lucidworks.com/2017/10/13/segment-merging-deleted-documents-optimize-may-bad/

 Best,
 Erick

 On Mon, Apr 30, 2018 at 1:56 PM, Michael Joyner 
 wrote:
>
> Based on experience, 2x head room is room is not always enough,
> sometimes
> not even 3x, if you are optimizing from many segments down to 1 segment
> in a
> single go.
>
> We have however figured out a way that can work with as little as 51%
> free
> space via the following iteration cycle:
>
> public void solrOptimize() {
>   int initialMaxSegments = 256;
>   i

solrj (admin) requests

2018-05-03 Thread Arturas Mazeika
Hi Solr Team,

Short question:

How can I systematically explore the solrj functionality/API?

Long question:

I am discovering solrj functionality and I am pretty much impressed what
solrj can do. What I am less impressed is my knowledge how to find what I
am looking for. On the positive side, one can relatively quickly find ways
to insert/delete/update/query docs using solrj (basically using google [1]
or from [2]). It is a rather simple task to get exposed to some admin
functionality through [3]. From this on, things are getting more difficult:
I was able to query for some admin infos by trial and error using this java
code:

ArrayList urls = new ArrayList<>();
urls.add(solrUrl);
CloudSolrClient client = new CloudSolrClient.Builder(urls)
.withConnectionTimeout(1)
.withSocketTimeout(6)
.build();
client.setDefaultCollection("tph");

final SolrRequest request = new CollectionAdminRequest.ClusterStatus();
final NamedList response = client.request(request);
final NamedList cluster  = (NamedList)
response.get("cluster");
final ArrayList nodes= (ArrayList)
cluster.get("live_nodes");
final NamedList cols   = (NamedList)
cluster.get("collections");
final LinkedHashMap tph = (LinkedHashMap)
cols.get("tph");

and then looking at what the keys names are, etc. Things are becoming more
difficult here, as (1) lots of functionality hides behind generic "get"
methods and (2) The containers that are returned vary from NamedList, to
anything possible (arraylist, string, hashmap, etc.)

Getting the size of one of the index, e.g., with

ArrayList urls = new ArrayList<>();
urls.add("http://localhost:8983/solr";);
CloudSolrClient client = new CloudSolrClient.Builder(urls)
.withConnectionTimeout(1)
.withSocketTimeout(6)
.build();

client.setDefaultCollection("trans");

CoreAdminRequest request = new CoreAdminRequest();
request.setAction(CoreAdminAction.STATUS);
request.setCoreName("trans_shard2_replica_n4");

request.setIndexInfoNeeded(true);
CoreAdminResponse resp = request.process(client);

NamedList coreStatus =
resp.getCoreStatus("trans_shard2_replica_n4");
NamedList indexStats = (NamedList) coreStatus.get("index");
System.out.println(indexStats.get("sizeInBytes"));

is even more challenging to learn (how to) as one needs to google deeper
and deeper [4].

I also looked at the following books for systematic ways to learn solrj:

* Apache Solr 4 Cookbook,
* Apache Solr Search Patterns

Is there a simple and systematic way to get exposed to solrj and available
API/functionality?

Cheers,
Arturas

[1] http://www.baeldung.com/apache-solrj
[2] https://lucene.apache.org/solr/guide/7_2/using-solrj.html
[3]
https://lucene.apache.org/solr/7_2_0/solr-solrj/index.html?org/apache/solr/client/solrj/request/CollectionAdminRequest.ClusterStatus.html
[4]
https://www.programcreek.com/java-api-examples/?api=org.apache.solr.client.solrj.request.CoreAdminRequest


Re: SolrCloud replicaition

2018-05-03 Thread Erick Erickson
Shalin's right, I was hurried in my response and forgot that the
min_rf just _allows_ the client to figure out that the update didn't
get updated on enough replicas and the client has to "do the right
thing" with that information, thanks Shalin!

Right, your scenario is correct. When the follower goes back to
"active" and starts serving queries it will be all caught up with the
leader, including any missed documents.

Your step 4, the client gets a success response since the document was
indexed successfully on the leader. There's some additional
information in the response saying min_rf wasn't met and you should do
whatever you think appropriate. Stop indexing, retry, send a message
to your sysadmin, etc.

You can figure out exactly what by a pretty simple experiment, just
take one replica of a two-replica system down and specify min_rf of
2.

Best,
Erick

On Wed, May 2, 2018 at 9:20 PM, Greenhorn Techie
 wrote:
> Shalin,
>
> Given the earlier response by Erick, wondering when this scenario occurs
> i.e. when the replica node recovers after a time period, wouldn’t it
> automatically recover all the missed updates by connecting to the leader?
> My understanding is the below from the responses so far (assuming
> replication factor of 2 for simplicity purposes):
>
> 1. Client tries an update request which is received by the shard leader
> 2. Leader once it updates on its own node, send the update to the
> unavailable replica node
> 3. Leader keeps trying to send the update to the replica node
> 4. After a while leader gives up and communicates to the client (not sure
> what kind of message will the client receive in this case?)
> 5. Replica node recovers and then realises that it needs to catch-up and
> hence receives all the updates in recovery mode
>
> Correct me if I am wrong in my understanding.
>
> Thnx!!
>
>
> On 3 May 2018 at 04:10:12, Shalin Shekhar Mangar (shalinman...@gmail.com)
> wrote:
>
> The min_rf parameter does not fail indexing. It only tells you how many
> replicas received the live update. So if the value is less than what you
> wanted then it is up to you to retry the update later.
>
> On Wed, May 2, 2018 at 3:33 PM, Greenhorn Techie 
>
> wrote:
>
>> Hi,
>>
>> Good Morning!!
>>
>> In the case of a SolrCloud setup with sharing and replication in place,
>> when a document is sent for indexing, what happens when only the shard
>> leader has indexed the document, but the replicas failed, for whatever
>> reason. Will the document be resent by the leader to the replica shards
> to
>> index the document after sometime or how is scenario addressed?
>>
>> Also, given the above context, when I set the value of min_rf parameter
> to
>> say 2, does that mean the calling application will be informed that the
>> indexing failed?
>>
>
>
>
> --
> Regards,
> Shalin Shekhar Mangar.


Re: solrj (admin) requests

2018-05-03 Thread Erick Erickson
Yeah, that can be a pain. Unfortunately there's no official
"programming guide" for instance.

What there is, however, is an extensive suite of unit tests in
/Users/Erick/apache/solrJiras/master/solr/solrj/src/test/org/apache/solr/client/solrj.
>From there it's often a hunt though.

Best,
Erick

On Thu, May 3, 2018 at 8:07 AM, Arturas Mazeika  wrote:
> Hi Solr Team,
>
> Short question:
>
> How can I systematically explore the solrj functionality/API?
>
> Long question:
>
> I am discovering solrj functionality and I am pretty much impressed what
> solrj can do. What I am less impressed is my knowledge how to find what I
> am looking for. On the positive side, one can relatively quickly find ways
> to insert/delete/update/query docs using solrj (basically using google [1]
> or from [2]). It is a rather simple task to get exposed to some admin
> functionality through [3]. From this on, things are getting more difficult:
> I was able to query for some admin infos by trial and error using this java
> code:
>
> ArrayList urls = new ArrayList<>();
> urls.add(solrUrl);
> CloudSolrClient client = new CloudSolrClient.Builder(urls)
> .withConnectionTimeout(1)
> .withSocketTimeout(6)
> .build();
> client.setDefaultCollection("tph");
>
> final SolrRequest request = new CollectionAdminRequest.ClusterStatus();
> final NamedList response = client.request(request);
> final NamedList cluster  = (NamedList)
> response.get("cluster");
> final ArrayList nodes= (ArrayList)
> cluster.get("live_nodes");
> final NamedList cols   = (NamedList)
> cluster.get("collections");
> final LinkedHashMap tph = (LinkedHashMap)
> cols.get("tph");
>
> and then looking at what the keys names are, etc. Things are becoming more
> difficult here, as (1) lots of functionality hides behind generic "get"
> methods and (2) The containers that are returned vary from NamedList, to
> anything possible (arraylist, string, hashmap, etc.)
>
> Getting the size of one of the index, e.g., with
>
> ArrayList urls = new ArrayList<>();
> urls.add("http://localhost:8983/solr";);
> CloudSolrClient client = new CloudSolrClient.Builder(urls)
> .withConnectionTimeout(1)
> .withSocketTimeout(6)
> .build();
>
> client.setDefaultCollection("trans");
>
> CoreAdminRequest request = new CoreAdminRequest();
> request.setAction(CoreAdminAction.STATUS);
> request.setCoreName("trans_shard2_replica_n4");
>
> request.setIndexInfoNeeded(true);
> CoreAdminResponse resp = request.process(client);
>
> NamedList coreStatus =
> resp.getCoreStatus("trans_shard2_replica_n4");
> NamedList indexStats = (NamedList) coreStatus.get("index");
> System.out.println(indexStats.get("sizeInBytes"));
>
> is even more challenging to learn (how to) as one needs to google deeper
> and deeper [4].
>
> I also looked at the following books for systematic ways to learn solrj:
>
> * Apache Solr 4 Cookbook,
> * Apache Solr Search Patterns
>
> Is there a simple and systematic way to get exposed to solrj and available
> API/functionality?
>
> Cheers,
> Arturas
>
> [1] http://www.baeldung.com/apache-solrj
> [2] https://lucene.apache.org/solr/guide/7_2/using-solrj.html
> [3]
> https://lucene.apache.org/solr/7_2_0/solr-solrj/index.html?org/apache/solr/client/solrj/request/CollectionAdminRequest.ClusterStatus.html
> [4]
> https://www.programcreek.com/java-api-examples/?api=org.apache.solr.client.solrj.request.CoreAdminRequest


Re: SolrCloud replicaition

2018-05-03 Thread Greenhorn Techie
Perfect! Thanks Erick and Shalin!!


On 3 May 2018 at 16:13:06, Erick Erickson (erickerick...@gmail.com) wrote:

Shalin's right, I was hurried in my response and forgot that the
min_rf just _allows_ the client to figure out that the update didn't
get updated on enough replicas and the client has to "do the right
thing" with that information, thanks Shalin!

Right, your scenario is correct. When the follower goes back to
"active" and starts serving queries it will be all caught up with the
leader, including any missed documents.

Your step 4, the client gets a success response since the document was
indexed successfully on the leader. There's some additional
information in the response saying min_rf wasn't met and you should do
whatever you think appropriate. Stop indexing, retry, send a message
to your sysadmin, etc.

You can figure out exactly what by a pretty simple experiment, just
take one replica of a two-replica system down and specify min_rf of
2.

Best,
Erick

On Wed, May 2, 2018 at 9:20 PM, Greenhorn Techie
 wrote:
> Shalin,
>
> Given the earlier response by Erick, wondering when this scenario occurs
> i.e. when the replica node recovers after a time period, wouldn’t it
> automatically recover all the missed updates by connecting to the leader?
> My understanding is the below from the responses so far (assuming
> replication factor of 2 for simplicity purposes):
>
> 1. Client tries an update request which is received by the shard leader
> 2. Leader once it updates on its own node, send the update to the
> unavailable replica node
> 3. Leader keeps trying to send the update to the replica node
> 4. After a while leader gives up and communicates to the client (not sure
> what kind of message will the client receive in this case?)
> 5. Replica node recovers and then realises that it needs to catch-up and
> hence receives all the updates in recovery mode
>
> Correct me if I am wrong in my understanding.
>
> Thnx!!
>
>
> On 3 May 2018 at 04:10:12, Shalin Shekhar Mangar (shalinman...@gmail.com)
> wrote:
>
> The min_rf parameter does not fail indexing. It only tells you how many
> replicas received the live update. So if the value is less than what you
> wanted then it is up to you to retry the update later.
>
> On Wed, May 2, 2018 at 3:33 PM, Greenhorn Techie <
greenhorntec...@gmail.com>
>
> wrote:
>
>> Hi,
>>
>> Good Morning!!
>>
>> In the case of a SolrCloud setup with sharing and replication in place,
>> when a document is sent for indexing, what happens when only the shard
>> leader has indexed the document, but the replicas failed, for whatever
>> reason. Will the document be resent by the leader to the replica shards
> to
>> index the document after sometime or how is scenario addressed?
>>
>> Also, given the above context, when I set the value of min_rf parameter
> to
>> say 2, does that mean the calling application will be informed that the
>> indexing failed?
>>
>
>
>
> --
> Regards,
> Shalin Shekhar Mangar.


inconsistent results

2018-05-03 Thread Satya Marivada
Hi there,

We have a solr (6.3.0) index which is being re-indexed every night, it
takes about 6-7 hours for the indexing to complete. During the time of
re-indexing, the index becomes flaky and would serve inconsistent count of
documents 70,000 at times and 80,000 at times. After the indexing is
completed, it serves the consistent and right number of documents that it
has indexed from the database. Any suggestions on this.

Also solr writes to the same location as current index during re-indexing.
Could this be the cause of concern?

Thanks,
Satya


Re: Learning to Rank (LTR) with grouping

2018-05-03 Thread Diego Ceccarelli
Thanks ilayaraja,

I updated the PR today integrating your and Alan's comments. Now it works
also in distributed mode. Please let me know what do you think :)

Cheers
Diego

On Wed, May 2, 2018, 17:46 ilayaraja  wrote:

> Figured out that offset is used as part of the grouping patch which I
> applied
> (SOLR-8776) :
> solr/core/src/java/org/apache/solr/handler/component/QueryComponent.java
> +  if (query instanceof AbstractReRankQuery){
> +topNGroups = cmd.getOffset() +
> ((AbstractReRankQuery)query).getReRankDocs();
> +  } else {
> +topNGroups = cmd.getOffset() + cmd.getLen();
>
>
>
>
>
>
> -
> --Ilay
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>


Re: inconsistent results

2018-05-03 Thread Erick Erickson
The short for is that different replicas in a shard have different
commit point if you go by wall-clock time. So during heavy indexing,
you can happen to catch the different counts. That really shouldn't
happen, though, unless you're clearing the index first on the
assumption that you're replacing the same docs each time

One solution people use is to index to a "dark" collection, then use
collection aliasing to atomically switch when the job is done.

Best,
Erick


On Thu, May 3, 2018 at 11:55 AM, Satya Marivada
 wrote:
> Hi there,
>
> We have a solr (6.3.0) index which is being re-indexed every night, it
> takes about 6-7 hours for the indexing to complete. During the time of
> re-indexing, the index becomes flaky and would serve inconsistent count of
> documents 70,000 at times and 80,000 at times. After the indexing is
> completed, it serves the consistent and right number of documents that it
> has indexed from the database. Any suggestions on this.
>
> Also solr writes to the same location as current index during re-indexing.
> Could this be the cause of concern?
>
> Thanks,
> Satya


Re: inconsistent results

2018-05-03 Thread Satya Marivada
Yes, we are doing clean and full import. Is it not supposed to serve
old(existing) index till the new index is built and then do a cleanup,
replace old index after new index is built?

Would a full import without clean not give this problem?

Thanks Erick, this would be useful.

On Thu, May 3, 2018, 4:28 PM Erick Erickson  wrote:

> The short for is that different replicas in a shard have different
> commit point if you go by wall-clock time. So during heavy indexing,
> you can happen to catch the different counts. That really shouldn't
> happen, though, unless you're clearing the index first on the
> assumption that you're replacing the same docs each time
>
> One solution people use is to index to a "dark" collection, then use
> collection aliasing to atomically switch when the job is done.
>
> Best,
> Erick
>
>
> On Thu, May 3, 2018 at 11:55 AM, Satya Marivada
>  wrote:
> > Hi there,
> >
> > We have a solr (6.3.0) index which is being re-indexed every night, it
> > takes about 6-7 hours for the indexing to complete. During the time of
> > re-indexing, the index becomes flaky and would serve inconsistent count
> of
> > documents 70,000 at times and 80,000 at times. After the indexing is
> > completed, it serves the consistent and right number of documents that it
> > has indexed from the database. Any suggestions on this.
> >
> > Also solr writes to the same location as current index during
> re-indexing.
> > Could this be the cause of concern?
> >
> > Thanks,
> > Satya
>


Re: inconsistent results

2018-05-03 Thread Shawn Heisey
On 5/3/2018 12:55 PM, Satya Marivada wrote:
> We have a solr (6.3.0) index which is being re-indexed every night, it
> takes about 6-7 hours for the indexing to complete. During the time of
> re-indexing, the index becomes flaky and would serve inconsistent count of
> documents 70,000 at times and 80,000 at times. After the indexing is
> completed, it serves the consistent and right number of documents that it
> has indexed from the database. Any suggestions on this.

Initial guess is that there are commits being fired before the whole
indexing process is complete.

If you're running in cloud mode, there could be other things going on.

> Also solr writes to the same location as current index during re-indexing.
> Could this be the cause of concern?

When you use an existing index as the write location for a re-index, you
must be very careful to ensure that you do not ever send any commit
requests before the entire indexing process is complete.  The autoCommit
config in solrconfig.xml must have openSearcher set to false, and
autoSoftCommit must not be active.  That way, all queries sent before
the process completes will be handled by the index that existed before
the indexing process started.  A commit when the process is done will
send new queries to the new state of the index.

An alternate idea would be to index the replacement index into a
different core/collection, and then swap the indexes.  In SolrCloud
mode, the swap would be accomplished using the Collection Alias feature.

Thanks,
Shawn



Re: solrj (admin) requests

2018-05-03 Thread Shawn Heisey
On 5/3/2018 9:07 AM, Arturas Mazeika wrote:
> Short question:
>
> How can I systematically explore the solrj functionality/API?

As Erick said, there is not an extensive programming guide.  The
javadocs for SolrJ classes are pretty decent, but figuring out precisely
what the response objects actually contain does require experimentation.

> final NamedList cluster  = (NamedList)
> response.get("cluster");
> final ArrayList nodes= (ArrayList)
> cluster.get("live_nodes");
> final NamedList cols   = (NamedList)
> cluster.get("collections");
> final LinkedHashMap tph = (LinkedHashMap)
> cols.get("tph");

It's possible to replace all this code with one line:

LinkedHashMap tph = (LinkedHashMap)
response.findRecursive("cluster", "collections", "tph");

I *did* make sure this really does work with a 7.3 cloud example, so use
with confidence!

> and then looking at what the keys names are, etc. Things are becoming more
> difficult here, as (1) lots of functionality hides behind generic "get"
> methods and (2) The containers that are returned vary from NamedList, to
> anything possible (arraylist, string, hashmap, etc.)

SolrJ (and Solr itself, because SolrJ is an integral part of the server)
uses the NamedList object for a LOT of things.  It's useful for encoding
a wide range of responses and configuration information.  It can, as you
noticed, hold any type of object, including additional NamedList
instances.  There is typically a NamedList object in all response
objects which contains the *entire* response.

Some of the information in a response is ALSO available via sugar
methods which translate parts of the full response to other data types. 
For instance, you can get numFound out of a query response without
ripping the NamedList apart:

  SolrQuery q = new SolrQuery("*:*");
  QueryResponse r = client.query(q);
  long numFound = r.getResults().getNumFound();

This value is also available from the NamedList object, but the code to
obtain it is very ugly:

  long numF = ((SolrDocumentList)
r.getResponse().get("response")).getNumFound();

If you make requests manually (possibly in a browser) with
wt=json&indent=true, it only takes a little practice before you'll be
able to translate what you see into the NamedList structure that the
SolrJ response object will contain.  You can also print the output of
the toString() method on the NamedList to see the structure, but that
output usually doesn't include the object types, only their values.

Javadocs are a primary source of information.  Exploring the responses
fills in the holes.

Using wt=xml instead of wt=json in manual requests actually yields more
information, but json is easier to read.

Thanks,
Shawn



Re: Regarding LTR feature

2018-05-03 Thread prateek . agarwal
Thanks again Alessandro

I tried with the feature and the Minmax normalizer you told.But then there is a 
slight problem with the params in normalization. I don't really know the 
range(Min, Max) of values the payload_score outputs and they are different for 
different queries.

I even tried looking at the source code to see if there is a way I can override 
a class so that it iterates over all the re-ranked documents and calculate Max 
and min there itself and pass it to MinMax normalizer class but it seems it's 
not possible.

Your help will be really appreciated.

Thanks



Regards,
Prateek









Re: Regarding LTR feature

2018-05-03 Thread prateek . agarwal
Thanks again Alessandro

I tried with the feature and the Minmax normalizer you told.But then there is a 
slight problem with the params in normalization. I don't really know the 
range(Min, Max) of values the payload_score outputs and they are different for 
different queries.

I even tried looking at the source code to see if there is a way I can override 
a class so that it iterates over all the re-ranked documents and calculate Max 
and min there itself and pass it to MinMax normalizer class but it seems it's 
not possible.

Your help will really appreciated.

Thanks



Regards,
Prateek



On 2018/05/03 14:00:00, Alessandro Benedetti  wrote: 
> Mmmm, first of all, you know that each Solr feature is calculated per
> document right ?
> So you want to calculate the payload score for the document you are
> re-ranking, based on the query ( your External Feature Information) and
> normalize across the different documents?
> 
> I would go with this feature and use the normalization LTR functionality :
> 
> { 
>   "store" : "my_feature_store", 
>   "name" : "in_aggregated_terms", 
>   "class" : "org.apache.solr.ltr.feature.SolrFeature", 
>   "params" : { "q" : "{!payload_score 
> f=aggregated_terms func=max v=${query}}" } 
> } 
> 
> Then in the model you specify something like :
> 
> "name" : "myModelName",
>"features" : [
>{
>  "name" : "isBook"
>},
> ...
>{
>  "name" : "in_aggregated_terms",
>  "norm": {
>  "class" : "org.apache.solr.ltr.norm.MinMaxNormalizer",
>  "params" : { "min":"x", "max":"y" }
>  }
>},
>}
> 
> Give it a try, let me know
> 
> 
> 
> 
> -
> ---
> Alessandro Benedetti
> Search Consultant, R&D Software Engineer, Director
> Sease Ltd. - www.sease.io
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
> 


the number of docs in each group depends on rows

2018-05-03 Thread fatduo
Hi,
We used Solr Cloud 7.1.0(3 nodes, 3 shards with 2 replicas). When we used
group query, we found that the number of docs in each group depends on the
rows number(group number).

difference:
 

when the rows bigger then 5, the return docs are correct and stable, for the
rest, the number of docs is smaller than the actual result.

Could you please explain why and give me some suggestion about how to decide
the rows number?





--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html