Re: Solr Result Tagging

2013-10-27 Thread Isaac Hebsh
Hi,
Try using facet.query on each part, you will get the number of total hits
for every OR.
If you need this info per document, the answers might appear when
specifying debug query=true.. If that info is useful, try adding
"[explain]" to fl param (probably requires registering the augmenter plugin
in solrconfig)

- Isaac.

On Friday, October 25, 2013, Cool Techi wrote:

> Hi,
> My search queries to solr are of the following nature,
>  (A OR B OR C) OR (X AND Y AND Z) OR ((ABC AND DEF) - XYZ)
> What I am trying to achieve is when I fire the query the results returned
> should be able to tagged with which part or the OR resulted in the result.
> In case all three parts above are applicable then the result should
> indicate the same. I tried group.query feature, but doesn't seem like it
> works on solr cloud.
> Thanks,Ayush
>


Bad fieldNorm when using morphologic synonyms

2013-12-05 Thread Isaac Hebsh
Hi,
we implemented a morphologic analyzer, which stems words on index time.
For some reasons, we index both the original word and the stem (on the same
position, of course).
The stemming is done on a specific language, so other languages are not
stemmed at all.

Because of that, two documents with the same amount of terms, may have
different termVector size. document which contains many words that being
stemmed, will have a double sized termVector. This behaviour affects the
relevance score in a BAD way. the fieldNorm of these documents reduces
thier score. This is NOT the wanted behaviour in our case.

We are looking for a way to "mark" the stemmed words (on index time, of
course) so they won't affect the fieldNorm. Do such a way exist?

Do you have another idea?


Global query parameters to facet query

2013-12-05 Thread Isaac Hebsh
Hi,

It seems that a facet query does not use the global query parameters (for
example, field aliasing for edismax parser).
We have an intensive use of facet queries (in some cases, we have a lot of
facet.query for a single q), and the using of LocalParams for each
facet.query is not convenient.

Did I miss a normal way to solve it?
Did anyone else encountered this requirement?


Re: Bad fieldNorm when using morphologic synonyms

2013-12-05 Thread Isaac Hebsh
The field is our main textual field. In the standard case, the
length-normalization makes a significant work with tf-idf, we don't want to
avoid it.

Removing duplicates won't help here, because the terms are not dup. One
term is stemmed, and the other is not.


On Fri, Dec 6, 2013 at 9:48 AM, Ahmet Arslan  wrote:

> Hi Isaac,
>
> Did you consider omitting norms completely for that field? omitNorms="true"
> Are you using solr.RemoveDuplicatesTokenFilterFactory?
>
>
>
> On Thursday, December 5, 2013 8:55 PM, Isaac Hebsh 
> wrote:
>
> Hi,
> we implemented a morphologic analyzer, which stems words on index time.
> For some reasons, we index both the original word and the stem (on the same
> position, of course).
> The stemming is done on a specific language, so other languages are not
> stemmed at all.
>
> Because of that, two documents with the same amount of terms, may have
> different termVector size. document which contains many words that being
> stemmed, will have a double sized termVector. This behaviour affects the
> relevance score in a BAD way. the fieldNorm of these documents reduces
> thier score. This is NOT the wanted behaviour in our case.
>
> We are looking for a way to "mark" the stemmed words (on index time, of
> course) so they won't affect the fieldNorm. Do such a way exist?
>
> Do you have another idea?
>


Re: Bad fieldNorm when using morphologic synonyms

2013-12-06 Thread Isaac Hebsh
1) positions look all right (for me).
2) fieldNorm is determined by the size of the termVector, isn't it? the
termVector size isn't affected by the positions.


On Fri, Dec 6, 2013 at 10:46 AM, Robert Muir  wrote:

> Your analyzer needs to set positionIncrement correctly: sounds like its
> broken.
>
> On Thu, Dec 5, 2013 at 1:53 PM, Isaac Hebsh  wrote:
> > Hi,
> > we implemented a morphologic analyzer, which stems words on index time.
> > For some reasons, we index both the original word and the stem (on the
> same
> > position, of course).
> > The stemming is done on a specific language, so other languages are not
> > stemmed at all.
> >
> > Because of that, two documents with the same amount of terms, may have
> > different termVector size. document which contains many words that being
> > stemmed, will have a double sized termVector. This behaviour affects the
> > relevance score in a BAD way. the fieldNorm of these documents reduces
> > thier score. This is NOT the wanted behaviour in our case.
> >
> > We are looking for a way to "mark" the stemmed words (on index time, of
> > course) so they won't affect the fieldNorm. Do such a way exist?
> >
> > Do you have another idea?
>


LocalParam for nested query without escaping?

2013-12-06 Thread Isaac Hebsh
We want to set a LocalParam on a nested query. When quering with "v" inline
parameter, it works fine:
http://localhost:8983/solr/collection1/select?debugQuery=true&defType=lucene&df=id&q=TERM1AND
{!lucene df=text v="TERM2 TERM3 \"TERM4 TERM5\""}

the parsedquery_toString is
+id:TERM1 +(text:term2 text:term3 text:"term4 term5")

Query using the "_query_" also works fine:
http://localhost:8983/solr/collection1/select?debugQuery=true&defType=lucene&df=id&q=TERM1AND
_query_:"{!lucene df=text}TERM2 TERM3 \"TERM4 TERM5\""

(parsedquery is exactly the same).


BUT, when trying to put the nested query in place, it yields syntax error:
http://localhost:8983/solr/collection1/select?debugQuery=true&defType=lucene&df=id&q=TERM1AND
{!lucene df=text}(TERM2 TERM3 "TERM4 TERM5")

org.apache.solr.search.SyntaxError: Cannot parse '(TERM2'

The previous options are less preferred, because the escaping that should
be made on the nested query.

Can't I set a LocalParam to a nested query without escaping the query?


Re: LocalParam for nested query without escaping?

2013-12-06 Thread Isaac Hebsh
Obviously, there is the option of external parameter ({...
v=$nestedq}&nestedq=...)

This is a good solution, but it is not practical, when having a lot of such
nested queries.

Any ideas?

On Friday, December 6, 2013, Isaac Hebsh wrote:

> We want to set a LocalParam on a nested query. When quering with "v"
> inline parameter, it works fine:
>
> http://localhost:8983/solr/collection1/select?debugQuery=true&defType=lucene&df=id&q=TERM1AND
>  {!lucene df=text v="TERM2 TERM3 \"TERM4 TERM5\""}
>
> the parsedquery_toString is
> +id:TERM1 +(text:term2 text:term3 text:"term4 term5")
>
> Query using the "_query_" also works fine:
>
> http://localhost:8983/solr/collection1/select?debugQuery=true&defType=lucene&df=id&q=TERM1AND
>  _query_:"{!lucene df=text}TERM2 TERM3 \"TERM4 TERM5\""
>
> (parsedquery is exactly the same).
>
>
> BUT, when trying to put the nested query in place, it yields syntax error:
>
> http://localhost:8983/solr/collection1/select?debugQuery=true&defType=lucene&df=id&q=TERM1AND
>  {!lucene df=text}(TERM2 TERM3 "TERM4 TERM5")
>
> org.apache.solr.search.SyntaxError: Cannot parse '(TERM2'
>
> The previous options are less preferred, because the escaping that should
> be made on the nested query.
>
> Can't I set a LocalParam to a nested query without escaping the query?
>


Re: Bad fieldNorm when using morphologic synonyms

2013-12-09 Thread Isaac Hebsh
Hi Robert and Manuel.

The DefaultSimilarity indeed sets discountOverlap to true by default.
BUT, the *factory*, aka DefaultSimilarityFactory, when called by
IndexSchema (the getSimilarity method), explicitly sets this value to the
value of its corresponding class member.
This class member is initialized to be FALSE  when the instance is created
(like every boolean variable in the world). It should be set when "init"
method is called. If the parameter is not set in schema.xml, the default is
true.

Everything seems to be alright, but the issue is that "init" method is NOT
called, if the similarity is not *explicitly* declared in schema.xml. In
that case, init method is not called, the discountOverlaps member (of the
factory class) remains FALSE, and getSimilarity explicitly calls
setDiscountOverlaps with value of FALSE.

This is very easy to reproduce and debug.


On Mon, Dec 9, 2013 at 9:19 PM, Robert Muir  wrote:

> no, its turned on by default in the default similarity.
>
> as i said, all that is necessary is to fix your analyzer to emit the
> proper position increments.
>
> On Mon, Dec 9, 2013 at 12:27 PM, Manuel Le Normand
>  wrote:
> > In order to set discountOverlaps to true you must have added the
> >  to the schema.xml,
> which
> > is commented out by default!
> >
> > As by default this param is false, the above situation is expected with
> > correct positioning, as said.
> >
> > In order to fix the field norms you'd have to reindex with the similarity
> > class which initializes the param to true.
> >
> > Cheers,
> > Manu
>


Re: LocalParam for nested query without escaping?

2013-12-09 Thread Isaac Hebsh
If so, can someone suggest how a query should be escaped (securely and
correctly)?
Should I escape the quote mark (and backslash mark itself) only?


On Fri, Dec 6, 2013 at 2:59 PM, Isaac Hebsh  wrote:

> Obviously, there is the option of external parameter ({...
> v=$nestedq}&nestedq=...)
>
> This is a good solution, but it is not practical, when having a lot of
> such nested queries.
>
> Any ideas?
>
> On Friday, December 6, 2013, Isaac Hebsh wrote:
>
>> We want to set a LocalParam on a nested query. When quering with "v"
>> inline parameter, it works fine:
>>
>> http://localhost:8983/solr/collection1/select?debugQuery=true&defType=lucene&df=id&q=TERM1AND
>>  {!lucene df=text v="TERM2 TERM3 \"TERM4 TERM5\""}
>>
>> the parsedquery_toString is
>> +id:TERM1 +(text:term2 text:term3 text:"term4 term5")
>>
>> Query using the "_query_" also works fine:
>>
>> http://localhost:8983/solr/collection1/select?debugQuery=true&defType=lucene&df=id&q=TERM1AND
>>  _query_:"{!lucene df=text}TERM2 TERM3 \"TERM4 TERM5\""
>>
>> (parsedquery is exactly the same).
>>
>>
>> BUT, when trying to put the nested query in place, it yields syntax error:
>>
>> http://localhost:8983/solr/collection1/select?debugQuery=true&defType=lucene&df=id&q=TERM1AND
>>  {!lucene df=text}(TERM2 TERM3 "TERM4 TERM5")
>>
>> org.apache.solr.search.SyntaxError: Cannot parse '(TERM2'
>>
>> The previous options are less preferred, because the escaping that should
>> be made on the nested query.
>>
>> Can't I set a LocalParam to a nested query without escaping the query?
>>
>


Re: Global query parameters to facet query

2013-12-09 Thread Isaac Hebsh
created SOLR-5542.
Anyone else want it?


On Thu, Dec 5, 2013 at 8:55 PM, Isaac Hebsh  wrote:

> Hi,
>
> It seems that a facet query does not use the global query parameters (for
> example, field aliasing for edismax parser).
> We have an intensive use of facet queries (in some cases, we have a lot of
> facet.query for a single q), and the using of LocalParams for each
> facet.query is not convenient.
>
> Did I miss a normal way to solve it?
> Did anyone else encountered this requirement?
>


Re: Bad fieldNorm when using morphologic synonyms

2013-12-09 Thread Isaac Hebsh
You can see the norm value, in the "explain" text, when setting
debugQuery=true.
If the same item gets different norm before/after, that's it.

Note that this configuration is in schema.xml (not solrconfig.xml...)

On Monday, December 9, 2013, Roman Chyla wrote:

> Isaac, is there an easy way to recognize this problem? We also index
> synonym tokens in the same position (like you do, and I'm sure that our
> positions are set correctly). I could test whether the default similarity
> factory in solrconfig.xml had any effect (before/after reindexing).
>
> --roman
>
>
> On Mon, Dec 9, 2013 at 2:42 PM, Isaac Hebsh 
> >
> wrote:
>
> > Hi Robert and Manuel.
> >
> > The DefaultSimilarity indeed sets discountOverlap to true by default.
> > BUT, the *factory*, aka DefaultSimilarityFactory, when called by
> > IndexSchema (the getSimilarity method), explicitly sets this value to the
> > value of its corresponding class member.
> > This class member is initialized to be FALSE  when the instance is
> created
> > (like every boolean variable in the world). It should be set when "init"
> > method is called. If the parameter is not set in schema.xml, the default
> is
> > true.
> >
> > Everything seems to be alright, but the issue is that "init" method is
> NOT
> > called, if the similarity is not *explicitly* declared in schema.xml. In
> > that case, init method is not called, the discountOverlaps member (of the
> > factory class) remains FALSE, and getSimilarity explicitly calls
> > setDiscountOverlaps with value of FALSE.
> >
> > This is very easy to reproduce and debug.
> >
> >
> > On Mon, Dec 9, 2013 at 9:19 PM, Robert Muir >
> wrote:
> >
> > > no, its turned on by default in the default similarity.
> > >
> > > as i said, all that is necessary is to fix your analyzer to emit the
> > > proper position increments.
> > >
> > > On Mon, Dec 9, 2013 at 12:27 PM, Manuel Le Normand
> > > > wrote:
> > > > In order to set discountOverlaps to true you must have added the
> > > >  to the schema.xml,
> > > which
> > > > is commented out by default!
> > > >
> > > > As by default this param is false, the above situation is expected
> with
> > > > correct positioning, as said.
> > > >
> > > > In order to fix the field norms you'd have to reindex with the
> > similarity
> > > > class which initializes the param to true.
> > > >
> > > > Cheers,
> > > > Manu
> > >
> >
>


Re: Global query parameters to facet query

2013-12-09 Thread Isaac Hebsh
my case is field aliasing of edismax.
consider this request, which sent to the example configuration:

http://localhost:8983/solr/collection1/select?defType=edismax&q.myalias.qf=text&q=myalias:1234&facet=true&facet.query=myalias:1234


the result is:

undefined field myalias
400


when disabling the facet, the query works perfectly. I understand that
either defType parameter or q.myalias.qf parameters, do not affect the
facet query (which always runs with lucene parser??)
when adding {!edismax} the the facet query, no error is thrown, but I can't
determine whether the alias is OK or edismax "recovery mode" is too much
"nice". debugQuery does not show the parsedQuery for facet query.

Anyway, there is a problem here.


On Tue, Dec 10, 2013 at 3:09 AM, Chris Hostetter
wrote:

> : It seems that a facet query does not use the global query parameters (for
> : example, field aliasing for edismax parser).
>
> can you please give a specific example of a query that isn't working for
> you?
>
> Using this query against the examle data, things work exactly as i would
> expect showing that the QParsers used for facet.queries inherit the
> global params (unless overridden by local params of course)...
>
>
> http://localhost:8983/solr/select?q=*:*&wt=json&indent=true&&facet=true&facet.query={!dismax}solr+bogus&facet.query={!dismax%20mm=1}solr+bogus&facet.query={!dismax%20mm=1%20qf=%27foo_t%27}solr+bogus&rows=0&mm=2&qf=name
> {
>   "responseHeader":{
> "status":0,
> "QTime":2,
> "params":{
>   "mm":"2",
>   "facet":"true",
>   "indent":"true",
>   "facet.query":["{!dismax}solr bogus",
> "{!dismax mm=1}solr bogus",
> "{!dismax mm=1 qf='foo_t'}solr bogus"],
>   "q":"*:*",
>   "qf":"name",
>   "wt":"json",
>   "rows":"0"}},
>   "response":{"numFound":32,"start":0,"docs":[]
>   },
>   "facet_counts":{
> "facet_queries":{
>   "{!dismax}solr bogus":0,
>   "{!dismax mm=1}solr bogus":1,
>   "{!dismax mm=1 qf='foo_t'}solr bogus":0},
> "facet_fields":{},
> "facet_dates":{},
> "facet_ranges":{}}}
>
>
>
>
>
>
> -Hoss
> http://www.lucidworks.com/
>


Re: LocalParam for nested query without escaping?

2013-12-19 Thread Isaac Hebsh
created SOLR-5560


On Tue, Dec 10, 2013 at 8:48 AM, William Bell  wrote:

> Sounds like a bug.
>
>
> On Mon, Dec 9, 2013 at 1:16 PM, Isaac Hebsh  wrote:
>
> > If so, can someone suggest how a query should be escaped (securely and
> > correctly)?
> > Should I escape the quote mark (and backslash mark itself) only?
> >
> >
> > On Fri, Dec 6, 2013 at 2:59 PM, Isaac Hebsh 
> wrote:
> >
> > > Obviously, there is the option of external parameter ({...
> > > v=$nestedq}&nestedq=...)
> > >
> > > This is a good solution, but it is not practical, when having a lot of
> > > such nested queries.
> > >
> > > Any ideas?
> > >
> > > On Friday, December 6, 2013, Isaac Hebsh wrote:
> > >
> > >> We want to set a LocalParam on a nested query. When quering with "v"
> > >> inline parameter, it works fine:
> > >>
> > >>
> >
> http://localhost:8983/solr/collection1/select?debugQuery=true&defType=lucene&df=id&q=TERM1AND{!lucenedf=text
>  v="TERM2 TERM3 \"TERM4 TERM5\""}
> > >>
> > >> the parsedquery_toString is
> > >> +id:TERM1 +(text:term2 text:term3 text:"term4 term5")
> > >>
> > >> Query using the "_query_" also works fine:
> > >>
> > >>
> >
> http://localhost:8983/solr/collection1/select?debugQuery=true&defType=lucene&df=id&q=TERM1AND_query_:"{!lucene
> df=text}TERM2 TERM3 \"TERM4 TERM5\""
> > >>
> > >> (parsedquery is exactly the same).
> > >>
> > >>
> > >> BUT, when trying to put the nested query in place, it yields syntax
> > error:
> > >>
> > >>
> >
> http://localhost:8983/solr/collection1/select?debugQuery=true&defType=lucene&df=id&q=TERM1AND{!lucenedf=text}(TERM2
>  TERM3 "TERM4 TERM5")
> > >>
> > >> org.apache.solr.search.SyntaxError: Cannot parse '(TERM2'
> > >>
> > >> The previous options are less preferred, because the escaping that
> > should
> > >> be made on the nested query.
> > >>
> > >> Can't I set a LocalParam to a nested query without escaping the query?
> > >>
> > >
> >
>
>
>
> --
> Bill Bell
> billnb...@gmail.com
> cell 720-256-8076
>


Re: Bad fieldNorm when using morphologic synonyms

2013-12-19 Thread Isaac Hebsh
Roman, do you have any results?

created SOLR-5561

Robert, if I'm wrong, you are welcome to close that issue.


On Mon, Dec 9, 2013 at 10:50 PM, Isaac Hebsh  wrote:

> You can see the norm value, in the "explain" text, when setting
> debugQuery=true.
> If the same item gets different norm before/after, that's it.
>
> Note that this configuration is in schema.xml (not solrconfig.xml...)
>
> On Monday, December 9, 2013, Roman Chyla wrote:
>
>> Isaac, is there an easy way to recognize this problem? We also index
>> synonym tokens in the same position (like you do, and I'm sure that our
>> positions are set correctly). I could test whether the default similarity
>> factory in solrconfig.xml had any effect (before/after reindexing).
>>
>> --roman
>>
>>
>> On Mon, Dec 9, 2013 at 2:42 PM, Isaac Hebsh 
>> wrote:
>>
>> > Hi Robert and Manuel.
>> >
>> > The DefaultSimilarity indeed sets discountOverlap to true by default.
>> > BUT, the *factory*, aka DefaultSimilarityFactory, when called by
>> > IndexSchema (the getSimilarity method), explicitly sets this value to
>> the
>> > value of its corresponding class member.
>> > This class member is initialized to be FALSE  when the instance is
>> created
>> > (like every boolean variable in the world). It should be set when "init"
>> > method is called. If the parameter is not set in schema.xml, the
>> default is
>> > true.
>> >
>> > Everything seems to be alright, but the issue is that "init" method is
>> NOT
>> > called, if the similarity is not *explicitly* declared in schema.xml. In
>> > that case, init method is not called, the discountOverlaps member (of
>> the
>> > factory class) remains FALSE, and getSimilarity explicitly calls
>> > setDiscountOverlaps with value of FALSE.
>> >
>> > This is very easy to reproduce and debug.
>> >
>> >
>> > On Mon, Dec 9, 2013 at 9:19 PM, Robert Muir  wrote:
>> >
>> > > no, its turned on by default in the default similarity.
>> > >
>> > > as i said, all that is necessary is to fix your analyzer to emit the
>> > > proper position increments.
>> > >
>> > > On Mon, Dec 9, 2013 at 12:27 PM, Manuel Le Normand
>> > >  wrote:
>> > > > In order to set discountOverlaps to true you must have added the
>> > > >  to the
>> schema.xml,
>> > > which
>> > > > is commented out by default!
>> > > >
>> > > > As by default this param is false, the above situation is expected
>> with
>> > > > correct positioning, as said.
>> > > >
>> > > > In order to fix the field norms you'd have to reindex with the
>> > similarity
>> > > > class which initializes the param to true.
>> > > >
>> > > > Cheers,
>> > > > Manu
>> > >
>> >
>>
>


Re: Bad fieldNorm when using morphologic synonyms

2013-12-26 Thread Isaac Hebsh
Attached patch into the JIRA issue.
Reviews are welcome.


On Thu, Dec 19, 2013 at 7:24 PM, Isaac Hebsh  wrote:

> Roman, do you have any results?
>
> created SOLR-5561
>
> Robert, if I'm wrong, you are welcome to close that issue.
>
>
> On Mon, Dec 9, 2013 at 10:50 PM, Isaac Hebsh wrote:
>
>> You can see the norm value, in the "explain" text, when setting
>> debugQuery=true.
>> If the same item gets different norm before/after, that's it.
>>
>> Note that this configuration is in schema.xml (not solrconfig.xml...)
>>
>> On Monday, December 9, 2013, Roman Chyla wrote:
>>
>>> Isaac, is there an easy way to recognize this problem? We also index
>>> synonym tokens in the same position (like you do, and I'm sure that our
>>> positions are set correctly). I could test whether the default similarity
>>> factory in solrconfig.xml had any effect (before/after reindexing).
>>>
>>> --roman
>>>
>>>
>>> On Mon, Dec 9, 2013 at 2:42 PM, Isaac Hebsh 
>>> wrote:
>>>
>>> > Hi Robert and Manuel.
>>> >
>>> > The DefaultSimilarity indeed sets discountOverlap to true by default.
>>> > BUT, the *factory*, aka DefaultSimilarityFactory, when called by
>>> > IndexSchema (the getSimilarity method), explicitly sets this value to
>>> the
>>> > value of its corresponding class member.
>>> > This class member is initialized to be FALSE  when the instance is
>>> created
>>> > (like every boolean variable in the world). It should be set when
>>> "init"
>>> > method is called. If the parameter is not set in schema.xml, the
>>> default is
>>> > true.
>>> >
>>> > Everything seems to be alright, but the issue is that "init" method is
>>> NOT
>>> > called, if the similarity is not *explicitly* declared in schema.xml.
>>> In
>>> > that case, init method is not called, the discountOverlaps member (of
>>> the
>>> > factory class) remains FALSE, and getSimilarity explicitly calls
>>> > setDiscountOverlaps with value of FALSE.
>>> >
>>> > This is very easy to reproduce and debug.
>>> >
>>> >
>>> > On Mon, Dec 9, 2013 at 9:19 PM, Robert Muir  wrote:
>>> >
>>> > > no, its turned on by default in the default similarity.
>>> > >
>>> > > as i said, all that is necessary is to fix your analyzer to emit the
>>> > > proper position increments.
>>> > >
>>> > > On Mon, Dec 9, 2013 at 12:27 PM, Manuel Le Normand
>>> > >  wrote:
>>> > > > In order to set discountOverlaps to true you must have added the
>>> > > >  to the
>>> schema.xml,
>>> > > which
>>> > > > is commented out by default!
>>> > > >
>>> > > > As by default this param is false, the above situation is expected
>>> with
>>> > > > correct positioning, as said.
>>> > > >
>>> > > > In order to fix the field norms you'd have to reindex with the
>>> > similarity
>>> > > > class which initializes the param to true.
>>> > > >
>>> > > > Cheers,
>>> > > > Manu
>>> > >
>>> >
>>>
>>
>


MoinMoin Dump

2013-07-17 Thread Isaac Hebsh
Hi,

There was a thread about viewing Solr Wiki offline, About 6 months ago. I'm
intersted, too.

It seems that a manual (cron?) dump will do the work...

Would it be too much to ask that one of the admins will manually create
such a dump? (http://moinmo.in/HelpOnMoinCommand/ExportDump)

Otis, is there any progress made on this in Apache Infra?


Sending shard requests to all replicas

2013-07-26 Thread Isaac Hebsh
Hi!

When SolrClound executes a query, it creates shard requests, which is sent
to one replica of each shard. Total QTime is determined by the slowest
shard response (plus some extra time). [For simplicity, let's assume that
no stored fields are requested.]

I suffer from a situation where in every query, some shards are much slower
than others.

We might consider a different approach, which sends the shard request to
*ALL* replicas of each shard. Solr will continue when responses are got
from at least one replica of each shard.

Of course, the amount of work that is wasted is big (multiplied by
replicationFactor), but in my case, there are very few concurrent queries,
and the most important performance is the qtime. Such a solution might
improve qtime significantly.


Did someone tried this before?
Any tip from where should I start in the code?


Re: Sending shard requests to all replicas

2013-07-27 Thread Isaac Hebsh
Hi Erick, thanks.

I have about 40 shards. repFactor=2.
The cause of slower shards is very interesting, and this is the main
approach we took.
Note that in every query, it is another shard which is the slowest. In 20%
of the queries, the slowest shard takes about 4 times more than the average
shard qtime.
While continuing investigation, remember it might be the virtualization /
storage-access / network / gc /..., so I thought that reducing the effect
of the slow shards might be a good (temporary or permanent) solution.

I thought it should be an almost trivial code change (for proving the
concept). Isn't it?


On Sat, Jul 27, 2013 at 6:11 PM, Erick Erickson wrote:

> This has been suggested, but so far it's not been implemented
> as far as I know.
>
> I'm curious though, how many shards are you dealing with? I
> wonder if it would be a better idea to try to figure out _why_
> you so often have a slow shard and whether the problem could
> be cured with, say, better warming queries on the shards...
>
> Best
> Erick
>
> On Fri, Jul 26, 2013 at 8:23 AM, Isaac Hebsh 
> wrote:
> > Hi!
> >
> > When SolrClound executes a query, it creates shard requests, which is
> sent
> > to one replica of each shard. Total QTime is determined by the slowest
> > shard response (plus some extra time). [For simplicity, let's assume that
> > no stored fields are requested.]
> >
> > I suffer from a situation where in every query, some shards are much
> slower
> > than others.
> >
> > We might consider a different approach, which sends the shard request to
> > *ALL* replicas of each shard. Solr will continue when responses are got
> > from at least one replica of each shard.
> >
> > Of course, the amount of work that is wasted is big (multiplied by
> > replicationFactor), but in my case, there are very few concurrent
> queries,
> > and the most important performance is the qtime. Such a solution might
> > improve qtime significantly.
> >
> >
> > Did someone tried this before?
> > Any tip from where should I start in the code?
>


Re: Sending shard requests to all replicas

2013-07-27 Thread Isaac Hebsh
Shawn, thank you for the tips.
I know the significant cons of virtualization, but I don't want to move
this thread into a virtualization pros/cons in the Solr(Cloud) case.

I've just asked what is the minimal code change should be made, in order to
examine whether this is a possible solution or not.. :)


On Sun, Jul 28, 2013 at 1:06 AM, Shawn Heisey  wrote:

> On 7/27/2013 3:33 PM, Isaac Hebsh wrote:
> > I have about 40 shards. repFactor=2.
> > The cause of slower shards is very interesting, and this is the main
> > approach we took.
> > Note that in every query, it is another shard which is the slowest. In
> 20%
> > of the queries, the slowest shard takes about 4 times more than the
> average
> > shard qtime.
> > While continuing investigation, remember it might be the virtualization /
> > storage-access / network / gc /..., so I thought that reducing the effect
> > of the slow shards might be a good (temporary or permanent) solution.
>
> Virtualization is not the best approach for Solr.  Assuming you're
> dealing with your own hardware and not something based in the cloud like
> Amazon, you can get better results by running on bare metal and having
> multiple shards per host.
>
> Garbage collection is a very likely source of this problem.
>
> http://wiki.apache.org/solr/SolrPerformanceProblems#GC_pause_problems
>
> > I thought it should be an almost trivial code change (for proving the
> > concept). Isn't it?
>
> I have no idea what you're saying/asking here.  Can you clarify?
>
> It seems to me that sending requests to all replicas would just increase
> the overall load on the cluster, with no real benefit.
>
> Thanks,
> Shawn
>
>


Re: Sending shard requests to all replicas

2013-07-30 Thread Isaac Hebsh
Hi,
I submitted a new JIRA for this:
https://issues.apache.org/jira/browse/SOLR-5092

A (very initial) patch is already attached. Reviews are very welcome.


On Sun, Jul 28, 2013 at 4:50 PM, Erick Erickson wrote:

> You'd probably start in CloudSolrServer in SolrJ code,
> as far as I know that's where the request is sent out.
>
> I'd think that would be better than changing Solr itself
> since if you found that this was useful you wouldn't
> be patching your Solr release, just keeping your client
> up to date.
>
> Best
> Erick
>
> On Sat, Jul 27, 2013 at 7:28 PM, Isaac Hebsh 
> wrote:
> > Shawn, thank you for the tips.
> > I know the significant cons of virtualization, but I don't want to move
> > this thread into a virtualization pros/cons in the Solr(Cloud) case.
> >
> > I've just asked what is the minimal code change should be made, in order
> to
> > examine whether this is a possible solution or not.. :)
> >
> >
> > On Sun, Jul 28, 2013 at 1:06 AM, Shawn Heisey  wrote:
> >
> >> On 7/27/2013 3:33 PM, Isaac Hebsh wrote:
> >> > I have about 40 shards. repFactor=2.
> >> > The cause of slower shards is very interesting, and this is the main
> >> > approach we took.
> >> > Note that in every query, it is another shard which is the slowest. In
> >> 20%
> >> > of the queries, the slowest shard takes about 4 times more than the
> >> average
> >> > shard qtime.
> >> > While continuing investigation, remember it might be the
> virtualization /
> >> > storage-access / network / gc /..., so I thought that reducing the
> effect
> >> > of the slow shards might be a good (temporary or permanent) solution.
> >>
> >> Virtualization is not the best approach for Solr.  Assuming you're
> >> dealing with your own hardware and not something based in the cloud like
> >> Amazon, you can get better results by running on bare metal and having
> >> multiple shards per host.
> >>
> >> Garbage collection is a very likely source of this problem.
> >>
> >> http://wiki.apache.org/solr/SolrPerformanceProblems#GC_pause_problems
> >>
> >> > I thought it should be an almost trivial code change (for proving the
> >> > concept). Isn't it?
> >>
> >> I have no idea what you're saying/asking here.  Can you clarify?
> >>
> >> It seems to me that sending requests to all replicas would just increase
> >> the overall load on the cluster, with no real benefit.
> >>
> >> Thanks,
> >> Shawn
> >>
> >>
>


Re: Sending shard requests to all replicas

2013-07-31 Thread Isaac Hebsh
Thanks to Ryan Ernst, my issue is duplicate of SOLR-4449.
I think that this proposal might be very useful (some supporting links are
attached there. worth reading..)


On Tue, Jul 30, 2013 at 11:49 PM, Isaac Hebsh  wrote:

> Hi,
> I submitted a new JIRA for this:
> https://issues.apache.org/jira/browse/SOLR-5092
>
> A (very initial) patch is already attached. Reviews are very welcome.
>
>
> On Sun, Jul 28, 2013 at 4:50 PM, Erick Erickson 
> wrote:
>
>> You'd probably start in CloudSolrServer in SolrJ code,
>> as far as I know that's where the request is sent out.
>>
>> I'd think that would be better than changing Solr itself
>> since if you found that this was useful you wouldn't
>> be patching your Solr release, just keeping your client
>> up to date.
>>
>> Best
>> Erick
>>
>> On Sat, Jul 27, 2013 at 7:28 PM, Isaac Hebsh 
>> wrote:
>> > Shawn, thank you for the tips.
>> > I know the significant cons of virtualization, but I don't want to move
>> > this thread into a virtualization pros/cons in the Solr(Cloud) case.
>> >
>> > I've just asked what is the minimal code change should be made, in
>> order to
>> > examine whether this is a possible solution or not.. :)
>> >
>> >
>> > On Sun, Jul 28, 2013 at 1:06 AM, Shawn Heisey 
>> wrote:
>> >
>> >> On 7/27/2013 3:33 PM, Isaac Hebsh wrote:
>> >> > I have about 40 shards. repFactor=2.
>> >> > The cause of slower shards is very interesting, and this is the main
>> >> > approach we took.
>> >> > Note that in every query, it is another shard which is the slowest.
>> In
>> >> 20%
>> >> > of the queries, the slowest shard takes about 4 times more than the
>> >> average
>> >> > shard qtime.
>> >> > While continuing investigation, remember it might be the
>> virtualization /
>> >> > storage-access / network / gc /..., so I thought that reducing the
>> effect
>> >> > of the slow shards might be a good (temporary or permanent) solution.
>> >>
>> >> Virtualization is not the best approach for Solr.  Assuming you're
>> >> dealing with your own hardware and not something based in the cloud
>> like
>> >> Amazon, you can get better results by running on bare metal and having
>> >> multiple shards per host.
>> >>
>> >> Garbage collection is a very likely source of this problem.
>> >>
>> >> http://wiki.apache.org/solr/SolrPerformanceProblems#GC_pause_problems
>> >>
>> >> > I thought it should be an almost trivial code change (for proving the
>> >> > concept). Isn't it?
>> >>
>> >> I have no idea what you're saying/asking here.  Can you clarify?
>> >>
>> >> It seems to me that sending requests to all replicas would just
>> increase
>> >> the overall load on the cluster, with no real benefit.
>> >>
>> >> Thanks,
>> >> Shawn
>> >>
>> >>
>>
>
>


documentCache and lazyFieldLoading

2013-08-29 Thread Isaac Hebsh
Hi,
We've investigated a memory dump, which was taken after some frequent OOM
incidents.

The main issue we found was a lot of millions of LazyField instances,
taking ~2GB of memory, even though queries request about 10 small fields
only.

We've found that LazyDocument creates a LazyField object for every item in
a multivalued field, even if do not want this field.

For example, documents contain a multivalued field, named "f", with a lot
of values (let's say 100 values per document). Queries set fl=id (request
only document id). The documentCache will grow up in memory :(

In our case, documentCache was configured to 32000. There are 2 cores per
node, so 64000 LazyDocument instances are in memory. This is pretty big
number, and we'll reduce it.


I'm curious whether it's a known issue or not? and why should the
LazyDocument know the amount of values in a multivalued field which is not
requested?

Another thought which I had: Is it reasonable to add something like
"{!cache=false}" which will affect documentCache. For example. If my query
request "id" only, with a big rows parameter, I don't want documentCache to
hold these big LazyDocument objects.

Did anyone else encounter this?


Re: documentCache and lazyFieldLoading

2013-08-29 Thread Isaac Hebsh
Thanks Hoss.

1. We currently use Solr 4.3.0.
2. I understand this architecture of LazyFields, but i did not understand
why multiple LazyFields should be created for the multivalued field. You
can't load a part of them. If you request the field, you will get ALL of
its values. so 100 (or more) placeholders are not necessary in this case.
Moreover, why should Solr KNOW how much values are in that unloaded field?
3. In our poor case, we might handle some concurrent queries, each one
requests rows=2000.

What do you think about temporary disabling documentCache, for a specific
query?




On Thu, Aug 29, 2013 at 10:11 PM, Chris Hostetter
wrote:

>
> : The main issue we found was a lot of millions of LazyField instances,
> : taking ~2GB of memory, even though queries request about 10 small fields
> : only.
>
> which version of Solr are you using?  there was a really bad bug with
> lazyFieldLoading fixed in Solr 4.2.1 (SOLR-4589)
>
> : We've found that LazyDocument creates a LazyField object for every item
> in
> : a multivalued field, even if do not want this field.
>
> right, that's exactly how lazyFieldLoading is designed to work -- instead
> of loading the full field values into ram, only a small LazyField object
> is loaded in it's place and that LazyField only fetches the underlying
> data if/when it's requested.
>
> If the LazyField instances weren't created as placeholders, subsequent
> requests for the document that *might* request additional fields (beyond
> the "10 small fields" that were requested the first time) would have no
> way of knowing if/when those additional fields existed to be able to fetch
> them from the index.
>
> : In our case, documentCache was configured to 32000. There are 2 cores per
> : node, so 64000 LazyDocument instances are in memory. This is pretty big
> : number, and we'll reduce it.
>
> FWIW: Even at 1/10 that size, that seems like a ridiculously large
> documentCache to me.
>
>
> -Hoss
>


Getting a query parameter in a TokenFilter

2013-09-17 Thread Isaac Hebsh
Hi everyone,

We developed a TokenFilter.
It should act differently, depends on a parameter supplied in the
query (for query chain only, not the index one, of course).
We found no way to pass that parameter into the TokenFilter flow. I guess
that the root cause is because TokenFilter is a pure lucene object.

As a last resort, we tried to pass the parameter as the first term in the
query text (q=...), and save it as a member of the TokenFilter instance.

Although it is ugly, it might work fine.
But, the problem is that it is not guaranteed that all the terms of a
particular query will be analyzed by the same instance of a TokenFilter. In
this case, some terms will be analyzed without the required information of
that "parameter". We can produce such a race very easily.

How should I overcome this issue?
Do anyone have a better resolution?


Re: Getting a query parameter in a TokenFilter

2013-09-21 Thread Isaac Hebsh
Thought about that again,
We can do this work as a search component, manipulating the query string.
The cons are the double QParser work, and the double tokenization work.

Another approach which might solve this issue easily is "Dynamic query
analyze chain": https://issues.apache.org/jira/browse/SOLR-5053

What would you do?


On Tue, Sep 17, 2013 at 10:31 PM, Isaac Hebsh  wrote:

> Hi everyone,
>
> We developed a TokenFilter.
> It should act differently, depends on a parameter supplied in the
> query (for query chain only, not the index one, of course).
> We found no way to pass that parameter into the TokenFilter flow. I guess
> that the root cause is because TokenFilter is a pure lucene object.
>
> As a last resort, we tried to pass the parameter as the first term in the
> query text (q=...), and save it as a member of the TokenFilter instance.
>
> Although it is ugly, it might work fine.
> But, the problem is that it is not guaranteed that all the terms of a
> particular query will be analyzed by the same instance of a TokenFilter. In
> this case, some terms will be analyzed without the required information of
> that "parameter". We can produce such a race very easily.
>
> How should I overcome this issue?
> Do anyone have a better resolution?
>


Considerations about setting maxMergedSegmentMB

2013-09-30 Thread Isaac Hebsh
Hi,
Trying to solve query performance issue, we suspect on the number of index
segments, which might slow the query (due to I/O seeks, happens for each
term in the query, multiplied by number of segments).
We are on Solr 4.3 (TieredMergePolicy with mergeFactor of 4).

We can reduce the number of segments by enlarging maxMergedSegmentMB, from
the default 5GB to something bigger (10GB, 15GB?).

What are the side effects, which should be considered when doing it?
Did anyone changed this setting in PROD for a while?


Re: Data duplication using Cloud+HDFS+Mirroring

2013-09-30 Thread Isaac Hebsh
Hi Greg, Did you get an answer?
I'm interested in the same question.

More generally, what are the benefits of HdfsDirectoryFactory, besides the
transparent restore of the shard contents in case of a disk failure, and
the ability to rebuild index using MR?
Is the next statement exact? blocks of a particular shard, which are
replicated to another node, will be never queried, since there is no solr
core configured to read them.


On Wed, Aug 7, 2013 at 8:46 PM, Greg Walters
wrote:

> While testing Solr's new ability to store data and transaction directories
> in HDFS I added an additional core to one of my testing servers that was
> configured as a backup (active but not leader) core for a shard elsewhere.
> It looks like this extra core copies the data into its own directory rather
> than just using the existing directory with the data that's already
> available to it.
>
> Since HDFS likely already has redundancy of the data covered via the
> replicationFactor is there a reason for non-leader cores to create their
> own data directory rather than doing reads on the existing master copy? I
> searched Jira for anything that suggests this behavior might change and
> didn't find any issues; is there any intent to address this?
>
> Thanks,
> Greg
>


Re: Profiling Solr Lucene for query

2013-10-01 Thread Isaac Hebsh
Hi Dmitry,

I'm trying to examine your suggestion to create a frontend node. It sounds
pretty usefull.
I saw that every node in solr cluster can serve request for any collection,
even if it does not hold a core of that collection. because of that, I
thought that adding a new node to the cluster (aka, the frontend/gateway
server), and creating a dummy collection (with 1 dummy core), will solve
the problem.

But, I see that a request which sent to the gateway node, is not then sent
to the shards. Instead, the request is proxyed to a (random) core of the
requested collection, and from there it is sent to the shards. (It is
reasonable, because the SolrCore on the gateway might run with different
configuration, etc). This means that my new node isn't functioning as a
frontend (which responsible for sorting, etc.), but as a poor load
balancer. No performance improvement will come from this implementation.

So, how do you suggest to implement a frontend? On the one hand, it has to
run a core of the target collection, but on the other hand, we don't want
it to hold any shard contents.


On Fri, Sep 13, 2013 at 1:08 PM, Dmitry Kan  wrote:

> Manuel,
>
> Whether to have the front end solr as aggregator of shard results depends
> on your requirements. To repeat, we found merging from many shards very
> inefficient fo our use case. It can be the opposite for you (i.e. requires
> testing). There are some limitations with distributed search, see here:
>
> http://docs.lucidworks.com/display/solr/Distributed+Search+with+Index+Sharding
>
>
> On Wed, Sep 11, 2013 at 3:35 PM, Manuel Le Normand <
> manuel.lenorm...@gmail.com> wrote:
>
> > Dmitry - currently we don't have such a front end, this sounds like a
> good
> > idea creating it. And yes, we do query all 36 shards every query.
> >
> > Mikhail - I do think 1 minute is enough data, as during this exact
> minute I
> > had a single query running (that took a qtime of 1 minute). I wanted to
> > isolate these hard queries. I repeated this profiling few times.
> >
> > I think I will take the termInterval from 128 to 32 and check the
> results.
> > I'm currently using NRTCachingDirectoryFactory
> >
> >
> >
> >
> > On Mon, Sep 9, 2013 at 11:29 PM, Dmitry Kan 
> wrote:
> >
> > > Hi Manuel,
> > >
> > > The frontend solr instance is the one that does not have its own index
> > and
> > > is doing merging of the results. Is this the case? If yes, are all 36
> > > shards always queried?
> > >
> > > Dmitry
> > >
> > >
> > > On Mon, Sep 9, 2013 at 10:11 PM, Manuel Le Normand <
> > > manuel.lenorm...@gmail.com> wrote:
> > >
> > > > Hi Dmitry,
> > > >
> > > > I have solr 4.3 and every query is distributed and merged back for
> > > ranking
> > > > purpose.
> > > >
> > > > What do you mean by frontend solr?
> > > >
> > > >
> > > > On Mon, Sep 9, 2013 at 2:12 PM, Dmitry Kan 
> > wrote:
> > > >
> > > > > are you querying your shards via a frontend solr? We have noticed,
> > that
> > > > > querying becomes much faster if results merging can be avoided.
> > > > >
> > > > > Dmitry
> > > > >
> > > > >
> > > > > On Sun, Sep 8, 2013 at 6:56 PM, Manuel Le Normand <
> > > > > manuel.lenorm...@gmail.com> wrote:
> > > > >
> > > > > > Hello all
> > > > > > Looking on the 10% slowest queries, I get very bad performances
> > (~60
> > > > sec
> > > > > > per query).
> > > > > > These queries have lots of conditions on my main field (more
> than a
> > > > > > hundred), including phrase queries and rows=1000. I do return
> only
> > > id's
> > > > > > though.
> > > > > > I can quite firmly say that this bad performance is due to slow
> > > storage
> > > > > > issue (that are beyond my control for now). Despite this I want
> to
> > > > > improve
> > > > > > my performances.
> > > > > >
> > > > > > As tought in school, I started profiling these queries and the
> data
> > > of
> > > > ~1
> > > > > > minute profile is located here:
> > > > > >
> > http://picpaste.com/pics/IMG_20130908_132441-ZyrfXeTY.1378637843.jpg
> > > > > >
> > > > > > Main observation: most of the time I do wait for readVInt, who's
> > > > > stacktrace
> > > > > > (2 out of 2 thread dumps) is:
> > > > > >
> > > > > > catalina-exec-3870 - Thread t@6615
> > > > > >  java.lang.Thread.State: RUNNABLE
> > > > > >  at
> org.apadhe.lucene.store.DataInput.readVInt(DataInput.java:108)
> > > > > >  at
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apaChe.lucene.codeosAockTreeIermsReade$FieldReader$SegmentTermsEnumFrame.loadBlock(BlockTreeTermsReader.java:
> > > > > > 2357)
> > > > > >  at
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> ora.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum.seekExact(BlockTreeTermsReader.java:1745)
> > > > > >  at
> org.apadhe.lucene.index.TermContext.build(TermContext.java:95)
> > > > > >  at
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.lucene.search.PhraseQuery$PhraseWeight.(PhraseQuery.java:221)
> > > > > >  at
> > > > >
> > org.apache.lucene.search.PhraseQuery

Re: Profiling Solr Lucene for query

2013-10-01 Thread Isaac Hebsh
Hi Shawn,
I know that every node operates as a frontend. This is the way our cluster
currently run.

If I seperate the frontend from the nodes which hold the shards, I can let
him different amount of CPUs as RAM. (e.g. large amount of RAM to JVM,
because this server won't need the OS cache for reading the index, or more
CPUs because the merging process might be more CPU intensive).

Isn't it possible?


On Wed, Oct 2, 2013 at 12:42 AM, Shawn Heisey  wrote:

> On 10/1/2013 2:35 PM, Isaac Hebsh wrote:
>
>> Hi Dmitry,
>>
>> I'm trying to examine your suggestion to create a frontend node. It sounds
>> pretty usefull.
>> I saw that every node in solr cluster can serve request for any
>> collection,
>> even if it does not hold a core of that collection. because of that, I
>> thought that adding a new node to the cluster (aka, the frontend/gateway
>> server), and creating a dummy collection (with 1 dummy core), will solve
>> the problem.
>>
>> But, I see that a request which sent to the gateway node, is not then sent
>> to the shards. Instead, the request is proxyed to a (random) core of the
>> requested collection, and from there it is sent to the shards. (It is
>> reasonable, because the SolrCore on the gateway might run with different
>> configuration, etc). This means that my new node isn't functioning as a
>> frontend (which responsible for sorting, etc.), but as a poor load
>> balancer. No performance improvement will come from this implementation.
>>
>> So, how do you suggest to implement a frontend? On the one hand, it has to
>> run a core of the target collection, but on the other hand, we don't want
>> it to hold any shard contents.
>>
>
> With SolrCloud, every node is a frontend node.  If you're running
> SolrCloud, then it doesn't make sense to try and use that concept.
>
> It only makes sense to create a frontend node (or core) if you are using
> traditional distributed search, where you need to include a shards
> parameter.
>
> http://wiki.apache.org/solr/**DistributedSearch<http://wiki.apache.org/solr/DistributedSearch>
>
> Thanks,
> Shawn
>
>


Re: Basic auth on SolrCloud /admin/* calls

2013-03-29 Thread Isaac Hebsh
Hi Tim,
Are you running Solr 4.2? (In 4.0 and 4.1, the Collections API didn't
return any failure message. see SOLR-4043 issue).

As far as I know, you can't tell Solr to use authentication credentials
when communicating other nodes. It's a bigger issue.. for example, if you
want to protect the "/update" requestHandler, so unauthorized users won't
delete your whole collection, it can interfere the replication process.

I think it's a necessary mechanism in production environment... I'm curious
how do people use SolrCloud in production w/o it.





On Fri, Mar 29, 2013 at 3:42 AM, Vaillancourt, Tim wrote:

> Hey guys,
>
> I've recently setup basic auth under Jetty 8 for all my Solr 4.x
> '/admin/*' calls, in order to protect my Collections and Cores API.
>
> Although the security constraint is working as expected ('/admin/*' calls
> require Basic Auth or return 401), when I use the Collections API to create
> a collection, I receive a 200 OK to the Collections API CREATE call, but
> the background Cores API calls that are ran on the Collection API's behalf
> fail on the Basic Auth on other nodes with a 401 code, as I should have
> foreseen, but didn't.
>
> Is there a way to tell SolrCloud to use authentication on internal Cores
> API calls that are spawned on Collections API's behalf, or is this a new
> feature request?
>
> To reproduce:
>
> 1.   Implement basic auth on '/admin/*' URIs.
>
> 2.   Perform a CREATE Collections API call to a node (which will
> return 200 OK).
>
> 3.   Notice all Cores API calls fail (Collection isn't created). See
> stack trace below from the node that was issued the CREATE call.
>
> The stack trace I get is:
>
> "org.apache.solr.common.SolrException: Server at http:// HERE>:8983/solr returned non ok
> status:401, message:Unauthorized
> at
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:373)
> at
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
> at
> org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:169)
> at
> org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:135)
> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
> at java.lang.Thread.run(Thread.java:662)"
>
> Cheers!
>
> Tim
>
>
>


Re: Combining Solr Indexes at SolrCloud

2013-03-29 Thread Isaac Hebsh
Let's say you have machine A and machine B. you want to shutdown B.
If all the shards on B have replicas (on A), you can shutdown B instantly.
If there is a shard on B that has no replica, you should create one on
machine A (using Core API), let it replicate the whole shard contents, and
then you are safe to shutdown B.

[Changing the shard count of an existing collection is not possible for
now, so MERGing cores is not relevant.]


On Fri, Mar 29, 2013 at 11:23 AM, Furkan KAMACI wrote:

> Let's assume that I have two machine in a SolrCloud that works as a part of
> cloud. If I want to shutdown one of them an combine its indexes into other
> how can I do that?
>


SurroundQParser does not analyze the query text

2013-05-16 Thread Isaac Hebsh
Hi,

I'm trying to use Surround Query Parser for two reasons, which are not
covered by proximity slops:
1. find documents with two words within a given distance, *unordered*
2. given two lists of words, find documents with (at least) one word from
list A and (at least) one word from list B, within a given distance.

The surround query parser looks great, but it have one big drawback - It
does not analyze the query text. It is documented in the [weak :(] wiki
page.

Can this issue be solved somehow, or it is a bigger constraint?
Should I open a JIRA issue for this?
Any work-around?


Re: SurroundQParser does not analyze the query text

2013-05-17 Thread Isaac Hebsh
Thank you Erik and Jack.

I opened a JIRA issue: https://issues.apache.org/jira/browse/SOLR-4834
I wish a will have time to sumbit a patch file soon.


On Fri, May 17, 2013 at 7:38 AM, Jack Krupansky wrote:

> (Erik: Or he can get the LucidWorks Search product and then use "near" and
> "before" operators so that he doesn't need the surround query parser!)
>
> -- Jack Krupansky
>
> -Original Message- From: Erik Hatcher
> Sent: Thursday, May 16, 2013 6:11 PM
> To: solr-user@lucene.apache.org
> Subject: Re: SurroundQParser does not analyze the query text
>
>
> The issue can certainly be "solved".  But to me, it's actually a bit of a
> "feature" by design for the Lucene-level surround query parser to not do
> analysis, as it seems to have been meant for advanced query writers to
> piece together sophisticated SpanQuery-based pattern matching kinds of
> things utilizing their knowledge of how text was analyzed and indexed.
>
> But for sure it could be modified to do analysis, probably using the
> "multiterm" analyzer feature in there now elsewhere now.  I looked into
> this when I did the basic work of integrating the surround query parser,
> and determined it was a lot of work because it'd need changes in the Lucene
> level code to leverage analysis, and then glue at the Solr level to be
> field type aware and savvy.
>
> By all means open and JIRA and contribute!
>
> Workaround?  Client-side calls can be made to analyze text, and the
> client-side could build up a query expression based on term-by-term (or
> phrase) analysis results.  Maybe that means a prohibitive number of
> requests to Solr to build up a query in a way that leverages Solr's field
> type analysis settings, but it is a technologically possible technique
> maybe worth considering.
>
> Erik
>
>
>
> On May 16, 2013, at 16:38 , Isaac Hebsh wrote:
>
>  Hi,
>>
>> I'm trying to use Surround Query Parser for two reasons, which are not
>> covered by proximity slops:
>> 1. find documents with two words within a given distance, *unordered*
>> 2. given two lists of words, find documents with (at least) one word from
>> list A and (at least) one word from list B, within a given distance.
>>
>> The surround query parser looks great, but it have one big drawback - It
>> does not analyze the query text. It is documented in the [weak :(] wiki
>> page.
>>
>> Can this issue be solved somehow, or it is a bigger constraint?
>> Should I open a JIRA issue for this?
>> Any work-around?
>>
>
>


Bloom Filters

2013-05-17 Thread Isaac Hebsh
Hi everyone..

I'm indexing docs into Solr using the update request handler, by POSTing
data to the REST endpoint (not SolrJ, not DIH).
My indexer should return an indication, whether the document existed in the
collection before or not, based in its ID.

The obvious solution is the perform a query, before trying to index the
document. Do I have any better choice?

If the query approach is chosen, I thought that BloomFilters might make
this request very efficient. After searching in wiki and JIRA, I found this:
http://wiki.apache.org/solr/BloomIndexComponent

This JIRA issue is very old, and didn't managed to be resolved. What effort
should be done, in order to make this issue resolved?


Prevention of heavy wildcard queries

2013-05-27 Thread Isaac Hebsh
Hi.

Searching terms with wildcard in their start, is solved with
ReversedWildcardFilterFactory. But, what about terms with wildcard in both
start AND end?

This query is heavy, and I want to disallow such queries from my users.

I'm looking for a way to cause these queries to fail.
I guess there is no built-in support for my need, so it is OK to write a
new solution.

My current plan is to create a search component (which will run before
QueryComponent). It should analyze the query string, and to drop the query
if "too heavy" wildcard are found.

Another option is to create a query parser, which wraps the current
(specified or default) qparser, and does the same work as above.

These two options require an analysis of the query text, which might be an
ugly work (just think about nested queries [using _query_], OR even a lot
of more basic scenarios like quoted terms, etc.)

Am I missing a simple and clean way to do this?
What would you do?

P.S. if no simple solution exists, timeAllowed limit is the best
work-around I could think about. Any other suggestions?


Re: Prevention of heavy wildcard queries

2013-05-27 Thread Isaac Hebsh
Thanks Roman.
Based on some of your suggestions, will the steps below do the work?

* Create (and register) a new SearchComponent
* In its prepare method: Do for Q and all of the FQs (so this
SearchComponent should run AFTER QueryComponent, in order to see all of the
FQs)
* Create org.apache.lucene.queryparser.flexible.core.StandardQueryParser,
with a special implementation of QueryNodeProcessorPipeline, which contains
my NodeProcessor in the top of its list.
* Set my analyzer into that StandardQueryParser
* My NodeProcessor will be called for each term in the query, so it can
throw an exception if a (basic) querynode contains wildcard in both start
and end of the term.

Do I have a way to avoid from reimplementing the whole StandardQueryParser
class?
Will this work for both LuceneQParser and EdismaxQParser queries?

Any other solution/work-around? How do other production environments of
Solr overcome this issue?


On Mon, May 27, 2013 at 10:15 PM, Roman Chyla  wrote:

> You are right that starting to parse the query before the query component
> can get soon very ugly and complicated. You should take advantage of the
> flex parser, it is already in lucene contrib - but if you are interested in
> the better version, look at
> https://issues.apache.org/jira/browse/LUCENE-5014
>
> The way you can solve this is:
>
> 1. use the standard syntax grammar (which allows *foo*)
> 2. add (or modify) WildcardQueryNodeProcessor to dis/allow that case, or
> raise error etc
>
> this way, you are changing semantics - but don't need to touch the syntax
> definition; of course, you may also change the grammar and allow only one
> instance of wildcard (or some combination) but for that you should probably
> use LUCENE-5014
>
> roman
>
> On Mon, May 27, 2013 at 2:18 PM, Isaac Hebsh 
> wrote:
>
> > Hi.
> >
> > Searching terms with wildcard in their start, is solved with
> > ReversedWildcardFilterFactory. But, what about terms with wildcard in
> both
> > start AND end?
> >
> > This query is heavy, and I want to disallow such queries from my users.
> >
> > I'm looking for a way to cause these queries to fail.
> > I guess there is no built-in support for my need, so it is OK to write a
> > new solution.
> >
> > My current plan is to create a search component (which will run before
> > QueryComponent). It should analyze the query string, and to drop the
> query
> > if "too heavy" wildcard are found.
> >
> > Another option is to create a query parser, which wraps the current
> > (specified or default) qparser, and does the same work as above.
> >
> > These two options require an analysis of the query text, which might be
> an
> > ugly work (just think about nested queries [using _query_], OR even a lot
> > of more basic scenarios like quoted terms, etc.)
> >
> > Am I missing a simple and clean way to do this?
> > What would you do?
> >
> > P.S. if no simple solution exists, timeAllowed limit is the best
> > work-around I could think about. Any other suggestions?
> >
>


Re: Prevention of heavy wildcard queries

2013-05-27 Thread Isaac Hebsh
I don't want to affect on the (correctness of the) real query parsing, so
creating a QParserPlugin is risky.
Instead, If I'll parse the query in my search component, it will be
detached from the real query parsing, (obviously this causes double
parsing, but assume it's OK)...


On Tue, May 28, 2013 at 3:52 AM, Roman Chyla  wrote:

> Hi Issac,
> it is as you say, with the exception that you create a QParserPlugin, not a
> search component
>
> * create QParserPlugin, give it some name, eg. 'nw'
> * make a copy of the pipeline - your component should be at the same place,
> or just above, the wildcard processor
>
> also make sure you are setting your qparser for FQ queries, ie.
> fq="{!nw}foo"
>
>
> On Mon, May 27, 2013 at 5:01 PM, Isaac Hebsh 
> wrote:
>
> > Thanks Roman.
> > Based on some of your suggestions, will the steps below do the work?
> >
> > * Create (and register) a new SearchComponent
> > * In its prepare method: Do for Q and all of the FQs (so this
> > SearchComponent should run AFTER QueryComponent, in order to see all of
> the
> > FQs)
> > * Create org.apache.lucene.queryparser.flexible.core.StandardQueryParser,
> > with a special implementation of QueryNodeProcessorPipeline, which
> contains
> > my NodeProcessor in the top of its list.
> > * Set my analyzer into that StandardQueryParser
> > * My NodeProcessor will be called for each term in the query, so it can
> > throw an exception if a (basic) querynode contains wildcard in both start
> > and end of the term.
> >
> > Do I have a way to avoid from reimplementing the whole
> StandardQueryParser
> > class?
> >
>
> you can try subclassing it, if it allows it
>
>
> > Will this work for both LuceneQParser and EdismaxQParser queries?
> >
>
> this will not work for edismax, nothing but changing the edismax qparser
> will do the trick
>
>
> >
> > Any other solution/work-around? How do other production environments of
> > Solr overcome this issue?
> >
>
> you can also try modifying the standard solr parser, or even the JavaCC
> generated classes
> I believe many people do just that (or some sort of preprocessing)
>
> roman
>
>
> >
> >
> > On Mon, May 27, 2013 at 10:15 PM, Roman Chyla 
> > wrote:
> >
> > > You are right that starting to parse the query before the query
> component
> > > can get soon very ugly and complicated. You should take advantage of
> the
> > > flex parser, it is already in lucene contrib - but if you are
> interested
> > in
> > > the better version, look at
> > > https://issues.apache.org/jira/browse/LUCENE-5014
> > >
> > > The way you can solve this is:
> > >
> > > 1. use the standard syntax grammar (which allows *foo*)
> > > 2. add (or modify) WildcardQueryNodeProcessor to dis/allow that case,
> or
> > > raise error etc
> > >
> > > this way, you are changing semantics - but don't need to touch the
> syntax
> > > definition; of course, you may also change the grammar and allow only
> one
> > > instance of wildcard (or some combination) but for that you should
> > probably
> > > use LUCENE-5014
> > >
> > > roman
> > >
> > > On Mon, May 27, 2013 at 2:18 PM, Isaac Hebsh 
> > > wrote:
> > >
> > > > Hi.
> > > >
> > > > Searching terms with wildcard in their start, is solved with
> > > > ReversedWildcardFilterFactory. But, what about terms with wildcard in
> > > both
> > > > start AND end?
> > > >
> > > > This query is heavy, and I want to disallow such queries from my
> users.
> > > >
> > > > I'm looking for a way to cause these queries to fail.
> > > > I guess there is no built-in support for my need, so it is OK to
> write
> > a
> > > > new solution.
> > > >
> > > > My current plan is to create a search component (which will run
> before
> > > > QueryComponent). It should analyze the query string, and to drop the
> > > query
> > > > if "too heavy" wildcard are found.
> > > >
> > > > Another option is to create a query parser, which wraps the current
> > > > (specified or default) qparser, and does the same work as above.
> > > >
> > > > These two options require an analysis of the query text, which might
> be
> > > an
> > > > ugly work (just think about nested queries [using _query_], OR even a
> > lot
> > > > of more basic scenarios like quoted terms, etc.)
> > > >
> > > > Am I missing a simple and clean way to do this?
> > > > What would you do?
> > > >
> > > > P.S. if no simple solution exists, timeAllowed limit is the best
> > > > work-around I could think about. Any other suggestions?
> > > >
> > >
> >
>


Re: Prevention of heavy wildcard queries

2013-06-02 Thread Isaac Hebsh
Hi everyone.

I came across another need for term extraction: I want to find pairs of
words that appear in queries together. All of the "clustering" work is
ready. and the only hole is how to get the basic terms from the query.

Nobody tried it before? There is no clean way to do it?


On Tue, May 28, 2013 at 7:08 AM, Isaac Hebsh  wrote:

> I don't want to affect on the (correctness of the) real query parsing, so
> creating a QParserPlugin is risky.
> Instead, If I'll parse the query in my search component, it will be
> detached from the real query parsing, (obviously this causes double
> parsing, but assume it's OK)...
>
>
> On Tue, May 28, 2013 at 3:52 AM, Roman Chyla wrote:
>
>> Hi Issac,
>> it is as you say, with the exception that you create a QParserPlugin, not
>> a
>> search component
>>
>> * create QParserPlugin, give it some name, eg. 'nw'
>> * make a copy of the pipeline - your component should be at the same
>> place,
>> or just above, the wildcard processor
>>
>> also make sure you are setting your qparser for FQ queries, ie.
>> fq="{!nw}foo"
>>
>>
>> On Mon, May 27, 2013 at 5:01 PM, Isaac Hebsh 
>> wrote:
>>
>> > Thanks Roman.
>> > Based on some of your suggestions, will the steps below do the work?
>> >
>> > * Create (and register) a new SearchComponent
>> > * In its prepare method: Do for Q and all of the FQs (so this
>> > SearchComponent should run AFTER QueryComponent, in order to see all of
>> the
>> > FQs)
>> > * Create
>> org.apache.lucene.queryparser.flexible.core.StandardQueryParser,
>> > with a special implementation of QueryNodeProcessorPipeline, which
>> contains
>> > my NodeProcessor in the top of its list.
>> > * Set my analyzer into that StandardQueryParser
>> > * My NodeProcessor will be called for each term in the query, so it can
>> > throw an exception if a (basic) querynode contains wildcard in both
>> start
>> > and end of the term.
>> >
>> > Do I have a way to avoid from reimplementing the whole
>> StandardQueryParser
>> > class?
>> >
>>
>> you can try subclassing it, if it allows it
>>
>>
>> > Will this work for both LuceneQParser and EdismaxQParser queries?
>> >
>>
>> this will not work for edismax, nothing but changing the edismax qparser
>> will do the trick
>>
>>
>> >
>> > Any other solution/work-around? How do other production environments of
>> > Solr overcome this issue?
>> >
>>
>> you can also try modifying the standard solr parser, or even the JavaCC
>> generated classes
>> I believe many people do just that (or some sort of preprocessing)
>>
>> roman
>>
>>
>> >
>> >
>> > On Mon, May 27, 2013 at 10:15 PM, Roman Chyla 
>> > wrote:
>> >
>> > > You are right that starting to parse the query before the query
>> component
>> > > can get soon very ugly and complicated. You should take advantage of
>> the
>> > > flex parser, it is already in lucene contrib - but if you are
>> interested
>> > in
>> > > the better version, look at
>> > > https://issues.apache.org/jira/browse/LUCENE-5014
>> > >
>> > > The way you can solve this is:
>> > >
>> > > 1. use the standard syntax grammar (which allows *foo*)
>> > > 2. add (or modify) WildcardQueryNodeProcessor to dis/allow that case,
>> or
>> > > raise error etc
>> > >
>> > > this way, you are changing semantics - but don't need to touch the
>> syntax
>> > > definition; of course, you may also change the grammar and allow only
>> one
>> > > instance of wildcard (or some combination) but for that you should
>> > probably
>> > > use LUCENE-5014
>> > >
>> > > roman
>> > >
>> > > On Mon, May 27, 2013 at 2:18 PM, Isaac Hebsh 
>> > > wrote:
>> > >
>> > > > Hi.
>> > > >
>> > > > Searching terms with wildcard in their start, is solved with
>> > > > ReversedWildcardFilterFactory. But, what about terms with wildcard
>> in
>> > > both
>> > > > start AND end?
>> > > >
>> > > > This query is heavy, and I want to disallow such queries from my
>> users.
>> > > >
>> > > > I'm looking for a way to cause these queries to fail.
>> > > > I guess there is no built-in support for my need, so it is OK to
>> write
>> > a
>> > > > new solution.
>> > > >
>> > > > My current plan is to create a search component (which will run
>> before
>> > > > QueryComponent). It should analyze the query string, and to drop the
>> > > query
>> > > > if "too heavy" wildcard are found.
>> > > >
>> > > > Another option is to create a query parser, which wraps the current
>> > > > (specified or default) qparser, and does the same work as above.
>> > > >
>> > > > These two options require an analysis of the query text, which
>> might be
>> > > an
>> > > > ugly work (just think about nested queries [using _query_], OR even
>> a
>> > lot
>> > > > of more basic scenarios like quoted terms, etc.)
>> > > >
>> > > > Am I missing a simple and clean way to do this?
>> > > > What would you do?
>> > > >
>> > > > P.S. if no simple solution exists, timeAllowed limit is the best
>> > > > work-around I could think about. Any other suggestions?
>> > > >
>> > >
>> >
>>
>
>


OutOfMemory while indexing (PROD environment!)

2013-06-06 Thread Isaac Hebsh
Hi everyone,

My SolrCloud cluster (4.3.0) has came into production a few days ago.
Docs are being indexed into Solr using "/update" requestHandler, as a POST
request, containing text/xml content-type.

The collection is sharded into 36 pieces, each shard has two replicas.
There are 36 nodes (each node on separate virtual machine), so each node
holds exactly 2 cores.

Each update request contains 100 docs, what means 2-3 docs for each shard.
There are 1-2 such requests every minute. Soft-commit happens every 10
minutes, Hard-commit every 30 minutes, and ramBufferSizeMB=128.

After 48 hours of zero problems, suddenly one shard went down (its both
cores). Log says it's OOM ("GC overhead limit exceeded"). JVM is set to
Xmx=4G.
I'm pretty sure that some minutes before this incident, JVM memory wasn't
so high (even the max memory usage indicator was below 2G).

Indexing requests did not stop, and started getting HTTP 503 errors ("no
server hosting shard"). At this time, some other cores started to go down
(l had all of the rainbow colors: Active, Recovering, Down, Recovery Failed
and Gone :).

Then I tried to restart tomcat of the down nodes, but some of them failed
to start, due to the error message: "we are not the leader". Only shutting
down the both two cores and starting them gradually, solved the problem,
and the whole cluster came back to green state.

Solr is not yet exposed to users, so no queries have been made at that time
(but maybe some non-heavy auto-warm queries were executed).

I don't think that all of the 4GB were being used for justifiable reasons..
I guess that adding more RAM will not solve the problem, in the long term.

Where should I start my log investigation? (about the OOM itself, and about
the chain accident came after it)

I did a search for previous similar issues. There are a lot, but most of
them talks about very old versions of Solr.

[Versions:
Solr: 4.3.0
Tomcat 7
JVM: Oracle 7 (last, standard, JRE), 64bit.
OS: RedHat 6.3]


Wildcards and Phrase queries

2013-06-19 Thread Isaac Hebsh
Hi,

I'm trying to understand what is the status of enabling wildcards on phrase
queries?

Lucene JIRA issue: https://issues.apache.org/jira/browse/LUCENE-1486
Solr JIRA issue: https://issues.apache.org/jira/browse/SOLR-1604

It looks like these issues are not going to be solved in the close future
:( Will they? Did they came into a (partially) dead-end, in the current
approach. Can I contribute anything to make them fixed into an official
version?

Does the lastest patches which attached to rthe JIRAs are production ready?

[Should this message be sent to java-user list?]


Re: Wildcards and Phrase queries

2013-06-22 Thread Isaac Hebsh
Thanks Erick.
Maybe lucene (java-user) is a better mailing list to ask in?


On Sat, Jun 22, 2013 at 7:30 AM, Erick Erickson wrote:

> Wouldn't imagine they're production ready, they haven't been touched
> in months.
>
> So I'd say you're on your own here in terms of whether you wanted
> to use these for production.
>
> I confess I don't know what state they were left in or why they were
> never committed.
>
> FWIW,
> Erick
>
> On Wed, Jun 19, 2013 at 10:08 AM, Isaac Hebsh 
> wrote:
> > Hi,
> >
> > I'm trying to understand what is the status of enabling wildcards on
> phrase
> > queries?
> >
> > Lucene JIRA issue: https://issues.apache.org/jira/browse/LUCENE-1486
> > Solr JIRA issue: https://issues.apache.org/jira/browse/SOLR-1604
> >
> > It looks like these issues are not going to be solved in the close future
> > :( Will they? Did they came into a (partially) dead-end, in the current
> > approach. Can I contribute anything to make them fixed into an official
> > version?
> >
> > Does the lastest patches which attached to rthe JIRAs are production
> ready?
> >
> > [Should this message be sent to java-user list?]
>


Re: Wildcards and Phrase queries

2013-06-23 Thread Isaac Hebsh
Ahmet, it looks great!

Can you tell us why havn't this code been commited into lucene+solr trunk?


On Sun, Jun 23, 2013 at 2:28 PM, Ahmet Arslan  wrote:

> Hi Isaac,
>
> ComplexPhrase-4.2.1.zip should work with solr4.2.1. Zipball contains a
> ReadMe.txt file about instructions.
>
>
> You could try with higher solr versions too. If it does not work, please
> lets us know.
>
>
> https://issues.apache.org/jira/secure/attachment/12579832/ComplexPhrase-4.2.1.zip
>
>
>
> ____
>  From: Isaac Hebsh 
> To: solr-user@lucene.apache.org
> Sent: Saturday, June 22, 2013 9:33 PM
> Subject: Re: Wildcards and Phrase queries
>
>
> Thanks Erick.
> Maybe lucene (java-user) is a better mailing list to ask in?
>
>
> On Sat, Jun 22, 2013 at 7:30 AM, Erick Erickson  >wrote:
>
> > Wouldn't imagine they're production ready, they haven't been touched
> > in months.
> >
> > So I'd say you're on your own here in terms of whether you wanted
> > to use these for production.
> >
> > I confess I don't know what state they were left in or why they were
> > never committed.
> >
> > FWIW,
> > Erick
> >
> > On Wed, Jun 19, 2013 at 10:08 AM, Isaac Hebsh 
> > wrote:
> > > Hi,
> > >
> > > I'm trying to understand what is the status of enabling wildcards on
> > phrase
> > > queries?
> > >
> > > Lucene JIRA issue: https://issues.apache.org/jira/browse/LUCENE-1486
> > > Solr JIRA issue: https://issues.apache.org/jira/browse/SOLR-1604
> > >
> > > It looks like these issues are not going to be solved in the close
> future
> > > :( Will they? Did they came into a (partially) dead-end, in the current
> > > approach. Can I contribute anything to make them fixed into an official
> > > version?
> > >
> > > Does the lastest patches which attached to rthe JIRAs are production
> > ready?
> > >
> > > [Should this message be sent to java-user list?]
> >
>


Re: Solr cache considerations

2013-01-17 Thread Isaac Hebsh
Unfortunately, it seems (
http://lucene.472066.n3.nabble.com/Nrt-and-caching-td3993612.html) that
these caches are not per-segment. In this case, I want to (soft) commit
less frequently. Am I right?

Tomás, as the fieldValueCache is very similar to lucene's FieldCache, I
guess it has a big contribution to standard (not only faceted) queries
time. SolrWiki claims that it primarily used by faceting. What that says
about complex textual queries?

documentCache:
Erick, After a query processing is finished, doesn't some documents stay in
the documentCache? can't I use it to accelerate queries that should
retrieve stored fields of documents? In this case, a big documentCache can
hold more documents..

About commit frequency:
HardCommit: "openSearch=false" seems as a nice solution. Where can I read
about this? (found nothing but one unexplained sentence in SolrWiki).
SoftCommit: In my case, the required index freshness is 10 minutes. The
plan to soft commit every 10 minutes is similar to storing all of the
documents in a queue (outside to Solr), an indexing a bulk every 10 minutes.

Thanks.


On Fri, Jan 18, 2013 at 2:15 AM, Tomás Fernández Löbbe <
tomasflo...@gmail.com> wrote:

> I think fieldValueCache is not per segment, only fieldCache is. However,
> unless I'm missing something, this cache is only used for faceting on
> multivalued fields
>
>
> On Thu, Jan 17, 2013 at 8:58 PM, Erick Erickson  >wrote:
>
> > filterCache: This is bounded by 1M * (maxDoc) / 8 * (num filters in
> > cache). Notice the /8. This reflects the fact that the filters are
> > represented by a bitset on the _internal_ Lucene ID. UniqueId has no
> > bearing here whatsoever. This is, in a nutshell, why warming is
> > required, the internal Lucene IDs may change. Note also that it's
> > maxDoc, the internal arrays have "holes" for deleted documents.
> >
> > Note this is an _upper_ bound, if there are only a few docs that
> > match, the size will be (num of matching docs) * sizeof(int)).
> >
> > fieldValueCache. I don't think so, although I'm a bit fuzzy on this.
> > It depends on whether these are "per-segment" caches or not. Any "per
> > segment" cache is still valid.
> >
> > Think of documentCache as intended to hold the stored fields while
> > various components operate on it, thus avoiding repeatedly fetching
> > the data from disk. It's _usually_ not too big a worry.
> >
> > About hard-commits once a day. That's _extremely_ long. Think instead
> > of committing more frequently with openSearcher=false. If nothing
> > else, you transaction log will grow lots and lots and lots. I'm
> > thinking on the order of 15 minutes, or possibly even much less. With
> > softCommits happening more often, maybe every 15 seconds. In fact, I'd
> > start out with soft commits every 15 seconds and hard commits
> > (openSearcher=false) every 5 minutes. The problem with hard commits
> > being once a day is that, if for any reason the server is interrupted,
> > on startup Solr will try to replay the entire transaction log to
> > assure index integrity. Not to mention that your tlog will be huge.
> > Not to mention that there is some memory usage for each document in
> > the tlog. Hard commits roll over the tlog, flush the in-memory tlog
> > pointers, close index segments, etc.
> >
> > Best
> > Erick
> >
> > On Thu, Jan 17, 2013 at 1:29 PM, Isaac Hebsh 
> > wrote:
> > > Hi,
> > >
> > > I am going to build a big Solr (4.0?) index, which holds some dozens of
> > > millions of documents. Each document has some dozens of fields, and one
> > big
> > > textual field.
> > > The queries on the index are non-trivial, and a little-bit long (might
> be
> > > hundreds of terms). No query is identical to another.
> > >
> > > Now, I want to analyze the cache performance (before setting up the
> whole
> > > environment), in order to estimate how much RAM will I need.
> > >
> > > filterCache:
> > > In my scenariom, every query has some filters. let's say that each
> filter
> > > matches 1M documents, out of 10M. Does the estimated memory usage
> should
> > be
> > > 1M * sizeof(uniqueId) * num-of-filters-in-cache?
> > >
> > > fieldValueCache:
> > > Due to the difference between queries, I guess that fieldValueCache is
> > the
> > > most important factor on query performance. Here comes a generic
> > question:
> > > I'm indexing new documents to the index constantly. Soft commits will
> be
> > > performe

Re: Solr cache considerations

2013-01-19 Thread Isaac Hebsh
Ok. Thank you everyone for your helpful answers.
I understand that fieldValueCache is not used for resolving queries.
Is there any cache that can help this basic scenario (a lot of different
queries, on a small set of fields)?
Does Lucene's FieldCache help (implicitly)?
How can I use RAM to reduce I/O in this type of queries?


On Fri, Jan 18, 2013 at 4:09 PM, Tomás Fernández Löbbe <
tomasflo...@gmail.com> wrote:

> No, the fieldValueCache is not used for resolving queries. Only for
> multi-token faceting and apparently for the stats component too. The
> document cache maintains in memory the stored content of the fields you are
> retrieving or highlighting on. It'll hit if the same document matches the
> query multiple times and the same fields are requested, but as Eirck said,
> it is important for cases when multiple components in the same request need
> to access the same data.
>
> I think soft committing every 10 minutes is totally fine, but you should
> hard commit more often if you are going to be using transaction log.
> openSearcher=false will essentially tell Solr not to open a new searcher
> after the (hard) commit, so you won't see the new indexed data and caches
> wont be flushed. openSearcher=false makes sense when you are using
> hard-commits together with soft-commits, as the "soft-commit" is dealing
> with opening/closing searchers, you don't need hard commits to do it.
>
> Tomás
>
>
> On Fri, Jan 18, 2013 at 2:20 AM, Isaac Hebsh 
> wrote:
>
> > Unfortunately, it seems (
> > http://lucene.472066.n3.nabble.com/Nrt-and-caching-td3993612.html) that
> > these caches are not per-segment. In this case, I want to (soft) commit
> > less frequently. Am I right?
> >
> > Tomás, as the fieldValueCache is very similar to lucene's FieldCache, I
> > guess it has a big contribution to standard (not only faceted) queries
> > time. SolrWiki claims that it primarily used by faceting. What that says
> > about complex textual queries?
> >
> > documentCache:
> > Erick, After a query processing is finished, doesn't some documents stay
> in
> > the documentCache? can't I use it to accelerate queries that should
> > retrieve stored fields of documents? In this case, a big documentCache
> can
> > hold more documents..
> >
> > About commit frequency:
> > HardCommit: "openSearch=false" seems as a nice solution. Where can I read
> > about this? (found nothing but one unexplained sentence in SolrWiki).
> > SoftCommit: In my case, the required index freshness is 10 minutes. The
> > plan to soft commit every 10 minutes is similar to storing all of the
> > documents in a queue (outside to Solr), an indexing a bulk every 10
> > minutes.
> >
> > Thanks.
> >
> >
> > On Fri, Jan 18, 2013 at 2:15 AM, Tomás Fernández Löbbe <
> > tomasflo...@gmail.com> wrote:
> >
> > > I think fieldValueCache is not per segment, only fieldCache is.
> However,
> > > unless I'm missing something, this cache is only used for faceting on
> > > multivalued fields
> > >
> > >
> > > On Thu, Jan 17, 2013 at 8:58 PM, Erick Erickson <
> erickerick...@gmail.com
> > > >wrote:
> > >
> > > > filterCache: This is bounded by 1M * (maxDoc) / 8 * (num filters in
> > > > cache). Notice the /8. This reflects the fact that the filters are
> > > > represented by a bitset on the _internal_ Lucene ID. UniqueId has no
> > > > bearing here whatsoever. This is, in a nutshell, why warming is
> > > > required, the internal Lucene IDs may change. Note also that it's
> > > > maxDoc, the internal arrays have "holes" for deleted documents.
> > > >
> > > > Note this is an _upper_ bound, if there are only a few docs that
> > > > match, the size will be (num of matching docs) * sizeof(int)).
> > > >
> > > > fieldValueCache. I don't think so, although I'm a bit fuzzy on this.
> > > > It depends on whether these are "per-segment" caches or not. Any "per
> > > > segment" cache is still valid.
> > > >
> > > > Think of documentCache as intended to hold the stored fields while
> > > > various components operate on it, thus avoiding repeatedly fetching
> > > > the data from disk. It's _usually_ not too big a worry.
> > > >
> > > > About hard-commits once a day. That's _extremely_ long. Think instead
> > > > of committing more frequently with openSearcher=false. If nothing
> > > > else

Re: Solr cache considerations

2013-01-20 Thread Isaac Hebsh
Wow Erick, The MMap acrtivle is a very fundamental one. Totaly changed my
view. It must be mentioned in SolrPerformanceFactors in SolrWiki...
I'm sorry I did not know it before.
Thank you a lot.
I promise to share my results then my cart will start to fly :)


On Sun, Jan 20, 2013 at 6:08 PM, Erick Erickson wrote:

> About your question about document cache: Typically the document cache
> has a pretty low hit-ratio. I've rarely, if ever, seen it get hit very
> often. And remember that this cache is only hit when assembling the
> response for a few documents (your page size).
>
> Bottom line: I wouldn't worry about this cache much. It's quite useful
> for processing a particular query faster, but not really intended for
> cross-query use.
>
> Really, I think you're getting the cart before the horse here. Run it
> up the flagpole and try it. Rely on the OS to do its job
> (http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html).
> Find  a bottleneck _then_ tune. Premature optimization and all
> that
>
> Several tens of millions of docs isn't that large unless the text
> fields are enormous.
>
> Best
> Erick
>
> On Sat, Jan 19, 2013 at 2:32 PM, Isaac Hebsh 
> wrote:
> > Ok. Thank you everyone for your helpful answers.
> > I understand that fieldValueCache is not used for resolving queries.
> > Is there any cache that can help this basic scenario (a lot of different
> > queries, on a small set of fields)?
> > Does Lucene's FieldCache help (implicitly)?
> > How can I use RAM to reduce I/O in this type of queries?
> >
> >
> > On Fri, Jan 18, 2013 at 4:09 PM, Tomás Fernández Löbbe <
> > tomasflo...@gmail.com> wrote:
> >
> >> No, the fieldValueCache is not used for resolving queries. Only for
> >> multi-token faceting and apparently for the stats component too. The
> >> document cache maintains in memory the stored content of the fields you
> are
> >> retrieving or highlighting on. It'll hit if the same document matches
> the
> >> query multiple times and the same fields are requested, but as Eirck
> said,
> >> it is important for cases when multiple components in the same request
> need
> >> to access the same data.
> >>
> >> I think soft committing every 10 minutes is totally fine, but you should
> >> hard commit more often if you are going to be using transaction log.
> >> openSearcher=false will essentially tell Solr not to open a new searcher
> >> after the (hard) commit, so you won't see the new indexed data and
> caches
> >> wont be flushed. openSearcher=false makes sense when you are using
> >> hard-commits together with soft-commits, as the "soft-commit" is dealing
> >> with opening/closing searchers, you don't need hard commits to do it.
> >>
> >> Tomás
> >>
> >>
> >> On Fri, Jan 18, 2013 at 2:20 AM, Isaac Hebsh 
> >> wrote:
> >>
> >> > Unfortunately, it seems (
> >> > http://lucene.472066.n3.nabble.com/Nrt-and-caching-td3993612.html)
> that
> >> > these caches are not per-segment. In this case, I want to (soft)
> commit
> >> > less frequently. Am I right?
> >> >
> >> > Tomás, as the fieldValueCache is very similar to lucene's FieldCache,
> I
> >> > guess it has a big contribution to standard (not only faceted) queries
> >> > time. SolrWiki claims that it primarily used by faceting. What that
> says
> >> > about complex textual queries?
> >> >
> >> > documentCache:
> >> > Erick, After a query processing is finished, doesn't some documents
> stay
> >> in
> >> > the documentCache? can't I use it to accelerate queries that should
> >> > retrieve stored fields of documents? In this case, a big documentCache
> >> can
> >> > hold more documents..
> >> >
> >> > About commit frequency:
> >> > HardCommit: "openSearch=false" seems as a nice solution. Where can I
> read
> >> > about this? (found nothing but one unexplained sentence in SolrWiki).
> >> > SoftCommit: In my case, the required index freshness is 10 minutes.
> The
> >> > plan to soft commit every 10 minutes is similar to storing all of the
> >> > documents in a queue (outside to Solr), an indexing a bulk every 10
> >> > minutes.
> >> >
> >> > Thanks.
> >> >
> >> >
> >> > On Fri, Jan 18, 2013 at 2:15 AM, Tomás Fernández Löbbe <
>

Re: uniqueKey field type

2013-01-23 Thread Isaac Hebsh
"id" field is not serial, it generated randomly.. so range queries on this
field are almost useless.
I mentioned TrieField, because solr.LongField is internally implemented as
a string, while solr.TrieLongField is a number. It might improve
performace, even without setting a precisionStep...


On Thu, Jan 24, 2013 at 3:31 AM, Otis Gospodnetic <
otis.gospodne...@gmail.com> wrote:

> Hi,
>
> I think trie type fields add value only if you do range queries in them and
> it sounds like that is bit your use case.
>
> Otis
> Solr & ElasticSearch Support
> http://sematext.com/
> On Jan 23, 2013 2:53 PM, "Isaac Hebsh"  wrote:
>
> > Hi,
> >
> > In my use case, Solr have to to return only the "id" field, as a response
> > for queries. However, it should return 1000 docs at once (rows=1000).
> >
> > My id field is defined as StrField, due to external systems constraints.
> >
> > I guess that TrieFields are more efficient than StrFields.
> *Theoretically*,
> > the field content can be retrieved without loading the stored field.
> >
> > Should I strive that the id will be managed as a number, or it has no
> > contribution to performance (search & retrieve times)?
> >
> > (Yes, I know that lucene has an internal id mechanism. I think it is not
> > relevant to my question...)
> >
> >
> > - Isaac.
> >
>


Re: secure Solr server

2013-01-27 Thread Isaac Hebsh
You can define a security filter in WEB-INF\web.xml, on specific url
patterns.
You might want to set the url pattern to "/admin/*".

[find examples here:
http://stackoverflow.com/questions/7920092/how-can-i-bypass-security-filter-in-web-xml
]


On Sun, Jan 27, 2013 at 8:07 PM, Mingfeng Yang wrote:

> Before Solr 4.0, I secure solr by enable password protection in Jetty.
>  However, password protection will make solrcloud not work.
>
> We use EC2 now, and we need the www admin interface of solr to be
> accessible (with password) from anywhere.
>
> How do you protect your solr sever from unauthorized access?
>
> Thanks,
> Ming
>


Re: Distibuted search

2013-01-28 Thread Isaac Hebsh
Well, My index is already broken to 16 shards...
The behaviour I supposed - It absolutely doesn't happen... Right?
Does it make sense somehow as an improvement request?
Technically, Can multiple Lucene responses be intersected this way?


On Mon, Jan 28, 2013 at 9:27 PM, Mingfeng Yang wrote:

> In your case, since there is no co-current queries, adding replicas won't
> help much on improving the response speed.  However, break your index into
> a few shards do help increase query performance. I recently break an index
> with 30 million documents (30G) into 4 shards, and the boost is pretty
> impressive (roughly 2-5x faster for a complicated query)
>
> Ming
>
>
> On Mon, Jan 28, 2013 at 10:54 AM, Isaac Hebsh 
> wrote:
>
> > Does adding replicas (on additional servers) help to improve search
> > performance?
> >
> > It is known that each query goes to all the shards. It's clear that if we
> > have massive load, then multiple cores serving the same shard are very
> > useful.
> >
> > But what happens if I'll never have concurrent queries (one query is in
> the
> > system at any time), but I want these single queries to return faster.
> Is a
> > bigger replication factor will contribute?
> >
> > Especially, Will a complicated query (with a large amount of queried
> > fields) go to multiple cores *of the same shard*? (E.g. core1 searching
> for
> > term1 in field1, and core2 searching for term 2 in field2)
> >
> > And what about a query on a single field, which contains a lot of terms?
> >
> > Thanks in advance..
> >
>


Re: Servlet Filter for randomizing core names

2013-02-03 Thread Isaac Hebsh
Thanks Shawn for your quick answer.

When using collection name, Solr will choose the leader, when available in
the current server (see getCoreByCollection in SolrDispatchFilter). It is
clear that it's useful when indexing. But queries should run on replicas
too, don't they? Moreover, the core selection seems to be consistent (that
is, it will never get the non-first core in a specific arrangement)...

Under the assumption that a core makes extra work for serving queries (e.g,
combining results, processing every non distributed search component (?)),
and the assumption that multithreading works well here, Is utilizing all
the cores would not be useful?


On Sun, Feb 3, 2013 at 11:49 PM, Shawn Heisey  wrote:

> On 2/3/2013 1:18 PM, Isaac Hebsh wrote:
>
>> Hi.
>>
>> I have a SolrCloud cluster, which contains some servers. each server runs
>> multiple cores.
>>
>> I want to distribute the requests over the running cores on each server,
>> without knowing the cores names in the client.
>>
>> Question 1: Do I have any reason to do this (when indexing? when
>> querying?).
>>
>> All of these cores are sharing the same system resources, but I guess that
>> I still get a better performance if same amount of requests are going to
>> each core. Am I right?
>>
>
> If you are using a cloud-aware API (such as CloudSolrServer from SolrJ),
> your client knows about your zookeeper setup.  Behind the scenes, it
> consults zookeeper about how to find the various servers and cores.  You
> never have to configure any core names on the client.
>
> If you are not using a cloud-aware API, shouldn't you be talking to the
> collection, not the cores?  That is, talk to /solr/test, not
> /solr/test_shard1_replica1 in your program.  That should cause Solr itself
> to figure out where the cores are and forward requests as necessary.
>  Couple that with a load balancer and it approaches what a cloud-aware API
> gives you in terms of reliability.
>
> From my attempts to help people in the IRC channel, I have concluded that
> Solr 4.0 may use the name of the collection as the name of the core on each
> server.  I have not actually used SolrCloud in 4.0, so I cannot say.
>
> Solr 4.1 does not do this.  If you create a collection named test with 2
> shards and 2 replicas with the collections API, you get the following cores
> distributed among your servers:
>
> test_shard1_replica1
> test_shard1_replica2
> test_shard2_replica1
> test_shard2_replica2
>
>
>  Question 2:
>>
>> I've implemented a nice ServletFilter, which replaces the magic name
>> "/randomcore/" with a random core name (retrieved from CoreContainer). I'm
>> using RequestDispatcher.forward, on the new URI. It works, very cool :)
>>
>> But, for making it work, I had to set "FORWARD"
>> on
>> SolrRequestFilter. this setting is explicitly inadvisable in web.xml. Can
>> anyone explain why?
>>
>
> No idea here.
>
> Thanks,
> Shawn
>
>


Re: Servlet Filter for randomizing core names

2013-02-04 Thread Isaac Hebsh
LBHttpSolrServer is only solrj feature.. doesn't it?

I think that Solr does not balance queries among cores in the same server.
You can claim that it's a non-issue, if a single core can completely serve
multiple queries on the same time,  and passing requests through different
cores does nothing.  I feel that we can achieve some improvement in this
case...


On Mon, Feb 4, 2013 at 12:45 AM, Shawn Heisey  wrote:

> On 2/3/2013 3:24 PM, Isaac Hebsh wrote:
>
>> Thanks Shawn for your quick answer.
>>
>> When using collection name, Solr will choose the leader, when available in
>> the current server (see getCoreByCollection in SolrDispatchFilter). It is
>> clear that it's useful when indexing. But queries should run on replicas
>> too, don't they? Moreover, the core selection seems to be consistent (that
>> is, it will never get the non-first core in a specific arrangement)...
>>
>> Under the assumption that a core makes extra work for serving queries
>> (e.g,
>> combining results, processing every non distributed search component (?)),
>> and the assumption that multithreading works well here, Is utilizing all
>> the cores would not be useful?
>>
>
> Here's an excerpt from the SolrCloud wiki page that suggests it handles
> load balancing across the cluster automatically:
>
> 
> Now send a query to any of the servers to query the cluster:
>
> http://localhost:7500/solr/**collection1/select?q=*:*<http://localhost:7500/solr/collection1/select?q=*:*>
>
> Send this query multiple times and observe the logs from the solr servers.
> You should be able to observe Solr load balancing the requests (done via
> LBHttpSolrServer ?) across replicas, using different servers to satisfy
> each request.
> 
>
> This is near the end of example B.
>
> http://wiki.apache.org/solr/**SolrCloud#Example_B:_Simple_**
> two_shard_cluster_with_shard_**replicas<http://wiki.apache.org/solr/SolrCloud#Example_B:_Simple_two_shard_cluster_with_shard_replicas>
>
> Thanks,
> Shawn
>
>


Re: Servlet Filter for randomizing core names

2013-02-04 Thread Isaac Hebsh
Of course I did not mean to multiple cores of the same shard...
A normal SolrCloud configuration, let's say 4 shards, on 4 servers, using
replicationFactor=3.
Of course, no matter what core was requested, the request will be forwarded
to one core of each shard.
My question is - whether this *first* request should be distributed over
all of the cores in a specific server or not.

The statement "Cores are completely thread safe and can do queries/updates
concurrently" answers me that there is no reason for my idea.


On Mon, Feb 4, 2013 at 9:28 PM, Shawn Heisey  wrote:

> On 2/4/2013 12:06 PM, Isaac Hebsh wrote:
>
>> LBHttpSolrServer is only solrj feature.. doesn't it?
>>
>> I think that Solr does not balance queries among cores in the same server.
>> You can claim that it's a non-issue, if a single core can completely serve
>> multiple queries on the same time,  and passing requests through different
>> cores does nothing.  I feel that we can achieve some improvement in this
>> case...
>>
>
> If LBHttpSolrServer is used as described in the Wiki (whoever wrote that
> wasn't sure, they were asking), then it is being used on the server side,
> not the client.
>
> Multiple copies of a shard on the same server is probably not a generally
> supported config with SolrCloud.  It would use more memory and disk space,
> and I'm not sure that there would be any actual benefit to query speed.
>  Cores are completely thread safe and can do queries/updates concurrently.
>  Whatever concurrency problems exist are likely due to resource (CPU, RAM,
> I/O) utilization rather than code limitations.  If I'm right about that,
> multiple copies would not solve the problem.  Buying a bigger/faster server
> would be the solution to that problem.
>
> Thanks,
> Shawn
>
>


Re: IP Address as number

2013-02-07 Thread Isaac Hebsh
Small addition:
To support query, I probably have to implement an analyzer (query time)...
An analyzer can be configured on numeric (i.e non TEXT) field?


On Thu, Feb 7, 2013 at 6:48 PM, Isaac Hebsh  wrote:

> Hi.
>
> I have to index field which contains an IP address.
> Users want to query this field using RANGE queries. to support this, the
> IP is stored as its DWORD value (assume it is IPv4...). On the other side,
> users supply the IP addresses textually (xxx.xxx.xxx.xxx).
>
> I can write a new field type, extends TrieLongField, which will change the
> textual representation to numeric one.
> But what about the stored field retrieval? I want to return the textual
> form..  may be a search component, which changes the stored fields?
>
> Has anyone encountered this need before?
>


Re: Trying to understand soft vs hard commit vs transaction log

2013-02-08 Thread Isaac Hebsh
Shawn, what about 'flush to disk' behaviour on MMapDirectoryFactory?


On Fri, Feb 8, 2013 at 11:12 AM, Prakhar Birla wrote:

> Great explanation Shawn! BTW soft commited documents will be not be
> recovered on JVM crash.
>
> On 8 February 2013 13:27, Shawn Heisey  wrote:
>
> > On 2/7/2013 9:29 PM, Alexandre Rafalovitch wrote:
> >
> >> Hello,
> >>
> >> What actually happens when using soft (as opposed to hard) commit?
> >>
> >> I understand somewhat very high-level picture (documents become
> available
> >> faster, but you may loose them on power loss).
> >> I don't care about low-level implementation details.
> >>
> >> But I am trying to understand what is happening on the medium level of
> >> details.
> >>
> >> For example what are stages of a document if we are using all available
> >> transaction log, soft commit, hard commit options? It feels like there
> is
> >> three stages:
> >> *) Uncommitted (soft or hard): accessible only via direct real-time get?
> >> *) Soft-committed: accessible through all search operatons? (but not on
> >> disk? but where is it? in memory?)
> >> *) Hard-committed: all the same as soft-committed but it is now on disk
> >>
> >> Similarly,  in performance section of Wiki, it says: "A commit
> (including
> >> a
> >> soft commit) will free up almost all heap memory" - why would soft
> commit
> >> free up heap memory? I thought it was not flushed to disk.
> >>
> >> Also, with soft-commits and transaction log enabled, doesn't transaction
> >> log allows to replay/recover the latest state after crash? I believe
> >> that's
> >> what transaction log does for the database. If not, how does one
> recover,
> >> if at all?
> >>
> >> And where does openSearcher=false fits into that? Does it cause
> >> inconsistent results somehow?
> >>
> >> I am missing something, but I am not sure what or where. Any points in
> the
> >> right direction would be appreciated.
> >>
> >
> > Let's see if I can answer your questions without giving you incorrect
> > information.
> >
> > New indexed content is not searchable until you open a new searcher,
> > regardless of the type of commit that you do.
> >
> > A hard commit will close the current transaction log and start a new one.
> >  It will also instruct the Directory implementation to flush to disk.  If
> > you specify openSearcher=false, then the content that has just been
> > committed will NOT be searchable, as discussed in the previous paragraph.
> >  The existing searcher will remain open and continue to serve queries
> > against the same index data.
> >
> > A soft commit does not flush the new content to disk, but it does open a
> > new searcher.  I'm sure that the amount of memory available for caching
> > this content is not large, so it's possible that if you do a lot of
> > indexing with soft commits and your hard commits are too infrequent,
> you'll
> > end up flushing part of the cached data to disk anyway.  I'd love to hear
> > from a committer about this, because I could be wrong.
> >
> > There's a caveat with that 'flush to disk' operation -- the default
> > Directory implementation in the Solr example config, which is
> > NRTCachingDirectoryFactory, will cache the last few megabytes of indexed
> > data and not flush it to disk even with a hard commit.  If your commits
> are
> > small, then the net result is similar to a soft commit.  If the server or
> > Solr were to crash, the transaction logs would be replayed on Solr
> startup,
> > recovering that last few megabytes.  The transaction log may also recover
> > documents that were soft committed, but I'm not 100% sure about that.
> >
> > To take full advantage of NRT functionality, you can commit as often as
> > you like with soft commits.  On some reasonable interval, say every one
> to
> > fifteen minutes, you can issue a hard commit with openSearcher set to
> > false, to flush things to disk and cycle through transaction logs before
> > they get huge.  Solr will keep a few of the transaction logs around, and
> if
> > they are huge, it can take a long time to replay them.  You'll want to
> > choose a hard commit interval that doesn't create giant transaction logs.
> >
> > If any of the info I've given here is wrong, someone should correct me!
> >
> > Thanks,
> > Shawn
> >
> >
>
>
> --
> Regards,
> Prakhar Birla
> +91 9739868086
>


How to limit queries to specific IDs

2013-02-11 Thread Isaac Hebsh
Hi everyone.

I have queries that should be bounded to a set of IDs (the uniqueKey field
of my schema).
My client front-end sends two Solr request:
In the first one, it wants to get the top X IDs. This result should return
very fast. No time to "waste" on highlighting. this is a very standard
query.
In the aecond one, it wants to get the highlighting info (corresponding to
the queried fields and terms, of course), on those documents (may be some
sequential requests, on small "bulks" of the "full" list).

These two requests are implemented as almost identical calls, to different
requestHandlers.

I thought to append a filter query to the second request, "id:(1 2 3 4 5)".
Is this idea good for Solr?
If does, my problem is that I don't want these filters to flood my
filterCache... Is there any way (even if it involves some coding...) to add
a filter query which won't be added to filterCache (at least, not instead
of "standard" filters)?


Notes:
1. It can't be assured that the the first query will remain in
queryResultsCache...
2. consider index size of 50M documents...


Re: How to limit queries to specific IDs

2013-02-12 Thread Isaac Hebsh
Thank you, Erick! Three great answers!


On Wed, Feb 13, 2013 at 4:20 AM, Erick Erickson wrote:

> First, it may not be a problem assuming your other filter queries are more
> frequent.
>
> Second, the easiest way to keep these out of the filter cache would be just
> to include them as a MUST clause, like
> +(original query) +id:(1 2 3 4).
>
> Third possibility, see https://issues.apache.org/jira/browse/SOLR-2429,
> but
> the short form is:
> fq={!cache=false}restoffq
>
>
> On Mon, Feb 11, 2013 at 2:41 PM, Isaac Hebsh 
> wrote:
>
> > Hi everyone.
> >
> > I have queries that should be bounded to a set of IDs (the uniqueKey
> field
> > of my schema).
> > My client front-end sends two Solr request:
> > In the first one, it wants to get the top X IDs. This result should
> return
> > very fast. No time to "waste" on highlighting. this is a very standard
> > query.
> > In the aecond one, it wants to get the highlighting info (corresponding
> to
> > the queried fields and terms, of course), on those documents (may be some
> > sequential requests, on small "bulks" of the "full" list).
> >
> > These two requests are implemented as almost identical calls, to
> different
> > requestHandlers.
> >
> > I thought to append a filter query to the second request, "id:(1 2 3 4
> 5)".
> > Is this idea good for Solr?
> > If does, my problem is that I don't want these filters to flood my
> > filterCache... Is there any way (even if it involves some coding...) to
> add
> > a filter query which won't be added to filterCache (at least, not instead
> > of "standard" filters)?
> >
> >
> > Notes:
> > 1. It can't be assured that the the first query will remain in
> > queryResultsCache...
> > 2. consider index size of 50M documents...
> >
>


Re: Timestamp field is changed on update

2013-02-16 Thread Isaac Hebsh
I opened a JIRA for this improvement request (attached a patch to
DistributedUpdateProcessor).
It's my first JIRA. please review it...
(Or, if someone has an easier solution, tell us...)

https://issues.apache.org/jira/browse/SOLR-4468


On Fri, Feb 15, 2013 at 8:13 AM, Isaac Hebsh  wrote:

> Hi.
>
> I have a 'timestamp' field, which is a date, with a default value of 'NOW'.
> I want it to represent the datetime when the item was inserted (at the
> first time).
>
> Unfortunately, when the item is updated, the timestamp is changed...
>
> How can I implement INSERT TIME automatically?
>


Re: Timestamp field is changed on update

2013-02-16 Thread Isaac Hebsh
Hi,
I do have an externally-created timestamp, but some minutes may pass before
it will be sent to Solr.


On Sat, Feb 16, 2013 at 10:39 PM, Walter Underwood wrote:

> Do you really want the time that Solr first saw it or do you want the time
> that the document was really created in the system? I think an external
> create timestamp would be a lot more useful.
>
> wunder
>
> On Feb 16, 2013, at 12:37 PM, Isaac Hebsh wrote:
>
> > I opened a JIRA for this improvement request (attached a patch to
> > DistributedUpdateProcessor).
> > It's my first JIRA. please review it...
> > (Or, if someone has an easier solution, tell us...)
> >
> > https://issues.apache.org/jira/browse/SOLR-4468
> >
> >
> > On Fri, Feb 15, 2013 at 8:13 AM, Isaac Hebsh 
> wrote:
> >
> >> Hi.
> >>
> >> I have a 'timestamp' field, which is a date, with a default value of
> 'NOW'.
> >> I want it to represent the datetime when the item was inserted (at the
> >> first time).
> >>
> >> Unfortunately, when the item is updated, the timestamp is changed...
> >>
> >> How can I implement INSERT TIME automatically?
> >>
>
>
>
>
>


Re: Timestamp field is changed on update

2013-02-16 Thread Isaac Hebsh
The component who sends the document does not know whether it is a new
document or an update. These are my internal constraints.. But, guys, I
think that it's a basic feature, and it will be better if Solr will support
it without "external help"...


On Sun, Feb 17, 2013 at 12:37 AM, Upayavira  wrote:

> I think what Walter means is make the thing that sends it to Solr set
> the timestamp when it does so.
>
> Upayavira
>
> On Sat, Feb 16, 2013, at 08:56 PM, Isaac Hebsh wrote:
> > Hi,
> > I do have an externally-created timestamp, but some minutes may pass
> > before
> > it will be sent to Solr.
> >
> >
> > On Sat, Feb 16, 2013 at 10:39 PM, Walter Underwood
> > wrote:
> >
> > > Do you really want the time that Solr first saw it or do you want the
> time
> > > that the document was really created in the system? I think an external
> > > create timestamp would be a lot more useful.
> > >
> > > wunder
> > >
> > > On Feb 16, 2013, at 12:37 PM, Isaac Hebsh wrote:
> > >
> > > > I opened a JIRA for this improvement request (attached a patch to
> > > > DistributedUpdateProcessor).
> > > > It's my first JIRA. please review it...
> > > > (Or, if someone has an easier solution, tell us...)
> > > >
> > > > https://issues.apache.org/jira/browse/SOLR-4468
> > > >
> > > >
> > > > On Fri, Feb 15, 2013 at 8:13 AM, Isaac Hebsh 
> > > wrote:
> > > >
> > > >> Hi.
> > > >>
> > > >> I have a 'timestamp' field, which is a date, with a default value of
> > > 'NOW'.
> > > >> I want it to represent the datetime when the item was inserted (at
> the
> > > >> first time).
> > > >>
> > > >> Unfortunately, when the item is updated, the timestamp is changed...
> > > >>
> > > >> How can I implement INSERT TIME automatically?
> > > >>
> > >
> > >
> > >
> > >
> > >
>


Re: Timestamp field is changed on update

2013-02-17 Thread Isaac Hebsh
Thank you Alex.
Atomic Update allows you to "add" new values into multivalued field, for
example... It means that the original document is being read (using
RealTimeGet, which depends on updateLog).
There is no reason that the list of operations (add/set/inc) will not
include a "create-only" operation... I think that throwing it to the client
is not a good idea, and even only because the required atomicity (which is
handled in the DistributedUpdateProcessor using internal locks).

There is no problem when using Atomic Update semantics on non-existent
document.

Indeed, it will work on stored fields only.


On Sun, Feb 17, 2013 at 8:47 AM, Alexandre Rafalovitch
wrote:

> Unless it is an Atomic Update, right. In which case Solr/Lucene will
> actually look at the existing document and - I assume - will preserve
> whatever field got already populated as long as it is stored. Should work
> for default values as well, right? They get populated on first creation,
> then that document gets partially updated.
>
> But I can't tell from the problem description whether it can be
> reformulated as something that fits Atomic Update. I think if the client
> does not know whether this is a new record or an update one, Solr will
> complain if Atomic Update semantics is used against non-existent document.
>
> Regards,
>Alex.
> P.s. Lots of conjecture here; I haven't tested exactly this use-case.
>
> Personal blog: http://blog.outerthoughts.com/
> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
> - Time is the quality of nature that keeps events from happening all at
> once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)
>
>
> On Sun, Feb 17, 2013 at 12:40 AM, Walter Underwood 
> wrote:
> >
> > It is natural part of the update model for Solr (and for many other
> search engines). Solr does not do updates. It does add, replace, and
> delete.
> >
> > Every document is processed as if it was new. If there is already a
> document with that id, then the new document replaces it. The existing
> documents are not read during indexing. This allows indexing to be much
> faster than in a relational database.
> >
> > wunder
>


Re: Timestamp field is changed on update

2013-02-20 Thread Isaac Hebsh
Nobody responded my JIRA issue :(
Should I commit this patch into SVN's trunk, and set the issue as Resolved?


On Sun, Feb 17, 2013 at 9:26 PM, Isaac Hebsh  wrote:

> Thank you Alex.
> Atomic Update allows you to "add" new values into multivalued field, for
> example... It means that the original document is being read (using
> RealTimeGet, which depends on updateLog).
> There is no reason that the list of operations (add/set/inc) will not
> include a "create-only" operation... I think that throwing it to the client
> is not a good idea, and even only because the required atomicity (which is
> handled in the DistributedUpdateProcessor using internal locks).
>
> There is no problem when using Atomic Update semantics on non-existent
> document.
>
> Indeed, it will work on stored fields only.
>
>
> On Sun, Feb 17, 2013 at 8:47 AM, Alexandre Rafalovitch  > wrote:
>
>> Unless it is an Atomic Update, right. In which case Solr/Lucene will
>> actually look at the existing document and - I assume - will preserve
>> whatever field got already populated as long as it is stored. Should work
>> for default values as well, right? They get populated on first creation,
>> then that document gets partially updated.
>>
>> But I can't tell from the problem description whether it can be
>> reformulated as something that fits Atomic Update. I think if the client
>> does not know whether this is a new record or an update one, Solr will
>> complain if Atomic Update semantics is used against non-existent document.
>>
>> Regards,
>>Alex.
>> P.s. Lots of conjecture here; I haven't tested exactly this use-case.
>>
>> Personal blog: http://blog.outerthoughts.com/
>> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
>> - Time is the quality of nature that keeps events from happening all at
>> once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)
>>
>>
>> On Sun, Feb 17, 2013 at 12:40 AM, Walter Underwood > >
>> wrote:
>> >
>> > It is natural part of the update model for Solr (and for many other
>> search engines). Solr does not do updates. It does add, replace, and
>> delete.
>> >
>> > Every document is processed as if it was new. If there is already a
>> document with that id, then the new document replaces it. The existing
>> documents are not read during indexing. This allows indexing to be much
>> faster than in a relational database.
>> >
>> > wunder
>>
>
>


update fails if one doc is wrong

2013-02-26 Thread Isaac Hebsh
Hi.

I add documents to Solr by POSTing them to UpdateHandler, as bulks of 
commands (DIH is not used).

If one document contains any invalid data (e.g. string data into numeric
field), Solr returns HTTP 400 Bad Request, and the whole bulk is failed.

I'm searching for a way to tell Solr to accept the rest of the documents...
(I'll use RealTimeGet to determine which documents were added).

If there is no standard way for doing it, maybe it can be implemented by
spiltting the  commands into seperate HTTP POSTs. Because of using
auto-soft-commit, can I say that it is almost equivalent? What is the
performance penalty of 100 POST requests (of 1 document each) againt 1
request of 100 docs, if a soft commit is eventually done.

Thanks in advance...


Re: Timestamp field is changed on update

2013-02-28 Thread Isaac Hebsh
Hoss Man suggested a wonderful solution for this need:
Always set update="add" to the field you want to keep (is exists), and use
FirstFieldValueUpdateProcessorFactory in the update chain, after
DistributedUpdateProcessorFactory (so the AtomicUpdate will add the
existing field before, if exists).

This solution exactly covers my case. Thank you!


On Wed, Feb 20, 2013 at 11:33 PM, Isaac Hebsh  wrote:

> Nobody responded my JIRA issue :(
> Should I commit this patch into SVN's trunk, and set the issue as Resolved?
>
>
> On Sun, Feb 17, 2013 at 9:26 PM, Isaac Hebsh wrote:
>
>> Thank you Alex.
>> Atomic Update allows you to "add" new values into multivalued field, for
>> example... It means that the original document is being read (using
>> RealTimeGet, which depends on updateLog).
>> There is no reason that the list of operations (add/set/inc) will not
>> include a "create-only" operation... I think that throwing it to the client
>> is not a good idea, and even only because the required atomicity (which is
>> handled in the DistributedUpdateProcessor using internal locks).
>>
>> There is no problem when using Atomic Update semantics on non-existent
>> document.
>>
>> Indeed, it will work on stored fields only.
>>
>>
>> On Sun, Feb 17, 2013 at 8:47 AM, Alexandre Rafalovitch <
>> arafa...@gmail.com> wrote:
>>
>>> Unless it is an Atomic Update, right. In which case Solr/Lucene will
>>> actually look at the existing document and - I assume - will preserve
>>> whatever field got already populated as long as it is stored. Should work
>>> for default values as well, right? They get populated on first creation,
>>> then that document gets partially updated.
>>>
>>> But I can't tell from the problem description whether it can be
>>> reformulated as something that fits Atomic Update. I think if the client
>>> does not know whether this is a new record or an update one, Solr will
>>> complain if Atomic Update semantics is used against non-existent
>>> document.
>>>
>>> Regards,
>>>Alex.
>>> P.s. Lots of conjecture here; I haven't tested exactly this use-case.
>>>
>>> Personal blog: http://blog.outerthoughts.com/
>>> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
>>> - Time is the quality of nature that keeps events from happening all at
>>> once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)
>>>
>>>
>>> On Sun, Feb 17, 2013 at 12:40 AM, Walter Underwood <
>>> wun...@wunderwood.org>
>>> wrote:
>>> >
>>> > It is natural part of the update model for Solr (and for many other
>>> search engines). Solr does not do updates. It does add, replace, and
>>> delete.
>>> >
>>> > Every document is processed as if it was new. If there is already a
>>> document with that id, then the new document replaces it. The existing
>>> documents are not read during indexing. This allows indexing to be much
>>> faster than in a relational database.
>>> >
>>> > wunder
>>>
>>
>>
>


Any documentation on Solr MBeans?

2013-03-07 Thread Isaac Hebsh
Hi,

I'm trying to monitor some Solr behaviour, using JMX.
It looks like a great job was done there, but I can't find any
documentation on the MBeans themselves.

For example, DirectUpdateHandler2 attributes. What is the difference
between "adds" and "cumulative_adds"? Is "adds" count the last X seconds
only? or maybe "cumulative_adds" survives a core reload?


Solr 4.2 - DocValues on id field

2013-03-13 Thread Isaac Hebsh
Hi,

The example schema.xml in Solr 4.2 does not define "id" field
as docValues=true.
Any good reason? (other than backward compat for index for previous
version...)

If my common case is fl=id (and no other field), DocValues is classic for
me. Am I right?