Re: field collapsing performance in sharded environment

2013-11-15 Thread Paul Masurel
That's not the way grouping is done.
On a first round all shards return their 10 best group (represented as
their 10 best grouping values).

As a result it's a three round thing instead of the two round for regular
search, so observing an increasing in latency is normal but not in the
realm of what you are seeing here.

Most probably it is due to the performance issue of TermAllGroupsCollector
which you can patch very easily.


On Thu, Nov 14, 2013 at 3:56 PM, Erick Erickson wrote:

> bq:   Of the 10k docs,
> most have a unique near duplicate hash value, so there are about 10k unique
> values for the field that I'm grouping on.
>
> I suspect (but don't know the grouping code well) that this is the issue.
> You're
> getting the top N groups, right? But in the general case, you can't insure
> that the
> topN from shard1 has any relation to the topN from shard2. So I _suspect_
> that
> the code returns all of the groups. Say that shard1 for group 5 has 3 docs,
> but
> for shard2 has 3,000 docs. Do get the true top N, you need to collate all
> the values
> from all the groups; you can't just return the top 10 groups from each
> shard and
> get correct counts.
>
> Since your group cardinality is about 10K/shard, you're pushing 10 packets
> each
> containing 10K entries back to the originating shard, which has to
> combine/sort
> them all to get the true top N. At least that's my theory.
>
> Your situation is special in that you say that your groups don't appear on
> more than
> one shard, so you'd probably have to write something that aborted this
> behavior and
> returned only the top N, if I'm right.
>
> But that begs the question of why you're doing this. What purpose is served
> by
> grouping on documents that probably only have 1 member?
>
> Best,
> Erick
>
>
> On Wed, Nov 13, 2013 at 2:46 PM, David Anthony Troiano <
> dtroi...@basistech.com> wrote:
>
> > Hello,
> >
> > I'm hitting a performance issue when using field collapsing in a
> > distributed Solr setup and I'm wondering if others have seen it and if
> > anyone has an idea to work around. it.
> >
> > I'm using field collapsing to deduplicate documents that have the same
> near
> > duplicate hash value, and deduplicating at query time (as opposed to
> > filtering at index time) is a requirement.  I have a sharded setup with
> 10
> > cores (not SolrCloud), each having ~1000 documents each.  Of the 10k
> docs,
> > most have a unique near duplicate hash value, so there are about 10k
> unique
> > values for the field that I'm grouping on.  The grouping parameters that
> > I'm using are:
> >
> > group=true
> > group.field=
> > group.main=true
> >
> > I'm attempting distributed queries (&shards=s1,s2,...,s10) where the only
> > difference is the absence or presence of these three grouping parameters
> > and I'm consistently seeing a marked difference in performance (as a
> > representative data point, 200ms latency without grouping and 1600ms with
> > grouping).  Interestingly, if I put all 10k docs on the same core and
> query
> > that core independently with and without grouping, I don't see much of a
> > latency difference, so the performance degradation seems to exist only in
> > the sharded setup.
> >
> > Is there a known performance issue when field collapsing in a sharded
> setup
> > (perhaps only manifests when the grouping field has many unique values),
> or
> > have other people observed this?  Any ideas for a workaround?  Note that
> > docs in my sharded setup can only have the same signature if they're in
> the
> > same shard, so perhaps that can be used to boost perf, though I don't see
> > an exposed way to do so.
> >
> > A follow-on question is whether we're likely to see the same issue if /
> > when we move to SolrCloud.
> >
> > Thanks,
> > Dave
> >
>



-- 
__

 Masurel Paul
 e-mail: paul.masu...@gmail.com


TrieField and FieldCache confusion

2013-07-31 Thread Paul Masurel
Hello everyone,

I have a question about Solr TrieField and Lucene FieldCache.

>From my understanding, Solr added the implementation of TrieField to
perform faster range queries.
For each value it will index multiple terms. The n-th term being a masked
version of our value,
showing only it first (precisionStep * n) bits.

When uninverting the field to populate a FieldCache, the last value with
regard
to the lexicographical order will be retained ; which from my understanding
should
be the term with the highest precision.

Can I expect the FieldCache of Lucene to return the correct values when
working
with TrieField with the precisionStep higher than 0. If not, what did I get
wrong?

Regards,

Paul Masurel
e-mail: paul.masu...@gmail.com


Re: FieldCollapsing issues in SolrCloud 4.4

2013-07-31 Thread Paul Masurel
Do you mean you get different results with group=true?
numFound is supposed returns the number of ungrouped hits.

To get the number of groups, you are expected to set
set group.ngroups=true.
Even then, the result will only give you an upperbound
in a distributed environment.
To get the exact number of groups, you need to shard along
your grouping field.

If you have many groups, you may also experience a huge performance
hit, as the current implementation has been heaviy optimized for low
number of groups (e.g. e-commerce categories).

Paul



On Wed, Jul 31, 2013 at 1:59 AM, Ali, Saqib  wrote:

> Hello all,
>
> Is anyone experiencing issues with the numFound when using group=true in
> SolrCloud 4.4?
>
> Sometimes the results are off for us.
>
> I will post more details shortly.
>
> Thanks.
>



-- 
__

 Masurel Paul
 e-mail: paul.masu...@gmail.com


Re: FieldCollapsing issues in SolrCloud 4.4

2013-07-31 Thread Paul Masurel
If your issue is that you want to retrieve the number of groups,
group.ngroups will return the sum of the number of groups per shard.

This is not the number of groups overall as there if some groups are present
on more than one shard.

To make sure that this does not happen, one can choose to distribute
documents
so that all the documents with the same group key goes to the same shard.

(Disclaimer : Before doing so, you need to make sure that your documents
will still be spread
about equally.)

You can check out how to do that here
https://cwiki.apache.org/confluence/display/solr/Shards+and+Indexing+Data+in+SolrCloud





On Wed, Jul 31, 2013 at 8:02 PM, Ali, Saqib  wrote:

> Hello Paul,
>
> Can you please explain what you mean by:
> "To get the exact number of groups, you need to shard along your grouping
> field"
>
> Thanks! :)
>
>
> On Wed, Jul 31, 2013 at 3:08 AM, Paul Masurel  >wrote:
>
> > Do you mean you get different results with group=true?
> > numFound is supposed returns the number of ungrouped hits.
> >
> > To get the number of groups, you are expected to set
> > set group.ngroups=true.
> > Even then, the result will only give you an upperbound
> > in a distributed environment.
> > To get the exact number of groups, you need to shard along
> > your grouping field.
> >
> > If you have many groups, you may also experience a huge performance
> > hit, as the current implementation has been heaviy optimized for low
> > number of groups (e.g. e-commerce categories).
> >
> > Paul
> >
> >
> >
> > On Wed, Jul 31, 2013 at 1:59 AM, Ali, Saqib 
> wrote:
> >
> > > Hello all,
> > >
> > > Is anyone experiencing issues with the numFound when using group=true
> in
> > > SolrCloud 4.4?
> > >
> > > Sometimes the results are off for us.
> > >
> > > I will post more details shortly.
> > >
> > > Thanks.
> > >
> >
> >
> >
> > --
> > __
> >
> >  Masurel Paul
> >  e-mail: paul.masu...@gmail.com
> >
>



-- 
__

 Masurel Paul
 e-mail: paul.masu...@gmail.com


Re: TrieField and FieldCache confusion

2013-08-01 Thread Paul Masurel
Thank you very much for your very fast answer and
all the pointers.

That's what I thought, but then I got confused by the last note
http://wiki.apache.org/solr/StatsComponent

"TrieFields  has to use a
precisionStep of -1 to avoid using
UnInvertedField.java.
Consider using one field for doing stats, and one for doing range facetting
on. "

I assume it referred to former version of Solr.




On Wed, Jul 31, 2013 at 7:43 PM, Chris Hostetter
wrote:

>
> : Can I expect the FieldCache of Lucene to return the correct values when
> : working
> : with TrieField with the precisionStep higher than 0. If not, what did I
> get
> : wrong?
>
> Yes -- the code for building FieldCaches from Trie fields is smart enough
> to ensure that only the "real" original values are used to populate the
> Cache
>
> (See for example: FieldCache.NUMERIC_UTILS_INT_PARSER and the classes
> linked to from it's javadocs...
>
>
> https://lucene.apache.org/core/4_4_0/core/org/apache/lucene/search/FieldCache.html#NUMERIC_UTILS_INT_PARSER
>
> https://lucene.apache.org/core/4_4_0/core/org/apache/lucene/util/NumericUtils.html
>
> https://lucene.apache.org/core/4_4_0/core/org/apache/lucene/document/IntField.html
>
> (Solr's Trie fields are backed by the various numeric fields in lucene --
> ie: solr:TrieIntField -> lucene:IntField.  the "Trie*" prefix is used in
> solr because there already had classes named IntField, DoubleField, etc...
> when the Trie based impls where added to lucene)
>
>
> -Hoss
>



-- 
__

 Masurel Paul
 e-mail: paul.masu...@gmail.com


Re: Unexpected character '<' (code 60) expected '='

2013-08-01 Thread Paul Masurel
You can check for your xml validity with xmllint very simply.

xmllint 

Does this return an error?



On Thu, Aug 1, 2013 at 9:59 AM, deniz  wrote:

> Vineet Mishra wrote
> > I am using Solr 3.5 with the posting XML file size of just 1Mb.
> >
> >
> > On Wed, Jul 31, 2013 at 8:19 PM, Shawn Heisey <
>
> > solr@
>
> > > wrote:
> >
> >> On 7/31/2013 7:16 AM, Vineet Mishra wrote:
> >> > I checked the File. . .nothing is there. I mean the formatting is
> >> correct,
> >> > its a valid XML file.
> >>
> >> What version of Solr, and how large is your XML file?
> >>
> >> If Solr is older than version 4.1, then the POST buffer limit is decided
> >> by your container config, which based on your stacktrace, is tomcat.  If
> >> you have 4.1 or later, then the POST buffer limit is decided by Solr,
> >> and defaults to 2048KiB.
> >>
> >> Could that be the problem?
> >>
> >> Thanks,
> >> Shawn
> >>
> >>
>
>
> you might need to escape some chars like < to < and so on
>
>
>
> -
> Zeki ama calismiyor... Calissa yapar...
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Unexpected-character-code-60-expected-tp4081603p4081854.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
__

 Masurel Paul
 e-mail: paul.masu...@gmail.com


Re: Pagination on grouped results

2013-08-01 Thread Paul Masurel
Let me copy paste an answer I wrote yesterday :)

To get the number of groups, you are expected to set
set group.ngroups=true.
Even then, the result will only give you an upperbound
in a distributed environment.
To get the exact number of groups, you need to shard along
your grouping field.

If you have many groups, you may also experience a huge performance
hit, as the current implementation has been heaviy optimized for low
number of groups (e.g. e-commerce categories).



On Thu, Aug 1, 2013 at 10:06 AM, Bruno René Santos wrote:

> Hello,
>
> Is there any way to know how many items will I have after grouping results?
> I have the total matches without groups and for each group the quantity of
> items on it, but I cannot find the number of groups I got.
>
> Regards
>
> --
> Bruno René Santos
> Lisboa - Portugal
>



-- 
__

 Masurel Paul
 e-mail: paul.masu...@gmail.com


Re: Group and performing statistics on groups

2013-08-01 Thread Paul Masurel
https://issues.apache.org/jira/browse/SOLR-2931

Please add a word on the JIRA describing your mean and
keep an eye on the ticket. I might release such a plugin
any time soon.

Regards,

Paul



On Fri, Jul 26, 2013 at 4:16 PM, Otis Gospodnetic <
otis.gospodne...@gmail.com> wrote:

> Hi,
>
> I think no, and I think there is a JIRA issue open for that.
>
> Otis
> --
> Solr & ElasticSearch Support -- http://sematext.com/
> Performance Monitoring -- http://sematext.com/spm
>
>
>
> On Fri, Jul 26, 2013 at 2:32 PM, Vineet Mishra 
> wrote:
> > Hi
> >
> > This is a urgent call, I am grouping the solr documents by a field name
> and
> > want to get the Range(Min and Max) value for another field in that group.
> >
> > StatsComponent works fine on all the document as whole rendering the max
> > and min of a field, is it possible to get the StatsComponent per group of
> > the solr.
> >
> >
> > Thanks and Regards
> > Vineet
>



-- 
__

 Masurel Paul
 e-mail: paul.masu...@gmail.com


Re: Unexpected behavior when sorting groups

2013-08-04 Thread Paul Masurel
Dear Tony,

The behavior you described is correct, and what you are requiring
is impossible with Solr as it is.

I wouldn't however say it is a limitation of Solr : your problem is actually
difficult and require some preprocessing.

One solution if it is feasible for you is to precompute the lowest price
of your groups beforehands and add a field to all of the document of your
group.

The other way to address your problem is to do that within Solr.
This can be done by adding a custom cache holding these values.
You can implement the computation of the min price in the warm method.

You can then add a custom function to return the result stored in this
cache. Function values can be used for sorting.

If if does not exist yet, you may open a ticket. I will try and get
authorization
to opensource a solution for this.

Regards,

Paul




On Sat, Aug 3, 2013 at 12:00 AM, Tony Paloma wrote:

> I'm using field collapsing to group documents by a single field and have
> run into something unexpected with how sorting of the groups works. Right
> now I have each group return one document. The documents within each group
> are sorted by a field ("price") in ascending order using group.sort so that
> the document returned for each group in the search results is the cheapest
> document of the group. If I sort the groups amongst themselves using
> sort=price asc, I get what I expect with groups having documents whose
> lowest price value is low show first and groups having documents whose
> lowest price value is high show last.
>
> If I change this to sort on price desc, what happens is not what I would
> expect. I would like the groups to be returned in reverse order from what
> happened when sorting by price asc. Instead, what happens is the groups are
> sorted in descending order according to the highest priced document in each
> group. I want groups to be sorted in descending order according to the
> lowest priced document in each group, but it appears this is not possible.
> In other words, it appears sorting when groups are involved is limited to
> either MAX([field]) DESC or MIN([field]) ASC. The other two combinations
> are not possible. Does anyone know whether or not this is in fact
> impossible, and if not, how I might put in a feature request?
>



-- 
__

 Masurel Paul
 e-mail: paul.masu...@gmail.com


Re: Solr grouping performace

2013-08-05 Thread Paul Masurel
Collapsing is not that slow actually. With a high number of groups,
you may just have to let group.ngroups set to false.

If you need to get the overall number of groups, you may have
to patch lucene.


https://issues.apache.org/jira/browse/LUCENE-3972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13709974#comment-13709974
Martijn patch for instance may work ok for your range of values.

On Mon, Aug 5, 2013 at 9:11 AM, Alok Bhandari <
alokomprakashbhand...@gmail.com> wrote:

> Hello ,
> I need some functionality for which I found that grouping is the most
> suited
> feature. I want to know about performance issue associated with it. On some
> posts I found that performance is  an bottleneck but want to know that if I
> am having 3  million records with 0.5 million distinct values for
> group.value then can I expect results to return in 2-3 seconds? the
> grouping
> field is an "int" , also I want only one filed for a document. I can afford
> t use upto 4GB RAM.
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Solr-grouping-performace-tp4082480.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
__

 Masurel Paul
 e-mail: paul.masu...@gmail.com


Re: Unexpected behavior when sorting groups

2013-08-06 Thread Paul Masurel
On Mon, Aug 5, 2013 at 2:42 AM, Tony Paloma  wrote:

> Thanks Paul. That's helpful. I'm not familiar with the concept of custom
> caches. Would this be custom Java code or something defined in the
> config/schema? Can you point me to some documentation?
>
>
My solution requires both writing custom java code and define stuff in your
solr.config.
I'm waiting for approval to release my plugin, but I'm afraid I don't have
any
visibility on the length of the process.

There is only the bare minimum in the documentation.
http://wiki.apache.org/solr/SolrCaching

Write a class extending

*public class YourCache extends SolrCacheBase implements
SolrCache*

You just add some XML in your solr config to instantiate your custom cache.
At each commit, Solr will call warm... You can inline the code to recompute
all your min price here or delegate it to a CacheRegenerator.

You then need to declare ValueSource hitting on this cache.
You can access your cache in its parse function via the functionqparser :*


SolrIndexSearcher searcher = fp.getReq().getSearcher();
YourCache cache = (YourCache)searcher.getCache(cacheName);*





Another workaround I was thinking of using was making two Solr queries when
> wanting to sort groups by price desc. One to get the number of total groups
> and then another that gets groups sorted by price asc starting from ngroups
> - (start+rows) and then just flip the ordering to fake sorting by
> min(price) desc, but I was worried about the performance implications of
> that.
>

That should work indeed... But keep in mind it will be extremely expensive
if you start distributing your queries :
if you want to get hits from 100 to 110, shards will be asked to send hits
from 0 to 110.



> SOLR-2072 has a similar request.
> https://issues.apache.org/jira/browse/SOLR-2072
>
> Bryan's comment is exactly what I'm looking for:
> > I would like to able to use sort and group.sort together such that the
> group.sort is applied with in the group first and the first document of
> each group is then used as the representative document to perform the
> overall sorting of groups.
>
> The latest comment there suggests that it's a bug in distributed mode, but
> I don't think that's the case since I'm only using one instance of Solr
> with no sharding or anything.
>

This is not a bug. If I get some time, I'll try to write a post about how
collapsing is working in Solr.
Even though it is counterintuitive, what you are asking for is actually a
difficult problem.

Regards,

Paul



> -Original Message-
> From: Paul Masurel [mailto:paul.masu...@gmail.com]
> Sent: Sunday, August 04, 2013 2:54 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Unexpected behavior when sorting groups
>
> Dear Tony,
>
> The behavior you described is correct, and what you are requiring is
> impossible with Solr as it is.
>
> I wouldn't however say it is a limitation of Solr : your problem is
> actually difficult and require some preprocessing.
>
> One solution if it is feasible for you is to precompute the lowest price
> of your groups beforehands and add a field to all of the document of your
> group.
>
> The other way to address your problem is to do that within Solr.
> This can be done by adding a custom cache holding these values.
> You can implement the computation of the min price in the warm method.
>
> You can then add a custom function to return the result stored in this
> cache. Function values can be used for sorting.
>
> If if does not exist yet, you may open a ticket. I will try and get
> authorization to opensource a solution for this.
>
> Regards,
>
> Paul
>
>
>
>
> On Sat, Aug 3, 2013 at 12:00 AM, Tony Paloma  >wrote:
>
> > I'm using field collapsing to group documents by a single field and
> > have run into something unexpected with how sorting of the groups
> > works. Right now I have each group return one document. The documents
> > within each group are sorted by a field ("price") in ascending order
> > using group.sort so that the document returned for each group in the
> > search results is the cheapest document of the group. If I sort the
> > groups amongst themselves using sort=price asc, I get what I expect
> > with groups having documents whose lowest price value is low show
> > first and groups having documents whose lowest price value is high show
> last.
> >
> > If I change this to sort on price desc, what happens is not what I
> > would expect. I would like the groups to be returned in reverse order
> > from what happened when sorting by price asc. Instead, what happens is
> > the groups are sorted in descending order according to th

Re: Unexpected behavior when sorting groups

2013-08-06 Thread Paul Masurel
Here is some detail about how grouping is implemented in Solr.
http://fulmicoton.com/posts/grouping-in-solr/



On Mon, Aug 5, 2013 at 2:42 AM, Tony Paloma  wrote:

> Thanks Paul. That's helpful. I'm not familiar with the concept of custom
> caches. Would this be custom Java code or something defined in the
> config/schema? Can you point me to some documentation?
>
> Another workaround I was thinking of using was making two Solr queries
> when wanting to sort groups by price desc. One to get the number of total
> groups and then another that gets groups sorted by price asc starting from
> ngroups - (start+rows) and then just flip the ordering to fake sorting by
> min(price) desc, but I was worried about the performance implications of
> that.
>
> SOLR-2072 has a similar request.
> https://issues.apache.org/jira/browse/SOLR-2072
>
> Bryan's comment is exactly what I'm looking for:
> > I would like to able to use sort and group.sort together such that the
> group.sort is applied with in the group first and the first document of
> each group is then used as the representative document to perform the
> overall sorting of groups.
>
> The latest comment there suggests that it's a bug in distributed mode, but
> I don't think that's the case since I'm only using one instance of Solr
> with no sharding or anything.
>
> -Original Message-
> From: Paul Masurel [mailto:paul.masu...@gmail.com]
> Sent: Sunday, August 04, 2013 2:54 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Unexpected behavior when sorting groups
>
> Dear Tony,
>
> The behavior you described is correct, and what you are requiring is
> impossible with Solr as it is.
>
> I wouldn't however say it is a limitation of Solr : your problem is
> actually difficult and require some preprocessing.
>
> One solution if it is feasible for you is to precompute the lowest price
> of your groups beforehands and add a field to all of the document of your
> group.
>
> The other way to address your problem is to do that within Solr.
> This can be done by adding a custom cache holding these values.
> You can implement the computation of the min price in the warm method.
>
> You can then add a custom function to return the result stored in this
> cache. Function values can be used for sorting.
>
> If if does not exist yet, you may open a ticket. I will try and get
> authorization to opensource a solution for this.
>
> Regards,
>
> Paul
>
>
>
>
> On Sat, Aug 3, 2013 at 12:00 AM, Tony Paloma  >wrote:
>
> > I'm using field collapsing to group documents by a single field and
> > have run into something unexpected with how sorting of the groups
> > works. Right now I have each group return one document. The documents
> > within each group are sorted by a field ("price") in ascending order
> > using group.sort so that the document returned for each group in the
> > search results is the cheapest document of the group. If I sort the
> > groups amongst themselves using sort=price asc, I get what I expect
> > with groups having documents whose lowest price value is low show
> > first and groups having documents whose lowest price value is high show
> last.
> >
> > If I change this to sort on price desc, what happens is not what I
> > would expect. I would like the groups to be returned in reverse order
> > from what happened when sorting by price asc. Instead, what happens is
> > the groups are sorted in descending order according to the highest
> > priced document in each group. I want groups to be sorted in
> > descending order according to the lowest priced document in each group,
> but it appears this is not possible.
> > In other words, it appears sorting when groups are involved is limited
> > to either MAX([field]) DESC or MIN([field]) ASC. The other two
> > combinations are not possible. Does anyone know whether or not this is
> > in fact impossible, and if not, how I might put in a feature request?
> >
>
>
>
> --
> __
>
>  Masurel Paul
>  e-mail: paul.masu...@gmail.com
>



-- 
__

 Masurel Paul
 e-mail: paul.masu...@gmail.com


Re: Does MMap works on the Virtual Box?

2013-08-16 Thread Paul Masurel
Hi,

You can MMAP a size bigger than your memory without having any problem.
Part of your file will just not be loaded into RAM, because you don't
access it too often.

If you are short in memory, consider deactivating page Host IO Caching, as
it will be only redundant with your guest
OS page cache.

Regards,

Paul



On Fri, Aug 16, 2013 at 10:26 PM, Shawn Heisey  wrote:

> On 8/16/2013 1:02 PM, vibhoreng04 wrote:
>
>> I have a big index of 256 GB .Right now it is on one physical box of 256
>> GB
>> RAM . I am planning to virtualize it to the size of 32 GB Ram*8
>> boxes.Whether the MMap will work regardless in this condition ?
>>
>
> As far as MMap goes, if the operating system you are running is 64-bit,
> your Java is 64-bit, and the OS supports MMap (which almost every operating
> system does, including Linux and Windows), then you'd be fine.
>
> If you have the option of running Solr on bare metal vs. running on the
> same hardware in a virtualized environment, you should always choose the
> bare metal.
>
> I had a Solr installation with a sharded index.  When I first set it up, I
> used virtual machines, one Solr instance and shard per VM.  Half the VMs
> were running on one physical box, half on another.  For redundancy, I had a
> second pair of physical servers doing the same thing, each with VMs
> representing half the index.
>
> That same setup now runs on bare metal -- the exact same physical
> machines, in fact.  The index arrangement is nearly the same as before,
> except it uses multicore Solr, one instance per machine.
>
> Removing the virtualization layer helped performance quite a bit. Average
> QTimes went way down and it took less time to do a full index rebuild.
>
> Thanks,
> Shawn
>
>


-- 
__

 Masurel Paul
 e-mail: paul.masu...@gmail.com