short-circuit OR operator in lucene/solr

2013-07-21 Thread Deepak Konidena
I understand that lucene's AND (&&), OR (||) and NOT (!) operators are
shorthands for REQUIRED, OPTIONAL and EXCLUDE respectively, which is why
one can't treat them as boolean operators (adhering to boolean algebra).

I have been trying to construct a simple OR expression, as follows

q = +(field1:value1 OR field2:value2)

with a match on either field1 or field2. But since the OR is merely an
optional, documents where both field1:value1 and field2:value2 are matched,
the query returns a score resulting in a match on both the clauses.

How do I enforce short-circuiting in this context? In other words, how to
implement short-circuiting as in boolean algebra where an expression A || B
|| C returns true if A is true without even looking into whether B or C
could be true.
-Deepak


Multiple _val_ inside a lucene query.

2013-08-12 Thread Deepak Konidena
One of my previous mails to the group helped me simulate short-circuiting
OR behavior using (thanks to yonik)

_val_:"def(query(cond1,cond2,..))"

where if cond1 is true the query returns without executing the subsequent
conditions.

While it works successfully for single attribute, I am trying to extend it
so I can achieve the same behavior for multiple attributes. I am trying to
use multiple _val_, where the query returns an error.

How do I make a query with multiple _val_ clauses?

-Deepak


Order of fields in a search query.

2013-08-30 Thread Deepak Konidena
Does the order of fields matter in a lucene query?

For instance,

q = A && B && C

Lets say A appears in a million documents, B in 1, C in 1000.

while the results would be identical irrespective of the order in which you
AND
A, B and C, will the response times of the following queries differ in any
way?

C && B && A
A && B && C

Does Lucene/Solr pick the best query execution plan in terms of both space
and time for a given query?

-Deepak


Distributing lucene segments across multiple disks.

2013-09-11 Thread Deepak Konidena
Hi,

I know that SolrCloud allows you to have multiple shards on different
machines (or a single machine). But it requires a zookeeper installation
for doing things like leader election, leader availability, etc

While SolrCloud may be the ideal solution for my usecase eventually, I'd
like to know if there's a way I can point my Solr instance to read lucene
segments distributed across different disks attached to the same machine.

Thanks!

-Deepak


Re: Distributing lucene segments across multiple disks.

2013-09-11 Thread Deepak Konidena
@Greg - Are you suggesting RAID as a replacement for Solr or making Solr
work with RAID? Could you elaborate more on the latter, if that's you
meant?
We make use of solr's advanced text processing features which would be hard
to replicate just using RAID.


-Deepak



On Wed, Sep 11, 2013 at 12:11 PM, Greg Walters  wrote:

> Why not use some form of RAID for your index store? You'd get the
> performance benefit of multiple disks without the complexity of managing
> them via solr.
>
> Thanks,
> Greg
>
>
>
> -Original Message-
> From: Deepak Konidena [mailto:deepakk...@gmail.com]
> Sent: Wednesday, September 11, 2013 2:07 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Distributing lucene segments across multiple disks.
>
> Are you suggesting a multi-core setup, where all the cores share the same
> schema, and the cores lie on different disks?
>
> Basically, I'd like to know if I can distribute shards/segments on a
> single machine (with multiple disks) without the use of zookeeper.
>
>
>
>
>
> -Deepak
>
>
>
> On Wed, Sep 11, 2013 at 11:55 AM, Upayavira  wrote:
>
> > I think you'll find it hard to distribute different segments between
> > disks, as they are typically stored in the same directory.
> >
> > However, instantiating separate cores on different disks should be
> > straight-forward enough, and would give you a performance benefit.
> >
> > I've certainly heard of that done at Amazon, with a separate EBS
> > volume per core giving some performance improvement.
> >
> > Upayavira
> >
> > On Wed, Sep 11, 2013, at 07:35 PM, Deepak Konidena wrote:
> > > Hi,
> > >
> > > I know that SolrCloud allows you to have multiple shards on
> > > different machines (or a single machine). But it requires a
> > > zookeeper installation for doing things like leader election, leader
> > > availability, etc
> > >
> > > While SolrCloud may be the ideal solution for my usecase eventually,
> > > I'd like to know if there's a way I can point my Solr instance to
> > > read lucene segments distributed across different disks attached to
> the same machine.
> > >
> > > Thanks!
> > >
> > > -Deepak
> >
>


Re: Distributing lucene segments across multiple disks.

2013-09-11 Thread Deepak Konidena
Are you suggesting a multi-core setup, where all the cores share the same
schema, and the cores lie on different disks?

Basically, I'd like to know if I can distribute shards/segments on a single
machine (with multiple disks) without the use of zookeeper.





-Deepak



On Wed, Sep 11, 2013 at 11:55 AM, Upayavira  wrote:

> I think you'll find it hard to distribute different segments between
> disks, as they are typically stored in the same directory.
>
> However, instantiating separate cores on different disks should be
> straight-forward enough, and would give you a performance benefit.
>
> I've certainly heard of that done at Amazon, with a separate EBS volume
> per core giving some performance improvement.
>
> Upayavira
>
> On Wed, Sep 11, 2013, at 07:35 PM, Deepak Konidena wrote:
> > Hi,
> >
> > I know that SolrCloud allows you to have multiple shards on different
> > machines (or a single machine). But it requires a zookeeper installation
> > for doing things like leader election, leader availability, etc
> >
> > While SolrCloud may be the ideal solution for my usecase eventually, I'd
> > like to know if there's a way I can point my Solr instance to read lucene
> > segments distributed across different disks attached to the same machine.
> >
> > Thanks!
> >
> > -Deepak
>


Re: Distributing lucene segments across multiple disks.

2013-09-11 Thread Deepak Konidena
I guess at this point in the discussion, I should probably give some more
background on why I am doing what I am doing. Having a single Solr shard
(multiple segments) on the same disk is posing severe performance problems
under load,in that, calls to Solr cause a lot of connection timeouts. When
we looked at the ganglia stats for the Solr box, we saw that while memory,
cpu and network usage were quite normal, the i/o wait spiked. We are unsure
on what caused the i/o wait and why there were no spikes in the cpu/memory
usage. Since the Solr box is a beefy box (multi-core setup, huge ram, SSD),
we'd like to distribute the segments to multiple locations (disks) and see
whether this improves performance under load.

@Greg - Thanks for clarifying that.  I just learnt that I can't set them up
using RAID as some of them are SSDs and some others are SATA (spinning
disks).

@Shawn Heisey - Could you elaborate more about the "broker" core and
delegating the requests to other cores?


-Deepak



On Wed, Sep 11, 2013 at 1:10 PM, Shawn Heisey  wrote:

> On 9/11/2013 1:07 PM, Deepak Konidena wrote:
>
>> Are you suggesting a multi-core setup, where all the cores share the same
>> schema, and the cores lie on different disks?
>>
>> Basically, I'd like to know if I can distribute shards/segments on a
>> single
>> machine (with multiple disks) without the use of zookeeper.
>>
>
> Sure, you can do it all manually.  At that point you would not be using
> SolrCloud at all, because the way to enable SolrCloud is to tell Solr where
> zookeeper lives.
>
> Without SolrCloud, there is no cluster automation at all.  There is no
> "collection" paradigm, you just have cores.  You have to send updates to
> the correct core; they not be redirected for you.  Similarly, queries will
> not be load balanced automatically.  For Java clients, the CloudSolrServer
> object can work seamlessly when servers go down.  If you're not using
> SolrCloud, you can't use CloudSolrServer.
>
> You would be in charge of creating the shards parameter yourself.  The way
> that I do this on my index is that I have a "broker" core that has no index
> of its own, but its solrconfig.xml has the shards and shards.qt parameters
> in all the request handler definitions.  You can also include the parameter
> with the query.
>
> You would also have to handle redundancy yourself, either with replication
> or with independently updated indexes.  I use the latter method, because it
> offers a lot more flexibility than replication.
>
> As mentioned in another reply, setting up RAID with a lot of disks may be
> better than trying to split your index up on different filesystems that
> each reside on different disks.  I would recommend RAID10 for Solr, and it
> works best if it's hardware RAID and the controller has battery-backed (or
> NVRAM) cache.
>
> Thanks,
> Shawn
>
>


Re: Distributing lucene segments across multiple disks.

2013-09-11 Thread Deepak Konidena
@Greg - Thanks for the suggestion. Will pass it along to my folks.

@Shawn - That's the link I was looking for 'non-SolrCloud approach to
distributed search'. Thanks for passing that along. Will give it a try.

As far as RAM usage goes, I believe we set the heap size to about 40% of
the RAM and less than 10% is available for OS caching ( since replica takes
another 40%). Why does unallocated RAM help? How does it impact performance
under load?


-Deepak



On Wed, Sep 11, 2013 at 2:50 PM, Shawn Heisey  wrote:

> On 9/11/2013 2:57 PM, Deepak Konidena wrote:
>
>> I guess at this point in the discussion, I should probably give some more
>> background on why I am doing what I am doing. Having a single Solr shard
>> (multiple segments) on the same disk is posing severe performance problems
>> under load,in that, calls to Solr cause a lot of connection timeouts. When
>> we looked at the ganglia stats for the Solr box, we saw that while memory,
>> cpu and network usage were quite normal, the i/o wait spiked. We are
>> unsure
>> on what caused the i/o wait and why there were no spikes in the cpu/memory
>> usage. Since the Solr box is a beefy box (multi-core setup, huge ram,
>> SSD),
>> we'd like to distribute the segments to multiple locations (disks) and see
>> whether this improves performance under load.
>>
>> @Greg - Thanks for clarifying that.  I just learnt that I can't set them
>> up
>> using RAID as some of them are SSDs and some others are SATA (spinning
>> disks).
>>
>> @Shawn Heisey - Could you elaborate more about the "broker" core and
>> delegating the requests to other cores?
>>
>
> On the broker core - I have a core on my servers that has no index of its
> own.  In the /select handler (and others) I have placed a shards parameter,
> and many of them also have a shards.qt parameter.  The shards paramter is
> how a non-cloud distributed search is done.
>
> http://wiki.apache.org/solr/**DistributedSearch<http://wiki.apache.org/solr/DistributedSearch>
>
> Addressing your first paragraph: You say that you have lots of RAM ... but
> is there a lot of unallocated RAM that the OS can use for caching, or is it
> mostly allocated to processes, such as the java heap for Solr?
>
> Depending on exactly how your indexes are composed, you need up to 100% of
> the total index size available as unallocated RAM.  With SSD, the
> requirement is less, but cannot be ignored.  I personally wouldn't go below
> about 25-50% even with SSD, and I'd plan on 50-100% for regular disks.
>
> There is some evidence to suggest that you only need unallocated RAM equal
> to 10% of your index size for caching with SSD, but that is only likely to
> work if you have a lot of stored (as opposed to indexed) data.  If most of
> your index is unstored, then more would be required.
>
> Thanks,
> Shawn
>
>


Re: Distributing lucene segments across multiple disks.

2013-09-11 Thread Deepak Konidena
Very helpful link. Thanks for sharing that.


-Deepak



On Wed, Sep 11, 2013 at 4:34 PM, Shawn Heisey  wrote:

> On 9/11/2013 4:16 PM, Deepak Konidena wrote:
>
>> As far as RAM usage goes, I believe we set the heap size to about 40% of
>> the RAM and less than 10% is available for OS caching ( since replica
>> takes
>> another 40%). Why does unallocated RAM help? How does it impact
>> performance
>> under load?
>>
>
> Because once the data is in the OS disk cache, reading it becomes
> instantaneous, it doesn't need to go out to the disk.  Disks are glacial
> compared to RAM.  Even SSD has a far slower response time.  Any recent
> operating system does this automatically, including the one from Redmond
> that we all love to hate.
>
> http://blog.thetaphi.de/2012/**07/use-lucenes-mmapdirectory-**
> on-64bit.html<http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html>
>
> Thanks,
> Shawn
>
>