Re: Replication for SolrCloud

2015-04-18 Thread Jürgen Wagner (DVT)
Replication on the storage layer will provide a reliable storage for the
index and other data of Solr. In particular, this replication does not
guarantee your index files are consistent at any time as there may be
intermediate states that are only partially replicated. Replication is
only a convergent process, not an instant, atomic operation. With
frequent changes, this becomes an issue.

Replication inside SolrCloud as an application will not only maintain
the consistency of the search-level interfaces to your indexes, but also
scale in the sense of the application (query throughput).

Imagine a database: if you change one record, this may also result in an
index change. If the record and the index are stored in different
storage blocks, one will get replicated first. However, the replication
target will only be consistent again when both have been replicated. So,
you would have to suspend all accesses until the entire replication has
completed. That's undesirable. If you replicate on the application
(database management system) level, the application will employ a more
fine-grained approach to replication, guaranteeing application consistency.

Consequently, HDFS will allow you to scale storage and possibly even
replicate static indexes that won't change, but it won't help much with
live index replication. That's where SolrCloud jumps in.

Cheers,
--Jürgen

On 18.04.2015 08:44, gengmao wrote:
> I wonder why need to use SolrCloud replication on HDFS at all, given HDFS
> already provides replication and availability? The way to optimize
> performance and scalability should be tweaking shards, just like tweaking
> regions on HBase - which doesn't provide "region replication" too, isn't
> it?
>
> I have this question for a while and I didn't find clear answer about it.
> Could some experts please explain a bit?
>
> Best regards,
> Mao Geng
>
>


-- 

Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С
уважением
*i.A. Jürgen Wagner*
Head of Competence Center "Intelligence"
& Senior Cloud Consultant

Devoteam GmbH, Industriestr. 3, 70565 Stuttgart, Germany
Phone: +49 6151 868-8725, Fax: +49 711 13353-53, Mobile: +49 171 864 1543
E-Mail: juergen.wag...@devoteam.com
, URL: www.devoteam.de



Managing Board: Jürgen Hatzipantelis (CEO)
Address of Record: 64331 Weiterstadt, Germany; Commercial Register:
Amtsgericht Darmstadt HRB 6450; Tax Number: DE 172 993 071




Re: Replication for SolrCloud

2015-04-18 Thread gengmao
On Sat, Apr 18, 2015 at 12:20 AM "Jürgen Wagner (DVT)" <
juergen.wag...@devoteam.com> wrote:

>  Replication on the storage layer will provide a reliable storage for the
> index and other data of Solr. In particular, this replication does not
> guarantee your index files are consistent at any time as there may be
> intermediate states that are only partially replicated. Replication is only
> a convergent process, not an instant, atomic operation. With frequent
> changes, this becomes an issue.
>
Firstly thanks for your reply. However I can't agree with you on this.
HDFS guarantees the consistency even with replicates - you always read what
you write, no partially replicated state will be read, which is guaranteed
by HDFS server and client. Hence HBase can rely on HDFS for consistency and
availability, without implementing another replication mechanism - if I
understand correctly.


> Replication inside SolrCloud as an application will not only maintain the
> consistency of the search-level interfaces to your indexes, but also scale
> in the sense of the application (query throughput).
>
 Split one shard into two shards can increase the query throughput too.


> Imagine a database: if you change one record, this may also result in an
> index change. If the record and the index are stored in different storage
> blocks, one will get replicated first. However, the replication target will
> only be consistent again when both have been replicated. So, you would have
> to suspend all accesses until the entire replication has completed. That's
> undesirable. If you replicate on the application (database management
> system) level, the application will employ a more fine-grained approach to
> replication, guaranteeing application consistency.
>
In HBase, a region only locates on single region server at any time, which
guarantee its consistency. Because your read/write always drops in one
region, you won't have concern of parallel writes happens on multiple
replicates of same region.
The replication of HDFS is totally transparent to HBase. When a HDFS write
call returns, HBase know the data is written and replicated so losing one
copy of the data won't impact HBase at all.
So HDFS means consistency and reliability for HBase. However, HBase doesn't
use replicates (either HBase itself or HDFS's) to scale reads. If one
region's is too "hot" for reads or write, you split that region into two
regions, so that the reads and writes of that region can be distributed
into two region servers. Hence HBase scales.
I think this is the simplicity and beauty of HBase. Again, I am curious if
SolrCloud has better reason to use replication on HDFS? As I described,
HDFS provided consistency and reliability, meanwhile scalability can be
achieved via sharding, even without Solr replication.


> Consequently, HDFS will allow you to scale storage and possibly even
> replicate static indexes that won't change, but it won't help much with
> live index replication. That's where SolrCloud jumps in.
>

> Cheers,
> --Jürgen
>
>
> On 18.04.2015 08:44, gengmao wrote:
>
> I wonder why need to use SolrCloud replication on HDFS at all, given HDFS
> already provides replication and availability? The way to optimize
> performance and scalability should be tweaking shards, just like tweaking
> regions on HBase - which doesn't provide "region replication" too, isn't
> it?
>
> I have this question for a while and I didn't find clear answer about it.
> Could some experts please explain a bit?
>
> Best regards,
> Mao Geng
>
>
>
>
>
> --
>
> Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С
> уважением
> *i.A. Jürgen Wagner*
> Head of Competence Center "Intelligence"
> & Senior Cloud Consultant
>
> Devoteam GmbH, Industriestr. 3, 70565 Stuttgart, Germany
> Phone: +49 6151 868-8725, Fax: +49 711 13353-53, Mobile: +49 171 864 1543
> E-Mail: juergen.wag...@devoteam.com, URL: www.devoteam.de
> --
> Managing Board: Jürgen Hatzipantelis (CEO)
> Address of Record: 64331 Weiterstadt, Germany; Commercial Register:
> Amtsgericht Darmstadt HRB 6450; Tax Number: DE 172 993 071
>
>
>


Re: Replication for SolrCloud

2015-04-18 Thread Shalin Shekhar Mangar
Some comments inline:

On Sat, Apr 18, 2015 at 2:12 PM, gengmao  wrote:

> On Sat, Apr 18, 2015 at 12:20 AM "Jürgen Wagner (DVT)" <
> juergen.wag...@devoteam.com> wrote:
>
> >  Replication on the storage layer will provide a reliable storage for the
> > index and other data of Solr. In particular, this replication does not
> > guarantee your index files are consistent at any time as there may be
> > intermediate states that are only partially replicated. Replication is
> only
> > a convergent process, not an instant, atomic operation. With frequent
> > changes, this becomes an issue.
> >
> Firstly thanks for your reply. However I can't agree with you on this.
> HDFS guarantees the consistency even with replicates - you always read what
> you write, no partially replicated state will be read, which is guaranteed
> by HDFS server and client. Hence HBase can rely on HDFS for consistency and
> availability, without implementing another replication mechanism - if I
> understand correctly.
>
>
Lucene index is not one file but a collection of files which are written
independently. So if you replicate them out of order, Lucene might consider
the index as corrupted (because of missing files). I don't think HBase
works in that way.


>
> > Replication inside SolrCloud as an application will not only maintain the
> > consistency of the search-level interfaces to your indexes, but also
> scale
> > in the sense of the application (query throughput).
> >
>  Split one shard into two shards can increase the query throughput too.
>
>
> > Imagine a database: if you change one record, this may also result in an
> > index change. If the record and the index are stored in different storage
> > blocks, one will get replicated first. However, the replication target
> will
> > only be consistent again when both have been replicated. So, you would
> have
> > to suspend all accesses until the entire replication has completed.
> That's
> > undesirable. If you replicate on the application (database management
> > system) level, the application will employ a more fine-grained approach
> to
> > replication, guaranteeing application consistency.
> >
> In HBase, a region only locates on single region server at any time, which
> guarantee its consistency. Because your read/write always drops in one
> region, you won't have concern of parallel writes happens on multiple
> replicates of same region.
> The replication of HDFS is totally transparent to HBase. When a HDFS write
> call returns, HBase know the data is written and replicated so losing one
> copy of the data won't impact HBase at all.
> So HDFS means consistency and reliability for HBase. However, HBase doesn't
> use replicates (either HBase itself or HDFS's) to scale reads. If one
> region's is too "hot" for reads or write, you split that region into two
> regions, so that the reads and writes of that region can be distributed
> into two region servers. Hence HBase scales.
> I think this is the simplicity and beauty of HBase. Again, I am curious if
> SolrCloud has better reason to use replication on HDFS? As I described,
> HDFS provided consistency and reliability, meanwhile scalability can be
> achieved via sharding, even without Solr replication.
>
>
That's something that has been considered and may even be in the roadmap
for the Cloudera guys. See https://issues.apache.org/jira/browse/SOLR-6237

But one problem that isn't solved by HDFS replication is of near-real-time
indexing where you want the documents to be available for searchers as fast
as possible. SolrCloud replication supports that by replicating documents
as they come in and indexing them in several replicas. A new index searcher
is opened on the flushed index files as well as on the internal data
structures of the index writer. If we switch to relying on HDFS replication
then this will be awfully expensive. However, as Jürgen mentioned, HDFS can
certainly help with replicating static indexes.


>
> > Consequently, HDFS will allow you to scale storage and possibly even
> > replicate static indexes that won't change, but it won't help much with
> > live index replication. That's where SolrCloud jumps in.
> >
>
> > Cheers,
> > --Jürgen
> >
> >
> > On 18.04.2015 08:44, gengmao wrote:
> >
> > I wonder why need to use SolrCloud replication on HDFS at all, given HDFS
> > already provides replication and availability? The way to optimize
> > performance and scalability should be tweaking shards, just like tweaking
> > regions on HBase - which doesn't provide "region replication" too, isn't
> > it?
> >
> > I have this question for a while and I didn't find clear answer about it.
> > Could some experts please explain a bit?
> >
> > Best regards,
> > Mao Geng
> >
> >
> >
> >
> >
> > --
> >
> > Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С
> > уважением
> > *i.A. Jürgen Wagner*
> > Head of Competence Center "Intelligence"
> > & Senior Cloud Consultant
> >
> > Devoteam GmbH, Industriestr. 3, 70565 Stuttg

Re: Solr 5.0, defaultSearchField, defaultOperator ?

2015-04-18 Thread Bruno Mannina

Thx Chris & Ahmet !

Le 17/04/2015 23:56, Chris Hostetter a écrit :

: df and q.op are the ones you are looking for.
: You can define them in defaults section.

specifically...

https://cwiki.apache.org/confluence/display/solr/InitParams+in+SolrConfig


:
: Ahmet
:
:
:
: On Friday, April 17, 2015 9:18 PM, Bruno Mannina  wrote:
: Dear Solr users,
:
: Since today I used SOLR 5.0 (I used solr 3.6) so i try to adapt my old
: schema for solr 5.0.
:
: I have two questions:
: - how can I set the defaultSearchField ?
: I don't want to use in the query the df tag  because I have a lot of
: modification to do for that on my web project.
:
: - how can I set the defaultOperator (and|or) ?
:
: It seems that these "options" are now deprecated in SOLR 5.0 schema.
:
: Thanks a lot for your comment,
:
: Regards,
: Bruno
:
: ---
: Ce courrier électronique ne contient aucun virus ou logiciel malveillant 
parce que la protection avast! Antivirus est active.
: http://www.avast.com
:

-Hoss
http://www.lucidworks.com/



---
Ce courrier électronique ne contient aucun virus ou logiciel malveillant parce 
que la protection avast! Antivirus est active.
http://www.avast.com



Re: JSON Facet & Analytics API in Solr 5.1

2015-04-18 Thread Yonik Seeley
Thank you everyone for the feedback!

I've implemented and committed the flatter structure:
https://issues.apache.org/jira/browse/SOLR-7422
So either form can now be used (and I'll be switching to the flatter
method for examples when it actually reduces the levels).

For those who want to try it out, I just made a 5.2-dev snapshot:
https://github.com/yonik/lucene-solr/releases

-Yonik


RE: HttpSolrServer and CloudSolrServer

2015-04-18 Thread Vijay Bhoomireddy
Thanks Andrea and Erick. It helped my understanding.

Thanks & Regards
Vijay

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: 17 April 2015 17:27
To: solr-user@lucene.apache.org
Subject: Re: HttpSolrServer and CloudSolrServer

Additionally when indexing, CloudSolrServer collects up the documents for each 
shard and routes them to the leader for that shard, moving that processing away 
from whatever node you happen so contact using HttpSolrServer.

Finally, HttpSolrServer is a single point of failure if the node you point to 
goes down, whereas CloudSolrServer will compensate if any node goes down.

Best,
Erick

On Fri, Apr 17, 2015 at 2:39 AM, Andrea Gazzarini  wrote:
> If you're using SolrCloud then you should use CloudSolrServer as it is 
> able to abstract / hide the interaction with the cluster. 
> HttpSolrServer communicates directly with a Solr instance.
>
> Best,
> Andrea
>
>
> On 04/17/2015 10:59 AM, Vijay Bhoomireddy wrote:
>>
>> Hi All,
>>
>>
>> Good Morning!!
>>
>>
>> For SolrCloud deployment, for indexing data through SolrJ, which is 
>> the preferred / correct SolrServer class to use? HttpSolrServer of 
>> CloudSolrServer? In case both can be used, when to use which? Any 
>> help please.
>>
>>
>> Thanks & Regards
>>
>> Vijay
>>
>>
>>
>


-- 
The contents of this e-mail are confidential and for the exclusive use of 
the intended recipient. If you receive this e-mail in error please delete 
it from your system immediately and notify us either by e-mail or 
telephone. You should not copy, forward or otherwise disclose the content 
of the e-mail. The views expressed in this communication may not 
necessarily be the view held by WHISHWORKS.


Re: JSON Facet & Analytics API in Solr 5.1

2015-04-18 Thread Yonik Seeley
Alther minor benefit to the flatter structure means that the "smart
merging" of multiple JSON parameters works a little better in
conjunction with facets.

For example, if you already had a "top_genre" facet, you could insert
a "top_author" facet more easily:

json.facet.top_genre.facet.top_author={type:terms, field:author, limit:5}

(For anyone who doesn't know what "smart merging" is,  see
http://yonik.com/solr-json-request-api/ )

-Yonik


On Sat, Apr 18, 2015 at 11:36 AM, Yonik Seeley  wrote:
> Thank you everyone for the feedback!
>
> I've implemented and committed the flatter structure:
> https://issues.apache.org/jira/browse/SOLR-7422
> So either form can now be used (and I'll be switching to the flatter
> method for examples when it actually reduces the levels).
>
> For those who want to try it out, I just made a 5.2-dev snapshot:
> https://github.com/yonik/lucene-solr/releases
>
> -Yonik


Re: Replication for SolrCloud

2015-04-18 Thread Erick Erickson
AFAIK, the HDFS replication of Solr indexes isn't something that was
designed, it just came along for the ride given HDFS replication.
Having a shard with 1 leader and two followers have 9 copies of the
index around _is_ overkill, nobody argues that at all.

I know the folks at Cloudera (who contributed the original HDFS
implementation) have discussed various options around this. In the
grand scheme of things, there have been other priorities without
tearing into the guts of Solr and/or HDFS since disk space is
relatively cheap.

That said, I'm also sure that this will get some attention as
priorities change. All patches welcome of course ;), But if you're
inclined to work on this issue, I'd _really_ discuss it with Mark
Miller & etc. before investing too much effort in it. I don't quite
know the tradeoffs well enough to have an opinion on the right
implementation.

Best
Erick

On Sat, Apr 18, 2015 at 1:59 AM, Shalin Shekhar Mangar
 wrote:
> Some comments inline:
>
> On Sat, Apr 18, 2015 at 2:12 PM, gengmao  wrote:
>
>> On Sat, Apr 18, 2015 at 12:20 AM "Jürgen Wagner (DVT)" <
>> juergen.wag...@devoteam.com> wrote:
>>
>> >  Replication on the storage layer will provide a reliable storage for the
>> > index and other data of Solr. In particular, this replication does not
>> > guarantee your index files are consistent at any time as there may be
>> > intermediate states that are only partially replicated. Replication is
>> only
>> > a convergent process, not an instant, atomic operation. With frequent
>> > changes, this becomes an issue.
>> >
>> Firstly thanks for your reply. However I can't agree with you on this.
>> HDFS guarantees the consistency even with replicates - you always read what
>> you write, no partially replicated state will be read, which is guaranteed
>> by HDFS server and client. Hence HBase can rely on HDFS for consistency and
>> availability, without implementing another replication mechanism - if I
>> understand correctly.
>>
>>
> Lucene index is not one file but a collection of files which are written
> independently. So if you replicate them out of order, Lucene might consider
> the index as corrupted (because of missing files). I don't think HBase
> works in that way.
>
>
>>
>> > Replication inside SolrCloud as an application will not only maintain the
>> > consistency of the search-level interfaces to your indexes, but also
>> scale
>> > in the sense of the application (query throughput).
>> >
>>  Split one shard into two shards can increase the query throughput too.
>>
>>
>> > Imagine a database: if you change one record, this may also result in an
>> > index change. If the record and the index are stored in different storage
>> > blocks, one will get replicated first. However, the replication target
>> will
>> > only be consistent again when both have been replicated. So, you would
>> have
>> > to suspend all accesses until the entire replication has completed.
>> That's
>> > undesirable. If you replicate on the application (database management
>> > system) level, the application will employ a more fine-grained approach
>> to
>> > replication, guaranteeing application consistency.
>> >
>> In HBase, a region only locates on single region server at any time, which
>> guarantee its consistency. Because your read/write always drops in one
>> region, you won't have concern of parallel writes happens on multiple
>> replicates of same region.
>> The replication of HDFS is totally transparent to HBase. When a HDFS write
>> call returns, HBase know the data is written and replicated so losing one
>> copy of the data won't impact HBase at all.
>> So HDFS means consistency and reliability for HBase. However, HBase doesn't
>> use replicates (either HBase itself or HDFS's) to scale reads. If one
>> region's is too "hot" for reads or write, you split that region into two
>> regions, so that the reads and writes of that region can be distributed
>> into two region servers. Hence HBase scales.
>> I think this is the simplicity and beauty of HBase. Again, I am curious if
>> SolrCloud has better reason to use replication on HDFS? As I described,
>> HDFS provided consistency and reliability, meanwhile scalability can be
>> achieved via sharding, even without Solr replication.
>>
>>
> That's something that has been considered and may even be in the roadmap
> for the Cloudera guys. See https://issues.apache.org/jira/browse/SOLR-6237
>
> But one problem that isn't solved by HDFS replication is of near-real-time
> indexing where you want the documents to be available for searchers as fast
> as possible. SolrCloud replication supports that by replicating documents
> as they come in and indexing them in several replicas. A new index searcher
> is opened on the flushed index files as well as on the internal data
> structures of the index writer. If we switch to relying on HDFS replication
> then this will be awfully expensive. However, as Jürgen mentioned, HDFS can
> certainly help with replicating static i

Re: JSON Facet & Analytics API in Solr 5.1

2015-04-18 Thread Lukáš Vlček
Late here but let me add one more thing: IIRC the recommendation for JSON
is to never use data as a key in objects. One of the benefits of not using
data as a keys in JSON is easier validation using JSON schema. If one wants
to validate JSON query for Elasticsearch today it is necessary to implement
custom parser (and grammar first of course).

Lukas

On Sat, Apr 18, 2015 at 11:46 PM, Yonik Seeley  wrote:

> Alther minor benefit to the flatter structure means that the "smart
> merging" of multiple JSON parameters works a little better in
> conjunction with facets.
>
> For example, if you already had a "top_genre" facet, you could insert
> a "top_author" facet more easily:
>
> json.facet.top_genre.facet.top_author={type:terms, field:author, limit:5}
>
> (For anyone who doesn't know what "smart merging" is,  see
> http://yonik.com/solr-json-request-api/ )
>
> -Yonik
>
>
> On Sat, Apr 18, 2015 at 11:36 AM, Yonik Seeley  wrote:
> > Thank you everyone for the feedback!
> >
> > I've implemented and committed the flatter structure:
> > https://issues.apache.org/jira/browse/SOLR-7422
> > So either form can now be used (and I'll be switching to the flatter
> > method for examples when it actually reduces the levels).
> >
> > For those who want to try it out, I just made a 5.2-dev snapshot:
> > https://github.com/yonik/lucene-solr/releases
> >
> > -Yonik
>


Re: JSON Facet & Analytics API in Solr 5.1

2015-04-18 Thread Lukáš Vlček
Oh... and btw, I think the readability of the JSON will be less and less
important going forward. Queries will grow in size anyway (due to nested
facets) and the ability to quickly validate the query using some parser
will be more useful and practical than relying on human eye doing the check
instead.

I assume that both the ES and Solr will end up having some higher level
language for people to express queries and facets/aggregations in readable
form (anyone remember SQL?) and this will be transformed to JSON (or other
native) format down the road. In my opinion the most important thing for
any non-trivial JSON based language format now is to make sure it is parser
friendly and grammars can be defined easily for it.

On Sun, Apr 19, 2015 at 8:09 AM, Lukáš Vlček  wrote:

> Late here but let me add one more thing: IIRC the recommendation for JSON
> is to never use data as a key in objects. One of the benefits of not using
> data as a keys in JSON is easier validation using JSON schema. If one wants
> to validate JSON query for Elasticsearch today it is necessary to implement
> custom parser (and grammar first of course).
>
> Lukas
>
> On Sat, Apr 18, 2015 at 11:46 PM, Yonik Seeley  wrote:
>
>> Alther minor benefit to the flatter structure means that the "smart
>> merging" of multiple JSON parameters works a little better in
>> conjunction with facets.
>>
>> For example, if you already had a "top_genre" facet, you could insert
>> a "top_author" facet more easily:
>>
>> json.facet.top_genre.facet.top_author={type:terms, field:author, limit:5}
>>
>> (For anyone who doesn't know what "smart merging" is,  see
>> http://yonik.com/solr-json-request-api/ )
>>
>> -Yonik
>>
>>
>> On Sat, Apr 18, 2015 at 11:36 AM, Yonik Seeley  wrote:
>> > Thank you everyone for the feedback!
>> >
>> > I've implemented and committed the flatter structure:
>> > https://issues.apache.org/jira/browse/SOLR-7422
>> > So either form can now be used (and I'll be switching to the flatter
>> > method for examples when it actually reduces the levels).
>> >
>> > For those who want to try it out, I just made a 5.2-dev snapshot:
>> > https://github.com/yonik/lucene-solr/releases
>> >
>> > -Yonik
>>
>
>