TermsComponent/SolrCloud

2012-11-22 Thread Federico Méndez
Anyone knows if the TermsComponent supports distributed search trough a
SolrCloud installation? I have a SolrCloud installation that works OK for
regular searches but TermsComponent is returning empty results when using:
[collectionName]/terms?terms.fl=collector_name&terms.prefix=jo, the request
handler configuration is:

  
 
  true
  true


  terms

  


Re: SolrCloud and external Zookeeper ensemble

2012-11-22 Thread Luis Cappa Banda
Hello,

I´ve been dealing with the same question these days. In architecture terms,
it´s always better to separate services (Solr and Zookeeper, in this case)
rather to keep them in a single instance. However, when we have to deal
with costs issues, all of use we are quite limitated and we must elect the
best architecture/scalable/single point of failure option. As I see, the
options are:


*1. *Solr servers with Zookeeper embeded.
*2. *Solr servers with external Zookeeper.
*3.* Solr servers with external Zookeeper ensemble.

*Note*: as far as I know, the recommended number of Zookeeper services to
avoid single points of failure is:* ZkNum = 2 * Numshards - 1*. If you have


The best option is the third one. Reasons:

*1. *If one of your Solr servers goes down, Zookeeper services still up.
*2.* If one of your Zookeeper services goes down, Solr servers and the rest
of Zookeeper services still up.

Considering that option, we have two ways to implement it in production:

*1. *Each service (Solr and Zookeeper) in separate machines. Let´s imagine
that we have 2 shards for a given collection, so we need at least 4 Solr
servers to complete the leader-replica configuration. The best option is to
deploy them in for Amazon instances, one per each server. We need at least
3 Zookeeper services in a Zookeeper ensemble configuration. The optimal way
to install them is in separates machines (micro instance will be nice for
Zookeeper), so we will have 7 Amazon instances. The reason is that if one
machine goes down (Solr or Zookeeper one) the others services may still up
and your production environment will be safe. However,* for me this is the
best case, but it´s the more expensive one*, so in my case is imposible to
make real.

*2. *As wee need at least 4 Solr servers and 3 Zookeeper services up, I
would install three Amazon instances with Solr and Zookeeper, and one of
them only with Solr. So we´ll have: 3 complete Amazon instances (Solr +
Zookeeper) and 1 single Amazon instance  (only Solr). If one of them goes
down, the production environment will be safe. This architecture is not the
best one, as I told you, but I think that is optimal in terms of
robustness, single point of failure and costs.


It would be a pleasure to hear new suggestions from other people that
dealed with this kind of issues.

Regards,


- Luis Cappa.


2012/11/21 Marcin Rzewucki 

> Yes, I meant the same (not -zkRun). However, I was asking if it is safe to
> have zookeeper and solr processes running on the same node or better on
> different machines?
>
> On 21 November 2012 21:18, Rafał Kuć  wrote:
>
> > Hello!
> >
> > As I told I wouldn't use the Zookeeper that is embedded into Solr, but
> > rather setup a standalone one.
> >
> > --
> > Regards,
> >  Rafał Kuć
> >  Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch -
> ElasticSearch
> >
> > > First of all: thank you for your answers. Yes, I meant side by side
> > > configuration. I think the worst case for ZKs here is to loose two of
> > them.
> > > However, I'm going to use 4 availability zones in same region so at
> least
> > > this will reduce the risk of loosing both of them at the same time.
> > > Regards.
> >
> > > On 21 November 2012 17:06, Rafał Kuć  wrote:
> >
> > >> Hello!
> > >>
> > >> Zookeeper by itself is not demanding, but if something happens to your
> > >> nodes that have Solr on it, you'll loose ZooKeeper too if you have
> > >> them installed side by side. However if you will have 4 Solr nodes and
> > >> 3 ZK instances you can get them running side by side.
> > >>
> > >> --
> > >> Regards,
> > >>  Rafał Kuć
> > >>  Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch -
> > ElasticSearch
> > >>
> > >> > Separate is generally nice because then you can restart Solr nodes
> > >> > without consideration for ZooKeeper.
> > >>
> > >> > Performance-wise, I doubt it's a big deal either way.
> > >>
> > >> > - Mark
> > >>
> > >> > On Nov 21, 2012, at 8:54 AM, Marcin Rzewucki 
> > >> wrote:
> > >>
> > >> >> Hi,
> > >> >>
> > >> >> I have 4 solr collections, 2-3mn documents per collection, up to
> 100K
> > >> >> updates per collection daily (roughly). I'm going to create
> > SolrCloud4x
> > >> on
> > >> >> Amazon's m1.large instances (7GB mem,2x2.4GHz cpu each). The
> > question is
> > >> >> what about zookeeper? It's going to be external ensemble, but is it
> > >> better
> > >> >> to use same nodes as solr or dedicated micro instances? Zookeeper
> > does
> > >> not
> > >> >> seem to be resources demanding process, but what would be better in
> > this
> > >> >> case ? To keep it inside of solrcloud or separately (micro
> instances
> > >> seem
> > >> >> to be enough here) ?
> > >> >>
> > >> >> Thanks in advance.
> > >> >> Regards.
> > >>
> > >>
> >
> >
>



-- 

- Luis Cappa


Re: SolrCloud and external Zookeeper ensemble

2012-11-22 Thread Marcin Rzewucki
Yes, this is exactly my case. I prefer 3rd option too. As I have 2 more
instances to be used for my purposes (SolrCloud4x + 2 more instances for
loading) it will be easier to configure zookeeper ensemble (as I can use
those 2 additional machines + 1 from SolrCloud) and avoid more instances to
be purchased and maintained.

On 22 November 2012 10:18, Luis Cappa Banda  wrote:

> Hello,
>
> I´ve been dealing with the same question these days. In architecture terms,
> it´s always better to separate services (Solr and Zookeeper, in this case)
> rather to keep them in a single instance. However, when we have to deal
> with costs issues, all of use we are quite limitated and we must elect the
> best architecture/scalable/single point of failure option. As I see, the
> options are:
>
>
> *1. *Solr servers with Zookeeper embeded.
> *2. *Solr servers with external Zookeeper.
> *3.* Solr servers with external Zookeeper ensemble.
>
> *Note*: as far as I know, the recommended number of Zookeeper services to
> avoid single points of failure is:* ZkNum = 2 * Numshards - 1*. If you have
>
>
> The best option is the third one. Reasons:
>
> *1. *If one of your Solr servers goes down, Zookeeper services still up.
> *2.* If one of your Zookeeper services goes down, Solr servers and the rest
> of Zookeeper services still up.
>
> Considering that option, we have two ways to implement it in production:
>
> *1. *Each service (Solr and Zookeeper) in separate machines. Let´s imagine
> that we have 2 shards for a given collection, so we need at least 4 Solr
> servers to complete the leader-replica configuration. The best option is to
> deploy them in for Amazon instances, one per each server. We need at least
> 3 Zookeeper services in a Zookeeper ensemble configuration. The optimal way
> to install them is in separates machines (micro instance will be nice for
> Zookeeper), so we will have 7 Amazon instances. The reason is that if one
> machine goes down (Solr or Zookeeper one) the others services may still up
> and your production environment will be safe. However,* for me this is the
> best case, but it´s the more expensive one*, so in my case is imposible to
> make real.
>
> *2. *As wee need at least 4 Solr servers and 3 Zookeeper services up, I
> would install three Amazon instances with Solr and Zookeeper, and one of
> them only with Solr. So we´ll have: 3 complete Amazon instances (Solr +
> Zookeeper) and 1 single Amazon instance  (only Solr). If one of them goes
> down, the production environment will be safe. This architecture is not the
> best one, as I told you, but I think that is optimal in terms of
> robustness, single point of failure and costs.
>
>
> It would be a pleasure to hear new suggestions from other people that
> dealed with this kind of issues.
>
> Regards,
>
>
> - Luis Cappa.
>
>
> 2012/11/21 Marcin Rzewucki 
>
> > Yes, I meant the same (not -zkRun). However, I was asking if it is safe
> to
> > have zookeeper and solr processes running on the same node or better on
> > different machines?
> >
> > On 21 November 2012 21:18, Rafał Kuć  wrote:
> >
> > > Hello!
> > >
> > > As I told I wouldn't use the Zookeeper that is embedded into Solr, but
> > > rather setup a standalone one.
> > >
> > > --
> > > Regards,
> > >  Rafał Kuć
> > >  Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch -
> > ElasticSearch
> > >
> > > > First of all: thank you for your answers. Yes, I meant side by side
> > > > configuration. I think the worst case for ZKs here is to loose two of
> > > them.
> > > > However, I'm going to use 4 availability zones in same region so at
> > least
> > > > this will reduce the risk of loosing both of them at the same time.
> > > > Regards.
> > >
> > > > On 21 November 2012 17:06, Rafał Kuć  wrote:
> > >
> > > >> Hello!
> > > >>
> > > >> Zookeeper by itself is not demanding, but if something happens to
> your
> > > >> nodes that have Solr on it, you'll loose ZooKeeper too if you have
> > > >> them installed side by side. However if you will have 4 Solr nodes
> and
> > > >> 3 ZK instances you can get them running side by side.
> > > >>
> > > >> --
> > > >> Regards,
> > > >>  Rafał Kuć
> > > >>  Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch -
> > > ElasticSearch
> > > >>
> > > >> > Separate is generally nice because then you can restart Solr nodes
> > > >> > without consideration for ZooKeeper.
> > > >>
> > > >> > Performance-wise, I doubt it's a big deal either way.
> > > >>
> > > >> > - Mark
> > > >>
> > > >> > On Nov 21, 2012, at 8:54 AM, Marcin Rzewucki  >
> > > >> wrote:
> > > >>
> > > >> >> Hi,
> > > >> >>
> > > >> >> I have 4 solr collections, 2-3mn documents per collection, up to
> > 100K
> > > >> >> updates per collection daily (roughly). I'm going to create
> > > SolrCloud4x
> > > >> on
> > > >> >> Amazon's m1.large instances (7GB mem,2x2.4GHz cpu each). The
> > > question is
> > > >> >> what about zookeeper? It's going to be external ensemble, but is
> it
> > > >> better
> > > >> 

Re: TermsComponent/SolrCloud

2012-11-22 Thread Tomás Fernández Löbbe
Hi Federico, it should work. Make sure you set the "shards.qt" parameter
too (in your case, it should be shards.qt=/terms)


On Thu, Nov 22, 2012 at 6:51 AM, Federico Méndez wrote:

> Anyone knows if the TermsComponent supports distributed search trough a
> SolrCloud installation? I have a SolrCloud installation that works OK for
> regular searches but TermsComponent is returning empty results when using:
> [collectionName]/terms?terms.fl=collector_name&terms.prefix=jo, the request
> handler configuration is:
> 
>   
>  
>   true
>   true
> 
> 
>   terms
> 
>   
>


Re: How to use eDismax query parser on a non tokenized field

2012-11-22 Thread Tomás Fernández Löbbe
You can either escape the whitespace with "\" or search as a phrase.

fieldNonTokenized:foo\ bar
...or...
fieldNonTokenized:"foo bar"


On Thu, Nov 22, 2012 at 9:08 AM, Varun Thacker
wrote:

> I have indexed documents using a fieldType which does not break the word
> up. I confirmed this by looking up the index in luke. I can see that the
> words haven't been tokenized.
>
> I use a search handler which uses edismax query parser for searching.
> According to the wiki also
> http://wiki.apache.org/solr/ExtendedDisMax#Query_Structure Extended DisMax
> breaks up the query string into words before searching. Thus no results
> show up.
>
> Example for q=foo bar:
> In the index : fieldNonTokenized:foo bar
>
> And when searching this is the final query getting made
> is: ((fieldNonTokenized:foo:foo)~0.01 (fieldNonTokenized:foo:bar)~0.01)~1
>
> Thus no document matches and returns no result. I can understand why this
> is happening. Is there any way where I can say that the query string should
> not be broken up into words?
>
> --
>
>
> Regards,
> Varun Thacker
> http://www.vthacker.in/
>


Re: TermsComponent/SolrCloud

2012-11-22 Thread Federico Méndez
Thanks Tomas, your suggestion worked!!


 
  true
  true
  /terms


  terms

  


On Thu, Nov 22, 2012 at 11:59 AM, Tomás Fernández Löbbe <
tomasflo...@gmail.com> wrote:

> Hi Federico, it should work. Make sure you set the "shards.qt" parameter
> too (in your case, it should be shards.qt=/terms)
>
>
> On Thu, Nov 22, 2012 at 6:51 AM, Federico Méndez  >wrote:
>
> > Anyone knows if the TermsComponent supports distributed search trough a
> > SolrCloud installation? I have a SolrCloud installation that works OK for
> > regular searches but TermsComponent is returning empty results when
> using:
> > [collectionName]/terms?terms.fl=collector_name&terms.prefix=jo, the
> request
> > handler configuration is:
> > 
> >startup="lazy">
> >  
> >   true
> >   true
> > 
> > 
> >   terms
> > 
> >   
> >
>


Re: Suggester for numbers

2012-11-22 Thread Gustav
Hello Illu,
Here you go:




 
  
  
 



 


  




  

  



  




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Suggester-for-numbers-tp4021672p4021828.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: SolrCloud and exernal file fields

2012-11-22 Thread Martin Koch
Mikhail

To avoid freezes we deployed the patches that are now on the 4.1 trunk (bug
3985). But this wasn't good enough, because SOLR would still take very long
to restart when that was necessary.

I don't see how we could throw more hardware at the problem without making
it worse, really - the only solution here would be *fewer* shards, not
more.

IMO it would be ideal if the lucene/solr community could come up with a
good way of updating fields in a document without reindexing. This could be
by linking to some external data store, or in the lucene/solr internals. If
it would make things easier, a good first step would be to have dynamically
updateable numerical fields only.

/Martin

On Wed, Nov 21, 2012 at 8:51 PM, Mikhail Khludnev <
mkhlud...@griddynamics.com> wrote:

> Martin,
>
> I don't think solrconfig.xml shed any light on. I've just found what I
> didn't get in your setup - the way of how to explicitly assigning core to
> collection. Now, I realized most of details after all!
> Ball is on your side, let us know whether you have managed your cores to
> commit one by one to avoid freeze, or could you eliminate pauses by
> allocating more hardware?
> Thanks in advance!
>
>
> On Wed, Nov 21, 2012 at 3:56 PM, Martin Koch  wrote:
>
> > Mikhail,
> >
> > PSB
> >
> > On Wed, Nov 21, 2012 at 10:08 AM, Mikhail Khludnev <
> > mkhlud...@griddynamics.com> wrote:
> >
> > > On Wed, Nov 21, 2012 at 11:53 AM, Martin Koch  wrote:
> > >
> > > >
> > > > I wasn't aware until now that it is possible to send a commit to one
> > core
> > > > only. What we observed was the effect of curl
> > > > localhost:8080/solr/update?commit=true but perhaps we should
> experiment
> > > > with solr/coreN/update?commit=true. A quick trial run seems to
> indicate
> > > > that a commit to a single core causes commits on all cores.
> > > >
> > > You should see something like this in the log:
> > > ... SolrCmdDistributor  Distrib commit to: ...
> > >
> > > Yup, a commit towards a single core results in a commit on all cores.
> >
> >
> > > >
> > > >
> > > > Perhaps I should clarify that we are using SOLR as a black box; we do
> > not
> > > > touch the code at all - we only install the distribution WAR file and
> > > > proceed from there.
> > > >
> > > I still don't understand how you deploy/launch Solr. How many jettys
> you
> > > start whether you have -DzkRun -DzkHost -DnumShards=2  or you specifies
> > > shards= param for every request and distributes updates yourself? What
> > > collections do you create and with which settings?
> > >
> > > We let SOLR do the sharding using one collection with 16 SOLR cores
> > holding one shard each. We launch only one instance of jetty with the
> > folllowing arguments:
> >
> > -DnumShards=16
> > -DzkHost=
> > -Xmx10G
> > -Xms10G
> > -Xmn2G
> > -server
> >
> > Would you like to see the solrconfig.xml?
> >
> > /Martin
> >
> >
> > > >
> > > >
> > > > > Also from my POV such deployments should start at least from *16*
> > 4-way
> > > > > vboxes, it's more expensive, but should be much better available
> > during
> > > > > cpu-consuming operations.
> > > > >
> > > >
> > > > Do you mean that you recommend 16 hosts with 4 cores each? Or 4 hosts
> > > with
> > > > 16 cores? Or am I misunderstanding something :) ?
> > > >
> > > I prefer to start from 16 hosts with 4 cores each.
> > >
> > >
> > > >
> > > >
> > > > > Other details, if you use single jetty for all of them, are you
> sure
> > > that
> > > > > jetty's threadpool doesn't limit requests? is it large enough?
> > > > > You have 60G and set -Xmx=10G. are you sure that total size of
> cores
> > > > index
> > > > > directories is less than 45G?
> > > > >
> > > > > The total index size is 230 GB, so it won't fit in ram, but we're
> > using
> > > > an
> > > > SSD disk to minimize disk access time. We have tried putting the EFF
> > > onto a
> > > > ram disk, but this didn't have a measurable effect.
> > > >
> > > > Thanks,
> > > > /Martin
> > > >
> > > >
> > > > > Thanks
> > > > >
> > > > >
> > > > > On Wed, Nov 21, 2012 at 2:07 AM, Martin Koch 
> wrote:
> > > > >
> > > > > > Mikhail
> > > > > >
> > > > > > PSB
> > > > > >
> > > > > > On Tue, Nov 20, 2012 at 7:22 PM, Mikhail Khludnev <
> > > > > > mkhlud...@griddynamics.com> wrote:
> > > > > >
> > > > > > > Martin,
> > > > > > >
> > > > > > > Please find additional question from me below.
> > > > > > >
> > > > > > > Simone,
> > > > > > >
> > > > > > > I'm sorry for hijacking your thread. The only what I've heard
> > about
> > > > it
> > > > > at
> > > > > > > recent ApacheCon sessions is that Zookeeper is supposed to
> > > replicate
> > > > > > those
> > > > > > > files as configs under solr home. And I'm really looking
> forward
> > to
> > > > > know
> > > > > > > how it works with huge files in production.
> > > > > > >
> > > > > > > Thank You, Guys!
> > > > > > >
> > > > > > > 20.11.2012 18:06 пользователь "Martin Koch" 
> > > написал:
> > > > > > > >
> > > > > > > > Hi Mikhail
> > > > > > > >
> > > 

Performance improvement for solr faceting on large index

2012-11-22 Thread Pravin Agrawal
Hi All,

We are using solr 3.4 with following schema fields.

---
























---

The index on above schema is distributed on two solr shards with each index 
size of about 1.2 million, and size on disk of about 195GB per shard.

We want to retrieve (site, autoSuggestContent term, frequency of the term) 
information from our above main solr index. The site is a field in document and 
contains name of site to which that document belongs. The terms are retrieved 
from multivalued field autoSuggestContent which is created using shingles from 
content and title of the web page.

As of now, we are using facet query to retrieve (term, frequency of term)  for 
each site. Below is a sample query (you may ignore initial part of query)

http://localhost:8080/solr/select?indent=on&q=*:*&fq=site:www.abc.com&start=0&rows=0&fl=id&qt=dismax&facet=true&facet.field=autoSuggestContent&facet.mincount=25&facet.limit=-1&facet.method=enum&facet.sort=index

The problem is that with increase in index size, this method has started taking 
huge time. It used to take 7 minutes per site with index size of
0.4 million docs but takes around 60-90 minutes for index size of 2.5 
million(). With this speed, it will take around 5-6 days to index complete 1500 
sites. Also we are expecting the index size to grow with more documents and 
more sites and as such time to get the above information will increase further.

Please let us know if there is any better way to extract (site, term, 
frequency) information compare to current method.

Thanks,
Pravin Agrawal




DISCLAIMER
==
This e-mail may contain privileged and confidential information which is the 
property of Persistent Systems Ltd. It is intended only for the use of the 
individual or entity to which it is addressed. If you are not the intended 
recipient, you are not authorized to read, retain, copy, print, distribute or 
use this message. If you have received this communication in error, please 
notify the sender and delete all copies of this message. Persistent Systems 
Ltd. does not accept any liability for virus infected mails.


Re: Performance improvement for solr faceting on large index

2012-11-22 Thread Yuval Dotan
you could always try the fc facet method and maybe increase the filtercache
size

On Thu, Nov 22, 2012 at 2:53 PM, Pravin Agrawal <
pravin_agra...@persistent.co.in> wrote:

> Hi All,
>
> We are using solr 3.4 with following schema fields.
>
>
> ---
>
>  positionIncrementGap="100">
> 
> 
> 
>  maxShingleSize="5" outputUnigrams="true"/>
>  pattern="^([0-9. ])*$" replacement=""
> replace="all"/>
> 
> 
> 
> 
> 
> 
> 
>
> 
>  indexed="true" multiValued="true"/>
> 
> 
>
> 
> 
> 
>
>
> ---
>
> The index on above schema is distributed on two solr shards with each
> index size of about 1.2 million, and size on disk of about 195GB per shard.
>
> We want to retrieve (site, autoSuggestContent term, frequency of the term)
> information from our above main solr index. The site is a field in document
> and contains name of site to which that document belongs. The terms are
> retrieved from multivalued field autoSuggestContent which is created using
> shingles from content and title of the web page.
>
> As of now, we are using facet query to retrieve (term, frequency of term)
>  for each site. Below is a sample query (you may ignore initial part of
> query)
>
>
> http://localhost:8080/solr/select?indent=on&q=*:*&fq=site:www.abc.com&start=0&rows=0&fl=id&qt=dismax&facet=true&facet.field=autoSuggestContent&facet.mincount=25&facet.limit=-1&facet.method=enum&facet.sort=index
>
> The problem is that with increase in index size, this method has started
> taking huge time. It used to take 7 minutes per site with index size of
> 0.4 million docs but takes around 60-90 minutes for index size of 2.5
> million(). With this speed, it will take around 5-6 days to index complete
> 1500 sites. Also we are expecting the index size to grow with more
> documents and more sites and as such time to get the above information will
> increase further.
>
> Please let us know if there is any better way to extract (site, term,
> frequency) information compare to current method.
>
> Thanks,
> Pravin Agrawal
>
>
>
>
> DISCLAIMER
> ==
> This e-mail may contain privileged and confidential information which is
> the property of Persistent Systems Ltd. It is intended only for the use of
> the individual or entity to which it is addressed. If you are not the
> intended recipient, you are not authorized to read, retain, copy, print,
> distribute or use this message. If you have received this communication in
> error, please notify the sender and delete all copies of this message.
> Persistent Systems Ltd. does not accept any liability for virus infected
> mails.
>


Re: SolrCloud and external Zookeeper ensemble

2012-11-22 Thread Jack Krupansky
That's a tradeoff for you to make based on your own requirements, but the 
point is that it is LESS SAFE to run zookeeper on the same machine as a Solr 
instance.


Also keep in mind that the goal is to have at least THREE zookeeper 
instances running at any moment, so if you run zookeeper on the same machine 
as a Solr instance, you will need more than three zookeepeers. Figure three 
plus the MAXIMUM number of Solr nodes that you expect could be down 
simultaneously.


Also keep in mind that SolrCloud is about scaling,  but the intention is NOT 
to scale the zookeeper ensemble linearly with the number of Solr nodes. That 
means you would have to deal with the messiness of sometimes running 
zookeeper with Solr and sometimes not. So, unless you are running a very 
small SolrCloud cluster, you are much better off keeping zookeeper off your 
Solr machines.


The intent is that there will be a relatively small "ensemble" of zookeepers 
that service a large "army" or "armada" of Solr nodes.


-- Jack Krupansky

-Original Message- 
From: Marcin Rzewucki

Sent: Wednesday, November 21, 2012 5:06 PM
To: solr-user@lucene.apache.org
Subject: Re: SolrCloud and external Zookeeper ensemble

Yes, I meant the same (not -zkRun). However, I was asking if it is safe to
have zookeeper and solr processes running on the same node or better on
different machines?

On 21 November 2012 21:18, Rafał Kuć  wrote:


Hello!

As I told I wouldn't use the Zookeeper that is embedded into Solr, but
rather setup a standalone one.

--
Regards,
 Rafał Kuć
 Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - ElasticSearch

> First of all: thank you for your answers. Yes, I meant side by side
> configuration. I think the worst case for ZKs here is to loose two of
them.
> However, I'm going to use 4 availability zones in same region so at 
> least

> this will reduce the risk of loosing both of them at the same time.
> Regards.

> On 21 November 2012 17:06, Rafał Kuć  wrote:

>> Hello!
>>
>> Zookeeper by itself is not demanding, but if something happens to your
>> nodes that have Solr on it, you'll loose ZooKeeper too if you have
>> them installed side by side. However if you will have 4 Solr nodes and
>> 3 ZK instances you can get them running side by side.
>>
>> --
>> Regards,
>>  Rafał Kuć
>>  Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch -
ElasticSearch
>>
>> > Separate is generally nice because then you can restart Solr nodes
>> > without consideration for ZooKeeper.
>>
>> > Performance-wise, I doubt it's a big deal either way.
>>
>> > - Mark
>>
>> > On Nov 21, 2012, at 8:54 AM, Marcin Rzewucki 
>> wrote:
>>
>> >> Hi,
>> >>
>> >> I have 4 solr collections, 2-3mn documents per collection, up to 
>> >> 100K

>> >> updates per collection daily (roughly). I'm going to create
SolrCloud4x
>> on
>> >> Amazon's m1.large instances (7GB mem,2x2.4GHz cpu each). The
question is
>> >> what about zookeeper? It's going to be external ensemble, but is it
>> better
>> >> to use same nodes as solr or dedicated micro instances? Zookeeper
does
>> not
>> >> seem to be resources demanding process, but what would be better in
this
>> >> case ? To keep it inside of solrcloud or separately (micro instances
>> seem
>> >> to be enough here) ?
>> >>
>> >> Thanks in advance.
>> >> Regards.
>>
>>






Re: From Solr3.1 to SolrCloud

2012-11-22 Thread roySolr
I run a separate Zookeeper instance right now. Works great, nodes are visible
in admin.

Two more questions:

- I change my synonyms.txt on a solr node. How can i get zookeeper in sync
and the other solr nodes without restart?

- I read something more about zookeeper ensemble. When i need to run with 4
solr nodes(replicas) i need 3 zookeepers in ensemble(50% live). When
zookeeper and solr are separated it will takes 7 servers to get it live. In
the past we only needed 4 servers. Are there some other options because the
costs will grow? 3 zookeeper servers sounds like overkill.

Thanks



--
View this message in context: 
http://lucene.472066.n3.nabble.com/From-Solr3-1-to-SolrCloud-tp4021536p4021849.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Partial results with not enough hits

2012-11-22 Thread Otis Gospodnetic
Hi,

Maybe your goal should be to make your queries faster instead of fighting
with timeouts which are known not to work well.

What is your hardware like?
How about your queries?
What do you see in debugQuery=true output?

Otis
--
SOLR Performance Monitoring - http://sematext.com/spm
On Nov 21, 2012 6:04 PM, "Aleksey Vorona"  wrote:

> In all of my queries I have timeAllowed parameter. My application is ready
> for partial results. However, whenever Solr returns partial result it is a
> very bad result.
>
> For example, I have a test query and here its execution log with the
> strict time allowed:
> WARNING: Query: ; Elapsed time: 120Exceeded allowed search
> time: 100 ms.
> INFO: [] webapp=/solr path=/select params={&timeAllowed=**100}
> hits=189 status=0 QTime=119
> Here it is without such a strict limitation:
> INFO: [] webapp=/solr path=/select params={&timeAllowed=**1}
> hits=582 status=0 QTime=124
>
> The total execution time is different by mere 5 ms, but the partial result
> has only about 1/3 of the full result.
>
> Is it the expected behaviour? Does that mean I can never rely on the
> partial results?
>
> I added timeAllowed to protect from too expensive wide queries, but I
> still want to return something relevant to the user. This query returned
> 30% of the full result, but I have other queries in the log where partial
> result is just empty. Am I doing something wrong?
>
> P.S. I am using Solr 3.6.1, index size is 3Gb and easily fits in memory.
> Load Average on the Solr box is very low.
>
> -- Aleksey
>


Re: SolrCloud and external Zookeeper ensemble

2012-11-22 Thread Otis Gospodnetic
If your Solr instances don't max out your ec2 instances you should be fine.
But maybe even micro instances will suffice. Or 1 on demand and 2 spot
ones. If cost is the concern, that is.

Otis
--
SOLR Performance Monitoring - http://sematext.com/spm
On Nov 21, 2012 5:07 PM, "Marcin Rzewucki"  wrote:

> Yes, I meant the same (not -zkRun). However, I was asking if it is safe to
> have zookeeper and solr processes running on the same node or better on
> different machines?
>
> On 21 November 2012 21:18, Rafał Kuć  wrote:
>
> > Hello!
> >
> > As I told I wouldn't use the Zookeeper that is embedded into Solr, but
> > rather setup a standalone one.
> >
> > --
> > Regards,
> >  Rafał Kuć
> >  Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch -
> ElasticSearch
> >
> > > First of all: thank you for your answers. Yes, I meant side by side
> > > configuration. I think the worst case for ZKs here is to loose two of
> > them.
> > > However, I'm going to use 4 availability zones in same region so at
> least
> > > this will reduce the risk of loosing both of them at the same time.
> > > Regards.
> >
> > > On 21 November 2012 17:06, Rafał Kuć  wrote:
> >
> > >> Hello!
> > >>
> > >> Zookeeper by itself is not demanding, but if something happens to your
> > >> nodes that have Solr on it, you'll loose ZooKeeper too if you have
> > >> them installed side by side. However if you will have 4 Solr nodes and
> > >> 3 ZK instances you can get them running side by side.
> > >>
> > >> --
> > >> Regards,
> > >>  Rafał Kuć
> > >>  Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch -
> > ElasticSearch
> > >>
> > >> > Separate is generally nice because then you can restart Solr nodes
> > >> > without consideration for ZooKeeper.
> > >>
> > >> > Performance-wise, I doubt it's a big deal either way.
> > >>
> > >> > - Mark
> > >>
> > >> > On Nov 21, 2012, at 8:54 AM, Marcin Rzewucki 
> > >> wrote:
> > >>
> > >> >> Hi,
> > >> >>
> > >> >> I have 4 solr collections, 2-3mn documents per collection, up to
> 100K
> > >> >> updates per collection daily (roughly). I'm going to create
> > SolrCloud4x
> > >> on
> > >> >> Amazon's m1.large instances (7GB mem,2x2.4GHz cpu each). The
> > question is
> > >> >> what about zookeeper? It's going to be external ensemble, but is it
> > >> better
> > >> >> to use same nodes as solr or dedicated micro instances? Zookeeper
> > does
> > >> not
> > >> >> seem to be resources demanding process, but what would be better in
> > this
> > >> >> case ? To keep it inside of solrcloud or separately (micro
> instances
> > >> seem
> > >> >> to be enough here) ?
> > >> >>
> > >> >> Thanks in advance.
> > >> >> Regards.
> > >>
> > >>
> >
> >
>


Re: SolrCloud and exernal file fields

2012-11-22 Thread Yonik Seeley
On Tue, Nov 20, 2012 at 4:16 AM, Martin Koch  wrote:
> around 7M documents in the index; each document has a 45 character ID.

7M documents isn't that large.  Is there a reason why you need so many
shards (16 in your case) on a single box?

-Yonik
http://lucidworks.com


Re: SolrCloud and external Zookeeper ensemble

2012-11-22 Thread Jack Krupansky
That is an interesting point - what size of instance is needed for a 
zookeeper. Can it run well in a micro?


Another issue I wanted to raise is that maybe questions, advice, and 
guidelines should be relative to the "shirt size" of your cluster - small, 
medium, or large. SolrCloud is clearly more optimized for medium to large 
clusters. Sure, you can use it for small clusters, but then some of the 
features and guidance do seem like overkill. Nonetheless, I would hate to 
see anybody take the compromised guidance for very small clusters (3 or 4 
machines) and apply it to even medium-size clusters (10 to 20 machines), let 
alone large clusters (dozens to 100 or more machines).


-- Jack Krupansky

-Original Message- 
From: Otis Gospodnetic

Sent: Thursday, November 22, 2012 9:37 AM
To: solr-user@lucene.apache.org
Subject: Re: SolrCloud and external Zookeeper ensemble

If your Solr instances don't max out your ec2 instances you should be fine.
But maybe even micro instances will suffice. Or 1 on demand and 2 spot
ones. If cost is the concern, that is.

Otis
--
SOLR Performance Monitoring - http://sematext.com/spm
On Nov 21, 2012 5:07 PM, "Marcin Rzewucki"  wrote:


Yes, I meant the same (not -zkRun). However, I was asking if it is safe to
have zookeeper and solr processes running on the same node or better on
different machines?

On 21 November 2012 21:18, Rafał Kuć  wrote:

> Hello!
>
> As I told I wouldn't use the Zookeeper that is embedded into Solr, but
> rather setup a standalone one.
>
> --
> Regards,
>  Rafał Kuć
>  Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch -
ElasticSearch
>
> > First of all: thank you for your answers. Yes, I meant side by side
> > configuration. I think the worst case for ZKs here is to loose two of
> them.
> > However, I'm going to use 4 availability zones in same region so at
least
> > this will reduce the risk of loosing both of them at the same time.
> > Regards.
>
> > On 21 November 2012 17:06, Rafał Kuć  wrote:
>
> >> Hello!
> >>
> >> Zookeeper by itself is not demanding, but if something happens to 
> >> your

> >> nodes that have Solr on it, you'll loose ZooKeeper too if you have
> >> them installed side by side. However if you will have 4 Solr nodes 
> >> and

> >> 3 ZK instances you can get them running side by side.
> >>
> >> --
> >> Regards,
> >>  Rafał Kuć
> >>  Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch -
> ElasticSearch
> >>
> >> > Separate is generally nice because then you can restart Solr nodes
> >> > without consideration for ZooKeeper.
> >>
> >> > Performance-wise, I doubt it's a big deal either way.
> >>
> >> > - Mark
> >>
> >> > On Nov 21, 2012, at 8:54 AM, Marcin Rzewucki 
> >> wrote:
> >>
> >> >> Hi,
> >> >>
> >> >> I have 4 solr collections, 2-3mn documents per collection, up to
100K
> >> >> updates per collection daily (roughly). I'm going to create
> SolrCloud4x
> >> on
> >> >> Amazon's m1.large instances (7GB mem,2x2.4GHz cpu each). The
> question is
> >> >> what about zookeeper? It's going to be external ensemble, but is 
> >> >> it

> >> better
> >> >> to use same nodes as solr or dedicated micro instances? Zookeeper
> does
> >> not
> >> >> seem to be resources demanding process, but what would be better 
> >> >> in

> this
> >> >> case ? To keep it inside of solrcloud or separately (micro
instances
> >> seem
> >> >> to be enough here) ?
> >> >>
> >> >> Thanks in advance.
> >> >> Regards.
> >>
> >>
>
>





Re: SolrCloud and external Zookeeper ensemble

2012-11-22 Thread Shawn Heisey

On 11/22/2012 2:18 AM, Luis Cappa Banda wrote:

I´ve been dealing with the same question these days. In architecture terms,
it´s always better to separate services (Solr and Zookeeper, in this case)
rather to keep them in a single instance. However, when we have to deal
with costs issues, all of use we are quite limitated and we must elect the
best architecture/scalable/single point of failure option. As I see, the
options are:


*1. *Solr servers with Zookeeper embeded.
*2. *Solr servers with external Zookeeper.
*3.* Solr servers with external Zookeeper ensemble.


I've never used SolrCloud, so this is all speculation based on what I've 
been reading.  That has been mostly on this list, but also on dev@l.o 
and the IRC channel.


I have a four-node Solr 3.5 deployment with about 80 million documents 
(130GB) in the distributed index.  I think of my installation as small.  
Others might disagree with my opinion, but I know there are a lot of 
indexes out there that make mine look tiny.


If I needed to set a similarly small setup with SolrCloud on four Solr 
servers, what I would pitch to management would be one extra machine 
(cheap, 1U, low-end processor, etc) to act as a standalone zookeeper 
node.  For the other two zookeper instances, I would run standalone 
zookeeper (separate JVM from Solr) on two of the Solr servers.  I might 
ask for a small boost in RAM and/or CPU on the two servers that serve 
double-duty.  I would not run zookeeper in the same JVM as Solr.


With a little bit of growth in the cluster, I would ask for a second 
standalone zookeeper node, pulling zookeeper off one of the Solr 
servers.  If it continued to grow, then I would ask for the third.  I 
would leave blank spots in the rack for those standalone servers.


Thanks,
Shawn



Re: How to get a list of servers per collection in sorlcloud using java api?

2012-11-22 Thread Luis Cappa Banda
Hello, Joe.

Try something like this using SolrJ library:

String endpoints[] = // your Solr server endpoints. Example:
http://localhost:8080/solr/core1
String zookeeperEndpoints = // your Zookeeper endpoints. Example:
localhost:9000
String collectionName = // Your collection name. Example: core1

LBHttpSolrServer lbSolrServer = new LBHttpSolrServer(endpoints);
this.cloudSolrServer = new CloudSolrServer(zookeeperEndpoints,
lbSolrServer);
this.cloudSolrServer.setDefaultCollection(collectionName);


You have now created a CloudSolrServer instance which can manage Solr
server operations: add a new document, delete, update, etc.

Regards,


 - Luis Cappa.

2012/11/22 joe.cohe...@gmail.com 

> I want to write a function that will go thorugh all the servers that store
> a
> specific collection and perform a tsk on it, suppose RELOAD CORE task.
> How can I get a list of all solr servers/urls that run a specific
> collection?
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/How-to-get-a-list-of-servers-per-collection-in-sorlcloud-using-java-api-tp4021863.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 

- Luis Cappa


Re: From Solr3.1 to SolrCloud

2012-11-22 Thread Tomás Fernández Löbbe
>
> - I change my synonyms.txt on a solr node. How can i get zookeeper in sync
> and the other solr nodes without restart?
>

Well, you can upload the whole collection configuration again with zkClient
(included in the "cloud.scripts" section). see
http://wiki.apache.org/solr/SolrCloud#Getting_your_Configuration_Files_into_ZooKeeper
Other option, if you only want to upload one file is to write something
that communicate with zk with any of it's apis. I did this before Solr's
"zkClient" was committed and it is quite simple. Then, you can reload the
collection, which is like reloading all the cores for the collection in the
different nodes. See
http://wiki.apache.org/solr/SolrCloud#Managing_collections_via_the_Collections_API

>
> - I read something more about zookeeper ensemble. When i need to run with 4
> solr nodes(replicas) i need 3 zookeepers in ensemble(50% live). When
> zookeeper and solr are separated it will takes 7 servers to get it live. In
> the past we only needed 4 servers. Are there some other options because the
> costs will grow? 3 zookeeper servers sounds like overkill.
>

The number of Solr instances doesn't have to do with the number of ZK
instances that you need to run.  You can effectively run with only one zk
instance, the problem with this is that if that instance dies, then your
whole cluster will go down. So you can increase the number of zk instances.
When you create your Zookeeper ensemble, you declare the size of it (the
number of zk instances it will contain). When you run that ensemble,
Zookeeper requires that N/2+1 of the servers are connected. This means that
if you want your zk ensemble to survive one instance dying, you'll need at
least 3 ZK instances (if you have 2, and one dies, you still need 2 to
work, so it wont).

There has been some discussions these days in the list about this, but if
the number of physical servers is too much for you, you could run on the
same physical machine an instance of Solr and ZK.

Tomás


Re: Solr Cloud Zookeeper Namespace

2012-11-22 Thread Tomás Fernández Löbbe
You could use Zookeeper's chroot:
http://zookeeper.apache.org/doc/r3.2.2/zookeeperAdmin.html#sc_bestPractices

You can use chroot in Solr by specifying it in the zkHost parameter, for
example -DzkHost=localhost:2181/namespace1

In order for this to work, you need to first create the initial path (in
the example above, you should create /namespace1 in zookeeper before
starting Solr)

Tomás


On Thu, Nov 22, 2012 at 2:08 PM, Sandopolus  wrote:

> Is it possible with Solr Cloud 4.0 to specify a namespace for zookeeper so
> that you can run completely isolated Solr Cloud Clusters.
>
> There is the collection.configName property puts specific items into sub
> nodes for that collection, but certain things are still shared and in the
> root directory in Zookeeper like clusterstate.json
> What i am looking for a property which allows me to prepend a namespace to
> all nodes in Zookeeper that Solr Cloud inserts.
>
> Does anyone know if this exists?
>


Re: How to get a list of servers per collection in sorlcloud using java api?

2012-11-22 Thread Luis Cappa Banda
Hello,

As far as I know, you cannot do that at the moment, :-/

Regards,


 - Luis Cappa.


2012/11/22 joe.cohe...@gmail.com 

> Thanks Rakudten.
> I had my question mis-phrased.
> What I need is being able to get the solr servers storing a collection by
> giving the zookeeper server as an input.
>
> something like:
>
> // returns a list of solr servers in the zookeeper ensemble that store the
> given collection
> List getServers(String zkhost, String collectionName)
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/How-to-get-a-list-of-servers-per-collection-in-sorlcloud-using-java-api-tp4021863p4021883.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 

- Luis Cappa


Re: is there a way to prevent abusing rows parameter

2012-11-22 Thread solr-user
Thanks guys.  This is a problem with the front end not validating requests. 
I was hoping there might be a simple config value I could enter/change,
rather than going the long process of migrating a proper fix all the way up
to our production servers.  Looks like not, but thx.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/is-there-a-way-to-prevent-abusing-rows-parameter-tp4021467p4021892.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Partial results with not enough hits

2012-11-22 Thread Aleksey Vorona

Thank you!

That seems to be the case, I tried to execute queries without sorting 
and only one document in the response and I got execution time in the 
same range as before.


-- Aleksey

On 12-11-21 04:07 PM, Jack Krupansky wrote:

It could be that the time to get set up to return even the first result is
high and then each additional document is a minimal increment in time.

Do a query with &rows=1 (or even 0) and see what the minimum query time is
for your query, index, and environment.

-- Jack Krupansky

-Original Message-
From: Aleksey Vorona
Sent: Wednesday, November 21, 2012 6:04 PM
To: solr-user@lucene.apache.org
Subject: Partial results with not enough hits

In all of my queries I have timeAllowed parameter. My application is
ready for partial results. However, whenever Solr returns partial result
it is a very bad result.

For example, I have a test query and here its execution log with the
strict time allowed:
  WARNING: Query: ; Elapsed time: 120Exceeded allowed search
time: 100 ms.
  INFO: [] webapp=/solr path=/select
params={&timeAllowed=100} hits=189 status=0 QTime=119
Here it is without such a strict limitation:
  INFO: [] webapp=/solr path=/select
params={&timeAllowed=1} hits=582 status=0 QTime=124

The total execution time is different by mere 5 ms, but the partial
result has only about 1/3 of the full result.

Is it the expected behaviour? Does that mean I can never rely on the
partial results?

I added timeAllowed to protect from too expensive wide queries, but I
still want to return something relevant to the user. This query returned
30% of the full result, but I have other queries in the log where
partial result is just empty. Am I doing something wrong?

P.S. I am using Solr 3.6.1, index size is 3Gb and easily fits in memory.
Load Average on the Solr box is very low.

-- Aleksey






Re: Partial results with not enough hits

2012-11-22 Thread Aleksey Vorona

Thanks for the response.

I have increased the timeout and it did not increase execution time or 
system load. It is really that I misused the timeout.


Just to give you a bit of perspective, we added timeout to guarantee 
some level of QoS from the search engine. Our UI allows user to 
construct very complex queries and (what is worse) not all the time user 
really understands what she needs. That may become a problem if we have 
lots of users doing that. In this case I do not want to run such a 
complex query for seconds and want to return some result with a warning 
to the user that she is doing something wrong. But clearly, I set a 
timeout too low for that and started to harm even normal queries.


Anyway, thanks everyone for the replies. The issue is fixed and I now 
understand how timeout works much better (which was the reason to post 
to this list). Thanks!


-- Aleksey

On 12-11-22 06:37 AM, Otis Gospodnetic wrote:

Hi,

Maybe your goal should be to make your queries faster instead of fighting
with timeouts which are known not to work well.

What is your hardware like?
How about your queries?
What do you see in debugQuery=true output?

Otis
--
SOLR Performance Monitoring - http://sematext.com/spm
On Nov 21, 2012 6:04 PM, "Aleksey Vorona"  wrote:


In all of my queries I have timeAllowed parameter. My application is ready
for partial results. However, whenever Solr returns partial result it is a
very bad result.

For example, I have a test query and here its execution log with the
strict time allowed:
 WARNING: Query: ; Elapsed time: 120Exceeded allowed search
time: 100 ms.
 INFO: [] webapp=/solr path=/select params={&timeAllowed=**100}
hits=189 status=0 QTime=119
Here it is without such a strict limitation:
 INFO: [] webapp=/solr path=/select params={&timeAllowed=**1}
hits=582 status=0 QTime=124

The total execution time is different by mere 5 ms, but the partial result
has only about 1/3 of the full result.

Is it the expected behaviour? Does that mean I can never rely on the
partial results?

I added timeAllowed to protect from too expensive wide queries, but I
still want to return something relevant to the user. This query returned
30% of the full result, but I have other queries in the log where partial
result is just empty. Am I doing something wrong?

P.S. I am using Solr 3.6.1, index size is 3Gb and easily fits in memory.
Load Average on the Solr box is very low.

-- Aleksey





SolrCloud: Very strange behavior when doing atomic updates or documents reindexation.

2012-11-22 Thread Luis Cappa Banda
Hello everyone.

I´ve starting to seriously worry about with SolrCloud due an strange
behavior that I have detected. The situation is this the following:

*1.* SolrCloud with one shard and two Solr instances.
*2.* Indexation via SolrJ with CloudServer and a custom
BinaryLBHttpSolrServer that uses BinaryRequestWriter to execute correctly
atomic updates. Check
JIRA-4080
*3.* An asynchronous proccess updates partially some document fields. After
that operation I automatically execute a commit, so the index must be
reloaded.

What I have checked is that both using atomic updates or complete document
reindexations* aleatory documents are not updated* *even if I saw debugging
how the add() and commit() operations were executed correctly* *and without
errors*. Has anyone experienced a similar behavior? Is it posible that if
an index update operation didn´t finish and CloudSolrServer receives a new
one this second update operation doesn´t complete?

Thank you in advance.

Regards,

-- 

- Luis Cappa


Re: How to get a list of servers per collection in sorlcloud using java api?

2012-11-22 Thread Sami Siren
On Thu, Nov 22, 2012 at 7:20 PM, joe.cohe...@gmail.com <
joe.cohe...@gmail.com> wrote:

> Thanks Rakudten.
> I had my question mis-phrased.
> What I need is being able to get the solr servers storing a collection by
> giving the zookeeper server as an input.
>
> something like:
>
> // returns a list of solr servers in the zookeeper ensemble that store the
> given collection
> List getServers(String zkhost, String collectionName)
>

You can use ZKStateReader (#getClusterState) to get this info.

--
 Sami Siren


Reloading config to zookeeper

2012-11-22 Thread Cool Techi
When we make changes to our config files, how do we reload the files into 
zookeeper. 

Also, I understand that we would need to reload the collection, would we need 
to do this at a per shard level or just at the cloud level.

Regards,
Ayush

  

Re: SolrCloud: Very strange behavior when doing atomic updates or documents reindexation.

2012-11-22 Thread Sami Siren
I think the problem is that even though you were able to work around the
bug in the client solr still uses the xml format internally so the atomic
update (with multivalued field) fails later down the stack. The bug you
filed needs to be fixed to get the problem solved.


On Thu, Nov 22, 2012 at 8:19 PM, Luis Cappa Banda wrote:

> Hello everyone.
>
> I´ve starting to seriously worry about with SolrCloud due an strange
> behavior that I have detected. The situation is this the following:
>
> *1.* SolrCloud with one shard and two Solr instances.
> *2.* Indexation via SolrJ with CloudServer and a custom
> BinaryLBHttpSolrServer that uses BinaryRequestWriter to execute correctly
> atomic updates. Check
> JIRA-4080<
> https://issues.apache.org/jira/browse/SOLR-4080?focusedCommentId=13498055&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13498055
> >
> *3.* An asynchronous proccess updates partially some document fields. After
> that operation I automatically execute a commit, so the index must be
> reloaded.
>
> What I have checked is that both using atomic updates or complete document
> reindexations* aleatory documents are not updated* *even if I saw debugging
> how the add() and commit() operations were executed correctly* *and without
> errors*. Has anyone experienced a similar behavior? Is it posible that if
> an index update operation didn´t finish and CloudSolrServer receives a new
> one this second update operation doesn´t complete?
>
> Thank you in advance.
>
> Regards,
>
> --
>
> - Luis Cappa
>


Re: SolrCloud: Very strange behavior when doing atomic updates or documents reindexation.

2012-11-22 Thread Luis Cappa Banda
Hi, Sami!

But isn´t strange that some documents were updated (atomic updates)
correctly and other ones not? Can´t it be a more serious problem like some
kind of index writer lock, or whatever?

Regards,

- Luis Cappa.

2012/11/22 Sami Siren 

> I think the problem is that even though you were able to work around the
> bug in the client solr still uses the xml format internally so the atomic
> update (with multivalued field) fails later down the stack. The bug you
> filed needs to be fixed to get the problem solved.
>
>
> On Thu, Nov 22, 2012 at 8:19 PM, Luis Cappa Banda  >wrote:
>
> > Hello everyone.
> >
> > I´ve starting to seriously worry about with SolrCloud due an strange
> > behavior that I have detected. The situation is this the following:
> >
> > *1.* SolrCloud with one shard and two Solr instances.
> > *2.* Indexation via SolrJ with CloudServer and a custom
> > BinaryLBHttpSolrServer that uses BinaryRequestWriter to execute correctly
> > atomic updates. Check
> > JIRA-4080<
> >
> https://issues.apache.org/jira/browse/SOLR-4080?focusedCommentId=13498055&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13498055
> > >
> > *3.* An asynchronous proccess updates partially some document fields.
> After
> > that operation I automatically execute a commit, so the index must be
> > reloaded.
> >
> > What I have checked is that both using atomic updates or complete
> document
> > reindexations* aleatory documents are not updated* *even if I saw
> debugging
> > how the add() and commit() operations were executed correctly* *and
> without
> > errors*. Has anyone experienced a similar behavior? Is it posible that if
> > an index update operation didn´t finish and CloudSolrServer receives a
> new
> > one this second update operation doesn´t complete?
> >
> > Thank you in advance.
> >
> > Regards,
> >
> > --
> >
> > - Luis Cappa
> >
>



-- 

- Luis Cappa


Re: Reloading config to zookeeper

2012-11-22 Thread Marcin Rzewucki
Hi,

I'm using "cloud-scripts/zkcli.sh" script for reloading configuration, for
example:
$ ./cloud-scripts/zkcli.sh -cmd upconfig -confdir  -solrhome
 -confname  -z 

Then I'm reloading collection on each node in cloud, but maybe someone
knows better solution.
Regards.

On 22 November 2012 19:23, Cool Techi  wrote:

> When we make changes to our config files, how do we reload the files into
> zookeeper.
>
> Also, I understand that we would need to reload the collection, would we
> need to do this at a per shard level or just at the cloud level.
>
> Regards,
> Ayush
>
>


RE: Reloading config to zookeeper

2012-11-22 Thread Cool Techi
Thanks, but why do we need to specify the -solrhome? 

I am using the following command to load new config,

java -classpath .:/Users/solr-cli-lib/* org.apache.solr.cloud.ZkCLI -cmd 
upconfig -zkhost 
localhost:2181,localhost:2182,localhost:2183,localhost:2184,localhost:2185 
-confdir /Users/config-files -confname myconf

So basically reloading is just uploading the configs back again?

Regard,s
Ayush

> Date: Thu, 22 Nov 2012 19:32:27 +0100
> Subject: Re: Reloading config to zookeeper
> From: mrzewu...@gmail.com
> To: solr-user@lucene.apache.org
> 
> Hi,
> 
> I'm using "cloud-scripts/zkcli.sh" script for reloading configuration, for
> example:
> $ ./cloud-scripts/zkcli.sh -cmd upconfig -confdir  -solrhome
>  -confname  -z 
> 
> Then I'm reloading collection on each node in cloud, but maybe someone
> knows better solution.
> Regards.
> 
> On 22 November 2012 19:23, Cool Techi  wrote:
> 
> > When we make changes to our config files, how do we reload the files into
> > zookeeper.
> >
> > Also, I understand that we would need to reload the collection, would we
> > need to do this at a per shard level or just at the cloud level.
> >
> > Regards,
> > Ayush
> >
> >
  

Re: Reloading config to zookeeper

2012-11-22 Thread Marcin Rzewucki
I think solrhome is not mandatory.
Yes, reloading is uploading config dir again. It's a pity we can't update
just modified files.
Regards.

On 22 November 2012 19:38, Cool Techi  wrote:

> Thanks, but why do we need to specify the -solrhome?
>
> I am using the following command to load new config,
>
> java -classpath .:/Users/solr-cli-lib/* org.apache.solr.cloud.ZkCLI -cmd
> upconfig -zkhost
> localhost:2181,localhost:2182,localhost:2183,localhost:2184,localhost:2185
> -confdir /Users/config-files -confname myconf
>
> So basically reloading is just uploading the configs back again?
>
> Regard,s
> Ayush
>
> > Date: Thu, 22 Nov 2012 19:32:27 +0100
> > Subject: Re: Reloading config to zookeeper
> > From: mrzewu...@gmail.com
> > To: solr-user@lucene.apache.org
> >
> > Hi,
> >
> > I'm using "cloud-scripts/zkcli.sh" script for reloading configuration,
> for
> > example:
> > $ ./cloud-scripts/zkcli.sh -cmd upconfig -confdir  -solrhome
> >  -confname  -z 
> >
> > Then I'm reloading collection on each node in cloud, but maybe someone
> > knows better solution.
> > Regards.
> >
> > On 22 November 2012 19:23, Cool Techi  wrote:
> >
> > > When we make changes to our config files, how do we reload the files
> into
> > > zookeeper.
> > >
> > > Also, I understand that we would need to reload the collection, would
> we
> > > need to do this at a per shard level or just at the cloud level.
> > >
> > > Regards,
> > > Ayush
> > >
> > >
>
>


Re: Reload core via CoreAdminRequest doesnt work with solr cloud? (solrj)

2012-11-22 Thread Tomás Fernández Löbbe
If you need to reload all the cores from a given collection you can use the
Collections API:
http://localhost:8983/solr/admin/collections?action=RELOAD&name=mycollection


On Thu, Nov 22, 2012 at 3:17 PM, joe.cohe...@gmail.com <
joe.cohe...@gmail.com> wrote:

> Hi,
> I'm using solr-4.0.0
> I'm trying to reload all the cores of a given collection in my solr cloud.
> I use it like this:
>
> CloudSolrServer server = new CloudSolrServer (zkserver:port);
> server.setDefaultCollection("collection1");
> CoreAdminRequest req = new CoreAdminRequest();
> req.reloadCore("collection1", server)
>
> This throws an Exception telling me that no live solr servers are availble,
> listing the servers like this:
> http://server/solr/collection1
>
> Of course doing other tasks like adding documnets to the CloudSolrServer
> above works fine.
> Using reloadCore on a HttpSolrServer also works fine.
>
> Any know issue with CloudSolrServer   and CoreAdminRequest ?
>
>
> Note that I moved to solr-4.0.0 from solr-4.0.0-beta after trying the same
> thing also failed, but with a different exception.
> it failed saying cannot cast string to map in class ClusterState,  in
> load()
> method (line 300), because the key "range" gave some String value instead
> of
> a map object.
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Reload-core-via-CoreAdminRequest-doesnt-work-with-solr-cloud-solrj-tp4021882.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: SolrCloud: Very strange behavior when doing atomic updates or documents reindexation.

2012-11-22 Thread Sami Siren
It might even depend on the cluster layout! Let's say you have 2 shards (no
replicas) if the doc belongs to the node you send it to so that it does not
get forwarded to another node then the update should work and in case where
the doc gets forwarded to another node the problem occurs. With replicas it
could appear even more strange: the leader might have the doc right and the
replica not.

I only briefly looked at the bits that deal with this so perhaps there's
something more involved.


On Thu, Nov 22, 2012 at 8:29 PM, Luis Cappa Banda wrote:

> Hi, Sami!
>
> But isn´t strange that some documents were updated (atomic updates)
> correctly and other ones not? Can´t it be a more serious problem like some
> kind of index writer lock, or whatever?
>
> Regards,
>
> - Luis Cappa.
>
> 2012/11/22 Sami Siren 
>
> > I think the problem is that even though you were able to work around the
> > bug in the client solr still uses the xml format internally so the atomic
> > update (with multivalued field) fails later down the stack. The bug you
> > filed needs to be fixed to get the problem solved.
> >
> >
> > On Thu, Nov 22, 2012 at 8:19 PM, Luis Cappa Banda  > >wrote:
> >
> > > Hello everyone.
> > >
> > > I´ve starting to seriously worry about with SolrCloud due an strange
> > > behavior that I have detected. The situation is this the following:
> > >
> > > *1.* SolrCloud with one shard and two Solr instances.
> > > *2.* Indexation via SolrJ with CloudServer and a custom
> > > BinaryLBHttpSolrServer that uses BinaryRequestWriter to execute
> correctly
> > > atomic updates. Check
> > > JIRA-4080<
> > >
> >
> https://issues.apache.org/jira/browse/SOLR-4080?focusedCommentId=13498055&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13498055
> > > >
> > > *3.* An asynchronous proccess updates partially some document fields.
> > After
> > > that operation I automatically execute a commit, so the index must be
> > > reloaded.
> > >
> > > What I have checked is that both using atomic updates or complete
> > document
> > > reindexations* aleatory documents are not updated* *even if I saw
> > debugging
> > > how the add() and commit() operations were executed correctly* *and
> > without
> > > errors*. Has anyone experienced a similar behavior? Is it posible that
> if
> > > an index update operation didn´t finish and CloudSolrServer receives a
> > new
> > > one this second update operation doesn´t complete?
> > >
> > > Thank you in advance.
> > >
> > > Regards,
> > >
> > > --
> > >
> > > - Luis Cappa
> > >
> >
>
>
>
> --
>
> - Luis Cappa
>


upgrading from 4.0 to 4.1 causes "CorruptIndexException: checksum mismatch in segments file"

2012-11-22 Thread solr-user
hi all

I have been working on moving us from 4.0 to a newer build of 4.1

I am seeing a "CorruptIndexException: checksum mismatch in segments file"
error when I try to use the existing index files.

I did see something in the build log for #119 re "LUCENE-4446" that mentions
"flip file formats to point to 4.1 format"

Do I just need to reindex or is this some other issue (ie do I need to
configure something differently)?

or should I move back a few builds?

note, we are currently using:

solr-spec 4.0.0.2012.04.05.15.05.52
solr-impl 4.0-SNAPSHOT 1310094M - - 2012-04-05 15:05:52
lucene-spec 4.0-SNAPSHOT
lucene-impl 4.0-SNAPSHOT 1309921 - - 2012-04-05 10:25:27

and are considering moving to:

solr-spec 4.1.0.2012.11.03.18.08.42
solr-impl 4.1-2012-11-03_18-05-49 1405392 - hudson - 2012-11-03 18:08:42
lucene-spec 4.1-2012-11-03_18-05-49
lucene-impl 4.1-2012-11-03_18-05-49 1405392 - hudson - 2012-11-03 18:06:50
(aka apache-solr-4.1-2012-11-03_18-05-49)





--
View this message in context: 
http://lucene.472066.n3.nabble.com/upgrading-from-4-0-to-4-1-causes-CorruptIndexException-checksum-mismatch-in-segments-file-tp4021913.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Error: _version_field must exist in schema

2012-11-22 Thread Nick Zadrozny
On Wed, Oct 17, 2012 at 3:20 PM, Dotan Cohen  wrote:

> I do have a Solr 4 Beta index running on Websolr that does not have
> such a field. It works, but throws many "Service Unavailable" and
> "Communication Error" errors. Might the lack of the _version_ field be
> the reason?
>

Belated reply, but this is probably something you should let us know about
directly at supp...@onemorecloud.com if it happens again. Cheers.

-- 
Nick Zadrozny

Cofounder, One More Cloud

websolr.com  • bonsai.io 

Hassle-free hosted full-text search,
powered by Apache Solr and ElasticSearch.


Re: SolrCloud: Very strange behavior when doing atomic updates or documents reindexation.

2012-11-22 Thread Luis Cappa Banda
Hello!

I´m using a simple test configuration with nShards=1 without any replica.
SolrCloudServer is suposed to forward properly those index/update
operations, isn´t it? I test with a complete document reindexation, not
atomic updates, using the official LBHttpSolrServer, not my custom
BinaryLBHttpSolrServer, and it dosn´t work. I think is not just a bug
related with atomic updates via CloudSolrServer but a general bug when an
index changes with reindexations/updates frequently.

Regards,

- Luis Cappa.


2012/11/22 Sami Siren 

> It might even depend on the cluster layout! Let's say you have 2 shards (no
> replicas) if the doc belongs to the node you send it to so that it does not
> get forwarded to another node then the update should work and in case where
> the doc gets forwarded to another node the problem occurs. With replicas it
> could appear even more strange: the leader might have the doc right and the
> replica not.
>
> I only briefly looked at the bits that deal with this so perhaps there's
> something more involved.
>
>
> On Thu, Nov 22, 2012 at 8:29 PM, Luis Cappa Banda  >wrote:
>
> > Hi, Sami!
> >
> > But isn´t strange that some documents were updated (atomic updates)
> > correctly and other ones not? Can´t it be a more serious problem like
> some
> > kind of index writer lock, or whatever?
> >
> > Regards,
> >
> > - Luis Cappa.
> >
> > 2012/11/22 Sami Siren 
> >
> > > I think the problem is that even though you were able to work around
> the
> > > bug in the client solr still uses the xml format internally so the
> atomic
> > > update (with multivalued field) fails later down the stack. The bug you
> > > filed needs to be fixed to get the problem solved.
> > >
> > >
> > > On Thu, Nov 22, 2012 at 8:19 PM, Luis Cappa Banda  > > >wrote:
> > >
> > > > Hello everyone.
> > > >
> > > > I´ve starting to seriously worry about with SolrCloud due an strange
> > > > behavior that I have detected. The situation is this the following:
> > > >
> > > > *1.* SolrCloud with one shard and two Solr instances.
> > > > *2.* Indexation via SolrJ with CloudServer and a custom
> > > > BinaryLBHttpSolrServer that uses BinaryRequestWriter to execute
> > correctly
> > > > atomic updates. Check
> > > > JIRA-4080<
> > > >
> > >
> >
> https://issues.apache.org/jira/browse/SOLR-4080?focusedCommentId=13498055&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13498055
> > > > >
> > > > *3.* An asynchronous proccess updates partially some document fields.
> > > After
> > > > that operation I automatically execute a commit, so the index must be
> > > > reloaded.
> > > >
> > > > What I have checked is that both using atomic updates or complete
> > > document
> > > > reindexations* aleatory documents are not updated* *even if I saw
> > > debugging
> > > > how the add() and commit() operations were executed correctly* *and
> > > without
> > > > errors*. Has anyone experienced a similar behavior? Is it posible
> that
> > if
> > > > an index update operation didn´t finish and CloudSolrServer receives
> a
> > > new
> > > > one this second update operation doesn´t complete?
> > > >
> > > > Thank you in advance.
> > > >
> > > > Regards,
> > > >
> > > > --
> > > >
> > > > - Luis Cappa
> > > >
> > >
> >
> >
> >
> > --
> >
> > - Luis Cappa
> >
>



-- 

- Luis Cappa


Re: SolrCloud: Very strange behavior when doing atomic updates or documents reindexation.

2012-11-22 Thread Luis Cappa Banda
For more details, my indexation App is:

1. Multithreaded.
2. NRT indexation.
3. It´s a Web App with a REST API. It receives asynchronous requests that
produces those atomic updates / document reindexations I told before.

I´m pretty sure that the wrong behavior is related with CloudSolrServer and
with the fact that maybe you are trying to modify the index while an index
update is in course.

Regards,


- Luis Cappa.


2012/11/22 Luis Cappa Banda 

> Hello!
>
> I´m using a simple test configuration with nShards=1 without any replica.
> SolrCloudServer is suposed to forward properly those index/update
> operations, isn´t it? I test with a complete document reindexation, not
> atomic updates, using the official LBHttpSolrServer, not my custom
> BinaryLBHttpSolrServer, and it dosn´t work. I think is not just a bug
> related with atomic updates via CloudSolrServer but a general bug when an
> index changes with reindexations/updates frequently.
>
> Regards,
>
> - Luis Cappa.
>
>
> 2012/11/22 Sami Siren 
>
>> It might even depend on the cluster layout! Let's say you have 2 shards
>> (no
>> replicas) if the doc belongs to the node you send it to so that it does
>> not
>> get forwarded to another node then the update should work and in case
>> where
>> the doc gets forwarded to another node the problem occurs. With replicas
>> it
>> could appear even more strange: the leader might have the doc right and
>> the
>> replica not.
>>
>> I only briefly looked at the bits that deal with this so perhaps there's
>> something more involved.
>>
>>
>> On Thu, Nov 22, 2012 at 8:29 PM, Luis Cappa Banda > >wrote:
>>
>> > Hi, Sami!
>> >
>> > But isn´t strange that some documents were updated (atomic updates)
>> > correctly and other ones not? Can´t it be a more serious problem like
>> some
>> > kind of index writer lock, or whatever?
>> >
>> > Regards,
>> >
>> > - Luis Cappa.
>> >
>> > 2012/11/22 Sami Siren 
>> >
>> > > I think the problem is that even though you were able to work around
>> the
>> > > bug in the client solr still uses the xml format internally so the
>> atomic
>> > > update (with multivalued field) fails later down the stack. The bug
>> you
>> > > filed needs to be fixed to get the problem solved.
>> > >
>> > >
>> > > On Thu, Nov 22, 2012 at 8:19 PM, Luis Cappa Banda <
>> luisca...@gmail.com
>> > > >wrote:
>> > >
>> > > > Hello everyone.
>> > > >
>> > > > I´ve starting to seriously worry about with SolrCloud due an strange
>> > > > behavior that I have detected. The situation is this the following:
>> > > >
>> > > > *1.* SolrCloud with one shard and two Solr instances.
>> > > > *2.* Indexation via SolrJ with CloudServer and a custom
>> > > > BinaryLBHttpSolrServer that uses BinaryRequestWriter to execute
>> > correctly
>> > > > atomic updates. Check
>> > > > JIRA-4080<
>> > > >
>> > >
>> >
>> https://issues.apache.org/jira/browse/SOLR-4080?focusedCommentId=13498055&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13498055
>> > > > >
>> > > > *3.* An asynchronous proccess updates partially some document
>> fields.
>> > > After
>> > > > that operation I automatically execute a commit, so the index must
>> be
>> > > > reloaded.
>> > > >
>> > > > What I have checked is that both using atomic updates or complete
>> > > document
>> > > > reindexations* aleatory documents are not updated* *even if I saw
>> > > debugging
>> > > > how the add() and commit() operations were executed correctly* *and
>> > > without
>> > > > errors*. Has anyone experienced a similar behavior? Is it posible
>> that
>> > if
>> > > > an index update operation didn´t finish and CloudSolrServer
>> receives a
>> > > new
>> > > > one this second update operation doesn´t complete?
>> > > >
>> > > > Thank you in advance.
>> > > >
>> > > > Regards,
>> > > >
>> > > > --
>> > > >
>> > > > - Luis Cappa
>> > > >
>> > >
>> >
>> >
>> >
>> > --
>> >
>> > - Luis Cappa
>> >
>>
>
>
>
> --
>
> - Luis Cappa
>
>


-- 

- Luis Cappa


Re: SolrCloud: Very strange behavior when doing atomic updates or documents reindexation.

2012-11-22 Thread Luis Cappa Banda
More info:

-  I´m trying to update the document re-indexing the whole document again.
I first retrieve the document querying by it´s id, then delete it by it´s
id, and re-index including the new changes.
- At the same time there are other index writing operations.

*RESULT*: in most cases the document wasn´t updated. Bad news... it smells
like a critical bug.

Regards,


- Luis Cappa.

2012/11/22 Luis Cappa Banda 

> For more details, my indexation App is:
>
> 1. Multithreaded.
> 2. NRT indexation.
> 3. It´s a Web App with a REST API. It receives asynchronous requests that
> produces those atomic updates / document reindexations I told before.
>
> I´m pretty sure that the wrong behavior is related with CloudSolrServer
> and with the fact that maybe you are trying to modify the index while an
> index update is in course.
>
> Regards,
>
>
> - Luis Cappa.
>
>
> 2012/11/22 Luis Cappa Banda 
>
>> Hello!
>>
>> I´m using a simple test configuration with nShards=1 without any replica.
>> SolrCloudServer is suposed to forward properly those index/update
>> operations, isn´t it? I test with a complete document reindexation, not
>> atomic updates, using the official LBHttpSolrServer, not my custom
>> BinaryLBHttpSolrServer, and it dosn´t work. I think is not just a bug
>> related with atomic updates via CloudSolrServer but a general bug when an
>> index changes with reindexations/updates frequently.
>>
>> Regards,
>>
>> - Luis Cappa.
>>
>>
>> 2012/11/22 Sami Siren 
>>
>>> It might even depend on the cluster layout! Let's say you have 2 shards
>>> (no
>>> replicas) if the doc belongs to the node you send it to so that it does
>>> not
>>> get forwarded to another node then the update should work and in case
>>> where
>>> the doc gets forwarded to another node the problem occurs. With replicas
>>> it
>>> could appear even more strange: the leader might have the doc right and
>>> the
>>> replica not.
>>>
>>> I only briefly looked at the bits that deal with this so perhaps there's
>>> something more involved.
>>>
>>>
>>> On Thu, Nov 22, 2012 at 8:29 PM, Luis Cappa Banda >> >wrote:
>>>
>>> > Hi, Sami!
>>> >
>>> > But isn´t strange that some documents were updated (atomic updates)
>>> > correctly and other ones not? Can´t it be a more serious problem like
>>> some
>>> > kind of index writer lock, or whatever?
>>> >
>>> > Regards,
>>> >
>>> > - Luis Cappa.
>>> >
>>> > 2012/11/22 Sami Siren 
>>> >
>>> > > I think the problem is that even though you were able to work around
>>> the
>>> > > bug in the client solr still uses the xml format internally so the
>>> atomic
>>> > > update (with multivalued field) fails later down the stack. The bug
>>> you
>>> > > filed needs to be fixed to get the problem solved.
>>> > >
>>> > >
>>> > > On Thu, Nov 22, 2012 at 8:19 PM, Luis Cappa Banda <
>>> luisca...@gmail.com
>>> > > >wrote:
>>> > >
>>> > > > Hello everyone.
>>> > > >
>>> > > > I´ve starting to seriously worry about with SolrCloud due an
>>> strange
>>> > > > behavior that I have detected. The situation is this the following:
>>> > > >
>>> > > > *1.* SolrCloud with one shard and two Solr instances.
>>> > > > *2.* Indexation via SolrJ with CloudServer and a custom
>>> > > > BinaryLBHttpSolrServer that uses BinaryRequestWriter to execute
>>> > correctly
>>> > > > atomic updates. Check
>>> > > > JIRA-4080<
>>> > > >
>>> > >
>>> >
>>> https://issues.apache.org/jira/browse/SOLR-4080?focusedCommentId=13498055&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13498055
>>> > > > >
>>> > > > *3.* An asynchronous proccess updates partially some document
>>> fields.
>>> > > After
>>> > > > that operation I automatically execute a commit, so the index must
>>> be
>>> > > > reloaded.
>>> > > >
>>> > > > What I have checked is that both using atomic updates or complete
>>> > > document
>>> > > > reindexations* aleatory documents are not updated* *even if I saw
>>> > > debugging
>>> > > > how the add() and commit() operations were executed correctly* *and
>>> > > without
>>> > > > errors*. Has anyone experienced a similar behavior? Is it posible
>>> that
>>> > if
>>> > > > an index update operation didn´t finish and CloudSolrServer
>>> receives a
>>> > > new
>>> > > > one this second update operation doesn´t complete?
>>> > > >
>>> > > > Thank you in advance.
>>> > > >
>>> > > > Regards,
>>> > > >
>>> > > > --
>>> > > >
>>> > > > - Luis Cappa
>>> > > >
>>> > >
>>> >
>>> >
>>> >
>>> > --
>>> >
>>> > - Luis Cappa
>>> >
>>>
>>
>>
>>
>> --
>>
>> - Luis Cappa
>>
>>
>
>
> --
>
> - Luis Cappa
>
>


-- 

- Luis Cappa


Re: upgrading from 4.0 to 4.1 causes "CorruptIndexException: checksum mismatch in segments file"

2012-11-22 Thread Jack Krupansky
Moving from the final release of 4.0 to 4.1 should be fine, but you appear 
to be using a snapshot of 4.0 that is even older than the ALPHA release of 
4.0 and a number of format changes occurred last Spring. So, yeah, you will 
have to re-index.


-- Jack Krupansky

-Original Message- 
From: solr-user

Sent: Thursday, November 22, 2012 2:03 PM
To: solr-user@lucene.apache.org
Subject: upgrading from 4.0 to 4.1 causes "CorruptIndexException: checksum 
mismatch in segments file"


hi all

I have been working on moving us from 4.0 to a newer build of 4.1

I am seeing a "CorruptIndexException: checksum mismatch in segments file"
error when I try to use the existing index files.

I did see something in the build log for #119 re "LUCENE-4446" that mentions
"flip file formats to point to 4.1 format"

Do I just need to reindex or is this some other issue (ie do I need to
configure something differently)?

or should I move back a few builds?

note, we are currently using:

solr-spec 4.0.0.2012.04.05.15.05.52
solr-impl 4.0-SNAPSHOT 1310094M - - 2012-04-05 15:05:52
lucene-spec 4.0-SNAPSHOT
lucene-impl 4.0-SNAPSHOT 1309921 - - 2012-04-05 10:25:27

and are considering moving to:

solr-spec 4.1.0.2012.11.03.18.08.42
solr-impl 4.1-2012-11-03_18-05-49 1405392 - hudson - 2012-11-03 18:08:42
lucene-spec 4.1-2012-11-03_18-05-49
lucene-impl 4.1-2012-11-03_18-05-49 1405392 - hudson - 2012-11-03 18:06:50
(aka apache-solr-4.1-2012-11-03_18-05-49)





--
View this message in context: 
http://lucene.472066.n3.nabble.com/upgrading-from-4-0-to-4-1-causes-CorruptIndexException-checksum-mismatch-in-segments-file-tp4021913.html
Sent from the Solr - User mailing list archive at Nabble.com. 



Re: Find the matched field in each matched document

2012-11-22 Thread Jack Krupansky
No, not directly, but indirectly you can - add &debugQuery=true to your 
request and the "explain" section will detail which terms matched in which 
fields.


You could probably also implement a custom search component which annotated 
each document with the matched field names. In that sense, Solr CAN do it.


-- Jack Krupansky

-Original Message- 
From: Alireza Salimi

Sent: Thursday, November 22, 2012 6:11 PM
To: solr-user@lucene.apache.org
Subject: Re: Find the matched field in each matched document

Maybe I should say it in different way:

By having documents like above, I want to know what "Robert De Niro" is?
Is it an actor or a movie title.

you can just tell me if Solr can do it or not, it will be enough.

Thanks



On Thu, Nov 22, 2012 at 1:57 PM, Alireza Salimi 
wrote:



Hi,

I apologize if i'm asking a duplicate question but I haven't found any
good answer for my problem.
My question is: How can I find out the type of fields that are matched to
the search criteria,
when I search over multip fields.

Assume I have documents like this:
{"title": "Robert De Niro", "actors": []}
{"title": "ronin", "actors": ["robert de niro", "jean reno"]}
{"title": "casino", "actors": ["robert de niro", "Joe Pesci"]}

Here's is the schema:





Now after search for "robert de niro" in both "title" and "Actors",
I will have some matches, but my question is: How can I find out
what "robert de niro" is? Is he "an actor" or a "movie title"?


Thanks in advance



--
Alireza Salimi
Java EE Developer






--
Alireza Salimi
Java EE Developer 



Re: Find the matched field in each matched document

2012-11-22 Thread Alireza Salimi
Hi Jack,

Thanks for the reply.

I'm not sure about debug components, I thought it slows down query time.
Can you explain more about custom search component?

Thanks


On Thu, Nov 22, 2012 at 7:02 PM, Jack Krupansky wrote:

> No, not directly, but indirectly you can - add &debugQuery=true to your
> request and the "explain" section will detail which terms matched in which
> fields.
>
> You could probably also implement a custom search component which
> annotated each document with the matched field names. In that sense, Solr
> CAN do it.
>
> -- Jack Krupansky
>
> -Original Message- From: Alireza Salimi
> Sent: Thursday, November 22, 2012 6:11 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Find the matched field in each matched document
>
>
> Maybe I should say it in different way:
>
> By having documents like above, I want to know what "Robert De Niro" is?
> Is it an actor or a movie title.
>
> you can just tell me if Solr can do it or not, it will be enough.
>
> Thanks
>
>
>
> On Thu, Nov 22, 2012 at 1:57 PM, Alireza Salimi 
> **wrote:
>
>  Hi,
>>
>> I apologize if i'm asking a duplicate question but I haven't found any
>> good answer for my problem.
>> My question is: How can I find out the type of fields that are matched to
>> the search criteria,
>> when I search over multip fields.
>>
>> Assume I have documents like this:
>> {"title": "Robert De Niro", "actors": []}
>> {"title": "ronin", "actors": ["robert de niro", "jean reno"]}
>> {"title": "casino", "actors": ["robert de niro", "Joe Pesci"]}
>>
>> Here's is the schema:
>>
>> > indexed="true"
>> multiValued="true"
>> stored="true"
>> termPositions="true"
>> termOffsets="true"
>> termVectors="true"
>> type="text_general" />
>>
>> > indexed="true"
>> multiValued="false"
>> stored="true"
>> type="text_general" />
>>
>> Now after search for "robert de niro" in both "title" and "Actors",
>> I will have some matches, but my question is: How can I find out
>> what "robert de niro" is? Is he "an actor" or a "movie title"?
>>
>>
>> Thanks in advance
>>
>>
>>
>> --
>> Alireza Salimi
>> Java EE Developer
>>
>>
>>
>>
>
> --
> Alireza Salimi
> Java EE Developer
>



-- 
Alireza Salimi
Java EE Developer


Re: Performance improvement for solr faceting on large index

2012-11-22 Thread Otis Gospodnetic
Hi,

I don't quite follow what you are trying gyroscope do, but it almost sounds
like you may be better off using something other than Solr if all you are
doing is filtering by site and counting something.
I see unigrams in what looks like it could be a big field and that's a red
flag.
Your index is quite big - how much memory have you got?  Do those queries
produce a lot of disk IO. I have a feeling they do. If so, your shards may
be too large for your hardware.

Otis
--
SOLR Performance Monitoring - http://sematext.com/spm
On Nov 22, 2012 7:53 AM, "Pravin Agrawal" 
wrote:

> Hi All,
>
> We are using solr 3.4 with following schema fields.
>
>
> ---
>
>  positionIncrementGap="100">
> 
> 
> 
>  maxShingleSize="5" outputUnigrams="true"/>
>  pattern="^([0-9. ])*$" replacement=""
> replace="all"/>
> 
> 
> 
> 
> 
> 
> 
>
> 
>  indexed="true" multiValued="true"/>
> 
> 
>
> 
> 
> 
>
>
> ---
>
> The index on above schema is distributed on two solr shards with each
> index size of about 1.2 million, and size on disk of about 195GB per shard.
>
> We want to retrieve (site, autoSuggestContent term, frequency of the term)
> information from our above main solr index. The site is a field in document
> and contains name of site to which that document belongs. The terms are
> retrieved from multivalued field autoSuggestContent which is created using
> shingles from content and title of the web page.
>
> As of now, we are using facet query to retrieve (term, frequency of term)
>  for each site. Below is a sample query (you may ignore initial part of
> query)
>
>
> http://localhost:8080/solr/select?indent=on&q=*:*&fq=site:www.abc.com&start=0&rows=0&fl=id&qt=dismax&facet=true&facet.field=autoSuggestContent&facet.mincount=25&facet.limit=-1&facet.method=enum&facet.sort=index
>
> The problem is that with increase in index size, this method has started
> taking huge time. It used to take 7 minutes per site with index size of
> 0.4 million docs but takes around 60-90 minutes for index size of 2.5
> million(). With this speed, it will take around 5-6 days to index complete
> 1500 sites. Also we are expecting the index size to grow with more
> documents and more sites and as such time to get the above information will
> increase further.
>
> Please let us know if there is any better way to extract (site, term,
> frequency) information compare to current method.
>
> Thanks,
> Pravin Agrawal
>
>
>
>
> DISCLAIMER
> ==
> This e-mail may contain privileged and confidential information which is
> the property of Persistent Systems Ltd. It is intended only for the use of
> the individual or entity to which it is addressed. If you are not the
> intended recipient, you are not authorized to read, retain, copy, print,
> distribute or use this message. If you have received this communication in
> error, please notify the sender and delete all copies of this message.
> Persistent Systems Ltd. does not accept any liability for virus infected
> mails.
>


Re: SolrCloud and external Zookeeper ensemble

2012-11-22 Thread Otis Gospodnetic
Note the number of zookeeper nodes is independent of number of shards.

Otis
--
SOLR Performance Monitoring - http://sematext.com/spm
On Nov 22, 2012 4:19 AM, "Luis Cappa Banda"  wrote:

> Hello,
>
> I´ve been dealing with the same question these days. In architecture terms,
> it´s always better to separate services (Solr and Zookeeper, in this case)
> rather to keep them in a single instance. However, when we have to deal
> with costs issues, all of use we are quite limitated and we must elect the
> best architecture/scalable/single point of failure option. As I see, the
> options are:
>
>
> *1. *Solr servers with Zookeeper embeded.
> *2. *Solr servers with external Zookeeper.
> *3.* Solr servers with external Zookeeper ensemble.
>
> *Note*: as far as I know, the recommended number of Zookeeper services to
> avoid single points of failure is:* ZkNum = 2 * Numshards - 1*. If you have
>
>
> The best option is the third one. Reasons:
>
> *1. *If one of your Solr servers goes down, Zookeeper services still up.
> *2.* If one of your Zookeeper services goes down, Solr servers and the rest
> of Zookeeper services still up.
>
> Considering that option, we have two ways to implement it in production:
>
> *1. *Each service (Solr and Zookeeper) in separate machines. Let´s imagine
> that we have 2 shards for a given collection, so we need at least 4 Solr
> servers to complete the leader-replica configuration. The best option is to
> deploy them in for Amazon instances, one per each server. We need at least
> 3 Zookeeper services in a Zookeeper ensemble configuration. The optimal way
> to install them is in separates machines (micro instance will be nice for
> Zookeeper), so we will have 7 Amazon instances. The reason is that if one
> machine goes down (Solr or Zookeeper one) the others services may still up
> and your production environment will be safe. However,* for me this is the
> best case, but it´s the more expensive one*, so in my case is imposible to
> make real.
>
> *2. *As wee need at least 4 Solr servers and 3 Zookeeper services up, I
> would install three Amazon instances with Solr and Zookeeper, and one of
> them only with Solr. So we´ll have: 3 complete Amazon instances (Solr +
> Zookeeper) and 1 single Amazon instance  (only Solr). If one of them goes
> down, the production environment will be safe. This architecture is not the
> best one, as I told you, but I think that is optimal in terms of
> robustness, single point of failure and costs.
>
>
> It would be a pleasure to hear new suggestions from other people that
> dealed with this kind of issues.
>
> Regards,
>
>
> - Luis Cappa.
>
>
> 2012/11/21 Marcin Rzewucki 
>
> > Yes, I meant the same (not -zkRun). However, I was asking if it is safe
> to
> > have zookeeper and solr processes running on the same node or better on
> > different machines?
> >
> > On 21 November 2012 21:18, Rafał Kuć  wrote:
> >
> > > Hello!
> > >
> > > As I told I wouldn't use the Zookeeper that is embedded into Solr, but
> > > rather setup a standalone one.
> > >
> > > --
> > > Regards,
> > >  Rafał Kuć
> > >  Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch -
> > ElasticSearch
> > >
> > > > First of all: thank you for your answers. Yes, I meant side by side
> > > > configuration. I think the worst case for ZKs here is to loose two of
> > > them.
> > > > However, I'm going to use 4 availability zones in same region so at
> > least
> > > > this will reduce the risk of loosing both of them at the same time.
> > > > Regards.
> > >
> > > > On 21 November 2012 17:06, Rafał Kuć  wrote:
> > >
> > > >> Hello!
> > > >>
> > > >> Zookeeper by itself is not demanding, but if something happens to
> your
> > > >> nodes that have Solr on it, you'll loose ZooKeeper too if you have
> > > >> them installed side by side. However if you will have 4 Solr nodes
> and
> > > >> 3 ZK instances you can get them running side by side.
> > > >>
> > > >> --
> > > >> Regards,
> > > >>  Rafał Kuć
> > > >>  Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch -
> > > ElasticSearch
> > > >>
> > > >> > Separate is generally nice because then you can restart Solr nodes
> > > >> > without consideration for ZooKeeper.
> > > >>
> > > >> > Performance-wise, I doubt it's a big deal either way.
> > > >>
> > > >> > - Mark
> > > >>
> > > >> > On Nov 21, 2012, at 8:54 AM, Marcin Rzewucki  >
> > > >> wrote:
> > > >>
> > > >> >> Hi,
> > > >> >>
> > > >> >> I have 4 solr collections, 2-3mn documents per collection, up to
> > 100K
> > > >> >> updates per collection daily (roughly). I'm going to create
> > > SolrCloud4x
> > > >> on
> > > >> >> Amazon's m1.large instances (7GB mem,2x2.4GHz cpu each). The
> > > question is
> > > >> >> what about zookeeper? It's going to be external ensemble, but is
> it
> > > >> better
> > > >> >> to use same nodes as solr or dedicated micro instances? Zookeeper
> > > does
> > > >> not
> > > >> >> seem to be resources demanding process, but what would be better
> in
> > > this
> > 

User context based search in apache solr

2012-11-22 Thread sagarzond
In our application we are providing product master data search with SOLR. Now
our requirement want to provide user context based search(means we are
providing top search result using user history).

For that i have created one score table having following field

1)product_id

2)user_id

3)score_value

As soon as user clicked for any product that will create entry in this table
and also increase score_value if already present product for that user. We
are planning to use boost field and eDisMax from SOLR to improve search
result but for this i have to use one to many mapping between score and
product table(Because we are having one product with different score value
for different user) and solr not providing one to many mapping.

We can solved this issue (one to many mapping handling) by de-normalizing
structure as having multiple product entry with different score value for
different user but it result huge amount of redundant data.

Is this(de-normalized structure) currect way to handle or is there any other
way to handle such context based search.

Plz help me



--
View this message in context: 
http://lucene.472066.n3.nabble.com/User-context-based-search-in-apache-solr-tp4021964.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Error: _version_field must exist in schema

2012-11-22 Thread Dotan Cohen
On Thu, Nov 22, 2012 at 9:26 PM, Nick Zadrozny  wrote:
> Belated reply, but this is probably something you should let us know about
> directly at supp...@onemorecloud.com if it happens again. Cheers.
>

Hi Nick. This particular issue was on a Solr 4 instance on AWS, not on
the Websolr account. But I commend you taking notice and taking an
interest. Thank you!

-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com


RE: Solr UIMA with KEA

2012-11-22 Thread Markus Jelsma
See: 
http://nutch.apache.org/apidocs-2.1/org/apache/nutch/crawl/AdaptiveFetchSchedule.html

 
-Original message-
> From:nutchsolruser 
> Sent: Fri 23-Nov-2012 06:53
> To: solr-user@lucene.apache.org
> Subject: Solr UIMA with KEA
> 
> Is there any way we can extract tags or keyphrases from solr document at
> index time?
> 
> I know we can use solr UIMA library  to enrich solr document with metadata
> but it require alchemy API key (which we have to purchase for commercial
> use) . Can we wrap KeyPhraseExtractor(KEA) in UIMA for this purpose  if yes
> then then let me know some useful pointers for doing this.
> 
> Thank you ,
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Solr-UIMA-with-KEA-tp4021962.html
> Sent from the Solr - User mailing list archive at Nabble.com.
> 


RE: Solr UIMA with KEA

2012-11-22 Thread Markus Jelsma
Sorry, wrong list :) 
 
-Original message-
> From:Markus Jelsma 
> Sent: Fri 23-Nov-2012 08:32
> To: solr-user@lucene.apache.org
> Subject: RE: Solr UIMA with KEA
> 
> See: 
> http://nutch.apache.org/apidocs-2.1/org/apache/nutch/crawl/AdaptiveFetchSchedule.html
> 
>  
> -Original message-
> > From:nutchsolruser 
> > Sent: Fri 23-Nov-2012 06:53
> > To: solr-user@lucene.apache.org
> > Subject: Solr UIMA with KEA
> > 
> > Is there any way we can extract tags or keyphrases from solr document at
> > index time?
> > 
> > I know we can use solr UIMA library  to enrich solr document with metadata
> > but it require alchemy API key (which we have to purchase for commercial
> > use) . Can we wrap KeyPhraseExtractor(KEA) in UIMA for this purpose  if yes
> > then then let me know some useful pointers for doing this.
> > 
> > Thank you ,
> > 
> > 
> > 
> > --
> > View this message in context: 
> > http://lucene.472066.n3.nabble.com/Solr-UIMA-with-KEA-tp4021962.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
> > 
>