date:20140415

[ANNOUNCE] Apache Solr 4.7.2 released.

2014-04-15 Thread Robert Muir

April 2014, Apache Solr™ 4.7.2 available

The Lucene PMC is pleased to announce the release of Apache Solr 4.7.2

Solr is the popular, blazing fast, open source NoSQL search platform
from the Apache Lucene project. Its major features include powerful
full-text search, hit highlighting, faceted search, dynamic
clustering, database integration, rich document (e.g., Word, PDF)
handling, and geospatial search. Solr is highly scalable, providing
fault tolerant distributed search and indexing, and powers the search
and navigation features of many of the world's largest internet sites.

Solr 4.7.2 is available for immediate download at:

http://lucene.apache.org/solr/mirrors-solr-latest-redir.html

Solr 4.7.2 includes 2 bug fixes, as well as Lucene 4.7.2 and its bug fixes.

See the CHANGES.txt file included with the release for a full list of
changes and further details.

Please report any feedback to the mailing lists
(http://lucene.apache.org/solr/discussion.html)

Note: The Apache Software Foundation uses an extensive mirroring
network for distributing releases. It is possible that the mirror you
are using may not have replicated the release yet. If that is the
case, please try another mirror. This also goes for Maven access.

Re: filter capabilities are limited?

2014-04-15 Thread horot

Variables over which the comparison is a string data type. I can not apply to
them or mathematical functions needed to perform the conversion type (string
to integer). Will I be able to build a circuit without changing a filter?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/filter-capabilities-are-limited-tp4130458p4131174.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr join and lucene scoring

2014-04-15 Thread mm


Thank you for the clarification.
We really need scoring with solr joins, but as you can see I'm not a  
specialist in solr development.
We would like to hire somebody with more experience to write a qparser  
plugin for scoring in joins and donate the source code to the community.


Any suggestions where we could find somebody with the fitting experience?


Zitat von Mikhail Khludnev :


On Wed, Apr 9, 2014 at 1:33 PM,  wrote:


Hello Mikhail,

thx for the clarification. I'm a little bit confused by the answer of
Alvaro, but my own tests didn't result in a proper score, so I think you're
right and it's still not implemented.

What do you mean with the "impedance between Lucene and Solr"?


It's an old story, and unfortunately obvious. Using Lucene's code in Solr
might not be straightforward. I haven't looked at this problem
particularly, it's just a caveat.



Why isn't the possibility of scoring in joins not implemented in Solr
anyways when Lucene offers a solution for that?


As you can see these are two separate implementation. It seems like Solr
guys just didn't care about scoring (and here I share their point). It's
just an exercise for someone, who needs it.




Best regards,
Moritz

Zitat von Mikhail Khludnev :

 On Thu, Apr 3, 2014 at 1:42 PM,  wrote:


 Hello,


referencing to this issue:
https://issues.apache.org/jira/browse/SOLR-4307

Is it still not possible with the solr query time join to use scoring?

 It's not implemented still.

https://github.com/apache/lucene-solr/blob/trunk/solr/
core/src/java/org/apache/solr/search/JoinQParserPlugin.java#L549


 Do I still have to write my own plugin or is there a plugin somewhere I

could use?

I never wrote a plugin for solr before, so I would prefer if I don't have
to start from scratch.

 The right approach from my POV is to use Lucene's join

https://github.com/apache/lucene-solr/blob/trunk/lucene/
join/src/java/org/apache/lucene/search/join/JoinUtil.javain
new QParser, but solving the impedance between Lucene and Solr, might
be
tricky.





THX,
Moritz





--
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics


 









--
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

Autocomplete with Case-insensitive feature

2014-04-15 Thread Sunayana

Hi All,

I have been trying out this autocomplete feature in Solr4.7.1 using
Suggester.I have configured it to display phrase suggestions also.Problem is
If I type "game" I get suggestions as "game" or phrases containing "game".
But If I type "Game" *no suggestion is displayed at all*.How can I get
suggestions case-insensitive?
I have defined in schema.xml fields like this:
 




 
 
   






 





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Autocomplete-with-Case-insensitive-feature-tp4131182.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Autocomplete with Case-insensitive feature

2014-04-15 Thread Dmitry Kan

Hi,

Configure LowerCaseFilterFactory into the "query" side of your type config.

Dmitry


On Tue, Apr 15, 2014 at 10:50 AM, Sunayana  wrote:

> Hi All,
>
> I have been trying out this autocomplete feature in Solr4.7.1 using
> Suggester.I have configured it to display phrase suggestions also.Problem
> is
> If I type "game" I get suggestions as "game" or phrases containing "game".
> But If I type "Game" *no suggestion is displayed at all*.How can I get
> suggestions case-insensitive?
> I have defined in schema.xml fields like this:
>   stored="true" multiValued="true" />
> 
>  positionIncrementGap="100" >
> 
> 
>   class="solr.LowerCaseFilterFactory"/>
>   minShingleSize="2"
> maxShingleSize="4"
> outputUnigrams="true"
> outputUnigramsIfNoShingles="true"/>
>
>
> 
> 
> 
> 
>
>  
> 
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Autocomplete-with-Case-insensitive-feature-tp4131182.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Dmitry
Blog: http://dmitrykan.blogspot.com
Twitter: http://twitter.com/dmitrykan

Re: More Robust Search Timeouts (to Kill Zombie Queries)?

2014-04-15 Thread Salman Akram

Looking at this, sharding seems to be best and simple option to handle such
queries.


On Wed, Apr 2, 2014 at 1:26 AM, Mikhail Khludnev  wrote:

> Hello Salman,
> Let's me drop few thoughts on
>
> http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200903.mbox/%3c856ac15f0903272054q2dbdbd19kea3c5ba9e105b...@mail.gmail.com%3E
>
> There two aspects of this question:
> 1. dealing with long running processing (thread divergence actions
> http://docs.oracle.com/javase/specs/jls/se5.0/html/memory.html#65310) and
> 2. an actual time checking.
> "terminating" or "aborting" thread (2.) are just a way to tracking time
> externally, and send interrupt() which the thread should react on, which
> they don't do now, and we returning to the core issue (1.)
>
> Solr's time allowed is to the proper way to handle this things, the only
> problem is that expect that the only core search is long running, but in
> your case rewriting MultiTermQuery-s takes a huge time.
> Let's consider this problem. First of all MultiTermQuery.rewrite() is the
> nearly design issue, after heavy rewrite occurs, it's thrown away, after
> search is done. I think the most straightforward way is to address this
> issue by caching these expensive queries. Solr does it well
> http://wiki.apache.org/solr/CommonQueryParameters#fq However, only for
> http://en.wikipedia.org/wiki/Conjunctive_normal_form like queries, there
> is
> a workaround allows to cache disjunction legs see
> http://blog.griddynamics.com/2014/01/segmented-filter-cache-in-solr.html
> If you still want to run expensively rewritten queries you need to
> implement timeout check (similar to TimeLimitingCollector) for TermsEnum
> returned from MultiTermQuery.getTermsEnum(), wrapping an actual TermsEnums
> is the good way, to apply queries injecting time limiting wrapper
> TermsEnum, you might consider override methods like
> SolrQueryParserBase.newWildcardQuery(Term) or post process the query three
> after parsing.
>
>
>
> On Mon, Mar 31, 2014 at 2:24 PM, Salman Akram <
> salman.ak...@northbaysolutions.net> wrote:
>
> > Anyone?
> >
> >
> > On Wed, Mar 26, 2014 at 7:55 PM, Salman Akram <
> > salman.ak...@northbaysolutions.net> wrote:
> >
> > > With reference to this thread<
> >
> http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200903.mbox/%3c856ac15f0903272054q2dbdbd19kea3c5ba9e105b...@mail.gmail.com%3E
> >I
> > wanted to know if there was any response to that or if Chris Harris
> > > himself can comment on what he ended up doing, that would be great!
> > >
> > >
> > > --
> > > Regards,
> > >
> > > Salman Akram
> > >
> > >
> >
> >
> > --
> > Regards,
> >
> > Salman Akram
> >
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> Principal Engineer,
> Grid Dynamics
>
> 
>  
>



-- 
Regards,

Salman Akram

Re: Autocomplete with Case-insensitive feature

2014-04-15 Thread Sunayana

Hi,

Did u mean changing field type as 



 
 
   




 


 


This did not work out for me. 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Autocomplete-with-Case-insensitive-feature-tp4131182p4131198.html
Sent from the Solr - User mailing list archive at Nabble.com.

Indexing Big Data With or Without Solr

2014-04-15 Thread Vineet Mishra

Hi All,

I have worked with Solr 3.5 to implement real time search on some 100GB
data, that worked fine but was little slow on complex queries(Multiple
group/joined queries).
But now I want to index some real Big Data(around 4 TB or even more), can
SolrCloud be solution for it if not what could be the best possible
solution in this case.

*Stats for the previous Implementation:*
It was Master Slave Architecture with normal Standalone multiple instance
of Solr 3.5. There were around 12 Solr instance running on different
machines.

*Things to consider for the next implementation:*
Since all the data is sensor data hence it is the factor of duplicity and
uniqueness.

*Really urgent, please take the call on priority with set of feasible
solution.*

Regards

Re: Class not found ICUFoldingFilter (SOLR-4852)

2014-04-15 Thread Ronak Kirit

Hello Shawn,

Thanks for your reply.

Yes, I have defined ${solr.solr.home} explicitly, and all the mentioned jars
present in ${solr.solr.home}/lib. solr.log also shows that those files are
getting added once (grep "icu4" solr.log). I could see the lines in log,

INFO  - 2014-04-15 15:40:21.448; org.apache.solr.core.SolrResourceLoader;
Adding 'file:/solr/lib/icu4j-49.1.jar' to classloader
INFO  - 2014-04-15 15:40:21.454; org.apache.solr.core.SolrResourceLoader;
Adding 'file:/solr/lib/lucene-analyzers-icu-4.3.1.jar' to classloader
INFO  - 2014-04-15 15:40:21.454; org.apache.solr.core.SolrResourceLoader;
Adding 'file:/solr/lib/lucene-analyzers-morfologik-4.3.1.jar' to classloader
INFO  - 2014-04-15 15:40:21.455; org.apache.solr.core.SolrResourceLoader;
Adding 'file:/solr/lib/lucene-analyzers-smartcn-4.3.1.jar' to classloader
INFO  - 2014-04-15 15:40:21.455; org.apache.solr.core.SolrResourceLoader;
Adding 'file:/solr/lib/lucene-analyzers-stempel-4.3.1.jar' to classloader
INFO  - 2014-04-15 15:40:21.455; org.apache.solr.core.SolrResourceLoader;
Adding 'file:/solr/lib/lucene-analyzers-uima-4.3.1.jar' to classloader

But, still, I get the same exception ICUFoldingFilter not found. However,
coping those files to WEB-INF/lib, works fine for me. 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Class-not-found-ICUFoldingFilter-SOLR-4852-tp4130612p4131221.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Class not found ICUFoldingFilter (SOLR-4852)

2014-04-15 Thread ronak kirit

Hello Shawn,

Thanks for your reply.

Yes, I have defined ${solr.solr.home} explicitly, and all the mentioned
jars present in ${solr.solr.home}/lib. solr.log also shows that those files
are getting added once (grep "icu4" solr.log). I could see the lines in
log,

INFO  - 2014-04-15 15:40:21.448; org.apache.solr.core.SolrResourceLoader;
Adding 'file:/solr/lib/icu4j-49.1.jar' to classloader
INFO  - 2014-04-15 15:40:21.454; org.apache.solr.core.SolrResourceLoader;
Adding 'file:/solr/lib/lucene-analyzers-icu-4.3.1.jar' to classloader
INFO  - 2014-04-15 15:40:21.454; org.apache.solr.core.SolrResourceLoader;
Adding 'file:/solr/lib/lucene-analyzers-morfologik-4.3.1.jar' to
classloader
INFO  - 2014-04-15 15:40:21.455; org.apache.solr.core.SolrResourceLoader;
Adding 'file:/solr/lib/lucene-analyzers-smartcn-4.3.1.jar' to classloader
INFO  - 2014-04-15 15:40:21.455; org.apache.solr.core.SolrResourceLoader;
Adding 'file:/solr/lib/lucene-analyzers-stempel-4.3.1.jar' to classloader
INFO  - 2014-04-15 15:40:21.455; org.apache.solr.core.SolrResourceLoader;
Adding 'file:/solr/lib/lucene-analyzers-uima-4.3.1.jar' to classloader

But, still, I get the same exception ICUFoldingFilter not found. However,
coping those files to WEB-INF/lib, works fine for me.

Thanks,
Ronak


On Fri, Apr 11, 2014 at 3:14 PM, ronak kirit  wrote:

> Hello,
>
> I am facing the same issue discussed at SOLR-4852. I am getting below
> error:
>
> Caused by: java.lang.NoClassDefFoundError: Could not initialize class
> org.apache.lucene.analysis.icu.ICUFoldingFilter
> at
> org.apache.lucene.analysis.icu.ICUFoldingFilterFactory.create(ICUFoldingFilterFactory.java:50)
>   at
> org.apache.solr.analysis.TokenizerChain.createComponents(TokenizerChain.java:67)
>
>
> I am using solr-4.3.1. As discussed at SOLR-4852, I had all the jars at
> (SOLR_HOME)/lib and there is no reference to lib via any of solrconfig.xml
> or schema.xml.
>
> I have also tried with setting "sharedLib=foo", but that also didn't work.
> However, if  I removed all the below files:
>
> icu4j-49.1.jar
>
> lucene-analyzers-morfologik-4.3.1.jar  l
>
> ucene-analyzers-stempel-4.3.1.jar
>
> solr-analysis-extras-4.3.1.jar
>
> lucene-analyzers-icu-4.3.1.jar
>
> lucene-analyzers-smartcn-4.3.1.jar
>
> lucene-analyzers-uima-4.3.1.jar
>
> from $(solrhome)/lib and move to solr-webapp/webapp/WEB-INF/lib things are
> working fine.
>
> Any guess? Any help?
>
> Thanks,
>
> Ronak
>

Re: Error Arising from when I start to crawl

2014-04-15 Thread Cihad Guzel

Hi Ridwan,

This error is not related to Solr. Solr is used in "IndexerJob" for
Nutch.  This error is thrown from "InjectorJob." It is related Nutch and
Gora. You check your hbase and nutch configuration. You ensure the HBase
run correctly and to use the correct version. For more accurate
information, you should ask questions to the "nutch user list" with more
information.


2014-04-14 5:11 GMT+03:00 Alexandre Rafalovitch :

> This is most definitely not a Solr issue, so you may want to check with
> Gora's list.
>
> However as a quick general hint, you problem seems to be in thus
> part: 3530@engr-MacBookProlocalhost . I assume it should be a server name
> there, but it seems to be two name joined together. So I would check where
> that (possibly hbase listen address) is defined and ensure it is correct.
>
> Regards,
>  Alex
> On 14/04/2014 8:46 am, "Ridwan Naibi"  wrote:
>
> > Hi there,
> >
> > I get the following error after I run the following command. Can you
> > please let me know what the problem is? I have exhausted online tutorials
> > trying to solve this issue. Thanks
> >
> > engr@engr-MacBookPro:~/NUTCH_HOME/apache-nutch-2.2.1/runtime/local$
> > bin/crawl urls/seed.txt TestCrawl http://localhost:8983/solr/ 2
> > InjectorJob: starting at 2014-04-14 02:28:56
> > InjectorJob: Injecting urlDir: urls/seed.txt
> > InjectorJob: org.apache.gora.util.GoraException:
> > java.lang.RuntimeException: java.lang.IllegalArgumentException: Not a
> > host:port pair: � 3530@engr-MacBookProlocalhost,43200,1397436949832
> > at org.apache.gora.store.DataStoreFactory.createDataStore(
> > DataStoreFactory.java:167)
> > at org.apache.gora.store.DataStoreFactory.createDataStore(
> > DataStoreFactory.java:135)
> > at org.apache.nutch.storage.StorageUtils.createWebStore(
> > StorageUtils.java:75)
> > at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:221)
> > at org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:251)
> > at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:273)
> > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> > at org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:282)
> > Caused by: java.lang.RuntimeException:
> java.lang.IllegalArgumentException:
> > Not a host:port pair: � 3530@engr-MacBookProlocalhost
> ,43200,1397436949832
> > at org.apache.gora.hbase.store.HBaseStore.initialize(HBaseStore.java:127)
> > at org.apache.gora.store.DataStoreFactory.initializeDataStore(
> > DataStoreFactory.java:102)
> > at org.apache.gora.store.DataStoreFactory.createDataStore(
> > DataStoreFactory.java:161)
> > ... 7 more
> > Caused by: java.lang.IllegalArgumentException: Not a host:port pair: �
> > 3530@engr-MacBookProlocalhost,43200,1397436949832
> > at org.apache.hadoop.hbase.HServerAddress.(HServerAddress.java:60)
> > at org.apache.hadoop.hbase.MasterAddressTracker.getMasterAddress(
> > MasterAddressTracker.java:63)
> > at org.apache.hadoop.hbase.client.HConnectionManager$
> > HConnectionImplementation.getMaster(HConnectionManager.java:354)
> > at org.apache.hadoop.hbase.client.HBaseAdmin.(HBaseAdmin.java:94)
> > at org.apache.gora.hbase.store.HBaseStore.initialize(HBaseStore.java:109)
> > ... 9 more
> >
> >
>

Re: Analysis Tool Not Working for CharFilterFactory?

2014-04-15 Thread Alexandre Rafalovitch

Which version of Solr. I think there was a bug in ui. You can check network
traffic to confirm.
On 15/04/2014 5:32 pm, "Steve Huckle"  wrote:

>  I have used a CharFilterFactory in my schema.xml for fileType
> text_general, so that queries for cafe and café return the same results. It
> works correctly. Here's the relevant part of my schema.xml:
>
>   positionIncrementGap="100">
>   
>  mapping="mapping-ISOLatin1Accent.txt"/>
> 
>  words="stopwords.txt" />
> 
>   
>   
>  mapping="mapping-ISOLatin1Accent.txt"/>
> 
>  words="stopwords.txt" />
>  ignoreCase="true" expand="true"/>
> 
>   
> 
>
> However, using the analysis tool within the admin ui, if I analyse
> text_general with any field values for index and query, the output for ST,
> SF and LCF are all empty. Is this a bug?
>
>
> --
> Steve Huckle
>
> If you print this email, eventually you'll want to throw it away. But there 
> is no away. So don't print this email, even if you have to.
>
>

Re: multiple analyzers for one field

2014-04-15 Thread Michael Sokolov

A blog post is a great idea, Alex!  I think I should wait until I have a 
complete end-to-end implementation done before I write about it though, 
because I'd also like to include some tips about configuring the new 
suggesters with Solr (the documentation on the wiki hasn't quite caught 
up yet, I think), and I don't have that working as I'd like just yet.  
But I will follow up with something soon; probably I will be able to 
share code on a public repo.


-Mike

On 04/14/2014 10:01 PM, Alexandre Rafalovitch wrote:

Hi Mike,

Glad I was able to help. Good note about the PoolingReuseStrategy, I
did not think of that either.

  Is there a blog post or a GitHub repository coming with more details
on that? Sounds like something others may benefit from as well.

Regards,
Alex.
P.s. If you don't have your own blog, I'll be happy to host such
article on mine.

Personal website: http://www.outerthoughts.com/
Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency


On Tue, Apr 15, 2014 at 8:52 AM, Michael Sokolov
 wrote:

I lost the original thread; sorry for the new / repeated topic, but thought
I would follow up to let y'all know that I ended up implementing Alex's idea
to implement an UpdateRequestProcessor in order to apply different analysis
to different fields when doing something analogous to copyFields.

It was pretty straightforward except that when there are multiple values, I
ended up needing multiple copies of the same Analyzer.  I had to implement a
new PoolingReuseStrategy for the Analyzer to handle this, which I hadn't
foreseen.

-Mike

Re: multiple analyzers for one field

2014-04-15 Thread Alexandre Rafalovitch

Your call, though from experience thus sounds like either two or no blog
posts. I certainly have killed a bunch of good articles by waiting for
perfection:-)
On 15/04/2014 7:01 pm, "Michael Sokolov" 
wrote:

> A blog post is a great idea, Alex!  I think I should wait until I have a
> complete end-to-end implementation done before I write about it though,
> because I'd also like to include some tips about configuring the new
> suggesters with Solr (the documentation on the wiki hasn't quite caught up
> yet, I think), and I don't have that working as I'd like just yet.  But I
> will follow up with something soon; probably I will be able to share code
> on a public repo.
>
> -Mike
>
> On 04/14/2014 10:01 PM, Alexandre Rafalovitch wrote:
>
>> Hi Mike,
>>
>> Glad I was able to help. Good note about the PoolingReuseStrategy, I
>> did not think of that either.
>>
>>   Is there a blog post or a GitHub repository coming with more details
>> on that? Sounds like something others may benefit from as well.
>>
>> Regards,
>> Alex.
>> P.s. If you don't have your own blog, I'll be happy to host such
>> article on mine.
>>
>> Personal website: http://www.outerthoughts.com/
>> Current project: http://www.solr-start.com/ - Accelerating your Solr
>> proficiency
>>
>>
>> On Tue, Apr 15, 2014 at 8:52 AM, Michael Sokolov
>>  wrote:
>>
>>> I lost the original thread; sorry for the new / repeated topic, but
>>> thought
>>> I would follow up to let y'all know that I ended up implementing Alex's
>>> idea
>>> to implement an UpdateRequestProcessor in order to apply different
>>> analysis
>>> to different fields when doing something analogous to copyFields.
>>>
>>> It was pretty straightforward except that when there are multiple
>>> values, I
>>> ended up needing multiple copies of the same Analyzer.  I had to
>>> implement a
>>> new PoolingReuseStrategy for the Analyzer to handle this, which I hadn't
>>> foreseen.
>>>
>>> -Mike
>>>
>>
>

Re: Indexing Big Data With or Without Solr

2014-04-15 Thread Furkan KAMACI

Hi Vineet;

I've been using SolrCloud for such kind of Big Data and I think that you
should consider to use it. If you have any problems you can ask it here.

Thanks;
Furkan KAMACI


2014-04-15 13:20 GMT+03:00 Vineet Mishra :

> Hi All,
>
> I have worked with Solr 3.5 to implement real time search on some 100GB
> data, that worked fine but was little slow on complex queries(Multiple
> group/joined queries).
> But now I want to index some real Big Data(around 4 TB or even more), can
> SolrCloud be solution for it if not what could be the best possible
> solution in this case.
>
> *Stats for the previous Implementation:*
> It was Master Slave Architecture with normal Standalone multiple instance
> of Solr 3.5. There were around 12 Solr instance running on different
> machines.
>
> *Things to consider for the next implementation:*
> Since all the data is sensor data hence it is the factor of duplicity and
> uniqueness.
>
> *Really urgent, please take the call on priority with set of feasible
> solution.*
>
> Regards
>

Bug within the solr query parser (version 4.7.1)

2014-04-15 Thread Johannes Siegert


Hi,

I have updated my solr instance from 4.5.1 to 4.7.1. Now the parsed 
query seems to be not correct.


Query: /*q=*:*&fq=title:T&E&debug=true */

Before the update the parsed filter query is "*/+title:t&e +title:t 
+title:e/*". After the update the parsed filter query is "*/+((title:t&e 
title:t)/no_coord) +title:e/*". It seems like a bug within the query parser.


I also have validated the parsed filter query with the analysis 
component. The result was "*/+title:t&e +title:t +title:e/*".


The behavior is equal on all special characters that split words into 2 
parts.


I use the following WordDelimiterFilter on query side:

generateNumberParts="1" catenateWords="0" catenateNumbers="0" 
catenateAll="0" splitOnCaseChange="0" splitOnNumerics="0" 
preserveOriginal="1"/>


Thanks.

Johannes


Additional informations:

Debug before the update:


*:*
*:*
MatchAllDocsQuery(*:*)
*:*

LuceneQParser

(title:((T&E)))

* **
**+title:t&e +title:t +title:e **
** *
...

Debug after the update:


*:*
*:*
MatchAllDocsQuery(*:*)
*:*

LuceneQParser

(title:((T&E)))

* **
**+((title:t&e title:t)/no_coord) +title:e **
***
...

"title"-field definition:

positionIncrementGap="100" omitNorms="true">

  

mapping="mapping.txt"/>


generateWordParts="1" generateNumberParts="1" catenateWords="1" 
catenateNumbers="1" catenateAll="1" splitOnCaseChange="1" 
splitOnNumerics="1" preserveOriginal="1" stemEnglishPossessive="0"/>


  
  

 mapping="mapping.txt"/>


synonyms="synonyms.txt" ignoreCase="true" expand="false"/>
generateWordParts="1" generateNumberParts="1" catenateWords="0" 
catenateNumbers="0" catenateAll="0" splitOnCaseChange="0" 
splitOnNumerics="0" preserveOriginal="1"/>

clusterstate.json does not reflect current state of down versus active

2014-04-15 Thread Rich Mayfield

Solr 4.7.1

I am trying to orchestrate a fast restart of a SolrCloud (4.7.1). I was
hoping to use clusterstate.json would reflect the up/down state of each
core as well as whether or not a given core was leader.

clusterstate.json is not kept up to date with what I see going on in my
logs though - I see the leader election process play out. I would expect
that "state" would show "down" immediately for replicas on the node that I
have shut down.

Eventually, after about 30 minutes, all of the leader election processes
complete and clusterstate.json gets updated to the true state for each
replica.

Why does it take so long for clusterstate.json to reflect the correct
state? Is there a better way to determine the state of the system?

(In my case, each node has upwards of 1,000 1-shard collections. There are
two nodes in the cluster - each collection has 2 replicas.)

Thanks much.
rich

Re: clusterstate.json does not reflect current state of down versus active

2014-04-15 Thread Shawn Heisey

On 4/15/2014 8:58 AM, Rich Mayfield wrote:
> I am trying to orchestrate a fast restart of a SolrCloud (4.7.1). I was
> hoping to use clusterstate.json would reflect the up/down state of each
> core as well as whether or not a given core was leader.
>
> clusterstate.json is not kept up to date with what I see going on in my
> logs though - I see the leader election process play out. I would expect
> that "state" would show "down" immediately for replicas on the node that I
> have shut down.
>
> Eventually, after about 30 minutes, all of the leader election processes
> complete and clusterstate.json gets updated to the true state for each
> replica.
>
> Why does it take so long for clusterstate.json to reflect the correct
> state? Is there a better way to determine the state of the system?
>
> (In my case, each node has upwards of 1,000 1-shard collections. There are
> two nodes in the cluster - each collection has 2 replicas.)

First, I'll admit that my experience with SolrCloud is not as extensive
as my experience with non-cloud installs.  I do have a SolrCloud (4.2.1)
install, but it's a the smallest possible redundant setup -- three
servers, two run Solr and Zookeeper, the third runs Zookeeper only.

What are you trying to achieve with your restart?  Can you just reload
the collections one by one instead?

Assuming that reloading isn't going to work for some reason (rebooting
for OS updates is one possibility), we need to determine why it takes so
long for a node to stabilize.

Here's a bunch of info about performance problems with Solr.  I wrote
it, so we can discuss it in depth if you like:

http://wiki.apache.org/solr/SolrPerformanceProblems

I have three possible suspicions for the root of your problem.  It is
likely to be one of them, but it could be a combination of any or all of
them.  Because this happens at startup, I don't think it's likely that
you're dealing with a GC problem caused by a very large heap.

1) The system is replaying 1000 transaction logs (possibly large, one
for each core) at startup, and also possibly initiating index recovery
using replication.  2) You don't have enough RAM to cache your index
effectively.  3) Your java heap is too small.

If your zookeeper ensemble does not use separate disks from your Solr
data (or separate servers), there could be an issue with zookeeper
client timeouts that's completely separate from any other problems.

I haven't addressed the fact that your cluster state doesn't update
quickly.  This might be a bug, but if we can deal with the slow
startup/stabilization first, then we can see whether there's anything
left to deal with on the cluster state.

Thanks,
Shawn

Re: Empty documents in Solr\lucene 3.6

2014-04-15 Thread Shawn Heisey

On 4/15/2014 9:41 AM, Alexey Kozhemiakin wrote:
> We've faced a strange data corruption issue with one of our clients old solr 
> setup (3.6).
>
> When we do a query (id:X OR id:Y) we get 2 nodes, one contains normal doc 
> data, another is empty ().
> We've looked inside lucene index using Luke - same story, one of documents is 
> empty.
> When we click on 1st document - it shows nothing.
> http://snag.gy/O5Lgq.jpg
>
>
> Probably files for stored data were corrupted? But luke index check says OK.
> Any clues how to troubleshoot root cause?

Do you know for sure that the index was OK at some point?  Do you know
what might have happened when it became not OK, like a system crash?

If you have Solr logs from whatever event caused the problem, we might
be able to figure it out ... but if you don't know when it happened or
you don't have logs, it might not be possible to know what happened. 
The document may have simply been indexed incorrectly.

Thanks,
Shawn

Empty documents in Solr\lucene 3.6

2014-04-15 Thread Alexey Kozhemiakin

Dear Community,

We've faced a strange data corruption issue with one of our clients old solr 
setup (3.6).

When we do a query (id:X OR id:Y) we get 2 nodes, one contains normal doc data, 
another is empty ().
We've looked inside lucene index using Luke - same story, one of documents is 
empty.
When we click on 1st document - it shows nothing.
http://snag.gy/O5Lgq.jpg


Probably files for stored data were corrupted? But luke index check says OK.
Any clues how to troubleshoot root cause?

Best regards,
Alexey

Race condition in Leader Election

2014-04-15 Thread Rich Mayfield

I see something similar where, given ~1000 shards, both nodes spend a LOT of 
time sorting through the leader election process. Roughly 30 minutes.

I too am wondering - if I force all leaders onto one node, then shut down both, 
then start up the node with all of the leaders on it first, then start up the 
other node, then I think I would have a much faster startup sequence.

Does that sound reasonable? And if so, is there a way to trigger the leader 
election process without taking the time to unload and recreate the shards?

> Hi
> 
>   When restarting a node in solrcloud, i run into scenarios where both the
> replicas for a shard get into "recovering" state and never come up causing
> the error "No servers hosting this shard". To fix this, I either unload one
> core or restart one of the nodes again so that one of them becomes the
> leader.
> 
> Is there a way to "force" leader election for a shard for solrcloud? Is
> there a way to break ties automatically (without restarting nodes) to make
> a node as the leader for the shard?
> 
> 
> Thanks
> Nitin

RE: Empty documents in Solr\lucene 3.6

2014-04-15 Thread Alexey Kozhemiakin

The system was up and running for long time(months) without any updates.
There was no crashes for sure, at least support team says so.
Logs indicate that at some point there was not enough disk space (caused by 
weekend index optimization).


Were there any other similar cases or it's unique for us?


Alexey.
-Original Message-
From: Shawn Heisey [mailto:s...@elyograg.org] 
Sent: Tuesday, April 15, 2014 18:50
To: solr-user@lucene.apache.org
Subject: Re: Empty documents in Solr\lucene 3.6

Do you know for sure that the index was OK at some point?  Do you know what 
might have happened when it became not OK, like a system crash?

If you have Solr logs from whatever event caused the problem, we might be able 
to figure it out ... but if you don't know when it happened or you don't have 
logs, it might not be possible to know what happened. 
The document may have simply been indexed incorrectly.

Thanks,
Shawn

Re: Race condition in Leader Election

2014-04-15 Thread Mark Miller

We have to fix that then.

-- 
Mark Miller
about.me/markrmiller

On April 15, 2014 at 12:20:03 PM, Rich Mayfield (mayfield.r...@gmail.com) wrote:

I see something similar where, given ~1000 shards, both nodes spend a LOT of 
time sorting through the leader election process. Roughly 30 minutes.  

I too am wondering - if I force all leaders onto one node, then shut down both, 
then start up the node with all of the leaders on it first, then start up the 
other node, then I think I would have a much faster startup sequence.  

Does that sound reasonable? And if so, is there a way to trigger the leader 
election process without taking the time to unload and recreate the shards?  

> Hi  
>  
> When restarting a node in solrcloud, i run into scenarios where both the  
> replicas for a shard get into "recovering" state and never come up causing  
> the error "No servers hosting this shard". To fix this, I either unload one  
> core or restart one of the nodes again so that one of them becomes the  
> leader.  
>  
> Is there a way to "force" leader election for a shard for solrcloud? Is  
> there a way to break ties automatically (without restarting nodes) to make  
> a node as the leader for the shard?  
>  
>  
> Thanks  
> Nitin

Re: Empty documents in Solr\lucene 3.6

2014-04-15 Thread Shawn Heisey

On 4/15/2014 10:22 AM, Alexey Kozhemiakin wrote:
> The system was up and running for long time(months) without any updates.
> There was no crashes for sure, at least support team says so.
> Logs indicate that at some point there was not enough disk space (caused by 
> weekend index optimization).

Software behavior becomes very difficult to define when a resource (RAM,
disk space, etc) is completely exhausted.  Even if Lucene's behavior is
well defined (which I think it might be -- the index itself is NOT
corrupt), Solr is another layer here, and I don't know whether its
behavior is well defined.  I suspect that it's not.  This might explain
what you're seeing.  That might be the only information you'll get, if
there's nothing else in the logs besides the inability to write to the disk.

Thanks,
Shawn

Re: What's the actual story with new morphline and hadoop contribs?

2014-04-15 Thread Wolfgang Hoschek

The solr morphline jars are integrated with solr by way of the solr specific 
solr/contrib/map-reduce module.

Ingestion from Flume into Solr is available here: 
http://flume.apache.org/FlumeUserGuide.html#morphlinesolrsink

FWIW, for our purposes we see no role for DataImportHandler anymore.

Wolfgang.

On Apr 15, 2014, at 6:01 AM, Alexandre Rafalovitch  wrote:

> The use case I keep thinking about is Flue/Morphline replacing
> DataImportHandler. So, when I saw morphline shipped with Solr, I tried
> to understand whether it is a step towards it.
> 
> As it is, I am still not sure I understand why those jars are shipped
> with Solr, if it is not actually integrating into Solr.
> 
> Regards,
>   Alex.
> Personal website: http://www.outerthoughts.com/
> Current project: http://www.solr-start.com/ - Accelerating your Solr 
> proficiency
> 
> 
> On Mon, Apr 14, 2014 at 8:36 PM, Wolfgang Hoschek  
> wrote:
>> Currently all Solr morphline use cases I’m aware of run in processes outside 
>> of the Solr JVM, e.g. in Flume, in MapReduce, in HBase Lily Indexer, etc. 
>> These ingestion processes generate Solr documents for Solr updates. Running 
>> in external processes is done to improve scalability, reliability, 
>> flexibility and reusability. Not everything needs to run inside of the Solr 
>> JVM.
>> 
>> We haven’t found a use case for it so far, but it would be easy to add an 
>> UpdateRequestProcessor that runs a morphline inside of the Solr JVM.
>> 
>> Here is more background info:
>> 
>> http://kitesdk.org/docs/current/kite-morphlines/index.html
>> 
>> http://kitesdk.org/docs/current/kite-morphlines/morphlinesReferenceGuide.html
>> 
>> http://files.meetup.com/5139282/SHUG10%20-%20Search%20On%20Hadoop.pdf
>> 
>> Wolfgang.
>> 
>> On Apr 14, 2014, at 2:26 PM, Alexandre Rafalovitch  
>> wrote:
>> 
>>> Hello,
>>> 
>>> I saw that 4.7.1 has morphline and hadoop contribution libraries, but
>>> I can't figure out the degree to which they are useful to _Solr_
>>> users. I found one hadoop example in the readme that does some sort
>>> injection into Solr. Is that the only use case supported?
>>> 
>>> I thought that maybe there is a UpdateRequestProcessor or Handler
>>> end-point or something that hooks into morphline to do
>>> similar/alternative work to DataImportHandler. But I can't see any
>>> entry points or examples for that.
>>> 
>>> Anybody knows what the story is and/or what the future holds?
>>> 
>>> Regards,
>>>   Alex.
>>> Personal website: http://www.outerthoughts.com/
>>> Current project: http://www.solr-start.com/ - Accelerating your Solr 
>>> proficiency
>>

Re: What is Overseer?

2014-04-15 Thread Chris Hostetter


: So, is Overseer really only an "implementation detail" or something that Solr
: Ops guys need to be very aware of?

Most people don't ever need to worry about the overseer - it's magic and 
it will take care of itself.  

The recent work on adding support for an "overseer role" in 4.7 was 
specifically for people who *want* to worry about it.

I've updated several places in the solr ref guide to remove some 
missleading claims about hte overseer (some old docs equated it to running 
embedded zookeeper) and add some more info to the glossary..

https://cwiki.apache.org/confluence/display/solr/Solr+Glossary#SolrGlossary-Overseer
https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-api15AddRole



-Hoss
http://www.lucidworks.com/

cache warming questions

2014-04-15 Thread Matt Kuiper

Hello,

I have a few questions regarding how Solr caches are warmed.

My understanding is that there are two ways to warm internal Solr caches (only 
one way for document cache and lucene FieldCache):

Auto warming - occurs when there is a current searcher handling requests and 
new searcher is being prepared.  "When a new searcher is opened, its caches may 
be prepopulated or "autowarmed" with cached object from caches in the old 
searcher. autowarmCount is the number of cached items that will be regenerated 
in the new searcher."http://wiki.apache.org/solr/SolrCaching#autowarmCount

Explicit warming - where the static warming queries specified in Solrconfig.xml 
for newSearcher and firstSearcher listeners are executed when a new searcher is 
being prepared.

What does it mean that items will be regenerated or prepopulated from the 
current searcher's cache to the new searcher's cache?  I doubt it means copy, 
as the index has likely changed with a commit and possibly invalidated some 
contents of the cache.  Are the queries, or filters, that define the contents 
of the current caches re-executed for the new searcher's caches?

For the case where auto warming is configured, a current searcher is active, 
and static warming queries are defined how does auto warming and explicit 
warming work together? Or do they?  Is only one type of warming activated to 
fill the caches?

Thanks,
Matt

Re: Question regarding solrj

2014-04-15 Thread Prashant Golash

Sorry for not replying!!!
It was wrong version of solrj that client was using (As it was third-party
code, we couldn't find out earlier). After fixing the version, things seem
to be working fine.

Thanks for your response!!!


On Sun, Apr 13, 2014 at 7:26 PM, Erick Erickson wrote:

> You say "I can't change the client". What is the client written in?
> What does it expect? Does it use the same version of SolrJ?
>
> Best,
> Erick
>
> On Sun, Apr 13, 2014 at 6:40 AM, Prashant Golash
>  wrote:
> > Thanks for your feedback. Following are some more details
> >
> > Version of solr : 4.3.0
> > Version of solrj : 4.3.0
> >
> > The way I am returning response to client:
> >
> >
> > Request Holder is the object containing post process request from client
> > (After renaming few of the fields, and internal to external mapping of
> the
> > fields)
> >
> > **
> >
> > WS.WSRequestHolder requestHolder = WS.url(url);
> > // requestHolder processing of few fields
> > return requestHolder.get().map(
> > new F.Function() {
> > @Override
> > public Result apply(WS.Response response)
> > throws Throwable {
> > System.out.println("Response header:
> "
> > + response.getHeader("Content-Type"));
> > System.out.println("Response: " +
> > response.getBody());
> > *return
> > ok(response.asByteArray()).as(response.getHeader("Content-Type"));*
> > }
> > }
> > );
> >
> > Thanks,
> > Prashant
> >
> >
> > On Sun, Apr 13, 2014 at 3:35 AM, Furkan KAMACI  >wrote:
> >
> >> Hi;
> >>
> >> If you had a chance to change the code at client side I would suggest to
> >> try that:
> >>
> >>
> http://lucene.apache.org/solr/4_2_1/solr-solrj/org/apache/solr/client/solrj/impl/HttpSolrServer.html#setParser(org.apache.solr.client.solrj.ResponseParser)
> >> There
> >> maybe a problem about character encoding of your Play App and here is
> the
> >> information:
> >>
> >> Javabin is a custom binary format used to write out Solr's response in a
> >> fast and efficient manner. As of Solr 3.1, the JavaBin format has
> changed
> >> to version 2. Version 2 serializes strings differently: instead of
> writing
> >> the number of UTF-16 characters followed by the bytes in Modified UTF-8
> it
> >> writes the number of UTF-8 bytes followed by the bytes in UTF-8.
> >>
> >> Which version of Solr and Solrj do you use respectively? On the other
> hand
> >> if you give us more information I can help you because there may be any
> >> other interesting thing as like here:
> >> https://issues.apache.org/jira/browse/SOLR-5744
> >>
> >> Thanks;
> >> Furkan KAMACI
> >>
> >>
> >> 2014-04-12 22:18 GMT+03:00 Prashant Golash :
> >>
> >> > Hi Solr Gurus,
> >> >
> >> > I have some doubt related to solrj client.
> >> >
> >> > My scenario is like this:
> >> >
> >> >- There is a proxy server (Play App) which internally queries solr.
> >> >- The proxy server is called from client side, which uses Solrj
> >> library.
> >> >The issue is that I can't change client code. I can only change
> >> >configurations to call different servers, hence I need to use
> SolrJ.
> >> >- Results are successfully returned from my play app in
> >> > *java-bin*format without modify them, but on client side, I am
> >> > receiving this
> >> >exception:
> >> >
> >> > Caused by: java.lang.NullPointerException
> >> > * at
> >> >
> >> >
> >>
> org.apache.solr.common.util.JavaBinCodec.readExternString(JavaBinCodec.java:689)*
> >> > * at
> >> >
> org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:188)*
> >> > * at
> >> >
> >>
> org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:112)*
> >> > * at
> >> >
> >> >
> >>
> org.apache.solr.client.solrj.impl.BinaryResponseParser.processResponse(BinaryResponseParser.java:41)*
> >> > * at
> >> >
> >> >
> >>
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:385)*
> >> > * at
> >> >
> >> >
> >>
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:180)*
> >> > * at
> >> >
> >> >
> >>
> org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:90)*
> >> > * at
> org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:310)*
> >> > * at
> >> >
> >> >
> >>
> com.ibm.commerce.foundation.internal.server.services.search.util.SearchQueryHelper.query(SearchQueryHelper.java:125)*
> >> > * at
> >> >
> >> >
> >>
> com.ibm.commerce.foundation.server.services.rest.search.processor.solr.SolrRESTSearchExpressionProcessor.performSearch(SolrRESTSearchExpressionProcessor.java:506)*
> >> > * at
> >> >
> >> >
> >>
> com.ibm.commerce.foundation.server.services.search.SearchServiceFacade.performSearch(SearchS*
> >> > erviceFacade.java:193)
> >> >
> >> > I am not sure, if this exception is relate

Distributed commits in CloudSolrServer

2014-04-15 Thread Peter Keegan

I have a SolrCloud index, 1 shard, with a leader and one replica, and 3
ZKs. The Solr indexes are behind a load balancer. There is one
CloudSolrServer client updating the indexes. The index schema includes 3
ExternalFileFields. When the CloudSolrServer client issues a hard commit, I
observe that the commits occur sequentially, not in parallel, on the leader
and replica. The duration of each commit is about a minute. Most of this
time is spent reloading the 3 ExternalFileField files. Because of the
sequential commits, there is a period of time (1 minute+) when the index
searchers will return different results, which can cause a bad user
experience. This will get worse as replicas are added to handle
auto-scaling. The goal is to keep all replicas in sync w.r.t. the user
queries.

My questions:

1. Is there a reason that the distributed commits are done in sequence, not
in parallel? Is there a way to change this behavior?

2. If instead, the commits were done in parallel by a separate client via a
GET to each Solr instance, how would this client get the host/port values
for each Solr instance from zookeeper? Are there any downsides to doing
commits this way?

Thanks,
Peter

Re: multiple analyzers for one field

2014-04-15 Thread Michael Sokolov


Ha! You were right.  Thanks for the nudge; here's my post:

http://blog.safariflow.com/2014/04/15/search-suggestions-with-solr-2/

there's code at http://github.com/safarijv/ifpress-solr-plugin

cheers

-Mike

On 04/15/2014 08:18 AM, Alexandre Rafalovitch wrote:

Your call, though from experience thus sounds like either two or no blog
posts. I certainly have killed a bunch of good articles by waiting for
perfection:-)
On 15/04/2014 7:01 pm, "Michael Sokolov" 
wrote:


A blog post is a great idea, Alex!  I think I should wait until I have a
complete end-to-end implementation done before I write about it though,
because I'd also like to include some tips about configuring the new
suggesters with Solr (the documentation on the wiki hasn't quite caught up
yet, I think), and I don't have that working as I'd like just yet.  But I
will follow up with something soon; probably I will be able to share code
on a public repo.

-Mike

On 04/14/2014 10:01 PM, Alexandre Rafalovitch wrote:


Hi Mike,

Glad I was able to help. Good note about the PoolingReuseStrategy, I
did not think of that either.

   Is there a blog post or a GitHub repository coming with more details
on that? Sounds like something others may benefit from as well.

Regards,
 Alex.
P.s. If you don't have your own blog, I'll be happy to host such
article on mine.

Personal website: http://www.outerthoughts.com/
Current project: http://www.solr-start.com/ - Accelerating your Solr
proficiency


On Tue, Apr 15, 2014 at 8:52 AM, Michael Sokolov
 wrote:


I lost the original thread; sorry for the new / repeated topic, but
thought
I would follow up to let y'all know that I ended up implementing Alex's
idea
to implement an UpdateRequestProcessor in order to apply different
analysis
to different fields when doing something analogous to copyFields.

It was pretty straightforward except that when there are multiple
values, I
ended up needing multiple copies of the same Analyzer.  I had to
implement a
new PoolingReuseStrategy for the Analyzer to handle this, which I hadn't
foreseen.

-Mike

Re: Empty documents in Solr\lucene 3.6

2014-04-15 Thread Dmitry Kan

Alexey,

1. Can you take a backup of the index and run the index checker with -fix
option? Will it modify the index at all?
2. Are all the missing fields configured as stored? Are they marked as
required in the schema or optional?

Dmitry


On Tue, Apr 15, 2014 at 7:22 PM, Alexey Kozhemiakin <
alexey_kozhemia...@epam.com> wrote:

> The system was up and running for long time(months) without any updates.
> There was no crashes for sure, at least support team says so.
> Logs indicate that at some point there was not enough disk space (caused
> by weekend index optimization).
>
>
> Were there any other similar cases or it's unique for us?
>
>
> Alexey.
> -Original Message-
> From: Shawn Heisey [mailto:s...@elyograg.org]
> Sent: Tuesday, April 15, 2014 18:50
> To: solr-user@lucene.apache.org
> Subject: Re: Empty documents in Solr\lucene 3.6
>
> Do you know for sure that the index was OK at some point?  Do you know
> what might have happened when it became not OK, like a system crash?
>
> If you have Solr logs from whatever event caused the problem, we might be
> able to figure it out ... but if you don't know when it happened or you
> don't have logs, it might not be possible to know what happened.
> The document may have simply been indexed incorrectly.
>
> Thanks,
> Shawn
>
>


-- 
Dmitry
Blog: http://dmitrykan.blogspot.com
Twitter: http://twitter.com/dmitrykan

Re: What is Overseer?

2014-04-15 Thread Jack Krupansky

I should have suggested three levels in my question: 1) important to average 
users, 2) expert-only, and 3) internal implementation detail. Yes, 
expert-only does have a place, but it is good to mark features as such.


-- Jack Krupansky

-Original Message- 
From: Chris Hostetter

Sent: Tuesday, April 15, 2014 1:48 PM
To: solr-user@lucene.apache.org
Subject: Re: What is Overseer?


: So, is Overseer really only an "implementation detail" or something that 
Solr

: Ops guys need to be very aware of?

Most people don't ever need to worry about the overseer - it's magic and
it will take care of itself.

The recent work on adding support for an "overseer role" in 4.7 was
specifically for people who *want* to worry about it.

I've updated several places in the solr ref guide to remove some
missleading claims about hte overseer (some old docs equated it to running
embedded zookeeper) and add some more info to the glossary..

https://cwiki.apache.org/confluence/display/solr/Solr+Glossary#SolrGlossary-Overseer
https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-api15AddRole



-Hoss
http://www.lucidworks.com/

Re: Distributed commits in CloudSolrServer

2014-04-15 Thread Mark Miller

Inline responses below.
-- 
Mark Miller
about.me/markrmiller

On April 15, 2014 at 2:12:31 PM, Peter Keegan (peterlkee...@gmail.com) wrote:

I have a SolrCloud index, 1 shard, with a leader and one replica, and 3 
ZKs. The Solr indexes are behind a load balancer. There is one 
CloudSolrServer client updating the indexes. The index schema includes 3 
ExternalFileFields. When the CloudSolrServer client issues a hard commit, I 
observe that the commits occur sequentially, not in parallel, on the leader 
and replica. The duration of each commit is about a minute. Most of this 
time is spent reloading the 3 ExternalFileField files. Because of the 
sequential commits, there is a period of time (1 minute+) when the index 
searchers will return different results, which can cause a bad user 
experience. This will get worse as replicas are added to handle 
auto-scaling. The goal is to keep all replicas in sync w.r.t. the user 
queries. 

My questions: 

1. Is there a reason that the distributed commits are done in sequence, not 
in parallel? Is there a way to change this behavior? 


The reason is that updates are currently done this way - it’s the only safe way 
to do it without solving some more problems. I don’t think you can easily 
change this. I think we should probably file a JIRA issue to track a better 
solution for commit handling. I think there are some complications because of 
how commits can be added on update requests, but its something we probably want 
to try and solve before tackling *all* updates to replicas in parallel with the 
leader.



2. If instead, the commits were done in parallel by a separate client via a 
GET to each Solr instance, how would this client get the host/port values 
for each Solr instance from zookeeper? Are there any downsides to doing 
commits this way? 

Not really, other than the extra management.





Thanks, 
Peter

Transformation on a numeric field

2014-04-15 Thread Jean-Sebastien Vachon

Hi All,

I am looking for a way to index a numeric field and its value divided by 1 000 
into another numeric field.
I thought about using a CopyField with a PatternReplaceFilterFactory to keep 
only the first few digits (cutting the last three).

Solr complains that I can not have an analysis chain on a numeric field:

Core: 
org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: 
Plugin init failure for [schema.xml] fieldType "truncated_salary": FieldType: 
TrieIntField (truncated_salary) does not support specifying an analyzer. Schema 
file is /data/solr/solr-no-cloud/Core1/schema.xml


Is there a way to accomplish this ?

Thanks

Re: Transformation on a numeric field

2014-04-15 Thread Rafał Kuć

Hello!

You can achieve that using update processor, for example look here: 
http://wiki.apache.org/solr/ScriptUpdateProcessor

What you would have to do, in general, is create a script that would
take a value of the field, divide it by the 1000 and put it in another
field - the target numeric field.

-- 
Regards,
 Rafał Kuć
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/


> Hi All,

> I am looking for a way to index a numeric field and its value
> divided by 1 000 into another numeric field.
> I thought about using a CopyField with a
> PatternReplaceFilterFactory to keep only the first few digits (cutting the 
> last three).

> Solr complains that I can not have an analysis chain on a numeric field:

> Core:
> org.apache.solr.common.SolrException:org.apache.solr.common.SolrException:
> Plugin init failure for [schema.xml] fieldType "truncated_salary":
> FieldType: TrieIntField (truncated_salary) does not support
> specifying an analyzer. Schema file is
> /data/solr/solr-no-cloud/Core1/schema.xml


> Is there a way to accomplish this ?

> Thanks

Re: Transformation on a numeric field

2014-04-15 Thread Jack Krupansky

You can use an update processor. The stateless script update processor will 
let you write arbitrary JavaScript code, which can do this calculation.


You should be able to figure it  out from the wiki:
http://wiki.apache.org/solr/ScriptUpdateProcessor

My e-book has plenty of script examples for this processor as well.

We could also write a generic script that takes a source and destination 
field name and then does a specified operation on it, like add an offset or 
multiple by a scale factor.


-- Jack Krupansky

-Original Message- 
From: Jean-Sebastien Vachon

Sent: Tuesday, April 15, 2014 3:57 PM
To: 'solr-user@lucene.apache.org'
Subject: Transformation on a numeric field

Hi All,

I am looking for a way to index a numeric field and its value divided by 1 
000 into another numeric field.
I thought about using a CopyField with a PatternReplaceFilterFactory to keep 
only the first few digits (cutting the last three).


Solr complains that I can not have an analysis chain on a numeric field:

Core: 
org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: 
Plugin init failure for [schema.xml] fieldType "truncated_salary": 
FieldType: TrieIntField (truncated_salary) does not support specifying an 
analyzer. Schema file is /data/solr/solr-no-cloud/Core1/schema.xml



Is there a way to accomplish this ?

Thanks

Odd extra character duplicates in spell checking

2014-04-15 Thread Ed Smiley

Hi,
I am going to make this question pretty short, so I don’t overwhelm with 
technical details until  the end.
I suspect that some folks may be seeing this issue without the particular 
configuration we are using.

What our problem is:

  1.  Correctly spelled words are returning as not spelled correctly, with the 
original, correctly spelled word with a single oddball character appended as 
multiple suggestions.
  2.  Incorrectly spelled words are returning correct spelling suggestions with 
a single oddball character appended as multiple suggestions.
  3.  We’re seeing this in Solr 4.5x and 4.7x.

Example:

The return values are all a single character (unicode shown in square brackets).

correction=attitude[2d]
correction=attitude[2f]
correction=attitude[2026]

Spurious characters:

  *   Unicode Character 'HYPHEN-MINUS' (U+002D)
  *   Unicode Character 'SOLIDUS' (U+002F)
  *   Unicode Character 'HORIZONTAL ELLIPSIS' (U+2026)

Anybody see anything like this?  Anybody fix something like this?

Thanks!
—Ed


OK, here’s the gory details:


What we are doing:
We have developed an application that returns  "did you mean” spelling 
alternatives against a specific (presumably misspelled word).
We’re using the vocabulary of indexed pages of a specified book as the source 
of the alternatives, so this is not a general dictionary spell check, we are 
returning only matching alternatives.
So when I say “correctly spelled” I mean they are words found on at least one 
page.  We are using the collations, so that we restrict ourselves to those 
pages in one book.
We are having to check for and “fix up” these faulty results.  That’s not a 
robust or desirable solution.

We are using SolrJ to get the collations,
  private static final String DID_YOU_MEAN_REQUEST_HANDLER = 
"/spell”;
….
SolrQuery query = new SolrQuery(q);
query.set("spellcheck", true);
query.set(SpellingParams.SPELLCHECK_COUNT, 10);
query.set(SpellingParams.SPELLCHECK_COLLATE, true);
query.set(SpellingParams.SPELLCHECK_COLLATE_EXTENDED_RESULTS, true);
query.set("wt", "json");
query.setRequestHandler(DID_YOU_MEAN_REQUEST_HANDLER);
query.set("shards.qt", DID_YOU_MEAN_REQUEST_HANDLER);
query.set("shards.tolerant", "true");
etc……

but we can duplicate the behavior without SolrJ with the collations/ 
misspellingsAndCorrections below:, e.g.:
solr/pg1/spell?q=+doc-id:(810500)+AND+attitudex&spellcheck=true&spellcheck.count=10&spellcheck.collate=true&spellcheck.collateExtendedResults=true&wt=json&qt=%2Fspell&shards.qt=%2Fspell&shards.tolerant=true.out.print


{"responseHeader":{"status":0,"QTime":60},"response":{"numFound":0,"start":0,"maxScore":0.0,"docs":[]},"spellcheck":{"suggestions":["attitudex",{"numFound":6,"startOffset":21,"endOffset":30,"origFreq":0,"suggestion":[{"word":"attitudes","freq":362486},{"word":"attitu
 dex","freq":4819},{"word":"atti tudex","freq":3254},{"word":"attit 
udex","freq":159},{"word":"attitude-","freq":1080},{"word":"attituden","freq":261}]},"correctlySpelled",false,"collation",["collationQuery","
 doc-id:(810500) AND 
attitude-","hits",2,"misspellingsAndCorrections",["attitudex","attitude-"]],"collation",["collationQuery","
 doc-id:(810500) AND 
attitude/","hits",2,"misspellingsAndCorrections",["attitudex","attitude/"]],"collation",["collationQuery","
 doc-id:(810500) AND 
attitude…","hits",2,"misspellingsAndCorrections",["attitudex","attitude…"]]]}}

The configuration is:





  text

  default

  wordbreak

  on

  true

  10

  5

  5

  true

  true

  10

  5

name="last-components">

  spellcheck



  




  wordbreak

  solr.WordBreakSolrSpellChecker

  text

  true

  true

  25

  3






  default

  text

  solr.DirectSolrSpellChecker

  internal

  0.2

  2

  1

  25

  4

  1



--

Ed Smiley, Senior Software Architect, eBooks
ProQuest | 161 E Evelyn Ave|
Mountain View, CA 94041 | USA |
+1 650 475 8700 extension 3772
ed.smi...@proquest.com
www.proquest.com | 
www.ebrary.com | www.eblib.com
ebrary and EBL, ProQuest businesses.

Re: svn vs GIT

2014-04-15 Thread Jeff Wartes


I guess I should¹ve double-checked it was still the case before saying
anything, but I¹m glad to be proven wrong.
Yes, it worked nicely for me when I tried today, which should simplify my
life a bit.


On 4/14/14, 4:35 PM, "Shawn Heisey"  wrote:

>On 4/14/2014 12:56 PM, Ramkumar R. Aiyengar wrote:
>> ant compile / ant -f solr dist / ant test certainly work, I use them
>>with a
>> git working copy. You trying something else?
>> On 14 Apr 2014 19:36, "Jeff Wartes"  wrote:
>>
>>> I vastly prefer git, but last I checked, (admittedly, some time ago)
>>>you
>>> couldn't build the project from the git clone. Some of the build
>>>scripts
>>> assumed some svn commands will work.
>
>The nightly-smoke build target uses svn.  There is a related smoketest
>script that uses provided URL parameters (or svn if it's a checkout from
>svn and the parameters are not supplied) to obtain artifacts for
>testing.  This may not be the only build target that uses facilities not
>available from git, but it's the only one that I know about for sure.
>
>Ordinary people should be able to use repositories cloned from the
>git.apache.org or github mirrors with no problem if they are not using
>exotic build targets or build scripts.
>
>When I tried 'ant precommit' it worked, but it did say at least once in
>what scrolled by that this was not an SVN checkout, so the
>'-check-svn-working-copy' build target (which is part of precommit)
>didn't work.
>
>Thanks,
>Shawn
>

Re: cache warming questions

2014-04-15 Thread Erick Erickson

bq: What does it mean that items will be regenerated or prepopulated
from the current searcher's cache...

You're right, the values aren't cached. They can't be since the
internal Lucene document id is used to identify docs, and due to
merging the internal ID may bear no relation to the old internal ID
for a particular document.

I find it useful to think of Solr's caches as a  map where the key is
the "query" and the value is some representation of the found
documents. The details of the value don't matter, so I'll skip them.

What matters is the key. Consider the filter cache. You put something
like &fq=price:[0 TO 100] on a URL. Solr then uses the fq  clause as
the key to the filterCache.

Here's the sneaky bit. When you specify an autowarm count of N for the
filterCache, when a new searcher is opened the first N keys from the
map are re-executed in the new searcher's context and the results put
into the new searcher's filterCache.

bq:  ...how does auto warming and explicit warming work together?

They're orthogonal. IOW, the autowarming for each cache is executed as
well as the newSearcher static warming queries. Use the static queries
to do things like fill the sort caches etc.

Incidentally, this bears on why there's a "firstSearcher" and
"newSearcher". The newSearcher queries are run in addition to the
cache autowarms. firstSearcher static queries are only run when a Solr
server is started the first time, and there are no cache entries to
autowarm. So the firstSearcher queries might be quite a bit more
complex than newSearcher queries.

HTH,
Erick

On Tue, Apr 15, 2014 at 1:55 PM, Matt Kuiper  wrote:
> Hello,
>
> I have a few questions regarding how Solr caches are warmed.
>
> My understanding is that there are two ways to warm internal Solr caches 
> (only one way for document cache and lucene FieldCache):
>
> Auto warming - occurs when there is a current searcher handling requests and 
> new searcher is being prepared.  "When a new searcher is opened, its caches 
> may be prepopulated or "autowarmed" with cached object from caches in the old 
> searcher. autowarmCount is the number of cached items that will be 
> regenerated in the new searcher."
> http://wiki.apache.org/solr/SolrCaching#autowarmCount
>
> Explicit warming - where the static warming queries specified in 
> Solrconfig.xml for newSearcher and firstSearcher listeners are executed when 
> a new searcher is being prepared.
>
> What does it mean that items will be regenerated or prepopulated from the 
> current searcher's cache to the new searcher's cache?  I doubt it means copy, 
> as the index has likely changed with a commit and possibly invalidated some 
> contents of the cache.  Are the queries, or filters, that define the contents 
> of the current caches re-executed for the new searcher's caches?
>
> For the case where auto warming is configured, a current searcher is active, 
> and static warming queries are defined how does auto warming and explicit 
> warming work together? Or do they?  Is only one type of warming activated to 
> fill the caches?
>
> Thanks,
> Matt

Re: More Robust Search Timeouts (to Kill Zombie Queries)?

2014-04-15 Thread Steve Davids

I have also experienced a similar problem on our cluster, I went ahead and 
opened SOLR-5986 to track the issue. I know Apache Blur has implemented a 
mechanism to kill these long running term enumerations, would be fantastic if 
Solr can get a similar mechanism.

-Steve

On Apr 15, 2014, at 5:23 AM, Salman Akram  
wrote:

> Looking at this, sharding seems to be best and simple option to handle such
> queries.
> 
> 
> On Wed, Apr 2, 2014 at 1:26 AM, Mikhail Khludnev > wrote:
> 
>> Hello Salman,
>> Let's me drop few thoughts on
>> 
>> http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200903.mbox/%3c856ac15f0903272054q2dbdbd19kea3c5ba9e105b...@mail.gmail.com%3E
>> 
>> There two aspects of this question:
>> 1. dealing with long running processing (thread divergence actions
>> http://docs.oracle.com/javase/specs/jls/se5.0/html/memory.html#65310) and
>> 2. an actual time checking.
>> "terminating" or "aborting" thread (2.) are just a way to tracking time
>> externally, and send interrupt() which the thread should react on, which
>> they don't do now, and we returning to the core issue (1.)
>> 
>> Solr's time allowed is to the proper way to handle this things, the only
>> problem is that expect that the only core search is long running, but in
>> your case rewriting MultiTermQuery-s takes a huge time.
>> Let's consider this problem. First of all MultiTermQuery.rewrite() is the
>> nearly design issue, after heavy rewrite occurs, it's thrown away, after
>> search is done. I think the most straightforward way is to address this
>> issue by caching these expensive queries. Solr does it well
>> http://wiki.apache.org/solr/CommonQueryParameters#fq However, only for
>> http://en.wikipedia.org/wiki/Conjunctive_normal_form like queries, there
>> is
>> a workaround allows to cache disjunction legs see
>> http://blog.griddynamics.com/2014/01/segmented-filter-cache-in-solr.html
>> If you still want to run expensively rewritten queries you need to
>> implement timeout check (similar to TimeLimitingCollector) for TermsEnum
>> returned from MultiTermQuery.getTermsEnum(), wrapping an actual TermsEnums
>> is the good way, to apply queries injecting time limiting wrapper
>> TermsEnum, you might consider override methods like
>> SolrQueryParserBase.newWildcardQuery(Term) or post process the query three
>> after parsing.
>> 
>> 
>> 
>> On Mon, Mar 31, 2014 at 2:24 PM, Salman Akram <
>> salman.ak...@northbaysolutions.net> wrote:
>> 
>>> Anyone?
>>> 
>>> 
>>> On Wed, Mar 26, 2014 at 7:55 PM, Salman Akram <
>>> salman.ak...@northbaysolutions.net> wrote:
>>> 
 With reference to this thread<
>>> 
>> http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200903.mbox/%3c856ac15f0903272054q2dbdbd19kea3c5ba9e105b...@mail.gmail.com%3E
>>> I
>>> wanted to know if there was any response to that or if Chris Harris
 himself can comment on what he ended up doing, that would be great!
 
 
 --
 Regards,
 
 Salman Akram
 
 
>>> 
>>> 
>>> --
>>> Regards,
>>> 
>>> Salman Akram
>>> 
>> 
>> 
>> 
>> --
>> Sincerely yours
>> Mikhail Khludnev
>> Principal Engineer,
>> Grid Dynamics
>> 
>> 
>> 
>> 
> 
> 
> 
> -- 
> Regards,
> 
> Salman Akram

Tipping point of solr shards (Num of docs / size)

2014-04-15 Thread Mukesh Jha

Hi Gurus,

In my solr cluster I've multiple shards and each shard containing
~500,000,000 documents total index size being ~1 TB.

I was just wondering how much more can I keep on adding to the shard before
we reach a tipping point and the performance starts to degrade?

Also as best practice what is the recomended no of docs / size of shards .

Txz in advance :)

-- 
Thanks & Regards,

*Mukesh Jha *

Re: deleting large amount data from solr cloud

2014-04-15 Thread Vinay Pothnis

Another update:

I removed the replicas - to avoid the replication doing a full copy. I am
able delete sizeable chunks of data.
But the overall index size remains the same even after the deletes. It does
not seem to go down.

I understand that Solr would do this in background - but I don't seem to
see the decrease in overall index size even after 1-2 hours.
I can see a bunch of ".del" files in the index directory, but the it does
not seem to get cleaned up. Is there anyway to monitor/follow the progress
of index compaction?

Also, does triggering "optimize" from the admin UI help to compact the
index size on disk?

Thanks
Vinay


On 14 April 2014 12:19, Vinay Pothnis  wrote:

> Some update:
>
> I removed the auto warm configurations for the various caches and reduced
> the cache sizes. I then issued a call to delete a day's worth of data (800K
> documents).
>
> There was no out of memory this time - but some of the nodes went into
> recovery mode. Was able to catch some logs this time around and this is
> what i see:
>
> 
> *WARN  [2014-04-14 18:11:00.381] [org.apache.solr.update.PeerSync]
> PeerSync: core=core1_shard1_replica2 url=http://host1:8983/solr
>  too many updates received since start -
> startingUpdates no longer overlaps with our currentUpdates*
> *INFO  [2014-04-14 18:11:00.476] [org.apache.solr.cloud.RecoveryStrategy]
> PeerSync Recovery was not successful - trying replication.
> core=core1_shard1_replica2*
> *INFO  [2014-04-14 18:11:00.476] [org.apache.solr.cloud.RecoveryStrategy]
> Starting Replication Recovery. core=core1_shard1_replica2*
> *INFO  [2014-04-14 18:11:00.535] [org.apache.solr.cloud.RecoveryStrategy]
> Begin buffering updates. core=core1_shard1_replica2*
> *INFO  [2014-04-14 18:11:00.536] [org.apache.solr.cloud.RecoveryStrategy]
> Attempting to replicate from http://host2:8983/solr/core1_shard1_replica1/
> . core=core1_shard1_replica2*
> *INFO  [2014-04-14 18:11:00.536]
> [org.apache.solr.client.solrj.impl.HttpClientUtil] Creating new http
> client,
> config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false*
> *INFO  [2014-04-14 18:11:01.964]
> [org.apache.solr.client.solrj.impl.HttpClientUtil] Creating new http
> client,
> config:connTimeout=5000&socketTimeout=2&allowCompression=false&maxConnections=1&maxConnectionsPerHost=1*
> *INFO  [2014-04-14 18:11:01.969] [org.apache.solr.handler.SnapPuller]  No
> value set for 'pollInterval'. Timer Task not started.*
> *INFO  [2014-04-14 18:11:01.973] [org.apache.solr.handler.SnapPuller]
> Master's generation: 1108645*
> *INFO  [2014-04-14 18:11:01.973] [org.apache.solr.handler.SnapPuller]
> Slave's generation: 1108627*
> *INFO  [2014-04-14 18:11:01.973] [org.apache.solr.handler.SnapPuller]
> Starting replication process*
> *INFO  [2014-04-14 18:11:02.007] [org.apache.solr.handler.SnapPuller]
> Number of files in latest index in master: 814*
> *INFO  [2014-04-14 18:11:02.007]
> [org.apache.solr.core.CachingDirectoryFactory] return new directory for
> /opt/data/solr/core1_shard1_replica2/data/index.20140414181102007*
> *INFO  [2014-04-14 18:11:02.008] [org.apache.solr.handler.SnapPuller]
> Starting download to
> NRTCachingDirectory(org.apache.lucene.store.MMapDirectory@/opt/data/solr/core1_shard1_replica2/data/index.20140414181102007
> lockFactory=org.apache.lucene.store.NativeFSLockFactory@5f6570fe;
> maxCacheMB=48.0 maxMergeSizeMB=4.0) fullCopy=true*
>
> 
>
>
> So, it looks like the number of updates is too huge for the regular
> replication and then it goes into full copy of index. And since our index
> size is very huge (350G), this is causing the cluster to go into recovery
> mode forever - trying to copy that huge index.
>
> I also read in some thread
> http://lucene.472066.n3.nabble.com/Recovery-too-many-updates-received-since-start-td3935281.htmlthat
>  there is a limit of 100 documents.
>
> I wonder if this has been updated to make that configurable since that
> thread. If not, the only option I see is to do a "trickle" delete of 100
> documents per second or something.
>
> Also - the other suggestion of using "distributed=false" might not help
> because the issue currently is that the replication is going to "full copy".
>
> Any thoughts?
>
> Thanks
> Vinay
>
>
>
>
>
>
>
> On 14 April 2014 07:54, Vinay Pothnis  wrote:
>
>> Yes, that is our approach. We did try deleting a day's worth of data at a
>> time, and that resulted in OOM as well.
>>
>> Thanks
>> Vinay
>>
>>
>> On 14 April 2014 00:27, Furkan KAMACI  wrote:
>>
>>> Hi;
>>>
>>> I mean you can divide the range (i.e. one week at each delete instead of
>>> one month) and try to check whether you still get an OOM or not.
>>>
>>> Thanks;
>>> Furkan KAMACI
>>>
>>>
>>> 2014-04-14 7:09 GMT+03:00 Vinay Pothnis :
>>>
>>> > Aman,
>>> > Yes - Will do!
>>> >
>>> > Furkan,
>>> > How do you mean by 'bulk delete'?
>>> >
>>> > -Thanks
>>> > Vinay
>>> >
>>> >
>>>

Re: Tipping point of solr shards (Num of docs / size)

2014-04-15 Thread Vinay Pothnis

You could look at this link to understand about the factors that affect the
solrcloud performance: http://wiki.apache.org/solr/SolrPerformanceProblems

Especially, the sections about RAM and disk cache. If the index grows too
big for one node, it can lead to performance issues. From the looks of it,
500mil docs per shard - may be already pushing it. How much does that
translate to in terms of index size on disk per shard?

-vinay

On 15 April 2014 21:44, Mukesh Jha  wrote:

> Hi Gurus,
>
> In my solr cluster I've multiple shards and each shard containing
> ~500,000,000 documents total index size being ~1 TB.
>
> I was just wondering how much more can I keep on adding to the shard before
> we reach a tipping point and the performance starts to degrade?
>
> Also as best practice what is the recomended no of docs / size of shards .
>
> Txz in advance :)
>
> --
> Thanks & Regards,
>
> *Mukesh Jha *
>

Re: Tipping point of solr shards (Num of docs / size)

2014-04-15 Thread Mukesh Jha

My index size per shard varies b/w 250 GB to 1 TB.
The cluster is performing well even now but thought it's high time to
change it, so that a shard doesn't get too big


On Wed, Apr 16, 2014 at 10:25 AM, Vinay Pothnis  wrote:

> You could look at this link to understand about the factors that affect the
> solrcloud performance: http://wiki.apache.org/solr/SolrPerformanceProblems
>
> Especially, the sections about RAM and disk cache. If the index grows too
> big for one node, it can lead to performance issues. From the looks of it,
> 500mil docs per shard - may be already pushing it. How much does that
> translate to in terms of index size on disk per shard?
>
> -vinay
>
>
> On 15 April 2014 21:44, Mukesh Jha  wrote:
>
> > Hi Gurus,
> >
> > In my solr cluster I've multiple shards and each shard containing
> > ~500,000,000 documents total index size being ~1 TB.
> >
> > I was just wondering how much more can I keep on adding to the shard
> before
> > we reach a tipping point and the performance starts to degrade?
> >
> > Also as best practice what is the recomended no of docs / size of shards
> .
> >
> > Txz in advance :)
> >
> > --
> > Thanks & Regards,
> >
> > *Mukesh Jha *
> >
>



-- 


Thanks & Regards,

*Mukesh Jha *

44 matches

Mail list logo