How can I pass in query request parameter at search time and know of it in my query analyzer/tokenizer?

2014-09-30 Thread Michael
Hi all,

I'm using Solr 4.7.2 to implement multilingual search in my application.

I have a need to pass in query locale on search request and to choose
between custom tokenizers dynamically based on provided locale value.

In Solr In Action - Chapter 14 (Multilingual Search), Listing 14.9 -
*Indexing and querying multiple languages within the same field*
MultiFieldTextTokenizer allow specifying query language as a prefix to the
terms that goes into the analyzer. For example: q=en,fr,es|abandon AND
en,fr,es|understanding AND en,fr,es|sagess.

>From one side I have only one language per query, and from the other side I
allow users to use Lucene query syntax in queries including multi-term
queries. Therefore it seems that I have to do the nontrivial work of
parsing the user query according to the query parser rules and add the
prefix everywhere it's needed.
For example consider this user entered query: *one AND (two OR
field2:three)*
This will need to be non-trivially translated into:

*en:one AND (en:two OR field2:en|three). *
Is there other conventional way to pass in language string (one per search
request) to query analyzer/tokenizer ?

Thanks in advance,
Michael


Re: Autsuggest/autocomplete/spellcheck phrases

2010-06-17 Thread Michael
Blargy,

I've been experimenting with this myself for a work project. What I
did was use a combination of the two running the indexed terms through
the Shingle factory and then through the edge n-gram filter. I did
this in order to be able to match terms like :

.net asp c#
asp .net c#
c# asp .net
c# asp.net
 for a word query like
asp c# .net

The edge ngrams are good, but they can also fail to match on queries
when the words in the index are in a different order than those in the
query.

My setup in schema.xml  looks like this :


  




  


Let me know how this works for you.

On Thu, Jun 17, 2010 at 11:05 AM, Blargy  wrote:
>
> How can I preserve phrases for either autosuggest/autocomplete/spellcheck?
>
> For example we have a bunch of product listings and I want if someone types:
> "louis" for it to common up with "Louis Vuitton". "World" ... "World cup".
>
> Would I need n-grams? Shingling? Thanks
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Autsuggest-autocomplete-spellcheck-phrases-tp902951p902951.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Autsuggest/autocomplete/spellcheck phrases

2010-06-17 Thread Michael
We base the auto-suggest on popular searches. Our site logs the search
terms in a database and a simple query can give us a summary counting
the number of times the search was entered and the number of results
it returned, similar to the criteria used in the lucid imagination
article you cite. Each record includes the search terms, the total
number of times it was entered and the maximum number of hits
returned. Each record is fed in as a document. On a regular interval,
older documents are deleted and newer ones are added.

On Thu, Jun 17, 2010 at 12:29 PM, Blargy  wrote:
>
> Thanks for the reply Michael. Ill definitely try that out and let you know
> how it goes. Your solution sounds similar to the one I've read here:
> http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/
>
> There are some good comments in there too.
>
> I think I am having the biggest trouble distinguishing what needs to be done
> for autocomplete/autosuggestion (google like behavior) and a separate issue
> involving spellchecking (Did you mean...). I guess I originally thought
> those 2 distinct features would involve the same solution but it appears
> that they are completely different. Your solution sounds like its works best
> for autocomplete and I will be using it for that exact purpose ;) One
> question though... how do you handle more popular words/documents over
> others?
>
> Now my next question is, how would I get spellchecker to work with phrases.
> So if I typed "vitton" it would come back with something like: "Did you
> mean: 'Louis Vuitton'?" Will this also require a combination of ngrams and
> shingles?
>
> Thanks
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Autsuggest-autocomplete-spellcheck-phrases-tp902951p903225.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Preserving "C++" and other weird tokens

2009-08-06 Thread Michael _
Hi everyone,
I'm indexing several documents that contain words that the StandardTokenizer
cannot detect as tokens.  These are words like
  C#
  .NET
  C++
which are important for users to be able to search for, but get treated as
"C", "NET", and "C".

How can I create a list of words that should be understood to be indivisible
tokens?  Is my only option somehow stringing together a lot of
PatternTokenizers?  I'd love to do something like .

Thanks in advance!


Is kill -9 safe or not?

2009-08-07 Thread Michael _
I've seen several threads that are one or two years old saying that
performing "kill -9" on the java process running Solr either CAN, or CAN NOT
corrupt your index.  The more recent ones seem to say that it CAN NOT, but
before I bake a kill -9 into my control script (which first tries a normal
"kill", of course), I'd like to hear the answer straight from the horse's
mouth...
I'm using Solr 1.4 nightly from about a month ago.  Can I kill -9 without
fear of having to rebuild my index?

Thanks!
Michael


Re: Preserving "C++" and other weird tokens

2009-08-07 Thread Michael _
On Thu, Aug 6, 2009 at 11:38 AM, Michael _  wrote:

> Hi everyone,
> I'm indexing several documents that contain words that the
> StandardTokenizer cannot detect as tokens.  These are words like
>   C#
>   .NET
>   C++
> which are important for users to be able to search for, but get treated as
> "C", "NET", and "C".
>
> How can I create a list of words that should be understood to be
> indivisible tokens?  Is my only option somehow stringing together a lot of
> PatternTokenizers?  I'd love to do something like  class="StandardTokenizer" tokenwhitelist=".NET C++ C#" />.
>
> Thanks in advance!
>

By the way, in case it wasn't clear: I'm not particularly tied to using the
StandardTokenizer.  Any tokenizer would be fine, if it did a reasonable job
of splitting up the input text while preserving special cases.

I'm also not averse to passing in a list of regexes, if I had to, but I'm
suspicious that that would be redoing a lot of the work done by the parser
inside the Tokenizer.

Thanks,
Michael


Boosting relevance as terms get nearer to each other

2009-08-13 Thread Michael _
Hello,
I'd like to score documents higher that have the user's search terms nearer
each other.  For example, if a user searches for

  a AND b AND c

the standard query handler should return all documents with [a] [b] and [c]
in them, but documents matching the phrase "a b c" should get a boost over
those with "a x b c" over those with "b x y c z a", etc.

To accomplish this, I thought I might replace the user's query with

  "a b c"~10

hoping that the slop term gets a higher and higher score the closer together
[a] [b] and [c] appear.  This doesn't seem to be the case in my experiments;
when I debug the query, there's no component of the score based on how close
together [a] [b] and [c] are.  And I'm suspicious that this would make my
queries a whole lot slower -- in reality my users' queries get expanded
quite a bit already, and I'd thus need to add many slop terms.

Perhaps instead I could modify the Standard query handler to examine the
distance between all ANDed tokens, and boost proportionally to the inverse
of their average distance apart.  I've never modified a query handler before
so I have no idea if this is possible.

Any suggestions on what approach I should take?  The less I have to modify
Solr, the better -- I'd prefer a query-side solution over writing a plugin
over forking the standard query handler.

Thanks in advance!
Michael


Re: Boosting relevance as terms get nearer to each other

2009-08-17 Thread Michael
Anybody have any suggestions or hints?  I'd love to score my queries in a
way that pays attention to how close together terms appear.
Michael

On Thu, Aug 13, 2009 at 12:01 PM, Michael  wrote:

> Hello,
> I'd like to score documents higher that have the user's search terms nearer
> each other.  For example, if a user searches for
>
>   a AND b AND c
>
> the standard query handler should return all documents with [a] [b] and [c]
> in them, but documents matching the phrase "a b c" should get a boost over
> those with "a x b c" over those with "b x y c z a", etc.
>
> To accomplish this, I thought I might replace the user's query with
>
>   "a b c"~10
>
> hoping that the slop term gets a higher and higher score the closer
> together [a] [b] and [c] appear.  This doesn't seem to be the case in my
> experiments; when I debug the query, there's no component of the score based
> on how close together [a] [b] and [c] are.  And I'm suspicious that this
> would make my queries a whole lot slower -- in reality my users' queries get
> expanded quite a bit already, and I'd thus need to add many slop terms.
>
> Perhaps instead I could modify the Standard query handler to examine the
> distance between all ANDed tokens, and boost proportionally to the inverse
> of their average distance apart.  I've never modified a query handler before
> so I have no idea if this is possible.
>
> Any suggestions on what approach I should take?  The less I have to modify
> Solr, the better -- I'd prefer a query-side solution over writing a plugin
> over forking the standard query handler.
>
> Thanks in advance!
> Michael
>


Re: Boosting relevance as terms get nearer to each other

2009-08-17 Thread Michael
Thanks for the suggestion.  Unfortunately, my implementation requires the
Standard query parser -- I sanitize and expand user queries into deeply
nested queries with custom boosts and other bells and whistles that make
Dismax unappealing.
I see from the docs that Similarity.sloppyFreq() is a method for returning a
higher score for small edit distances, but it's not clear when that is used.
 If I make a (Standard) query like
  a AND b AND c AND "a b c"~100
does that imply that during the computation of the score for "a b
c"~100, sloppyFreq() will be called?  That's great for my needs,
assuming the 100 slop doesn't increase query time horribly.

Michael

On Mon, Aug 17, 2009 at 10:15 AM, Mark Miller  wrote:

> Dismax QueryParser with pf and ps params?
>
> http://wiki.apache.org/solr/DisMaxRequestHandler
>
> --
> - Mark
>
> http://www.lucidimagination.com
>
>
>
>
> Michael wrote:
>
>> Anybody have any suggestions or hints?  I'd love to score my queries in a
>> way that pays attention to how close together terms appear.
>> Michael
>>
>> On Thu, Aug 13, 2009 at 12:01 PM, Michael  wrote:
>>
>>
>>
>>> Hello,
>>> I'd like to score documents higher that have the user's search terms
>>> nearer
>>> each other.  For example, if a user searches for
>>>
>>>  a AND b AND c
>>>
>>> the standard query handler should return all documents with [a] [b] and
>>> [c]
>>> in them, but documents matching the phrase "a b c" should get a boost
>>> over
>>> those with "a x b c" over those with "b x y c z a", etc.
>>>
>>> To accomplish this, I thought I might replace the user's query with
>>>
>>>  "a b c"~10
>>>
>>> hoping that the slop term gets a higher and higher score the closer
>>> together [a] [b] and [c] appear.  This doesn't seem to be the case in my
>>> experiments; when I debug the query, there's no component of the score
>>> based
>>> on how close together [a] [b] and [c] are.  And I'm suspicious that this
>>> would make my queries a whole lot slower -- in reality my users' queries
>>> get
>>> expanded quite a bit already, and I'd thus need to add many slop terms.
>>>
>>> Perhaps instead I could modify the Standard query handler to examine the
>>> distance between all ANDed tokens, and boost proportionally to the
>>> inverse
>>> of their average distance apart.  I've never modified a query handler
>>> before
>>> so I have no idea if this is possible.
>>>
>>> Any suggestions on what approach I should take?  The less I have to
>>> modify
>>> Solr, the better -- I'd prefer a query-side solution over writing a
>>> plugin
>>> over forking the standard query handler.
>>>
>>> Thanks in advance!
>>> Michael
>>>
>>>
>>>
>>
>>
>>
>
>
>
>
>


Re: Boosting relevance as terms get nearer to each other

2009-08-17 Thread Michael
Great, thank you Mark!
Michael

On Mon, Aug 17, 2009 at 10:48 AM, Mark Miller  wrote:

> PhraseQuery's do score higher if the terms are found closer together.
>
>  does that imply that during the computation of the score for "a b
>>> c"~100, sloppyFreq() will be called?
>>>
>>
> Yes. PhraseQuery uses PhraseWeight, which creates a SloppyPhraseScorer,
> which takes into account Similiarity.sloppyFreq(matchLength).
>
>
>
> Michael wrote:
>
>> Thanks for the suggestion.  Unfortunately, my implementation requires the
>> Standard query parser -- I sanitize and expand user queries into deeply
>> nested queries with custom boosts and other bells and whistles that make
>> Dismax unappealing.
>> I see from the docs that Similarity.sloppyFreq() is a method for returning
>> a
>> higher score for small edit distances, but it's not clear when that is
>> used.
>>  If I make a (Standard) query like
>>  a AND b AND c AND "a b c"~100
>> does that imply that during the computation of the score for "a b
>> c"~100, sloppyFreq() will be called?  That's great for my needs,
>> assuming the 100 slop doesn't increase query time horribly.
>>
>> Michael
>>
>> On Mon, Aug 17, 2009 at 10:15 AM, Mark Miller 
>> wrote:
>>
>>
>>
>>> Dismax QueryParser with pf and ps params?
>>>
>>> http://wiki.apache.org/solr/DisMaxRequestHandler
>>>
>>> --
>>> - Mark
>>>
>>> http://www.lucidimagination.com
>>>
>>>
>>>
>>>
>>> Michael wrote:
>>>
>>>
>>>
>>>> Anybody have any suggestions or hints?  I'd love to score my queries in
>>>> a
>>>> way that pays attention to how close together terms appear.
>>>> Michael
>>>>
>>>> On Thu, Aug 13, 2009 at 12:01 PM, Michael  wrote:
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>> Hello,
>>>>> I'd like to score documents higher that have the user's search terms
>>>>> nearer
>>>>> each other.  For example, if a user searches for
>>>>>
>>>>>  a AND b AND c
>>>>>
>>>>> the standard query handler should return all documents with [a] [b] and
>>>>> [c]
>>>>> in them, but documents matching the phrase "a b c" should get a boost
>>>>> over
>>>>> those with "a x b c" over those with "b x y c z a", etc.
>>>>>
>>>>> To accomplish this, I thought I might replace the user's query with
>>>>>
>>>>>  "a b c"~10
>>>>>
>>>>> hoping that the slop term gets a higher and higher score the closer
>>>>> together [a] [b] and [c] appear.  This doesn't seem to be the case in
>>>>> my
>>>>> experiments; when I debug the query, there's no component of the score
>>>>> based
>>>>> on how close together [a] [b] and [c] are.  And I'm suspicious that
>>>>> this
>>>>> would make my queries a whole lot slower -- in reality my users'
>>>>> queries
>>>>> get
>>>>> expanded quite a bit already, and I'd thus need to add many slop terms.
>>>>>
>>>>> Perhaps instead I could modify the Standard query handler to examine
>>>>> the
>>>>> distance between all ANDed tokens, and boost proportionally to the
>>>>> inverse
>>>>> of their average distance apart.  I've never modified a query handler
>>>>> before
>>>>> so I have no idea if this is possible.
>>>>>
>>>>> Any suggestions on what approach I should take?  The less I have to
>>>>> modify
>>>>> Solr, the better -- I'd prefer a query-side solution over writing a
>>>>> plugin
>>>>> over forking the standard query handler.
>>>>>
>>>>> Thanks in advance!
>>>>> Michael
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>>
>>>
>>
>>
>>
>
>
> --
> - Mark
>
> http://www.lucidimagination.com
>
>
>
>


Re: Release Date Solr 1.4

2009-08-18 Thread Michael
I think this link gets you the exact bug count: it's Constantijn's link,
filtered to Unresolved Solr issues marked for fixing in 1.4:
https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=true&&pid=12310230&fixfor=12313351&resolution=-1&sorter/field=issuekey&sorter/order=DESC

<https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=true&&pid=12310230&fixfor=12313351&resolution=-1&sorter/field=issuekey&sorter/order=DESC>
Michael

On Tue, Aug 18, 2009 at 9:05 AM, Constantijn Visinescu
wrote:

> Last i heard the eta was aprox a month, but they won't release it untill
> it's ready.
>
> Check JIRA here for the list of open issues that need fixing before 1.4
>
> https://issues.apache.org/jira/secure/IssueNavigator.jspa?sorter/field=updated&sorter/order=DESC
>
> Constantijn Visinescu
>
> On Tue, Aug 18, 2009 at 2:57 PM, Daniel Knapp <
> daniel.kn...@mni.fh-giessen.de> wrote:
>
> > Hello Mailinglist,
> >
> >
> > does anyone know the release date from Solr 1.4?
> >
> > Thanks for your reply.
> >
> >
> > Regards,
> > Daniel
> >
>


Is caching worth it when my whole index is in RAM?

2009-08-31 Thread Michael
Hi,
If I've got my entire 20G 4MM document index in RAM (on a ramdisk), do I
have a need for the document cache?  Or should I set it to 0 items, because
pulling field values from an index in RAM is so fast that the document cache
would be a duplication of effort?

Are there any other caches that I should turn off if I can get my entire
index in RAM?  Filter cache, query results cache, etc?

Thanks!
Michael


Re: filtering facets

2009-08-31 Thread Michael
You could post-process the response and remove urls that don't match your
domain pattern.

On Mon, Aug 31, 2009 at 9:45 AM, Olivier H. Beauchesne wrote:

> Hi Mike,
>
> No, my problem is that the field article_outlinks is multivalued thus it
> contains several urls not related to my search. I would like to facet only
> urls matching my query.
>
> For exemple(only on one document, but my search targets over 1M docs):
>
> Doc1:
> article_url:
> url1.com/1
> url2.com/2
> url1.com/1
> url1.com/3
>
> And my query is: article_url:url1.com* and I facet by article_url and I
> want it to give me:
> url1.com/1 (2)
> url1.com/3 (1)
>
> But right now, because url2.com/2 is contained in a multivalued field with
> the matching urls, I get this:
> url1.com/1 (2)
> url1.com/3 (1)
> url2.com/2 (1)
>
> I can use facet.prefix to filter, but it's not very flexible if my url
> contains a subdomain as facet.prefix doesn't support wildcards.
>
> Thank you,
>
> Olivier
>
> Mike Topper a écrit :
>
>  Hi Olivier,
>>
>> are the facet counts on the urls you dont want 0?
>>
>> if so you can use facet.mincount to only return results greater than 0.
>>
>> -Mike
>>
>> Olivier H. Beauchesne wrote:
>>
>>
>>> Hi,
>>>
>>> Long time lurker, first time poster.
>>>
>>> I have a multi-valued field, let's call it article_outlinks containing
>>> all outgoing urls from a document. I want to get all matching urls
>>> sorted by counts.
>>>
>>> For exemple, I want to get all outgoing wikipedia url in my documents
>>> sorted by counts.
>>>
>>> So I execute a query like this:
>>> q=article_outlinks:http*wikipedia.org*  and I facet on article_outlinks
>>>
>>> But I get facets containing the other urls in the documents. I can get
>>> something close by using facet.prefix=http://en.wikipedia.org but I
>>> want to include other subdomains on wikipedia (ex: fr.wikipedia.org).
>>>
>>> Is there a way to do a search and getting facets only matching my query?
>>>
>>> I know facet.prefix isn't a query, but is there a way to get that
>>> behavior?
>>>
>>> Is it easy to extend solr to do something like that?
>>>
>>> Thank you,
>>>
>>> Olivier
>>>
>>> Sorry for my english.
>>>
>>>
>>>
>>
>>
>>
>>
>


Re: Is caching worth it when my whole index is in RAM?

2009-09-01 Thread Michael
Thanks, Avlesh!  I'll try the filter cache.
Anybody familiar enough with the caching implementation to chime in?

Michael

On Mon, Aug 31, 2009 at 10:02 PM, Avlesh Singh  wrote:

> Good question!
> The application level cache, say filter cache, would still help because it
> not only caches values but also the underlying computation. Even with all
> the data in your RAM you will still end up doing the computations every
> time.
>
> Looking for responses from the more knowledgeable.
>
> Cheers
> Avlesh
>
> On Mon, Aug 31, 2009 at 8:25 PM, Michael  wrote:
>
> > Hi,
> > If I've got my entire 20G 4MM document index in RAM (on a ramdisk), do I
> > have a need for the document cache?  Or should I set it to 0 items,
> because
> > pulling field values from an index in RAM is so fast that the document
> > cache
> > would be a duplication of effort?
> >
> > Are there any other caches that I should turn off if I can get my entire
> > index in RAM?  Filter cache, query results cache, etc?
> >
> > Thanks!
> > Michael
> >
>


Parallel requests to Tomcat

2009-09-22 Thread Michael
Hi,
I have a Solr+Tomcat installation on an 8 CPU Linux box, and I just tried
sending parallel requests to it and measuring response time.  I would expect
that it could handle up to 8 parallel requests without significant slowdown
of any individual request.

Instead, I found that Tomcat is serializing the requests.

For example, the response time for each of 2 parallel requests is nearly 2
times that for a single request, and the time for each of 8 parallel
requests is about 4 times that of a single request.

I am pretty sure this is a Tomcat issue, for when I started 8 identical
instances of Solr+Tomcat on the machine (on 8 different ports), I could send
one request to each in parallel with only a 20% slowdown (compared to 300%
in a single Tomcat.)

I'm using the stock Tomcat download with minimal configuration changes,
except that I disabled all logging (in case the logger was blocking for each
request, serializing them.)  I'm giving 2G RAM to each JVM.

Does anyone more familiar with Tomcat know what's wrong?  I can't imagine
that Tomcat really can't handle parallel requests.


Re: Parallel requests to Tomcat

2009-09-23 Thread Michael
I'm using a Solr 1.4 nightly from around July.  Is that recent enough to
have the improved reader implementation?
I'm not sure whether you'd call my operations IO heavy -- each query has so
many terms (~50) that even against a 45K document index a query takes 130ms,
but the entire index is in a ramfs.
- Michael

On Tue, Sep 22, 2009 at 8:08 PM, Yonik Seeley wrote:

> What version of Solr are you using?
> Solr1.3 and Lucene 2.4 defaulted to an index reader implementation
> that had to synchronize, so search operations that are IO "heavy"
> can't proceed in parallel.  You shouldn't see this with 1.4
>
> -Yonik
> http://www.lucidimagination.com
>
>
>
> On Tue, Sep 22, 2009 at 4:03 PM, Michael  wrote:
> > Hi,
> > I have a Solr+Tomcat installation on an 8 CPU Linux box, and I just tried
> > sending parallel requests to it and measuring response time.  I would
> expect
> > that it could handle up to 8 parallel requests without significant
> slowdown
> > of any individual request.
> >
> > Instead, I found that Tomcat is serializing the requests.
> >
> > For example, the response time for each of 2 parallel requests is nearly
> 2
> > times that for a single request, and the time for each of 8 parallel
> > requests is about 4 times that of a single request.
> >
> > I am pretty sure this is a Tomcat issue, for when I started 8 identical
> > instances of Solr+Tomcat on the machine (on 8 different ports), I could
> send
> > one request to each in parallel with only a 20% slowdown (compared to
> 300%
> > in a single Tomcat.)
> >
> > I'm using the stock Tomcat download with minimal configuration changes,
> > except that I disabled all logging (in case the logger was blocking for
> each
> > request, serializing them.)  I'm giving 2G RAM to each JVM.
> >
> > Does anyone more familiar with Tomcat know what's wrong?  I can't imagine
> > that Tomcat really can't handle parallel requests.
> >
>


Re: Parallel requests to Tomcat

2009-09-23 Thread Michael
Hi Fuad, thanks for the reply.
My queries are heavy enough that the difference in performance is obvious.
 I am using a home-grown load testing script that sends 1000 realistic
queries to the server and takes the average response time.  My index is on a
ramfs which I've shown makes the QR and doc caches unnecessary; I am warming
up the filter and fieldvalue caches before beginning the test.  There's no
appreciable difference between query times at the beginning, middle, or end
of the test, so I can't blame the hotspot or the Tomcat thread pool for not
being warmed up.

The queries I'm using are complex enough that they take a long time to run.
 8 queries against 1 Tomcat average 600ms per query, while 8 queries against
8 Tomcats average 190ms per query (on a dedicated 8 CPU server w 32G RAM).
 I don't see how to interpret these numbers except that Tomcat is not
multithreading as well as it should :)

Your thoughts?
Michael


On Wed, Sep 23, 2009 at 10:48 AM, Fuad Efendi  wrote:

> For 8-CPU load-stress testing of Tomcat you are probably making mistake:
> - you should execute load-stress software and wait 5-30 minutes (depends on
> index size) BEFORE taking measurements.
>
> 1. JVM HotSpot need to compile everything into native code
> 2. Tomcat Thread Pool needs warm up
> 3. SOLR caches need warm up(!)
> And etc.
>
> 8 parallel requests are too small for default Tomcat; it uses 150 threads
> (default for old versions), and new Concurrent package from Java 5
>
> You should not test manually; use software such as The Grinder etc., also
> note please: there is difference between mean time and response time,
> between average (successful) requests per second and average response
> time...
>
> > Tomcat is serializing the requests
> - doesn't mean anything for performance... yes, it has dedicated Listener
> on
> dedicated port dispatching requests to worker threads... and LAN NIC card
> serializes everything too...
>
>
>
> Fuad Efendi
> http://www.linkedin.com/in/liferay
>
>
>
> > -Original Message-
> > From: Michael [mailto:solrco...@gmail.com]
> > Sent: September-22-09 4:04 PM
> > To: solr-user@lucene.apache.org
> > Subject: Parallel requests to Tomcat
> >
> > Hi,
> > I have a Solr+Tomcat installation on an 8 CPU Linux box, and I just tried
> > sending parallel requests to it and measuring response time.  I would
> expect
> > that it could handle up to 8 parallel requests without significant
> slowdown
> > of any individual request.
> >
> > Instead, I found that Tomcat is serializing the requests.
> >
> > For example, the response time for each of 2 parallel requests is nearly
> 2
> > times that for a single request, and the time for each of 8 parallel
> > requests is about 4 times that of a single request.
> >
> > I am pretty sure this is a Tomcat issue, for when I started 8 identical
> > instances of Solr+Tomcat on the machine (on 8 different ports), I could
> send
> > one request to each in parallel with only a 20% slowdown (compared to
> 300%
> > in a single Tomcat.)
> >
> > I'm using the stock Tomcat download with minimal configuration changes,
> > except that I disabled all logging (in case the logger was blocking for
> each
> > request, serializing them.)  I'm giving 2G RAM to each JVM.
> >
> > Does anyone more familiar with Tomcat know what's wrong?  I can't imagine
> > that Tomcat really can't handle parallel requests.
>
>
>


Re: Parallel requests to Tomcat

2009-09-23 Thread Michael
On Wed, Sep 23, 2009 at 11:26 AM, Fuad Efendi  wrote:
>
> - something obviously wrong in your case, 130ms is too high. Is it
> dedicated
> server? Disk swapping? Etc.
>

It's that my queries are ridiculously complex.  My users are very familiar
with boolean searching, and I'm doing a lot of processing outside of Solr
that increases the query size by something like 50x.

I'm OK with the individual query time -- I can always shave terms off if I
must.  It's the difference between 1 Tomcat and 8 Tomcats that is the
problem: I'd like to be able to harness all 8 CPUs!  While my test corpus is
45K docs, my actual corpus will be 30MM, and so I'd like to get all the
performance I can out of my box.

Michael


Re: Parallel requests to Tomcat

2009-09-23 Thread Michael
Hi Fuad,

On Wed, Sep 23, 2009 at 11:37 AM, Fuad Efendi  wrote:

> >  8 queries against 1 Tomcat average 600ms per query, while 8 queries
> against
> > 8 Tomcats average 190ms per query (on a dedicated 8 CPU server w 32G
> RAM).
> >  I don't see how to interpret these numbers except that Tomcat is not
> > multithreading as well as it should :)
>
> Hi Michael, I think it is very natural; 8 single processes not sharing
> anything are faster than 8 threads sharing something.
>

8 threads sharing something may have *some* overhead versus 8 processes, but
as you say, 410ms overhead points to a different problem.


However, 600ms is too high.
>
> >My index is on a
> >ramfs which I've shown makes the QR and doc caches unnecessary;
>
> However, SOLR is faster than pure Lucene, try SOLR caches!
>

I have.  In a separate test, I verified that the caches that save disk I/O
(QR and doc) make no difference to query time, because my index is on a
ramfs.  The caches that save CPU cycles (filter and fieldvalue, because I'm
doing heavy faceting) DO help and I do have them turned on.

Michael


Re: Parallel requests to Tomcat

2009-09-23 Thread Michael
Hi Yonik,

On Wed, Sep 23, 2009 at 11:42 AM, Yonik Seeley
wrote:

>
> This could well be IO bound - lots of seeks and reads.
>

If this were IO bound, wouldn't I see the same results when sending my 8
requests to 8 Tomcats?  There's only one "disk" (well, RAM) whether I'm
querying 8 processes or 8 threads in 1 process, right?

Michael


Re: Parallel requests to Tomcat

2009-09-23 Thread Michael
Thanks for the suggestion, Walter!  I've been using Gaze 1.0 for a while
now, but when I moved to a multicore approach (which was the impetus behind
all of this testing) Gaze failed to start and I had to comment it out of
solrconfig.xml to get Solr to start.  Are you aware whether Gaze is able to
work in a multicore environment?
Michael

On Wed, Sep 23, 2009 at 11:55 AM, Walter Underwood wrote:

> This sure seems like a good time to try LucidGaze for Solr. That would give
> some Solr-specific profiling data.
>
> http://www.lucidimagination.com/Downloads/LucidGaze-for-Solr
>
> wunder
>
>
> On Sep 23, 2009, at 8:47 AM, Michael wrote:
>
>  Hi Yonik,
>>
>> On Wed, Sep 23, 2009 at 11:42 AM, Yonik Seeley
>> wrote:
>>
>>
>>> This could well be IO bound - lots of seeks and reads.
>>>
>>>
>> If this were IO bound, wouldn't I see the same results when sending my 8
>> requests to 8 Tomcats?  There's only one "disk" (well, RAM) whether I'm
>> querying 8 processes or 8 threads in 1 process, right?
>>
>> Michael
>>
>
>


Re: Parallel requests to Tomcat

2009-09-23 Thread Michael
On Wed, Sep 23, 2009 at 12:05 PM, Yonik Seeley
wrote:

> On Wed, Sep 23, 2009 at 11:47 AM, Michael  wrote:
> > If this were IO bound, wouldn't I see the same results when sending my 8
> > requests to 8 Tomcats?  There's only one "disk" (well, RAM) whether I'm
> > querying 8 processes or 8 threads in 1 process, right?
>
> Right - I was thinking IO bound at the Lucene Directory level - which
> synchronized in the past and led to poor concurrency.  Buy your Solr
> version is recent enough to use the newer unsynchronized method by
> default (on non-windows)
>

Ah, OK.  So it looks like comparing to Jetty is my only next step.  Although
I'm not sure what I'm going to do based on the result of that test -- if
Jetty behaves differently, then I still don't know why the heck Tomcat is
behaving badly! :)

Michael


Re: Can we point a Solr server to index directory dynamically at runtime..

2009-09-24 Thread Michael
Using a multicore approach, you could send a "create a core named
'core3weeksold' pointing to '/datadirs/3weeksold' " command to a live Solr,
which would spin it up on the fly.  Then you query it, and maybe keep it
spun up until it's not queried for 60 seconds or something, then send a
"remove core 'core3weeksold' " command.
See http://wiki.apache.org/solr/CoreAdmin#CoreAdminHandler .

Michael

On Thu, Sep 24, 2009 at 12:31 AM, Silent Surfer wrote:

> Hi,
>
> Is there any way to dynamically point the Solr servers to an index/data
> directories at run time?
>
> We are generating 200 GB worth of index per day and we want to retain the
> index for approximately 1 month. So our idea is to keep the first 1 week of
> index available at anytime for the users i.e have set of Solr servers up and
> running and handle request to get the past 1 week of date.
>
> But when user tries to query data which is older than 7 days old, we want
> to dynamically point the existing Solr instances to the inactive/dormant
> indexes and get the results.
>
> The main intention is to limit the number of Solr Slave instances and there
> by limit the # of Servers required.
>
> If the index directory and Solr instances are tightly coupled, then most of
> the Solr instances are just up and running and may hardly used, as most of
> the users are mainly interested in past 1 week data and not beyond that.
>
> Any thoughts or any other approaches to tackle this would be greatly
> appreciated.
>
> Thanks,
> sS
>
>
>
>
>


Re: Can we point a Solr server to index directory dynamically at runtime..

2009-09-25 Thread Michael
Are you storing (in addition to indexing) your data?  Perhaps you could turn
off storage on data older than 7 days (requires reindexing), thus losing the
ability to return snippets but cutting down on your storage space and server
count.  I've experienced 10x decrease in space requirements and a large
boost in speed after cutting extraneous storage from Solr -- the stored data
is mixed in with the index data and so it slows down searches.
You could also put all 200G onto one Solr instance rather than 10 for >7days
data, and accept that those searches will be slower.

Michael

On Fri, Sep 25, 2009 at 1:34 AM, Silent Surfer wrote:

> Hi,
>
> Thank you Michael and Chris for the response.
>
> Today after the mail from Michael, we tested with the dynamic loading of
> cores and it worked well. So we need to go with the hybrid approach of
> Multicore and Distributed searching.
>
> As per our testing, we found that a Solr instance with 20 GB of
> index(single index or spread across multiple cores) can provide better
> performance when compared to having a Solr instance say 40 (or) 50 GB of
> index (single index or index spread across cores).
>
> So the 200 GB of index on day 1 will be spread across 200/20=10 Solr salve
> instances.
>
> On day 2 data, 10 more Solr slave servers are required; Cumulative Solr
> Slave instances = 200*2/20=20
> ...
> ..
> ..
> On day 30 data, 10 more Solr slave servers are required; Cumulative Solr
> Slave instances = 200*30/20=300
>
> So with the above approach, we may need ~300 Solr slave instances, which
> becomes very unmanageable.
>
> But we know that most of the queries is for the past 1 week, i.e we
> definitely need 70 Solr Slaves containing the last 7 days worth of data up
> and running.
>
> Now for the rest of the 230 Solr instances, do we need to keep it running
> for the odd query,that can span across the 30 days of data (30*200 GB=6 TB
> data) which can come up only a couple of times a day.
> This linear increase of Solr servers with the retention period doesn't
> seems to be a very scalable solution.
>
> So we are looking for something more simpler approach to handle this
> scenario.
>
> Appreciate any further inputs/suggestions.
>
> Regards,
> sS
>
> --- On Fri, 9/25/09, Chris Hostetter  wrote:
>
> > From: Chris Hostetter 
> > Subject: Re: Can we point a Solr server to index directory dynamically
> at  runtime..
> > To: solr-user@lucene.apache.org
> > Date: Friday, September 25, 2009, 4:04 AM
> > : Using a multicore approach, you
> > could send a "create a core named
> > : 'core3weeksold' pointing to '/datadirs/3weeksold' "
> > command to a live Solr,
> > : which would spin it up on the fly.  Then you query
> > it, and maybe keep it
> > : spun up until it's not queried for 60 seconds or
> > something, then send a
> > : "remove core 'core3weeksold' " command.
> > : See http://wiki.apache.org/solr/CoreAdmin#CoreAdminHandler
> > .
> >
> > something that seems implicit in the question is what to do
> > when the
> > request spans all of the data ... this is where (in theory)
> > distributed
> > searching could help you out.
> >
> > index each days worth of data into it's own core, that
> > makes it really
> > easy to expire the old data (just UNLOAD and delete an
> > entire core once
> > it's more then 30 days old) if your user is only searching
> > "current" dta
> > then your app can directly query the core containing the
> > most current data
> > -- but if they want to query the last week, or last two
> > weeks worth of
> > data, you do a distributed request for all of the shards
> > needed to search
> > the appropriate amount of data.
> >
> > Between the ALIAS and SWAP commands it on the CoreAdmin
> > screen it should
> > be pretty easy have cores with names like
> > "today","1dayold","2dayold" so
> > that your app can configure simple shard params for all the
> > perumations
> > you'll need to query.
> >
> >
> > -Hoss
> >
> >
>
>
>
>
>
>


Re: Parallel requests to Tomcat

2009-09-25 Thread Michael
Thank you Grant and Lance for your comments -- I've run into a separate snag
which puts this on hold for a bit, but I'll return to finish digging into
this and post my results. - Michael
On Thu, Sep 24, 2009 at 9:23 PM, Lance Norskog  wrote:

> Are you on Java 5, 6 or 7? Each release sees some tweaking of the Java
> multithreading model as well as performance improvements (and bug
> fixes) in the Sun HotSpot runtime.
>
> You may be tripping over the TCP/IP multithreaded connection manager.
> You might wish to create each client thread with a separate socket.
>
> Also, here is a standard bit of benchmarking advice: include "think
> time". This means that instead of sending requests constantly, each
> thread should time out for a few seconds before sending the next
> request. This simulates a user "stopping and thinking" before clicking
> the mouse again. This helps simulate the quantity of threads, etc.
> which are stopped and waiting at each stage of the request pipeline.
> As it is, you are trying to simulate the throughput behaviour without
> simulating the horizontal volume. (Benchmarking is much harder than it
> looks.)
>
> On Wed, Sep 23, 2009 at 9:43 AM, Grant Ingersoll 
> wrote:
> >
> > On Sep 23, 2009, at 12:09 PM, Michael wrote:
> >
> >> On Wed, Sep 23, 2009 at 12:05 PM, Yonik Seeley
> >> wrote:
> >>
> >>> On Wed, Sep 23, 2009 at 11:47 AM, Michael  wrote:
> >>>>
> >>>> If this were IO bound, wouldn't I see the same results when sending my
> 8
> >>>> requests to 8 Tomcats?  There's only one "disk" (well, RAM) whether
> I'm
> >>>> querying 8 processes or 8 threads in 1 process, right?
> >>>
> >>> Right - I was thinking IO bound at the Lucene Directory level - which
> >>> synchronized in the past and led to poor concurrency.  Buy your Solr
> >>> version is recent enough to use the newer unsynchronized method by
> >>> default (on non-windows)
> >>>
> >>
> >> Ah, OK.  So it looks like comparing to Jetty is my only next step.
> >>  Although
> >> I'm not sure what I'm going to do based on the result of that test -- if
> >> Jetty behaves differently, then I still don't know why the heck Tomcat
> is
> >> behaving badly! :)
> >
> >
> > Have you done any profiling to see where hotspots are?  Have you looked
> at
> > garbage collection?  Do you have any full collections occurring?  What
> > garbage collector are you using?  How often are you updating/committing,
> > etc?
> >
> >
> > --
> > Grant Ingersoll
> > http://www.lucidimagination.com/
> >
> > Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
> > Solr/Lucene:
> > http://www.lucidimagination.com/search
> >
> >
>
>
>
> --
> Lance Norskog
> goks...@gmail.com
>


Re: Parallel requests to Tomcat

2009-09-28 Thread Michael
Great news for Solr -- a third party library that I'm calling is serialized.
 Silly me, I made a mistake when ruling out that library as the culprit
earlier.  Solr itself scales just great as add threads.  JProfiler helped me
find the problem.
Sorry for the false alarm, and thanks for the suggestions!
Michael

On Fri, Sep 25, 2009 at 10:02 AM, Michael  wrote:

> Thank you Grant and Lance for your comments -- I've run into a separate
> snag which puts this on hold for a bit, but I'll return to finish digging
> into this and post my results. - Michael
>
> On Thu, Sep 24, 2009 at 9:23 PM, Lance Norskog  wrote:
>
>> Are you on Java 5, 6 or 7? Each release sees some tweaking of the Java
>> multithreading model as well as performance improvements (and bug
>> fixes) in the Sun HotSpot runtime.
>>
>> You may be tripping over the TCP/IP multithreaded connection manager.
>> You might wish to create each client thread with a separate socket.
>>
>> Also, here is a standard bit of benchmarking advice: include "think
>> time". This means that instead of sending requests constantly, each
>> thread should time out for a few seconds before sending the next
>> request. This simulates a user "stopping and thinking" before clicking
>> the mouse again. This helps simulate the quantity of threads, etc.
>> which are stopped and waiting at each stage of the request pipeline.
>> As it is, you are trying to simulate the throughput behaviour without
>> simulating the horizontal volume. (Benchmarking is much harder than it
>> looks.)
>>
>> On Wed, Sep 23, 2009 at 9:43 AM, Grant Ingersoll 
>> wrote:
>> >
>> > On Sep 23, 2009, at 12:09 PM, Michael wrote:
>> >
>> >> On Wed, Sep 23, 2009 at 12:05 PM, Yonik Seeley
>> >> wrote:
>> >>
>> >>> On Wed, Sep 23, 2009 at 11:47 AM, Michael 
>> wrote:
>> >>>>
>> >>>> If this were IO bound, wouldn't I see the same results when sending
>> my 8
>> >>>> requests to 8 Tomcats?  There's only one "disk" (well, RAM) whether
>> I'm
>> >>>> querying 8 processes or 8 threads in 1 process, right?
>> >>>
>> >>> Right - I was thinking IO bound at the Lucene Directory level - which
>> >>> synchronized in the past and led to poor concurrency.  Buy your Solr
>> >>> version is recent enough to use the newer unsynchronized method by
>> >>> default (on non-windows)
>> >>>
>> >>
>> >> Ah, OK.  So it looks like comparing to Jetty is my only next step.
>> >>  Although
>> >> I'm not sure what I'm going to do based on the result of that test --
>> if
>> >> Jetty behaves differently, then I still don't know why the heck Tomcat
>> is
>> >> behaving badly! :)
>> >
>> >
>> > Have you done any profiling to see where hotspots are?  Have you looked
>> at
>> > garbage collection?  Do you have any full collections occurring?  What
>> > garbage collector are you using?  How often are you updating/committing,
>> > etc?
>> >
>> >
>> > --
>> > Grant Ingersoll
>> > http://www.lucidimagination.com/
>> >
>> > Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
>> > Solr/Lucene:
>> > http://www.lucidimagination.com/search
>> >
>> >
>>
>>
>>
>> --
>> Lance Norskog
>> goks...@gmail.com
>>
>
>


Conditional deduplication

2009-09-30 Thread Michael
If I index a bunch of email documents, is there a way to say"show me all
email documents, but only one per To: email address"
so that if there are a total of 10 distinct To: fields in the corpus, I get
back 10 email documents?

I'm aware of http://wiki.apache.org/solr/Deduplication but I want to retain
the ability to search across all of my email documents most of the time, and
only occasionally search for the distinct ones.

Essentially I want to do a
SELECT DISTINCT to_field FROM documents
where a normal search is a
SELECT * FROM documents

Thanks for any pointers.


Default query parameter for one core

2009-10-07 Thread Michael
I'd like to have 5 cores on my box.  core0 should automatically shard to
cores 1-4, which each have a quarter of my corpus.
I tried this in my solrconfig.xml:

  
 
   ${solr.core.shardsParam:} 
 
  

and this in my solr.xml:


  
  
  


Unfortunately, this doesn't work, because cores 1 through 4 end up
specifying a blank shards param, which is different from no shards param at
all -- it results in a NullPointerException.

Is there a way to not have the shards param at all for most cores, and for
core0 to specify it?


Re: Default query parameter for one core

2009-10-08 Thread Michael
On Wed, Oct 7, 2009 at 1:46 PM, Michael  wrote:
> Is there a way to not have the shards param at all for most cores, and for 
> core0 to specify it?

E.g. core0 requests always get a "&shards=foo" appended, while other
cores don't have an "&shards" param at all.

Or, barring that, is there a way to tell one core "use this chunk of
XML for your  tag", and tell the other cores "use this other
chunk of XML for your  tag"?


Re: Default query parameter for one core

2009-10-09 Thread Michael
On Fri, Oct 9, 2009 at 6:03 AM, Shalin Shekhar Mangar
 wrote:
> Michael, the last line does not seem right. The  tag has nothing
> called shardParam. If you want to add a core property called shardParam, you
> need to add something like this:
>
> 
>  
>
>   value="localhost:9990/core1,localhost:9990/core2,localhost:9990/core3,localhost:9990/core4"/>
> 
> 

Thanks, Shalin!  I had seen a solrconfig.xml referencing
${solr.core.instanceDir} which *is* defined as a  attribute, so
I falsely assumed it would accept arbitrary properties as attributes
on the  tag.


Re: Default query parameter for one core

2009-10-09 Thread Michael
Hm... still no success.  Can anyone point me to a doc that explains
how to define and reference core properties?  I've had no luck
searching Google.

Shalin, I gave an identical '' tag to
each of my cores, and referenced ${solr.core.shardsParam} (with no
default specified via a colon) in solrconfig.xml.  I get an error on
startup:

SEVERE: Error in solrconfig.xml:org.apache.solr.common.SolrException:
No system property or default value specified for
solr.core.shardsParam

(Not to mention that even if this *did* work, it wouldn't help me with
making most cores *not* specify a &shards query parameter.)  Clearly
I'm doing something wrong, but I'm in the dark as to how to do it
right.

Any help would be appreciated!
Michael

On Fri, Oct 9, 2009 at 10:13 AM, Michael  wrote:
> On Fri, Oct 9, 2009 at 6:03 AM, Shalin Shekhar Mangar
>  wrote:
>> Michael, the last line does not seem right. The  tag has nothing
>> called shardParam. If you want to add a core property called shardParam, you
>> need to add something like this:
>>
>> 
>>  
>>
>>  > value="localhost:9990/core1,localhost:9990/core2,localhost:9990/core3,localhost:9990/core4"/>
>> 
>> 
>
> Thanks, Shalin!  I had seen a solrconfig.xml referencing
> ${solr.core.instanceDir} which *is* defined as a  attribute, so
> I falsely assumed it would accept arbitrary properties as attributes
> on the  tag.
>


Re: Default query parameter for one core

2009-10-09 Thread Michael
On Fri, Oct 9, 2009 at 10:26 AM, Michael  wrote:
> Hm... still no success.  Can anyone point me to a doc that explains
> how to define and reference core properties?  I've had no luck
> searching Google.

OK, definition is described here:
http://wiki.apache.org/solr/CoreAdmin#property -- a page I've visited
many times before but somehow had trouble finding.

Still not sure whether I'm supposed to use "solr.core.shardsParam", or
just "shardsParam", and how to not have a defaulted &shards parameter
for most cores.

Michael


Re: Default query parameter for one core

2009-10-09 Thread Michael
For posterity...

After reading through http://wiki.apache.org/solr/SolrConfigXml and
http://wiki.apache.org/solr/CoreAdmin and
http://issues.apache.org/jira/browse/SOLR-646, I think there's no way
for me to make only one core specify &shards=foo, short of duplicating
my solrconfig.xml for that core and adding one line:

- I can't use a variable like ${shardsParam} in a single shared
solrconfig.xml, because the line
${shardsParam}
  has to be in there, and that forces a (possibly empty) &shards
parameter onto cores that *don't* need one, causing a
NullPointerException.

- I can't suck in just that one  line via a SOLR-646-style import, like
#solrconfig.xml

  

  

#solr.xml
...
...
  because SOLR-646's  feature got cut.

So I think my best bet is to make two mostly-identical
solrconfig.xmls, and point core0 to the one specifying a &shards=
parameter:


I don't like the duplication of config, but at least it accomplishes my goal!

Michael


On Fri, Oct 9, 2009 at 10:37 AM, Michael  wrote:
> On Fri, Oct 9, 2009 at 10:26 AM, Michael  wrote:
>> Hm... still no success.  Can anyone point me to a doc that explains
>> how to define and reference core properties?  I've had no luck
>> searching Google.
>
> OK, definition is described here:
> http://wiki.apache.org/solr/CoreAdmin#property -- a page I've visited
> many times before but somehow had trouble finding.
>
> Still not sure whether I'm supposed to use "solr.core.shardsParam", or
> just "shardsParam", and how to not have a defaulted &shards parameter
> for most cores.
>
> Michael
>


Re: Default query parameter for one core

2009-10-12 Thread Michael
Thanks for your input, Shalin.

On Sun, Oct 11, 2009 at 12:30 AM, Shalin Shekhar Mangar
 wrote:
>> - I can't use a variable like ${shardsParam} in a single shared
>> solrconfig.xml, because the line
>>    ${shardsParam}
>>  has to be in there, and that forces a (possibly empty) &shards
>> parameter onto cores that *don't* need one, causing a
>> NullPointerException.
>>
>>
> Well, we can fix the NPE :)  Please raise an issue.

The NPE may be the "correct" behavior -- I'm causing an empty &shards=
parameter, which doesn't have a defined behavior AFAIK.  The
deficiency I was pointing out was that using ${shardsParam} doesn't
help me achieve my real goal, which is to have the entire  tag
disappear for some shards.

>> So I think my best bet is to make two mostly-identical
>> solrconfig.xmls, and point core0 to the one specifying a &shards=
>> parameter:
>>    
>>
>> I don't like the duplication of config, but at least it accomplishes my
>> goal!
>>
>>
> There is another way too. Each plugin in Solr now supports a configuration
> attribute named "enable" which can be true or false. You can control the
> value (true/false) through a variable. So you can duplicate just the handle
> instead of the complete solrconfig.xml

I had looked into this, but thought it doesn't help because I'm not
disabling an entire plugin -- just a  tag specifying a default
parameter to a .  Individual  tags don't have an
"enable" flag for me to conditionally set to false.  Maybe I'm
misunderstanding what you're suggesting?

Thanks again,
Michael


Re: Default query parameter for one core

2009-10-12 Thread Michael
OK, a hacky but working solution to making one core shard to all
others: have the default parameter *name* vary, so that one core gets
"&shards=foo" and all other cores get "&dummy=foo".

# solr.xml




  


  
  
   ...



# solrconfig.xml

  
${shardsValue}
...

Michael

On Mon, Oct 12, 2009 at 12:00 PM, Michael  wrote:
> Thanks for your input, Shalin.
>
> On Sun, Oct 11, 2009 at 12:30 AM, Shalin Shekhar Mangar
>  wrote:
>>> - I can't use a variable like ${shardsParam} in a single shared
>>> solrconfig.xml, because the line
>>>    ${shardsParam}
>>>  has to be in there, and that forces a (possibly empty) &shards
>>> parameter onto cores that *don't* need one, causing a
>>> NullPointerException.
>>>
>>>
>> Well, we can fix the NPE :)  Please raise an issue.
>
> The NPE may be the "correct" behavior -- I'm causing an empty &shards=
> parameter, which doesn't have a defined behavior AFAIK.  The
> deficiency I was pointing out was that using ${shardsParam} doesn't
> help me achieve my real goal, which is to have the entire  tag
> disappear for some shards.
>
>>> So I think my best bet is to make two mostly-identical
>>> solrconfig.xmls, and point core0 to the one specifying a &shards=
>>> parameter:
>>>    
>>>
>>> I don't like the duplication of config, but at least it accomplishes my
>>> goal!
>>>
>>>
>> There is another way too. Each plugin in Solr now supports a configuration
>> attribute named "enable" which can be true or false. You can control the
>> value (true/false) through a variable. So you can duplicate just the handle
>> instead of the complete solrconfig.xml
>
> I had looked into this, but thought it doesn't help because I'm not
> disabling an entire plugin -- just a  tag specifying a default
> parameter to a .  Individual  tags don't have an
> "enable" flag for me to conditionally set to false.  Maybe I'm
> misunderstanding what you're suggesting?
>
> Thanks again,
> Michael
>


Re: Letters with accent in query

2009-10-12 Thread Michael
What tokenizer and filters are you using in what order?  See schema.xml.

Also, you may wish to use ASCIIFoldingFilter, which covers more cases
than ISOLatin1AccentFilter.

Michael

On Mon, Oct 12, 2009 at 12:42 PM, R. Tan  wrote:
> Hi,
> I'm querying with an accented keyword such as "café" but the debug info
> shows that it is only searching for "caf". I'm using the ISOLatin1Accent
> filter as well.
>
> Query:
> http://localhost:8983/solr/select?q=%E9&debugQuery=true
>
> Params return shows this:
> 
> 
> true
> 
>
> What am I missing here?
>
> Rih
>


Opaque replication failures

2009-10-14 Thread Michael
Hi,

I have a multicore Solr 1.4 setup.  core_master is a 3.7G master for
replication, and core_slave is a 500 byte slave pointing to the
master.  I'm using the example replication configuration from
solrconfig.xml, with ${enable.master} and ${enable.slave} properties
so that the master and slave can use the same solrconfig.xml.

When I attempt to replicate (every 60 seconds or by pressing the
button on the slave replication admin page), it doesn't work.
Unfortunately, neither the admin page nor the REST API "details"
command show anything useful, and the logs show no errors.

How can I get insight into what is causing the failure?  I assume it's
some configuration problem but don't know where to start.

Thanks in advance for any help!  Config files are below.
Michael



Here is my solr.xml:



  

  
  

  



And here's the relevant chunk of my solrconfig.xml:



${enable.master:false}
commit


${enable.slave:false}
http://localhost:31000/solr/core_master/replication
00:00:60
 


Here's what the "details" command on the slave has to say -- nothing
explanatory that I can see.  Is the "isReplicating=false" worrying?



  589 bytes
  /home/search/solr/data/1/index
  
  false
  true
  1254772638413
  2

  

  3.75 GB
  /home/search/solr/data/5/index
  
  true
  false
  1254772639291
  156

http://localhost:31000/solr/core_master/replication
00:00:60
Wed Oct 14 14:25:22 EDT 2009


  Wed Oct 14 14:25:22 EDT 2009
  Wed Oct 14 14:25:22 EDT 2009
  Wed Oct 14 14:25:21 EDT 2009
  Wed Oct 14 14:24:27 EDT 2009
  (etc)


  Wed Oct 14 14:25:22 EDT 2009
  Wed Oct 14 14:25:22 EDT 2009
  Wed Oct 14 14:25:21 EDT 2009
  Wed Oct 14 14:24:27 EDT 2009
  (etc)


1481
0
1481
Wed Oct 14 14:25:22 EDT 2009
0
false
  




Re: Opaque replication failures

2009-11-03 Thread Michael
OK, I've solved this.  For posterity:

The master doesn't make anything available for replication unless you
set replicateAfter="startup", or unless you set
replicateAfter="commit" and then both add a document and execute a
commit.  If you don't do one of those, even manually clicking
"Replicate Now" on the slave will show failures without explaining
why.

With replicateAfter="startup" and "commit" I was able to get a slave
core in the same Solr instance to replicate upon startup and upon
add-doc-and-commit.

Michael

On Tue, Nov 3, 2009 at 11:53 AM, Michael  wrote:
> I just tried setting up replication between two cores with the Nov 2
> nightly, and got the same result as below: replication reports as
> failed but doesn't tell me why.
>
> Is replication not allowed from a master core to a slave core within
> the same Solr instance?  Or is there a way for me to find out if there
> is something wrong with my index (which otherwise appears OK)?
>
> Thanks,
> Michael
>
> On Wed, Oct 14, 2009 at 1:33 PM, Michael  wrote:
>> Hi,
>>
>> I have a multicore Solr 1.4 setup.  core_master is a 3.7G master for
>> replication, and core_slave is a 500 byte slave pointing to the
>> master.  I'm using the example replication configuration from
>> solrconfig.xml, with ${enable.master} and ${enable.slave} properties
>> so that the master and slave can use the same solrconfig.xml.
>>
>> When I attempt to replicate (every 60 seconds or by pressing the
>> button on the slave replication admin page), it doesn't work.
>> Unfortunately, neither the admin page nor the REST API "details"
>> command show anything useful, and the logs show no errors.
>>
>> How can I get insight into what is causing the failure?  I assume it's
>> some configuration problem but don't know where to start.
>>
>> Thanks in advance for any help!  Config files are below.
>> Michael
>>
>>
>>
>> Here is my solr.xml:
>>
>> > persistent="true">
>> 
>>  
>>    
>>  
>>  
>>    
>>  
>> 
>> 
>>
>> And here's the relevant chunk of my solrconfig.xml:
>>
>> 
>>    
>>        ${enable.master:false}
>>        commit
>>    
>>    
>>        ${enable.slave:false}
>>        > name="masterUrl">http://localhost:31000/solr/core_master/replication
>>        00:00:60
>>     
>> 
>>
>> Here's what the "details" command on the slave has to say -- nothing
>> explanatory that I can see.  Is the "isReplicating=false" worrying?
>>
>> 
>>
>>  589 bytes
>>  /home/search/solr/data/1/index
>>  
>>  false
>>  true
>>  1254772638413
>>  2
>>
>>  
>>    
>>      3.75 GB
>>      /home/search/solr/data/5/index
>>      
>>      true
>>      false
>>      1254772639291
>>      156
>>    
>>    > name="masterUrl">http://localhost:31000/solr/core_master/replication
>>    00:00:60
>>    Wed Oct 14 14:25:22 EDT 2009
>>
>>    
>>      Wed Oct 14 14:25:22 EDT 2009
>>      Wed Oct 14 14:25:22 EDT 2009
>>      Wed Oct 14 14:25:21 EDT 2009
>>      Wed Oct 14 14:24:27 EDT 2009
>>      (etc)
>>    
>>    
>>      Wed Oct 14 14:25:22 EDT 2009
>>      Wed Oct 14 14:25:22 EDT 2009
>>      Wed Oct 14 14:25:21 EDT 2009
>>      Wed Oct 14 14:24:27 EDT 2009
>>      (etc)
>>    
>>
>>    1481
>>    0
>>    1481
>>    Wed Oct 14 14:25:22 EDT 2009
>>    0
>>    false
>>  
>>
>> 
>>
>


default a parameter to a core's name

2009-11-09 Thread Michael
Hi,

Is there a way for me to set up my solr.xml so that slave cores
replicate from the master's identically-named cores by default, but I
can override the core to replicate from if I wish? Something like
this, where core1 and core2 on the slave default to replicating from
foo/solr/core1 and foo/solr/core2, but core3 replicates from
foo/solr/core15.

# slave's solrconfig.xml

http://foo/solr/${replicateCore}/replication

# slave's solr.xml

  
  
 
 

  
  

  


This doesn't quite work because solr.core.name is not a valid variable
outside the  section.  I also tried putting
"${replicateCore:${solr.core.name}}" in the solrconfig.xml, but the
default in that case is literally "${solr.core.name}" -- the variable
expansion isn't recursive.

Thanks in advance for any pointers.

Michael


Stop solr without losing documents

2009-11-12 Thread Michael
I've got a process external to Solr that is constantly feeding it new
documents, retrying if Solr is nonresponding.  What's the right way to
stop Solr (running in Tomcat) so no documents are lost?

Currently I'm committing all cores and then running catalina's stop
script, but between my commit and the stop, more documents can come in
that would need *another* commit...

Lots of people must have had this problem already, so I know the
answer is simple; I just can't find it!

Thanks.
Michael


Re: Stop solr without losing documents

2009-11-13 Thread Michael
On Fri, Nov 13, 2009 at 4:32 AM, gwk  wrote:
> I don't know if this is the best solution, or even if it's applicable to
> your situation but we do incremental updates from a database based on a
> timestamp, (from a simple seperate sql table filled by triggers so deletes

Thanks, gwk!  This doesn't exactly meet our needs, but helped us get
to a solution.  In short, we are manually committing in our outside
updater process (instead of letting Solr autocommit), and marking
which documents have been updated before a successful commit.  Now
stopping solr is as easy as kill -9.

Michael


Re: Stop solr without losing documents

2009-11-16 Thread Michael
On Fri, Nov 13, 2009 at 4:09 PM, Chris Hostetter
 wrote:
> please don't kill -9 ... it's grossly overkill, and doesn't give your
[ ... snip ... ]
> Alternately, you could take advantage of the "enabled" feature from your
> client (just have it test the enabled url ever N updates or so) and when
> it sees that you have disabled the port it can send one last commit and
> then stop sending updates until it sees the enabled URL work againg -- as
> soon as you see the updates stop, you can safely shutdown hte port.

Thanks, Hoss.  I'll use Catalina stop instead of kill -9.

It's good to know about the enabled feature -- my team was just
discussing whether something like that existed that we could use --
but as we'd also like to recover cleanly from power failures and other
Solr terminations, I think we'll track which docs are uncommitted
outside of Solr.

Michael


Re: Stop solr without losing documents

2009-11-16 Thread Michael
On Fri, Nov 13, 2009 at 11:02 PM, Otis Gospodnetic
 wrote:
> So I think the question is really:
> "If I stop the servlet container, does Solr issue a commit in the shutdown 
> hook in order to ensure all buffered docs are persisted to disk before the 
> JVM exits".

Exactly right, Otis.

> I don't have the Solr source handy, but if I did, I'd look for "Shutdown", 
> "Hook" and "finalize" in the code.

Thanks for the direction.  There was some talk of close()ing a
SolrCore that I found, but I don't believe this meant a commit.

I somehow hadn't thought of actually *trying* to add a doc and then
shut down a Solr instance; shame on me.  Unfortunately, when I test
this via
 * make a new solr
 * add a doc
 * commit
 * verify it shows up in a search -- it does
 * add a 2nd doc
 * shutdown
solr doesn't stop.  It stops accepting connections, but java refuses
to actually die.  Not sure what we're doing wrong on our end, but I
see this frequently and end up having to do a kill (usually not -9!).
I guess we'll stick with externally tracking which docs have
committed, so that when we inevitably have to kill Solr it doesn't
cause a problem.

Michael


Re: Stop solr without losing documents

2009-11-16 Thread Michael
On Fri, Nov 13, 2009 at 11:45 PM, Lance Norskog  wrote:
> I would go with polling Solr to find what is not yet there. In
> production, it is better to assume that things will break, and have
> backstop janitors that fix them. And then test those janitors
> regularly.

Good idea, Lance.  I certainly agree with the idea of backstop
janitors.  We don't have a good way of polling Solr for what's in
there or not -- we have a kind of asynchronous, multithreaded updating
system sending docs to Solr -- but we always can find out *externally*
which docs have been committed or not.

Michael


Re: Solr 1.3 query and index perf tank during optimize

2009-11-20 Thread Michael
Hoss,

Using Solr 1.4, I see constant index growth until an optimize.  I
commit (hundreds of updates) every 5 minutes and have a mergefactor of
10, but every 50 minutes I don't see the index collapse down to its
original size -- it's slightly larger.

Over the course of a week, the index grew from 4.5 gigs to 6 gigs,
growing and shrinking in file size and count but generally upward.
Only when I manually optimized did the index return to 4.5 gigs.

So -- I thought I understood you to mean that if I frequently merge,
it's basically the same as an optimize, and cruft will get purged.  Am
I misunderstanding you?

Michael
PS: The extra 1.5G actually matters, as this is one of 8 cores and I'm
trying to keep it all in RAM.

On Tue, Nov 17, 2009 at 2:37 PM, Israel Ekpo  wrote:
> On Tue, Nov 17, 2009 at 2:24 PM, Chris Hostetter
> wrote:
>
>>
>> : Basically, search entries are keyed to other documents.  We have finite
>> : storage,
>> : so we purge old documents.  My understanding was that deleted documents
>> : still
>> : take space until an optimize is done.  Therefore, if I don't optimize,
>> the
>> : index
>> : size on disk will grow without bound.
>> :
>> : Am I mistaken?  If I don't ever have to optimize, it would make my life
>> : easier.
>>
>> deletions are purged as segments get merged.  if you want to force
>> deleted documents to be purged, the only way to do that at the
>> moment is to optimize (which merges all segments).  but if you are
>> continually deleteing/adding documents, the deletions will eventaully get
>> purged even if you never optimize.
>>
>>
>>
>>
>> -Hoss
>>
>>
>
> Chris,
>
> Since the mergeFactor controls the segment merge frequency and size and the
> number of segments is limited to mergeFactor - 1.
>
> Would one be correct to state that if some documents have been deleted from
> the index and the changes finalized with a call to commit, as more documents
> are added to the index, eventually the index will be  implicitly "*optimized
> *" and the deleted documents will be purged even without explicitly issuing
> an optimize statement?
>
>
> --
> "Good Enough" is not good enough.
> To give anything less than your best is to sacrifice the gift.
> Quality First. Measure Twice. Cut Once.
>


Re: Solr 1.3 query and index perf tank during optimize

2009-11-20 Thread Michael
On Fri, Nov 20, 2009 at 12:35 PM, Yonik Seeley
 wrote:
> On Fri, Nov 20, 2009 at 12:24 PM, Michael  wrote:
>> So -- I thought I understood you to mean that if I frequently merge,
>> it's basically the same as an optimize, and cruft will get purged.  Am
>> I misunderstanding you?
>
> That only applies to the segments involved in the merge.  The deleted
> documents are left behind when old segments are merged into a new
> segment.

Your statement is leading me to believe that I have misunderstood the
merge process.  I thought that every time there are 10 segments, they
get merged down to 1.  Therefore, every time a merge happens, every
single segment in my entire index is "involved in the merge".  9
segments later, we're back to 10 segments, and they're merged into 1.
9 segments later, we're back to 10 segments once again, and they're
merged into 1.

Maybe I have misunderstood the mergeFactor docs.  Maybe instead it's like this?
1. Segment A1 fills with N docs, and a new segment A2 is created.
2. A2 fills with N docs, and A3 is created; A3 fills with N docs, etc.
3. A9 fills with N docs, and merging occurs: Segment B1 is created
with 10*N docs, segments A1-A9 are deleted.
4. A new segment A1 fills with N docs, and a new segment A2 is
created; B1 is still sitting with 10*N docs.
5. Eventually A1 through A9 each have N docs, and then merging occurs:
Segment B2 is created, with 10*N docs.
6. Eventually Segments B1 through B9 each have 10*N docs, and merging
occurs: Segment C1 is created, with 100*N docs.  Segments B1-B9 are
deleted.
7. A new A1 starts filling again.

Some time down the line I might have 4 D segments with 1000*N docs
each, 6 C segments with 100*N docs each, 8 B segments with 10*N docs
each, 2 A segments with N docs each, and an open A3 segment filling
up.

If this is right, then your statement above means that yes, each merge
of many As into 1 B purges all the deleted docs in A1-A9, but All my
Ds, Cs, and Bs aren't updated to purge deleted docs yet.  Only when
B1-B9 merge into a new C do their deleted docs get purged; only when
C1-C9 merge into a new D do their deleted docs get purged; etc.

Is this right?  Sorry it was so verbose!
Michael


Modifying a stored field after analyzing it?

2009-07-10 Thread Michael _
Hello,
I've got a stored, indexed field that contains some actual text, and some
metainfo, like this:

   one two three four [METAINFO] oneprime twoprime threeprime fourprime

I have written a Tokenizer that skips past the [METAINFO] marker and uses
the last four words as the tokens for the field, mapping to the first four
words.  E.g. "twoprime" is the second token, with startposition=4 and
endposition=8.

When someone searches for "twoprime", therefore, they get back a highlighted
result like

   one two three ...

This is great and serves my needs, but I hate that I'm storing all that
METAINFO uselessly (there's actually a good deal more than in this
simplified example).  After I've used it to make my tokens, I'd really like
to convert the stored field to just

   one two three four

and store that.

I thought about using an UpdateRequestProcessor to do this, but that happens
*before* the Analyzers run, so if I strip the [METAINFO] there I can't use
it to build my tokens.  I also thought about sending the data in in two
fields, like

   f1: one two three four
   f1_meta: oneprime twoprime threeprime fourprime

but I can't figure out a way for f1's analyzer to grab the stream from
f1_meta.

Is there some clever way that I'm missing to build my token stream outside
of Solr, and store just the original text and index my token stream?

Thanks in advance!


NPE when sharding multiple layers

2010-01-07 Thread Michael
Hi all,

I've got an index split across 28 cores -- 4 cores on each of 7 boxes
(multiple cores per box in order to use more of its CPUs.)

When I configure a "toplevel" core to fan out to all 28 index cores,
it works, but is slower than I'd have expected:
 Toplevel core ==> all 28 index cores

In case it is the aggregation of 28 shards that is slow, I wanted to
try 2 layers of sharding.  I changed the toplevel core to shard to 1
"midlevel" core per box, which in turn shards to the 4 index cores on
localhost:
 Toplevel core ==> 7 midlevel cores, 1 per box ==> 4 localhost index cores

If I search for *:*, this works.  If I search for an actual
field:value, the midlevel cores throw an NPE.

I am configuring toplevel and midlevel cores' &shards= parameters via
default values in their solrconfigs, so my request URL just looks like
host/solr/toplevel/select/&q=field:value.

Is this a known bug, or am I just doing something wrong?

Thanks in advance!
- Michael

PS: The NPE, which is thrown by the midlevel cores:

Jan 7, 2010 4:01:02 PM org.apache.solr.common.SolrException log
SEVERE: java.lang.NullPointerException
at 
org.apache.solr.handler.component.ShardFieldSortedHitQueue$1.compare(ShardDoc.java:210)
at 
org.apache.solr.handler.component.ShardFieldSortedHitQueue.lessThan(ShardDoc.java:134)
at org.apache.lucene.util.PriorityQueue.upHeap(PriorityQueue.java:255)
at org.apache.lucene.util.PriorityQueue.put(PriorityQueue.java:114)
at 
org.apache.lucene.util.PriorityQueue.insertWithOverflow(PriorityQueue.java:156)
at org.apache.lucene.util.PriorityQueue.insert(PriorityQueue.java:141)
at 
org.apache.solr.handler.component.QueryComponent.mergeIds(QueryComponent.java:445)
at 
org.apache.solr.handler.component.QueryComponent.handleResponses(QueryComponent.java:298)
at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:290)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
at 
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
at 
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at 
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
at 
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
at 
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128)
at 
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
at 
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
at 
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293)
at 
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:849)
at 
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583)
at 
org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:454)
at java.lang.Thread.run(Thread.java:619)


Re: NPE when sharding multiple layers

2010-01-07 Thread Michael
Thanks, Yonik.

Does "not supported" mean "we can't guarantee whether it will work or
not", or "you may be able to figure it out on your own"?  Apparently I
am able to get *some* queries through, just not those that pass
through the fieldtype that i really need (a complex analyzer).  When I
search for foo:value where foo is an analyzer that uses

  StandardTokenizer
  LowerCaseFilter
  WordDelimeterFilter
  TrimFilter

I *don't* get an NPE.

Thanks,
Michael

On Thu, Jan 7, 2010 at 4:25 PM, Yonik Seeley  wrote:
> On Thu, Jan 7, 2010 at 4:17 PM, Michael  wrote:
>>  I wanted to try 2 layers of sharding.
>
> Distrib search was written with multi-level in mind, but it's not supported 
> yet.
>
> -Yonik
> http://www.lucidimagination.com
>


Re: Spatial / Local Solr radius

2010-03-30 Thread Michael
Mauricio,

I was wondering whether you had heard anything back from jteam
regarding this issue. I have also noticed it and was wondering why It
was happening.

One thing I noticed is that this problem only appears for "sparse"
datasets as compared to "dense" ones. For example, I have two datasets
I've been testing with - one with 56 U.S. cities (the "sparse" set)
and one with over 197000 towns and cities (the "dense" set). The dense
set exhibited no problems with consistency searching at various radii,
but the sparse set exhibited the same issues you experienced.

Michael D.

On Mon, Dec 28, 2009 at 7:39 PM, Mauricio Scheffer
 wrote:
> It's jteam's plugin ( http://www.jteam.nl/news/spatialsolr ) which AFAIK is
> just the latest patch for SOLR-773 packaged as a stand-alone plugin.
>
> I'll try to contact jteam directly.
>
> Thanks
> Mauricio
>
> On Mon, Dec 28, 2009 at 8:02 PM, Grant Ingersoll wrote:
>
>>
>> On Dec 28, 2009, at 11:47 AM, Mauricio Scheffer wrote:
>>
>> > q={!spatial lat=43.705 long=116.3635 radius=100}*:*
>>
>> What's QParser is the "spatial" plugin? I don't know of any such QParser in
>> Solr.  Is this a third party tool?  If so, I'd suggest asking on that list.
>>
>> >
>> > with no other parameters.
>> > When changing the radius to 250 I get no results.
>> >
>> > In my config I have startTier = 9 and endTier = 17 (default values)
>> >
>> >
>> > On Mon, Dec 28, 2009 at 1:24 PM, Grant Ingersoll 
>> wrote:
>> >
>> >> What do your queries look like?
>> >>
>> >> On Dec 28, 2009, at 9:30 AM, Mauricio Scheffer wrote:
>> >>
>> >>> Hi everyone,
>> >>> I'm getting inconsistent behavior from Spatial Solr when searching with
>> >>> different radii. For the same lat/long I get:
>> >>>
>> >>> radius=1 -> 1 result
>> >>> radius=10 -> 0 result
>> >>> radius=25 -> 2 results
>> >>> radius=100 -> 2 results
>> >>> radius=250 -> 0 results
>> >>>
>> >>> I don't understand why radius=10 and 250 return no results. Is this a
>> >> known
>> >>> bug? I'm using the default configuration as specified in the PDF.
>> >>> BTW I also tried LocalSolr with the same results.
>> >>>
>> >>> Thanks
>> >>> Mauricio
>> >>
>> >>
>>
>> --
>> Grant Ingersoll
>> http://www.lucidimagination.com/
>>
>> Search the Lucene ecosystem using Solr/Lucene:
>> http://www.lucidimagination.com/search
>>
>>
>


Re: Spatial / Local Solr radius

2010-03-31 Thread Michael
Mauricio,

I hooked up the spatial solr plugin to the Eclipse debugger and
narrowed the problem down to CartesianShapeFilter.getBoxShape(). The
algorithm used in the method can produce values of startX that are
greater than endX depending on the tier level returned by
CartesianTierPlotter.bestFit().  In this case, the "for" loop is
skipped altogether and the method returns a CartesianShape object with
an empty boxIds list.

I notice the problem when I have small, geographically sparse datasets.

I'm going to shoot the jteam an email regarding this.

Michael D.


On Tue, Mar 30, 2010 at 5:10 PM, Mauricio Scheffer
 wrote:
> Hi Michael
>
> I exchanged a few mails with jteam, ultimately I realized my longitudes'
> signs were inverted so I was mapping to China instead of U.S. Still a bug,
> but inverting those longitudes "fixed" the problem in my case since I'm not
> running world-wide searches.
> Before that I ran a test to determine  what radii failed for a grid of 3x3
> lat/long with radius between 10 and 2500, if you're interested I can send
> you the results to compare.
> Also I'm running RC3, I see RC4 is out but haven't tried it.
> It would be interesting to see if this happens with the new spatial
> functions in trunk.
>
> --
> Mauricio
>
>
> On Tue, Mar 30, 2010 at 4:00 PM, Michael  wrote:
>
>> Mauricio,
>>
>> I was wondering whether you had heard anything back from jteam
>> regarding this issue. I have also noticed it and was wondering why It
>> was happening.
>>
>> One thing I noticed is that this problem only appears for "sparse"
>> datasets as compared to "dense" ones. For example, I have two datasets
>> I've been testing with - one with 56 U.S. cities (the "sparse" set)
>> and one with over 197000 towns and cities (the "dense" set). The dense
>> set exhibited no problems with consistency searching at various radii,
>> but the sparse set exhibited the same issues you experienced.
>>
>> Michael D.
>>
>> On Mon, Dec 28, 2009 at 7:39 PM, Mauricio Scheffer
>>  wrote:
>> > It's jteam's plugin ( http://www.jteam.nl/news/spatialsolr ) which AFAIK
>> is
>> > just the latest patch for SOLR-773 packaged as a stand-alone plugin.
>> >
>> > I'll try to contact jteam directly.
>> >
>> > Thanks
>> > Mauricio
>> >
>> > On Mon, Dec 28, 2009 at 8:02 PM, Grant Ingersoll > >wrote:
>> >
>> >>
>> >> On Dec 28, 2009, at 11:47 AM, Mauricio Scheffer wrote:
>> >>
>> >> > q={!spatial lat=43.705 long=116.3635 radius=100}*:*
>> >>
>> >> What's QParser is the "spatial" plugin? I don't know of any such QParser
>> in
>> >> Solr.  Is this a third party tool?  If so, I'd suggest asking on that
>> list.
>> >>
>> >> >
>> >> > with no other parameters.
>> >> > When changing the radius to 250 I get no results.
>> >> >
>> >> > In my config I have startTier = 9 and endTier = 17 (default values)
>> >> >
>> >> >
>> >> > On Mon, Dec 28, 2009 at 1:24 PM, Grant Ingersoll 
>> >> wrote:
>> >> >
>> >> >> What do your queries look like?
>> >> >>
>> >> >> On Dec 28, 2009, at 9:30 AM, Mauricio Scheffer wrote:
>> >> >>
>> >> >>> Hi everyone,
>> >> >>> I'm getting inconsistent behavior from Spatial Solr when searching
>> with
>> >> >>> different radii. For the same lat/long I get:
>> >> >>>
>> >> >>> radius=1 -> 1 result
>> >> >>> radius=10 -> 0 result
>> >> >>> radius=25 -> 2 results
>> >> >>> radius=100 -> 2 results
>> >> >>> radius=250 -> 0 results
>> >> >>>
>> >> >>> I don't understand why radius=10 and 250 return no results. Is this
>> a
>> >> >> known
>> >> >>> bug? I'm using the default configuration as specified in the PDF.
>> >> >>> BTW I also tried LocalSolr with the same results.
>> >> >>>
>> >> >>> Thanks
>> >> >>> Mauricio
>> >> >>
>> >> >>
>> >>
>> >> --
>> >> Grant Ingersoll
>> >> http://www.lucidimagination.com/
>> >>
>> >> Search the Lucene ecosystem using Solr/Lucene:
>> >> http://www.lucidimagination.com/search
>> >>
>> >>
>> >
>>
>


Re: Spatial / Local Solr radius

2010-03-31 Thread Michael
Sean,

The JTeam code looks like it is based on an older version of the Local
Lucene code. The Lucene code appears to do a bit of sanity checking
that is not currently done in the JTeam code.

The problem also seems to happen more often with positive lat/long
values (as noted by Mauricio). I ran the same test on two sets of
data, one searching progressively outward from a point in the US and
from one in Russia. The Russia test showed the inconsistent results
while the U.S. didn't.

Mike D.

On Wed, Mar 31, 2010 at 4:57 PM, Mccleese, Sean W (388A)
 wrote:
> Michael,
>
> This was a problem I encountered as well, sometime late summer last year. My 
> memory is a bit hazy on the details, but as far as I remember the problem 
> centered around the tier level being set incorrectly. Additionally, I think 
> there's a JUnit test (perhaps CartesianShapeFilterTest?) that would indicate 
> the source of the problem but large sections of the test are 
> invalidated/commented out for the spatial change(s).
>
> Again, I haven't touched this code in several months but that's my 
> recollection on the issue. Either way, it's certainly not an isolated 
> problem, though my test datasets were also sparse and geographically distant.
>
> -Sean
>
> -- Forwarded Message
> From: Michael 
> Reply-To: 
> Date: Wed, 31 Mar 2010 13:33:39 -0700
> To: 
> Subject: Re: Spatial / Local Solr radius
>
> Mauricio,
>
> I hooked up the spatial solr plugin to the Eclipse debugger and
> narrowed the problem down to CartesianShapeFilter.getBoxShape(). The
> algorithm used in the method can produce values of startX that are
> greater than endX depending on the tier level returned by
> CartesianTierPlotter.bestFit().  In this case, the "for" loop is
> skipped altogether and the method returns a CartesianShape object with
> an empty boxIds list.
>
> I notice the problem when I have small, geographically sparse datasets.
>
> I'm going to shoot the jteam an email regarding this.
>
> Michael D.
>
>
> On Tue, Mar 30, 2010 at 5:10 PM, Mauricio Scheffer
>  wrote:
>> Hi Michael
>>
>> I exchanged a few mails with jteam, ultimately I realized my longitudes'
>> signs were inverted so I was mapping to China instead of U.S. Still a bug,
>> but inverting those longitudes "fixed" the problem in my case since I'm not
>> running world-wide searches.
>> Before that I ran a test to determine  what radii failed for a grid of 3x3
>> lat/long with radius between 10 and 2500, if you're interested I can send
>> you the results to compare.
>> Also I'm running RC3, I see RC4 is out but haven't tried it.
>> It would be interesting to see if this happens with the new spatial
>> functions in trunk.
>>
>> --
>> Mauricio
>>
>>
>> On Tue, Mar 30, 2010 at 4:00 PM, Michael  wrote:
>>
>>> Mauricio,
>>>
>>> I was wondering whether you had heard anything back from jteam
>>> regarding this issue. I have also noticed it and was wondering why It
>>> was happening.
>>>
>>> One thing I noticed is that this problem only appears for "sparse"
>>> datasets as compared to "dense" ones. For example, I have two datasets
>>> I've been testing with - one with 56 U.S. cities (the "sparse" set)
>>> and one with over 197000 towns and cities (the "dense" set). The dense
>>> set exhibited no problems with consistency searching at various radii,
>>> but the sparse set exhibited the same issues you experienced.
>>>
>>> Michael D.
>>>
>>> On Mon, Dec 28, 2009 at 7:39 PM, Mauricio Scheffer
>>>  wrote:
>>> > It's jteam's plugin ( http://www.jteam.nl/news/spatialsolr ) which AFAIK
>>> is
>>> > just the latest patch for SOLR-773 packaged as a stand-alone plugin.
>>> >
>>> > I'll try to contact jteam directly.
>>> >
>>> > Thanks
>>> > Mauricio
>>> >
>>> > On Mon, Dec 28, 2009 at 8:02 PM, Grant Ingersoll >> >wrote:
>>> >
>>> >>
>>> >> On Dec 28, 2009, at 11:47 AM, Mauricio Scheffer wrote:
>>> >>
>>> >> > q={!spatial lat=43.705 long=116.3635 radius=100}*:*
>>> >>
>>> >> What's QParser is the "spatial" plugin? I don't know of any such QParser
>>> in
>>> >> Solr.  Is this a third party tool?  If so, I'd suggest asking on that
>>> list.
>>> >>
>>> >> >
>>> >&

Tutorials for developing filter plugins.

2010-04-08 Thread Michael
Hi all,

I was wondering whether any of you know any good tutorials online
describing how to develop a custom filter plugin. I have been trying
to create a latitude/longitude bounding box filter using two
NumericRangeFilters in a ChainedFilter object to no avail. No
documents are returned by getDocIdSet it seems. I also tried using
NumericRangeQueries in a QueryWrapperFilter but still no luck. Do I
have to customize the getDocSetId methods in some way? Any help or
pointers in the right direction would be appreciated.

Regards,

Michael Donelan


Re: Tutorials for developing filter plugins.

2010-04-12 Thread Michael
MitchK, I need a range filter, not a token filter. I'm trying to
filter on ranges. If you have code, I'd love to see it.

Israel, I've been using the JTeam spatial plugin as a model for my
work. I initially tried to use the plugin, but there were some serious
bugs in the code. I was able to make some changes based on the Lucene
spatial code, but other problems popped up.

I tried replacing the cartesian filter with a bounding box using
simple latitudes and longitudes by using a ChainedFilter with 2
numeric range filters and then passing it in to a lat/long distance
filter, but neither seemed to work out of the box. It's as if I need
to override the default getDocIdSet methods in order to actually get a
result.

Is there a need to override that method? On would think they should work as-is.

Thank you both for your help.

On Sun, Apr 11, 2010 at 4:34 AM, Israel Ekpo  wrote:
> He is referring to the org.apache.lucene.search.Filter classes.
>
> Michael,
>
> I did a search too and I could not really find any useful tutorials on the
> subject.
>
> You can take a look at how this is implemented in the Spatial Solr Plugin by
> the JTeam
>
> http://www.jteam.nl/news/spatialsolr.html
>
> Their code, I believe, uses the bits() method which has been deprecated in
> Lucene 2.9 and removed in 3.0.
>
> The getDocIdSet() method returns a BitSet object which you can prepare from
> org.apache.lucene.util.OpenBitSet
>
> I think there is probably some example in the new version (2nd Edition) of
> the *Lucene In Action *book on how to do something similar.
>
> You should check it out from the Manning Early Access Program page.
>
> http://www.manning.com/hatcher3/
>
> You should also check out the Solr 1.5 source code for how some of the
> Lucene Filter classes are designed.
>
>
>
> On Sat, Apr 10, 2010 at 5:23 AM, MitchK  wrote:
>
>>
>> Hi Michael,
>>
>> do you mean a TokenFilter like StopWordFilter?
>>
>> If you like, you could post some code, so one can help you.
>> It's really easy to develop some TokenFilters, if you have a look at
>> already
>> implemented ones.
>>
>> Kind regards
>> - Mitch
>> --
>> View this message in context:
>> http://n3.nabble.com/Tutorials-for-developing-filter-plugins-tp706874p709897.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>
>
>
> --
> "Good Enough" is not good enough.
> To give anything less than your best is to sacrifice the gift.
> Quality First. Measure Twice. Cut Once.
> http://www.israelekpo.com/
>


Re: JTeam Spatial Plugin

2010-05-11 Thread Michael
Try using "geo_distance" in the return fields.

On Thu, Apr 29, 2010 at 9:26 AM, Jean-Sebastien Vachon
 wrote:
> Hi All,
>
> I am using JTeam's Spatial Plugin RC3 to perform spatial searches on my index 
> and it works great. However, I can't seem to get it to return the computed 
> distances.
>
> My query component is run before the geoDistanceComponent and the 
> distanceField is set to "distance"
> Fields for lat/long are defined as well and the different tiers field are in 
> the results. Increasing the radius cause the number of matches to increase so 
> I guess that my setup is working...
>
> Here is sample query and its output (I removed some of the fields to keep it 
> short):
>
> /select?passkey=sample&q={!spatial%20lat=40.27%20long=-76.29%20radius=22%20calc=arc}title:engineer&wt=json&indent=on&fl=*,distance
>
> 
>
> {
>  "responseHeader":{
>  "status":0,
>  "QTime":69,
>  "params":{
>        "fl":"*,distance",
>        "indent":"on",
>        "q":"{!spatial lat=40.27 long=-76.29 radius=22 
> calc=arc}title:engineer",
>        "wt":"json"}},
>  "response":{"numFound":223,"start":0,"docs":[
>        {
>
>         "title":"Electrical Engineer",
>        "long":-76.3054962158203,
>         "lat":40.037899017334,
>         "_tier_9":-3.004,
>         "_tier_10":-6.0008,
>         "_tier_11":-12.0016,
>         "_tier_12":-24.0031,
>         "_tier_13":-47.0061,
>         "_tier_14":-93.00122,
>         "_tier_15":-186.00243,
>         "_tier_16":-372.00485},
> }}
>
> This output suggests to me that everything is in place. Anyone knows how to 
> fetch the computed distance? I tried adding the field 'distance' to my list 
> of fields but it didn't work
>
> Thanks
>


Rolling upgrade to 5.4 from 5.0 - "bug" caused by leader changes - is there a workaround?

2016-01-19 Thread Michael Joyner

Hello all,

I downloaded 5.4 and started doing a rolling upgrade from a 5.0 
solrcloud cluster and discovered that there seems to be a compatibility 
issue where doing a rolling upgrade from pre-5.4 which causes the 5.4 to 
fail with unable to determine leader errors.


Is there a work around that does not require taking the cluster down to 
upgrade to 5.4? Should I just stay with 5.3 for now? I need to implement 
programmatic schema changes in our collection via solrj, and based on 
what I'm reading this is a very new feature and requires the latest (or 
near latest) solrcloud.


Thanks!

-Mike


Re: Rolling upgrade to 5.4 from 5.0 - "bug" caused by leader changes - is there a workaround?

2016-01-19 Thread Michael Joyner

ok,

I just found the 5.4.1 RC2 download, it seems to work ok for a rolling 
upgrade.


I will see about downgrading back to 5.4.0 afterwards to be on an 
official release ...



On 01/19/2016 04:27 PM, Michael Joyner wrote:

Hello all,

I downloaded 5.4 and started doing a rolling upgrade from a 5.0 
solrcloud cluster and discovered that there seems to be a 
compatibility issue where doing a rolling upgrade from pre-5.4 which 
causes the 5.4 to fail with unable to determine leader errors.


Is there a work around that does not require taking the cluster down 
to upgrade to 5.4? Should I just stay with 5.3 for now? I need to 
implement programmatic schema changes in our collection via solrj, and 
based on what I'm reading this is a very new feature and requires the 
latest (or near latest) solrcloud.


Thanks!

-Mike




Re: Rolling upgrade to 5.4 from 5.0 - "bug" caused by leader changes - is there a workaround?

2016-01-20 Thread Michael Joyner

Unfortunately, it really couldn't wait.

I did a rolling upgrade to the 5.4.1RC2 then downgraded everything to 
5.4.0 and so far everything seems fine.


Couldn't take the cluster down.

On 01/19/2016 05:03 PM, Anshum Gupta wrote:

If you can wait, I'd suggest to be on the bug fix release. It should be out
around the weekend.

On Tue, Jan 19, 2016 at 1:48 PM, Michael Joyner  wrote:


ok,

I just found the 5.4.1 RC2 download, it seems to work ok for a rolling
upgrade.

I will see about downgrading back to 5.4.0 afterwards to be on an official
release ...



On 01/19/2016 04:27 PM, Michael Joyner wrote:


Hello all,

I downloaded 5.4 and started doing a rolling upgrade from a 5.0 solrcloud
cluster and discovered that there seems to be a compatibility issue where
doing a rolling upgrade from pre-5.4 which causes the 5.4 to fail with
unable to determine leader errors.

Is there a work around that does not require taking the cluster down to
upgrade to 5.4? Should I just stay with 5.3 for now? I need to implement
programmatic schema changes in our collection via solrj, and based on what
I'm reading this is a very new feature and requires the latest (or near
latest) solrcloud.

Thanks!

-Mike









Re: Rolling upgrade to 5.4 from 5.0 - "bug" caused by leader changes - is there a workaround?

2016-01-21 Thread Michael Joyner

On 01/21/2016 01:22 PM, Ishan Chattopadhyaya wrote:

Perhaps you could stay on 5.4.1 RC2, since that is what 5.4.1 will be
(unless there are last moment issues).

On Wed, Jan 20, 2016 at 7:50 PM, Michael Joyner  wrote:


Unfortunately, it really couldn't wait.

I did a rolling upgrade to the 5.4.1RC2 then downgraded everything to
5.4.0 and so far everything seems fine.

Couldn't take the cluster down.




I can wait for the 5.4.1 to be official as we are on the now official 
5.40 and so far all is well.


Though, I do find it odd that you can add a copy field declaration that 
duplicates an already existing declaration and one ends up with 
duplicate declarations...




Custom plugin to handle proprietary binary input stream

2016-02-11 Thread michael dürr
I'm looking for an option to write a Solr plugin which can deal with a
custom binary input stream. Unfortunately Solr's javabin as a protocol is
not an option for us.

I already had a look at some possibilities like writing a custom request
handler, but it seems like the classes/interfaces one would need to
implement are not "generic" enough (e.g. the
SolrRequestHandler#handleRequest() method expects objects of classes
SolrQueryRequest and SolrQueryResponse rsp)

It would be of great help if you could direct me to any "pluggable"
solution which allows to receive and parse a proprietary binary stream at a
Solr server so that we do not have to provide an own customized binary solr
server.

Background:
Our problem is that, we use a proprietary protocol to transfer our solr
queries together with some other Java objects to our solr server (at
present 3.6).The reason for this is, that we have some logic at the solr
server which heavily depends on theses other java objects. Unfortunately we
cannot easily shift that logic to the client side.

Thank you!

Michael


Solr stops working...randomly

2016-02-24 Thread Michael Beccaria
We're running solr 4.4.0 running in this software 
(https://github.com/CDRH/nebnews - Django based newspaper site). Solr is 
running on Ubuntu 12.04 in Jetty. The site occasionally (once a day) goes down 
with a Connection Refused error. I’m having a hard time troubleshooting the 
issue and was looking for help in next steps in trying to find out why it is 
failing.



After debugging it turns out that it is solr that is refusing the connection 
(restarting Jetty fixes it every time). It randomly fails.



things I've tried:

running sudo service jetty check

Says the service is running



Opened up the port on the server and tried going to the solr admin page. This 
failed until I restarted jetty, then it works.



Checked the solr.log files and no errors are found. The jetty log level is set 
to INFO and I'm hesitant to put it to Debug because of file size growth and the 
long time between failures. The time between failures in the logs simply has a 
normal query at one time followed by a startup log sequence when I restart 
jetty.



Apache logs show tons of traffic (it's still running) from Google bots and 
maybe this is causing issues but I would still expect to find some sort of 
error. There is a mix of 200, 500 and 404 codes. Here's a small sample:


GET /lccn/sn85053037/1981-09-15/ed-1/seq-13/ocr/ HTTP/1.1

500

14814

-

Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

GET /lccn/sn86075296/1910-10-27/ed-1/seq-1/ HTTP/1.1

500

14884

-

Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

GET /lccn/sn84036028/1925-05-22/ed-1/seq-6/ocr/ HTTP/1.1

500

14791

-

Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

GET /lccn/sn84036028/1917-10-28/ed-1/seq-1/ocr.xml HTTP/1.1

200

400827

-

Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

GET /lccn/TheRetort/2011-10-07/ed-1/seq-10/ocr/ HTTP/1.1

500

14798

-

Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

GET /lccn/TheRetort/1979-05-10/ed-1/seq-8/ocr.xml HTTP/1.1

200

193883

-

Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

GET /lccn/sn84036124/1977-02-23/ed-1/seq-12/ocr/ HTTP/1.1

500

14790

-

Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

GET /lccn/Emcoe/1958-11-21/ed-1/seq-3/ocr/ HTTP/1.1

500

14760

-

Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

GET /lccn/sn85053252/1909-10-08/ed-1/.rdf HTTP/1.1

404

3051

-

Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)






I could simply restart jetty nightly I guess but that seems to be putting a 
bandaid on the issue and I'm not sure how to proceed on this one. Any ideas?



Mike







Mike Beccaria

Director of Library Services

Paul Smith’s College

7833 New York 30

Paul Smiths, NY 12970

518.327.6376

mbecca...@paulsmiths.edu

www.paulsmiths.edu



-Original Message-

From: roland.sz...@booknwalk.com [mailto:roland.sz...@booknwalk.com] On Behalf 
Of Szucs Roland

Sent: Wednesday, February 24, 2016 1:10 PM

To: solr-user@lucene.apache.org

Subject: Re: very slow frequent updates



Thanks again Jeff. I will check the documentation of join queries becasue I 
never used it before.



Regards



Roland



2016-02-24 19:07 GMT+01:00 Jeff Wartes :



>

> I suspect your problem is the intersection of “very large document”

> and “high rate of change”. Either of those alone would be fine.

>

> You’re correct, if the thing you need to search or sort by is the

> thing with a high change rate, you probably aren’t going to be able to

> peel those things out of your index.

>

> Perhaps you could work something out with join queries? So you have

> two kinds of documents - book content and book price - and your

> high-frequency change is limited to documents with very little data.

>

>

>

>

>

> On 2/24/16, 4:01 AM, "roland.sz...@booknwalk.com on behalf of Szűcs

> Roland"  szucs.rol...@bookandwalk.hu> wrote:

>

> >I have checked it already in the ref. guide. It is stated that you

> >can not search in external fields:

> >

> https://cwiki.apache.org/confluence/display/solr/Working+with+External

> +Files+and+Processes

> >

> >Really I am very curios that my problem is not a usual one or the

> >case is that SOLR mainly focuses on search and not a kind of end-to-end 
> >support.

> >How this approach works with 1 million documents with frequently

> >changing prices?

> >

> >Thanks your time,

> >

> >Roland

> >

> >2016-02-24 12:39 GMT+01:00 Stefan Matheis :

> >

> >> Depending of what features you do actually need, might be worth a

> >> look on "External File Fields" Roland?

> >>

> >> -Stefan

> >>

> >> On Wed, Feb 24, 2016 at 12:24 PM, Szűcs Roland

> >>  wrote:

> >> > Thanks Jeff your help,

> >> >

> >> > Can it work in production environment? Imagine when my customer

> initiate

> >> a

> >> > query having 1 000 docs in the result set. I can not us

Fwd: Standard highlighting doesn't work for Block Join

2016-03-01 Thread michael solomon
Hi,
I have solr 5.4.1 and I'm trying to use Block Join Query Parser for search
in children and return the parent.
I want to apply highlight on children but it's return empty.
My q parameter: "q={!parent which="is_parent:true"} normal_text:(account)"
highlight parameters:
"hl=true&hl.fl=normal_text&hl.simple.pre=&hl.simple.post="

and return:

> "highlighting": { "chikora.com": {} }
>

("chikora.com" it's the id of the parent document)
it's looks this already solved here:
https://issues.apache.org/jira/browse/LUCENE-5929
but I don't understand how to use it.

Thanks,
Michael
P.S: sorry about my English.. working on it :)


understand scoring

2016-03-01 Thread michael solomon
Hi all,
I'm struggling to understand Solr scoring but can understand why I get
those results:
[image: Inline image 1]
(If don't see pic:
https://mail.google.com/mail/u/0/?ui=2&ik=f570232aa3&view=fimg&th=153332681af9c93f&attid=0.1&disp=emb&realattid=ii_153332681af9c93f&attbid=ANGjdJ-af_Q3b_h02w_TyMUCG5JHSl75pLKOLJC0nXIOzp9ypz6FOG2fbk7RvkGM-dkb2MLguNgAjFMbigbW_VqO4Z-YpMxBGWUc7-T3q25XnFyeijoNzY_Fi6gRzhs&sz=s0-l75&ats=1456852075298&rm=153332681af9c93f&zw

I expected that the order will be 1,3,2 (because 1 is shortest filed[4
words], and 3 before 2 because the distance between the words...)
Thank you,
Michael


Re: understand scoring

2016-03-02 Thread michael solomon
Thanks you, @Doug Turnbull I tried http://splainer.io but it's not for my
query(not explain for the docs..).
here the picture again...
https://drive.google.com/file/d/0B-7dnH4rlntJc2ZWdmxMS3RDMGc/view?usp=sharing

On Tue, Mar 1, 2016 at 10:06 PM, Doug Turnbull <
dturnb...@opensourceconnections.com> wrote:

> Supposedly Late April, early May. But don't hold me to it until I see copy
> edits :) Of course looks like now you can read at least the full ebook in
> MEAP form.
>
> -Doug
>
> On Tue, Mar 1, 2016 at 2:57 PM, shamik  wrote:
>
> > Doug, do we've a date for the hard copy launch?
> >
> >
> >
> > --
> > View this message in context:
> >
> http://lucene.472066.n3.nabble.com/understand-scoring-tp4260837p4260860.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
> >
>
>
>
> --
> *Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections
> , LLC | 240.476.9983
> Author: Relevant Search 
> This e-mail and all contents, including attachments, is considered to be
> Company Confidential unless explicitly stated otherwise, regardless
> of whether attachments are marked as such.
>


Standard highlighting doesn't work for Block Join

2016-03-02 Thread michael solomon
IT WAS MY FIRST POST IN MAILING LIST SO NOT SURE IF YOU GET IT SO I'M SEND
IT AGAIN

Hi,
I have solr 5.4.1 and I'm trying to use Block Join Query Parser for search
in children and return the parent.
I want to apply highlight on children but it's return empty.
My q parameter: "q={!parent which="is_parent:true"} normal_text:(account)"
highlight parameters:
"hl=true&hl.fl=normal_text&hl.simple.pre=&hl.simple.post="

and return:

> "highlighting": { "chikora.com": {} }
>

("chikora.com" it's the id of the parent document)
it's looks this already solved here:
https://issues.apache.org/jira/browse/LUCENE-5929
but I don't understand how to use it.

Thanks,
Michael
P.S: sorry about my English.. working on it :)


Re: understand scoring

2016-03-02 Thread michael solomon
Hi Emir,
In morning I delete those documents and know added them again to re-run the
query.. and know this is how I expect (0_0) and I can't to re-produce the
problem... this weird.. :\

On Wed, Mar 2, 2016 at 11:38 AM, Emir Arnautovic <
emir.arnauto...@sematext.com> wrote:

> Hi Michael,
> Can you please run query with debug and share title field configuration.
>
> Thanks,
> Emir
>
> --
> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> Solr & Elasticsearch Support * http://sematext.com/
>
>
>
>
> On 02.03.2016 09:14, michael solomon wrote:
>
>> Thanks you, @Doug Turnbull I tried http://splainer.io but it's not for my
>> query(not explain for the docs..).
>> here the picture again...
>>
>> https://drive.google.com/file/d/0B-7dnH4rlntJc2ZWdmxMS3RDMGc/view?usp=sharing
>>
>> On Tue, Mar 1, 2016 at 10:06 PM, Doug Turnbull <
>> dturnb...@opensourceconnections.com> wrote:
>>
>> Supposedly Late April, early May. But don't hold me to it until I see copy
>>> edits :) Of course looks like now you can read at least the full ebook in
>>> MEAP form.
>>>
>>> -Doug
>>>
>>> On Tue, Mar 1, 2016 at 2:57 PM, shamik  wrote:
>>>
>>> Doug, do we've a date for the hard copy launch?
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>>
>>>>
>>> http://lucene.472066.n3.nabble.com/understand-scoring-tp4260837p4260860.html
>>>
>>>> Sent from the Solr - User mailing list archive at Nabble.com.
>>>>
>>>>
>>>
>>> --
>>> *Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections
>>> <http://opensourceconnections.com>, LLC | 240.476.9983
>>> Author: Relevant Search <http://manning.com/turnbull>
>>> This e-mail and all contents, including attachments, is considered to be
>>> Company Confidential unless explicitly stated otherwise, regardless
>>> of whether attachments are marked as such.
>>>
>>>


Query solrcloud questions

2016-03-02 Thread michael solomon
Hi,
I Installed 3 instances of SolrCloud 5.4.1.
I'm doing a little search engine of websites and I'm store their info as
Nested Documents(one document for the website general information and it
children is the pages inside the website).
So when I'm querying this collection I'm using a BlockJoin parser({!parent
which="is_parent:true" score="max"}).
I have several questions:

   1. How to use highlight with BlockJoin ? when I tried to use it my
   result's highlight returned empty.
   2. How to return the relevant child? i.e the children which because of
   it this parent returned as a result (this child with the highest score?).
   3. Am I boosting right?:

>{!parent which="is_parent:true" score="max"}
>(
>normal_text:("clients home"~1000)
>h_titles:("clients home"~1000)^3
>title:("clients home"~1000)^5
>depth:0^1.1
>)


Thank you,
Michael


Very slow updates

2016-03-10 Thread michael solomon
Hi,
I have a collection with one shard in solrcloud (for development before
scaling) and when I'm trying to update new documents it's take about 20 sec
for 12mb of data.
What wrong with my config?

VM RAM - 28gb
JVM-Memory - 10gb

What else can I do?

Thanks,
Michael


Re: Very slow updates

2016-03-11 Thread michael solomon
It's discover as cloud(azure) problem. If you build 'ds' series machine in
classic mode your Internet connection death.. (about 300k/s). I'm very
upset about that.. It's take for me about a day to track down this bug...
On Mar 10, 2016 9:52 PM, "Erick Erickson"  wrote:

> This really doesn't have much information to go on.
>
> Have you reviewed: http://wiki.apache.org/lucene-java/ImproveIndexingSpeed
> ?
>
> What is "slow"? How are you updating? Are you batching updates? Are
> you committing often?
>
> Details matter.
>
> Best,
> Erick
>
> On Thu, Mar 10, 2016 at 2:41 AM, michael solomon 
> wrote:
> > Hi,
> > I have a collection with one shard in solrcloud (for development before
> > scaling) and when I'm trying to update new documents it's take about 20
> sec
> > for 12mb of data.
> > What wrong with my config?
> >
> > VM RAM - 28gb
> > JVM-Memory - 10gb
> >
> > What else can I do?
> >
> > Thanks,
> > Michael
>


return and highlight the most relevant child with BlockJoinQuery

2016-03-14 Thread michael solomon
Hi,
how can I *highlight* and *return* the most relevant child with
BlockJoinQuery.
for this:

> {!parent which="is_parent:*" score=max}(title:(terms)


I expect to get:

.
.
.
docs:[

{
   doc parent
   _childDocuments_:{the most relevant child}
}
{
   doc parent2
   _childDocuments_:{the most relevant child}
}
.
.
.

]
highlight:{

{doc parent: highlight from the children}
{doc parent: highlight from the children}

}

Thanks a lot,
Michae


Re: return and highlight the most relevant child with BlockJoinQuery

2016-03-15 Thread michael solomon
Thanks Mikhail,
Regarding the former, :) Do you can elaborate? I didn't understand the
context of the JIRA issue that you mentioned(SOLR-8202).

Regarding highlighting, I think it's possible because:
https://issues.apache.org/jira/browse/LUCENE-5929
BUT HOW?

On Mon, Mar 14, 2016 at 7:28 PM, Mikhail Khludnev <
mkhlud...@griddynamics.com> wrote:

> Michae,
> Regarding the former, it's not a feature of [child] result transformer, it
> might be separately requested, but I prefer to provide via generic
> SOLR-8202.
>
> Regarding highlighting, I can't comment, I only saw that there is some
> highlighting case for {!parent} queries. Sorry.
>
> On Mon, Mar 14, 2016 at 6:13 PM, michael solomon 
> wrote:
>
> > Hi,
> > how can I *highlight* and *return* the most relevant child with
> > BlockJoinQuery.
> > for this:
> >
> > > {!parent which="is_parent:*" score=max}(title:(terms)
> >
> >
> > I expect to get:
> >
> > .
> > .
> > .
> > docs:[
> >
> > {
> >doc parent
> >_childDocuments_:{the most relevant child}
> > }
> > {
> >doc parent2
> >_childDocuments_:{the most relevant child}
> > }
> > .
> > .
> > .
> >
> > ]
> > highlight:{
> >
> > {doc parent: highlight from the children}
> > {doc parent: highlight from the children}
> >
> > }
> >
> > Thanks a lot,
> > Michae
> >
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> Principal Engineer,
> Grid Dynamics
>
> <http://www.griddynamics.com>
> 
>


Re: No live SolrServers available to handle this request

2016-03-19 Thread michael solomon
What query do you try?

On Thu, Mar 17, 2016 at 12:22 PM, Anil  wrote:

> HI,
>
> We are using solrcloud with zookeeper and each collection has 5 shareds and
> 2 replicas.
> we are seeing "org.apache.solr.client.solrj.SolrServerException: No live
> SolrServers available to handle this request". i dont see any issues with
> replicas.
>
> what would be root cause of the exception ? Thanks.
>
> Regards,
> Anil
>


score mixing

2016-03-27 Thread michael solomon
Hi,
I have nested documents and use in BlockJoinQueryParser.
In parent documents I have "rank" field that give an arbitrary score for
each parent.
It's possible to mix the original scoring with mine? i.e:
SolrScore+rank=finel score
or(proportional scoring..)SolrScore/MaxScore + rank/MaxRank = finel
score(between 0-1)
Thanks,
Michael


most popular collate spellcheck

2016-03-31 Thread michael solomon
Hi,
It's possible to return the most popular collate?
i.e:
spellcheck.q = prditive analytiycs
spellcheck.maxCollations = 5
spellcheck.count=0
response:

  
  false
  
positive analytic
positive analytics
predictive analytics
primitive analytics
punitive analytic
  


I want that the collations will order by numFound. and obviesly that
"predictive analytics" have more results from "positive analytic".
Thanks,
Michael


index time boost nested documents JSON format

2016-03-31 Thread michael solomon
Hi,
how can I index time boost nested documents JSON format?


update log not in ACTIVE or REPLAY state

2016-03-31 Thread michael dürr
Hallo,

when I launch my two nodes in Solr cloud, I always get the following error
at node2:

PeerSync: core=portal_shard1_replica2 url=http://127.0.1.1:8984/solr
ERROR,​ update log not in ACTIVE or REPLAY state.
FSUpdateLog{state=BUFFERING,​ tlog=null}

Actually, I cannot experience any problems, but before going to production,
I wanted to know why I get this error?

I'm running two nodes (node1 and node2) in a Solr Cloud cluster (5.4.1).
node1 is started with embedded zookeeper and listens to port 8983. Node2
listens on port 8984 and registers with the embedded zookeeper of node1 at
port 9983.
I have one collection "portal" (1 shard, 2 replicas), where each node
serves one replica.
The settings for commit on both nodes are:



  ${solr.autoCommit.maxTime:15000}
false

and


${solr.autoSoftCommit.maxTime:-1}

Can you give me some advise, how to get rid of this error?Should I
simply ignore it?

Thanks,Michael


Re: most popular collate spellcheck

2016-04-02 Thread michael solomon
Thanks, and what we can do about that?
On Apr 2, 2016 5:28 PM, "Reth RM"  wrote:

> Afaik, such feature doesn't exist currently, but looks like nice to have.
>
>
>
>
> On Thu, Mar 31, 2016 at 8:33 PM, michael solomon 
> wrote:
>
> > Hi,
> > It's possible to return the most popular collate?
> > i.e:
> > spellcheck.q = prditive analytiycs
> > spellcheck.maxCollations = 5
> > spellcheck.count=0
> > response:
> > 
> >   
> >   false
> >   
> > positive analytic
> > positive analytics
> > predictive analytics
> > primitive analytics
> > punitive analytic
> >   
> > 
> >
> > I want that the collations will order by numFound. and obviesly that
> > "predictive analytics" have more results from "positive analytic".
> > Thanks,
> > Michael
> >
>


spellcheck return wrong collation

2016-04-03 Thread michael solomon
Hi,
image:
http://s24.postimg.org/u457bhzr9/Untitled.png

why the suggestion return "analytics" (great!) but the collation take
"analtics"?
Thanks,
Michael


Re: most popular collate spellcheck

2016-04-03 Thread michael solomon
done :)
https://issues.apache.org/jira/browse/SOLR-8934

On Sun, Apr 3, 2016 at 2:08 PM, Reth RM  wrote:

> May be open a jira under improvement.
> https://issues.apache.org/jira/login.jsp?
>
>
> On Sat, Apr 2, 2016 at 11:30 PM, michael solomon 
> wrote:
>
> > Thanks, and what we can do about that?
> > On Apr 2, 2016 5:28 PM, "Reth RM"  wrote:
> >
> > > Afaik, such feature doesn't exist currently, but looks like nice to
> have.
> > >
> > >
> > >
> > >
> > > On Thu, Mar 31, 2016 at 8:33 PM, michael solomon  >
> > > wrote:
> > >
> > > > Hi,
> > > > It's possible to return the most popular collate?
> > > > i.e:
> > > > spellcheck.q = prditive analytiycs
> > > > spellcheck.maxCollations = 5
> > > > spellcheck.count=0
> > > > response:
> > > > 
> > > >   
> > > >   false
> > > >   
> > > > positive analytic
> > > > positive analytics
> > > > predictive analytics
> > > > primitive analytics
> > > > punitive analytic
> > > >   
> > > > 
> > > >
> > > > I want that the collations will order by numFound. and obviesly that
> > > > "predictive analytics" have more results from "positive analytic".
> > > > Thanks,
> > > > Michael
> > > >
> > >
> >
>


Expand or ChildDocTransformerFactory

2016-04-07 Thread michael solomon
Hi,
I'm using in JoinBlock query - {!parent which="is_parent:true"}
I need to return the most relevant child for each parent.
I saw in Google there two options for this:

   1. Expand
   2. ChildDocTransformerFactory

what the difference between them? which one I need to use?
Thanks a lot,
Michael


Limiting regex queries

2016-04-10 Thread Michael Harkins
Hey all,

I am using lucene and solr version 4.2, and was wondering what would
be the best way to not allow regex queries with very large numbers.
Something like blah{1234567} or blah{1234, 123445678}


Re: Limiting regex queries

2016-04-10 Thread Michael Harkins
Well the originally architecture is out of my hands , but when someone
sends in a query like that, if the range is a large number , my system
basically shuts down and the cpu spikes with a large increase in
memory usage. The queries are for strings. The query itself was an
accident but I want to be able to prevent an accident from bringing
down the index.


> On Apr 10, 2016, at 12:34 PM, Erick Erickson  wrote:
>
> OK, why is this a problem? This smells like an XY problem,
> you want to take some specific action, but it's not at all
> clear what the problem is. There might be other ways
> of doing this.
>
> If you're allowing regexes on numeric fields, using real
> number fields (trie) and using range queries is a much
> better way to go.
>
> Best,
> Erick
>
>> On Sun, Apr 10, 2016 at 9:28 AM, Michael Harkins  wrote:
>> Hey all,
>>
>> I am using lucene and solr version 4.2, and was wondering what would
>> be the best way to not allow regex queries with very large numbers.
>> Something like blah{1234567} or blah{1234, 123445678}


boost parent fields BlockJoinQuery

2016-04-11 Thread michael solomon
Hi,
I'm using in BlockJoin Parser Query for return the parent of the relevant
child i.e:
{!parent which="is_parent:true" score=max}(child_field:bla)

It's possible to boost the parent? something like:

{!parent which="is_parent:true" score=max}(child_field:bla)
parent_field:"bla bla"^10
Thanks,
Michael


Re: boost parent fields BlockJoinQuery

2016-04-12 Thread michael solomon
Thanks,
when I'm trying:
city:"walla walla"^10 {!parent which="is_parent:true"
score=max}(normal_text:walla)
I get:

> "msg": "org.apache.solr.search.SyntaxError: Cannot parse
> '(normal_text:walla': Encountered \"\" at line 1, column 18.\nWas
> expecting one of:\n ...\n ...\n ...\n\"+\"
> ...\n\"-\" ...\n ...\n\"(\" ...\n\")\" ...\n
> \"*\" ...\n\"^\" ...\n ...\n ...\n
>  ...\n ...\n ...\n
>  ...\n\"[\" ...\n\"{\" ...\n ...\n
> \"filter(\" ...\n ...\n"


On Tue, Apr 12, 2016 at 1:30 PM, Mikhail Khludnev <
mkhlud...@griddynamics.com> wrote:

> Hello,
>
> It's usually
> parent_field:"bla bla"^10 {!parent which="is_parent:true"
> score=max}(child_field:bla)
> or
> parent_field:"bla bla"^10 +{!parent which="is_parent:true"
> score=max}(child_field:bla)
>
> there should be no spaces in child clause, otherwise extract it to param
> and refrer via v=$param
>
>
> On Tue, Apr 12, 2016 at 9:56 AM, michael solomon 
> wrote:
>
> > Hi,
> > I'm using in BlockJoin Parser Query for return the parent of the relevant
> > child i.e:
> > {!parent which="is_parent:true" score=max}(child_field:bla)
> >
> > It's possible to boost the parent? something like:
> >
> > {!parent which="is_parent:true" score=max}(child_field:bla)
> > parent_field:"bla bla"^10
> > Thanks,
> > Michael
> >
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> Principal Engineer,
> Grid Dynamics
>
> <http://www.griddynamics.com>
> 
>


Re: boost parent fields BlockJoinQuery

2016-04-13 Thread michael solomon
Thank your for the response.
this query worked without errors:

> (city:"tucson"^1000) +{!parent which="is_parent:true"
> score=max}(normal_text:"silver ring")
>
however, this is not exactly what I was looking for..
I got from solr all the documents that their city field has the value
Tucson.
but I wanted to BOOST only the city:"Tucson" not search in this field.
Thank you a lot,
Micheal

On Tue, Apr 12, 2016 at 10:41 PM, Mikhail Khludnev <
mkhlud...@griddynamics.com> wrote:

> Giving the error message you undercopypasted search query and and omit the
> closing bracket.
>
> On Tue, Apr 12, 2016 at 3:30 PM, michael solomon 
> wrote:
>
> > Thanks,
> > when I'm trying:
> > city:"walla walla"^10 {!parent which="is_parent:true"
> > score=max}(normal_text:walla)
> > I get:
> >
> > > "msg": "org.apache.solr.search.SyntaxError: Cannot parse
> > > '(normal_text:walla': Encountered \"\" at line 1, column 18.\nWas
> > > expecting one of:\n ...\n ...\n ...\n
> \"+\"
> > > ...\n\"-\" ...\n ...\n\"(\" ...\n\")\" ...\n
> > > \"*\" ...\n\"^\" ...\n ...\n ...\n
> > >  ...\n ...\n ...\n
> > >  ...\n\"[\" ...\n\"{\" ...\n ...\n
> > > \"filter(\" ...\n ...\n"
> >
> >
> > On Tue, Apr 12, 2016 at 1:30 PM, Mikhail Khludnev <
> > mkhlud...@griddynamics.com> wrote:
> >
> > > Hello,
> > >
> > > It's usually
> > > parent_field:"bla bla"^10 {!parent which="is_parent:true"
> > > score=max}(child_field:bla)
> > > or
> > > parent_field:"bla bla"^10 +{!parent which="is_parent:true"
> > > score=max}(child_field:bla)
> > >
> > > there should be no spaces in child clause, otherwise extract it to
> param
> > > and refrer via v=$param
> > >
> > >
> > > On Tue, Apr 12, 2016 at 9:56 AM, michael solomon  >
> > > wrote:
> > >
> > > > Hi,
> > > > I'm using in BlockJoin Parser Query for return the parent of the
> > relevant
> > > > child i.e:
> > > > {!parent which="is_parent:true" score=max}(child_field:bla)
> > > >
> > > > It's possible to boost the parent? something like:
> > > >
> > > > {!parent which="is_parent:true" score=max}(child_field:bla)
> > > > parent_field:"bla bla"^10
> > > > Thanks,
> > > > Michael
> > > >
> > >
> > >
> > >
> > > --
> > > Sincerely yours
> > > Mikhail Khludnev
> > > Principal Engineer,
> > > Grid Dynamics
> > >
> > > <http://www.griddynamics.com>
> > > 
> > >
> >
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> Principal Engineer,
> Grid Dynamics
>
> <http://www.griddynamics.com>
> 
>


update log not in ACTIVE or REPLAY state

2016-04-13 Thread michael dürr
Hello,

when I launch my two nodes in Solr cloud, I always get the following error
at node2:

PeerSync: core=portal_shard1_replica2 url=http://127.0.1.1:8984/solr ERROR,
update log not in ACTIVE or REPLAY state. FSUpdateLog{state=BUFFERING,
tlog=null}

Actually, I cannot experience any problems, but before going to production,
I wanted to know why I get this error?

I'm running two nodes (node1 and node2) in a Solr Cloud cluster (5.4.1).
node1 is started with embedded zookeeper and listens to port 8983. Node2
listens on port 8984 and registers with the embedded zookeeper of node1 at
port 9983.
I have one collection "portal" (1 shard, 2 replicas), where each node
serves one replica.
The settings for commit on both nodes are:



  ${solr.autoCommit.maxTime:15000}
  false


and


  ${solr.autoSoftCommit.maxTime:-1}


Can you give me some advise, how to get rid of this error?
Should I simply ignore it?

Thanks,
Michael


Issue when zookeeper session expires during shard leader election.

2015-07-27 Thread Michael Roberts
Hey,

I am encountering an issue which looks a lot like 
https://issues.apache.org/jira/browse/SOLR-6763.

However, it seems like the fix for that does not address the entire problem. 
That fix will only work if we hit the zkClient.getChildren() call before the 
reconnect logic has finished reconnecting us to ZooKeeper (I can reproduce 
scenarios where it doesn’t in 4.10.4). If the reconnect has already happened, 
we won’t get the session timeout exception.

The specific problem I am seeing is slightly different SOLR-6763, but the root 
cause appears to be the same. The issue that I am seeing is; during startup the 
collections are registered and there is one coreZkRegister-1-thread-* per 
collection. The elections are started on this thread, the 
/collections//leader_elect ZNodes are created, and then the thread blocks 
waiting for the peers to become available. During the block the ZooKeeper 
session times out.

Once we finish blocking, the reconnect logic calls register() for each 
collection, which restarts the election process (although serially this time). 
At a later point, we can have two threads that are trying to register the same 
collection.

This is incorrect, because the coreZkRegister-1-thread-’s are assuming they are 
leader with no verification from zookeeper. The ephemeral leader_elect nodes 
they created were removed when the session timed out. If another host started 
in the interim (or any point after that actually), it would see no leader, and 
would attempt to become leader of the shard itself. This leads to some 
interesting race conditions, where you can end up with two leaders for a shard.

It seems like a more complete fix would be to actually close the 
ElectionContext upon reconnect. This would break us out of the wait for peers 
loop, and stop the threads from processing the rest of the leadership logic. 
The reconnection logic would then continue to call register() again for each 
Collection, and if the ZK state indicates it should be leader it can re-run the 
leadership logic.

I have a patch in testing that does this, and I think addresses the problem.

What is the general process for this? I didn’t want to reopen a close Jira 
item. Should I create a new one so the issue and the proposed fix can be 
discussed?

Thanks.

Mike.




Re: Lucene/Solr Git Mirrors 5 day lag behind SVN?

2015-10-24 Thread Michael McCandless
I added a comment on the INFRA issue.

I don't understand why it periodically "gets stuck".

Mike McCandless

http://blog.mikemccandless.com


On Fri, Oct 23, 2015 at 11:27 AM, Kevin Risden
 wrote:
> It looks like both Apache Git mirror (git://git.apache.org/lucene-solr.git)
> and GitHub mirror (https://github.com/apache/lucene-solr.git) are 5 days
> behind SVN. This seems to have happened before:
> https://issues.apache.org/jira/browse/INFRA-9182
>
> Is this a known issue?
>
> Kevin Risden


Re: Solr suggest is related to second letter, not to initial letter

2015-02-15 Thread Michael Sokolov
StandardTokenizer splits your text into tokens, and the suggester 
suggests tokens independently.  It sounds as if you want the suggestions 
to be based on the entire text (not just the current word), and that 
only adjacent words in the original should appear as suggestions.  
Assuming that's what you are after (it's a little hard to tell from your 
e-mail -- you might want to clarify by providing a few example of how 
you *do* want it to work instead of just examples of how you *don't* 
want it to work), you have a couple of choices:


1) don't use StandardTokenizer, use KeywordTokenizer instead - this will 
preserve the entire original text and suggest complete texts, rather 
than words
2) maybe consider using a shingle filter along with standard tokenizer, 
so that your tokens include multi-word shingles
3) Use a suggester with better support for a statistical language model, 
like this one: 
http://blog.mikemccandless.com/2014/01/finding-long-tail-suggestions-using.html, 
but to do this you will probably need to do some java programming since 
it isn't well integrated into solr


-Mike

On 2/14/2015 3:44 AM, Volkan Altan wrote:

Any idea?



On 12 Şub 2015, at 11:12, Volkan Altan  wrote:

Hello Everyone,

All I want to do with Solr suggester is obtaining the fact that the asserted 
suggestions  for the second letter whose entry actualizes after the initial 
letter  is actually related to initial letter, itself. But; just like the 
initial letters, the second letters rotate independently, as well.


Example;
http://localhost:8983/solr/solr/suggest?q=facet_suggest_data:”adidas+s"; 


adidas s

response>

0
4




1
27
28

samsung



facet_suggest_data:"adidas samsung"
0

adidas
samsung







The terms of ‘’Adidas’’ and ‘’Samsung’’ are available within seperate 
documents. A common place in which both of them are available cannot be found.

How can I solve that problem?



schema.xml


 
 
 
 
 
 
 
 
 
 
 
 
 
 
 




Best







Re: Solr suggest is related to second letter, not to initial letter

2015-02-18 Thread Michael Sokolov


On 02/17/2015 03:46 AM, Volkan Altan wrote:

First of all thank you for your answer.
You're welcome - thanks for sending a more complete example of your 
problem and expected behavior.


I don’t want to use KeywordTokenizer. Because, as long as the compound words 
written by the user are available in any document, I am able to receive a 
conclusion. I just don’t want “q=galaxy + samsung” to appear; because it is an 
inappropriate suggession and it doesn’t work.

Many Thanks Ahead of Time!

Did you try the other suggestions in my earlier reply?

-Mike



Re: highlighting the boolean query

2015-02-24 Thread Michael Sokolov
There is also PostingsHighlighter -- I recommend it, if only for the 
performance improvement, which is substantial, but I'm not completely 
sure how it handles this issue.  The one drawback I *am* aware of is 
that it is insensitive to positions (so words from phrases get 
highlighted even in isolation)


-Mike


On 02/24/2015 12:46 PM, Erik Hatcher wrote:

BooleanQuery’s extractTerms looks like this:

public void extractTerms(Set terms) {
   for (BooleanClause clause : clauses) {
 if (clause.isProhibited() == false) {
   clause.getQuery().extractTerms(terms);
 }
   }
}
that’s generally the method called by the Highlighter for what terms should be 
highlighted.  So even if a term didn’t match the document, the query that the 
term was in matched the document and it just blindly highlights all the terms 
(minus prohibited ones).   That at least explains the behavior you’re seeing, 
but it’s not ideal.  I’ve seen specialized highlighters that convert to spans, 
which are accurate to the exact matches within the document.  Been a while 
since I dug into the HighlightComponent, so maybe there’s some other options 
available out of the box?

—
Erik Hatcher, Senior Solutions Architect
http://www.lucidworks.com 





On Feb 24, 2015, at 3:16 AM, Dmitry Kan  wrote:

Erick,

Our default operator is AND.

Both queries below parse the same:

a OR (b c) OR d
a OR (b AND c) OR d

The parsed query:

Contents:a (+Contents:b +Contents:c)
Contents:d

So this part is consistent with our expectation.



I'm a bit puzzled by your statement that "c" didn't contribute to the

score.
what I meant was that the term c was not hit by the scorerer: the explain
section does not refer to it. I'm using the made up terms here, but the
concept holds.

The code suggests that we could benefit from storing term offsets and
positions:
http://grepcode.com/file/repo1.maven.org/maven2/org.apache.solr/solr-core/4.3.1/org/apache/solr/highlight/DefaultSolrHighlighter.java#470

Is it correct assumption?

On Mon, Feb 23, 2015 at 8:29 PM, Erick Erickson 
wrote:


Highlighting is such a pain...

what does the parsed query look like? If the default operator is OR,
then this seems correct as both 'd' and 'c' appear in the doc. So
I'm a bit puzzled by your statement that "c" didn't contribute to the
score.

If the parsed query is, indeed
a +b +c d

then it does look like something with the highlighter. Whether other
highlighters are better for this case.. no clue ;(

Best,
Erick

On Mon, Feb 23, 2015 at 9:36 AM, Dmitry Kan  wrote:

Erick,

nope, we are using std lucene qparser with some customizations, that do

not

affect the boolean query parsing logic.

Should we try some other highlighter?

On Mon, Feb 23, 2015 at 6:57 PM, Erick Erickson 
Are you using edismax?

On Mon, Feb 23, 2015 at 3:28 AM, Dmitry Kan 

wrote:

Hello!

In solr 4.3.1 there seem to be some inconsistency with the

highlighting

of

the boolean query:

a OR (b c) OR d

This returns a proper hit, which shows that only d was included into

the

document score calculation.

But the highlighter returns both d and c in  tags.

Is this a known issue of the standard highlighter? Can it be

mitigated?


--
Dmitry Kan
Luke Toolbox: http://github.com/DmitryKey/luke
Blog: http://dmitrykan.blogspot.com
Twitter: http://twitter.com/dmitrykan
SemanticAnalyzer: www.semanticanalyzer.info



--
Dmitry Kan
Luke Toolbox: http://github.com/DmitryKey/luke
Blog: http://dmitrykan.blogspot.com
Twitter: http://twitter.com/dmitrykan
SemanticAnalyzer: www.semanticanalyzer.info



--
Dmitry Kan
Luke Toolbox: http://github.com/DmitryKey/luke
Blog: http://dmitrykan.blogspot.com
Twitter: http://twitter.com/dmitrykan
SemanticAnalyzer: www.semanticanalyzer.info






[ANNOUNCE] Apache Solr 4.10.4 released

2015-03-05 Thread Michael McCandless
October 2014, Apache Solr™ 4.10.4 available

The Lucene PMC is pleased to announce the release of Apache Solr 4.10.4

Solr is the popular, blazing fast, open source NoSQL search platform
from the Apache Lucene project. Its major features include powerful
full-text search, hit highlighting, faceted search, dynamic
clustering, database integration, rich document (e.g., Word, PDF)
handling, and geospatial search. Solr is highly scalable, providing
fault tolerant distributed search and indexing, and powers the search
and navigation features of many of the world's largest internet sites.

Solr 4.10.4 is available for immediate download at:

http://www.apache.org/dyn/closer.cgi/lucene/solr/4.10.4

Solr 4.10.4 includes 24 bug fixes, as well as Lucene 4.10.4 and its 13
bug fixes.

See the CHANGES.txt file included with the release for a full list of
changes and further details.

Please report any feedback to the mailing lists
(http://lucene.apache.org/solr/discussion.html)

Note: The Apache Software Foundation uses an extensive mirroring
network for distributing releases. It is possible that the mirror you
are using may not have replicated the release yet. If that is the
case, please try another mirror. This also goes for Maven access.

Mike McCandless

http://blog.mikemccandless.com


Best way to monitor Solr regarding crashes

2015-03-28 Thread Michael Bakonyi
Hi,

we were using Solr for about 3 months without problems until a few days ago it 
crashed one time and we don't know why. After a restart everything was fine 
again but we want to be better prepared the next time this could happen. So I'd 
like to know what's the best way to monitor a single Solr-instance and what 
logging-configuration you think is useful for this kind of monitoring. Maybe 
there's a possibility to automatically restart Solr after it crashed + to see 
in detail in the logs what happend right before the crash ..?

Can you give me any hints? We're using Tomcat 6.X with Solr 4.8.X

Cheers,
Michael

Is it possible to facet on the results of a custom solr function?

2015-04-13 Thread Motulewicz, Michael
Hi,

  I’m attempting to facet on the results of a custom solr function. I’ve been 
trying all kinds of combinations that I think would work, but keep getting 
errors. I’m starting to wonder if it is possible.

   I’m using Solr 4.0 and here is how I am calling:
&facet.query={!func}myCustomSolrQuery(param1, param2, param3)
   (My function returns an array of results like {“M:3”, “D:4”})

   This returns the same number as the total results.

  I’ve tried adding to the end of the query to break out the results like this:
&facet.query={!func}myCustomSolrQuery(param1, param2, param3) : (“M:3”)
  No matter what I put, I get parsing exceptions

Thanks for any help!
Mike

IMPORTANT NOTICE: This communication, including any attachment, contains
information that may be confidential or privileged, and is intended solely for
the entity or individual to whom it is addressed.  If you are not the intended
recipient, you should delete this message and are hereby notified that any
disclosure, copying, or distribution of this message is strictly prohibited.
Nothing in this email, including any attachment, is intended to be a legally
binding signature.


Re: Is it possible to facet on the results of a custom solr function?

2015-04-20 Thread Motulewicz, Michael
Solved my own problem.

Using multiple function range query parsers works fine against my custom 
function

&facet.query={!frange l=1 u=1} MyCustomSolrQuery(param1,param2, param3)
&facet.query={!frange l=2 u=2} MyCustomSolrQuery(param1,param2, param3)
Etc…

Gives me the counts for 1 then 2 etc

Not sure if there’s a better way, but this works



From: , Michael Motulewicz 
mailto:michael.motulew...@healthsparq.com>>
Reply-To: "solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org>" 
mailto:solr-user@lucene.apache.org>>
Date: Monday, April 13, 2015 at 11:40 AM
To: "solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org>" 
mailto:solr-user@lucene.apache.org>>
Subject: Is it possible to facet on the results of a custom solr function?

Hi,

  I’m attempting to facet on the results of a custom solr function. I’ve been 
trying all kinds of combinations that I think would work, but keep getting 
errors. I’m starting to wonder if it is possible.

   I’m using Solr 4.0 and here is how I am calling:
&facet.query={!func}myCustomSolrQuery(param1, param2, param3)
   (My function returns an array of results like {“M:3”, “D:4”})

   This returns the same number as the total results.

  I’ve tried adding to the end of the query to break out the results like this:
&facet.query={!func}myCustomSolrQuery(param1, param2, param3) : (“M:3”)
  No matter what I put, I get parsing exceptions

Thanks for any help!
Mike


Ensure a sustainable future - only print when necessary.
IMPORTANT NOTICE: This communication, including any attachment, contains
information that may be confidential or privileged, and is intended solely for
the entity or individual to whom it is addressed.  If you are not the intended
recipient, you should delete this message and are hereby notified that any
disclosure, copying, or distribution of this message is strictly prohibited.
Nothing in this email, including any attachment, is intended to be a legally
binding signature.


Replication not triggered

2015-04-27 Thread Michael Lackhoff
We have old fashioned replication configured between one master and one
slave. Everything used to work but today I noticed that recent records
were not present in the slave (same query gives hits on master but non
on slave).
The replication communication seems to work. This is what I get in the logs:

INFO: [default] webapp=/solr path=/replication
params={command=fetchindex&_=1430136325501&wt=json} status=0 QTime=0
Apr 27, 2015 2:05:25 PM org.apache.solr.handler.SnapPuller fetchLatestIndex
INFO: Slave in sync with master.
Apr 27, 2015 2:05:25 PM org.apache.solr.core.SolrCore execute
INFO: [default] webapp=/solr path=/replication
params={command=details&_=1430136325600&wt=json} status=0 QTime=21

It says both are in sync, but obviously they are not and even the
replication page of the admin view mentions different Version, Gen and size:
Master (Searching)  1430107573634 27 287.19 GB
Master (Replicable) 1430107573634 27 -
Slave (Searching)   1429762011916 23 287.14 GB

Any idea why the replication is not triggered here or what I could try
to fix it?
Solr Version is 4.10.3.

-Michael


Sync failure after shard leader election when adding new replica.

2015-05-26 Thread Michael Roberts
Hi,

I have a SolrCloud setup, running 4.10.3. The setup consists of several cores, 
each with a single shard and initially each shard has a single replica (so, 
basically, one machine). I am using core discovery, and my deployment tools 
create an empty core on newly provisioned machines.

The scenario that I am testing is, Machine 1 is running and writes are 
occurring from my application to Solr. At some point, I stop Machine 1, and 
reconfigure my application to add Machine 2. Both machines are then started.

What I would expect to happen at this point, is Machine 2 cannot become leader 
because it is behind compared to Machine 1. Machine 2 would then restore from 
Machine 1.

However, looking at the logs. I am seeing Machine 2 become elected leader and 
fail the PeerRestore

2015-05-24 17:20:25.983 -0700 (,,,) coreZkRegister-1-thread-4 : INFO  
org.apache.solr.cloud.ShardLeaderElectionContext - Enough replicas found to 
continue.
2015-05-24 17:20:25.983 -0700 (,,,) coreZkRegister-1-thread-4 : INFO  
org.apache.solr.cloud.ShardLeaderElectionContext - I may be the new leader - 
try and sync
2015-05-24 17:20:25.997 -0700 (,,,) coreZkRegister-1-thread-4 : INFO  
org.apache.solr.update.PeerSync - PeerSync: core=project 
url=http://10.32.132.64:11000/solr START 
replicas=[http://jchar-1:11000/solr/project/] nUpdates=100
2015-05-24 17:20:25.999 -0700 (,,,) coreZkRegister-1-thread-4 : INFO  
org.apache.solr.update.PeerSync - PeerSync: core=project 
url=http://10.32.132.64:11000/solr DONE.  We have no versions.  sync failed.
2015-05-24 17:20:25.999 -0700 (,,,) coreZkRegister-1-thread-4 : INFO  
org.apache.solr.cloud.ShardLeaderElectionContext - We failed sync, but we have 
no versions - we can't sync in that case - we were active before, so become 
leader anyway
2015-05-24 17:20:25.999 -0700 (,,,) coreZkRegister-1-thread-4 : INFO  
org.apache.solr.cloud.ShardLeaderElectionContext - I am the new leader: 
http://10.32.132.64:11000/solr/project/ shard1

What is the expected behavior here? What’s the best practice for adding a new 
replica? Should I have the SolrCloud running and do it via the Collections API 
or can I continue to use core discovery?

Thanks.




Peer Sync fails when newly added node is elected leader.

2015-06-04 Thread Michael Roberts
Hi,

I am seeing some unexpected behavior when adding a new machine to my cluster. I 
am running 4.10.3.

My setup has multiple collections, each collection has a single shard. I am 
using core auto discovery on the hosts (my deployment mechanism ensures that 
the directory structure is created and the core.properties file is in the right 
place).

To add a new machine I have to stop the cluster.

If I add a new machine, and start the cluster, if this new machine is elected 
leader for the shard, peer recovery fails. So, now I have a leader with no 
content, and replicas with content. Depending on where the read request is 
sent, I may or may not get the response I am expecting.

2015-06-04 14:26:09.595 -0700 (,,,) coreZkRegister-1-thread-3 : INFO  
org.apache.solr.cloud.ShardLeaderElectionContext - Running the leader process 
for shard shard1
2015-06-04 14:26:09.607 -0700 (,,,) coreZkRegister-1-thread-9 : INFO  
org.apache.solr.cloud.ShardLeaderElectionContext - Waiting until we see more 
replicas up for shard shard1: total=2 found=1 timeoutin=1.14707356E15ms
2015-06-04 14:26:10.108 -0700 (,,,) coreZkRegister-1-thread-3 : INFO  
org.apache.solr.cloud.ShardLeaderElectionContext - Enough replicas found to 
continue.
2015-06-04 14:26:10.108 -0700 (,,,) coreZkRegister-1-thread-3 : INFO  
org.apache.solr.cloud.ShardLeaderElectionContext - I may be the new leader - 
try and sync
2015-06-04 14:26:10.115 -0700 (,,,) coreZkRegister-1-thread-3 : INFO  
org.apache.solr.update.PeerSync - PeerSync: core=domain 
url=http://10.36.9.70:11000/solr START 
replicas=[http://mlim:11000/solr/domain/] nUpdates=100
2015-06-04 14:26:10.121 -0700 (,,,) coreZkRegister-1-thread-3 : INFO  
org.apache.solr.update.PeerSync - PeerSync: core=domain 
url=http://10.36.9.70:11000/solr DONE.  We have no versions.  sync failed.
2015-06-04 14:26:10.121 -0700 (,,,) coreZkRegister-1-thread-3 : INFO  
org.apache.solr.cloud.ShardLeaderElectionContext - We failed sync, but we have 
no versions - we can't sync in that case - we were active before, so become 
leader anyway
2015-06-04 14:26:10.121 -0700 (,,,) coreZkRegister-1-thread-3 : INFO  
org.apache.solr.cloud.ShardLeaderElectionContext - I am the new leader: 
http://10.36.9.70:11000/solr/domain/ shard1
2015-06-04 14:26:11.153 -0700 (,,,) coreZkRegister-1-thread-3 : INFO  
org.apache.solr.cloud.ZkController - No LogReplay needed for core=domain 
baseURL=http://10.36.9.70:11000/solr
2015-06-04 14:26:11.153 -0700 (,,,) coreZkRegister-1-thread-3 : INFO  
org.apache.solr.cloud.ZkController - I am the leader, no recovery necessary

This seems like a fairly common scenario. So I suspect, either I am doing 
something incorrectly, or I have an incorrect assumption about how this is 
supposed to work.

Does anyone have any suggestions?

Thanks

Mike.


Re: Peer Sync fails when newly added node is elected leader.

2015-06-05 Thread Michael Roberts
Thanks, that was the response I was expecting unfortunately.

We have to stop the cluster to add a node, because Solr is part of a larger 
system and we don’t support either partial shutdown, or dynamic addition within 
the larger system.

“it waits for some time to see other nodes but if it finds none then it goes 
ahead and becomes the leader.”

That is not what I am seeing happen though. In my example, I had two machines, 
A (which had been running previously), B (which was newly added). Both A & B 
participated in the election, and B was elected. It wasn’t a case of just B was 
available. It would seem that B shouldn’t be elected when there was a better 
candidate (A), or that if elected B should ensure it’s caught up to it’s peers 
before marking itself as active.

On 6/4/15, 8:31 PM, "Shalin Shekhar Mangar"  wrote:



>Why do you stop the cluster while adding a node? This is the reason why
>this is happening. When the first node of a solr cluster starts up, it
>waits for some time to see other nodes but if it finds none then it goes
>ahead and becomes the leader. If other nodes were up and running then peer
>sync and replication recovery will make sure that the node with data
>becomes the leader. So just keep the cluster running while adding a new
>node.
>
>Also, stop relying on core discovery for setting up a node. At some point
>we will stop supporting this feature. Use the collection API to add new
>replicas.
>
>On Fri, Jun 5, 2015 at 5:01 AM, Michael Roberts 
>wrote:
>
>> Hi,
>>
>> I am seeing some unexpected behavior when adding a new machine to my
>> cluster. I am running 4.10.3.
>>
>> My setup has multiple collections, each collection has a single shard. I
>> am using core auto discovery on the hosts (my deployment mechanism ensures
>> that the directory structure is created and the core.properties file is in
>> the right place).
>>
>> To add a new machine I have to stop the cluster.
>>
>> If I add a new machine, and start the cluster, if this new machine is
>> elected leader for the shard, peer recovery fails. So, now I have a leader
>> with no content, and replicas with content. Depending on where the read
>> request is sent, I may or may not get the response I am expecting.
>>
>> 2015-06-04 14:26:09.595 -0700 (,,,) coreZkRegister-1-thread-3 : INFO
>> org.apache.solr.cloud.ShardLeaderElectionContext - Running the leader
>> process for shard shard1
>> 2015-06-04 14:26:09.607 -0700 (,,,) coreZkRegister-1-thread-9 : INFO
>> org.apache.solr.cloud.ShardLeaderElectionContext - Waiting until we see
>> more replicas up for shard shard1: total=2 found=1 timeoutin=1.14707356E15ms
>> 2015-06-04 14:26:10.108 -0700 (,,,) coreZkRegister-1-thread-3 : INFO
>> org.apache.solr.cloud.ShardLeaderElectionContext - Enough replicas found to
>> continue.
>> 2015-06-04 14:26:10.108 -0700 (,,,) coreZkRegister-1-thread-3 : INFO
>> org.apache.solr.cloud.ShardLeaderElectionContext - I may be the new leader
>> - try and sync
>> 2015-06-04 14:26:10.115 -0700 (,,,) coreZkRegister-1-thread-3 : INFO
>> org.apache.solr.update.PeerSync - PeerSync: core=domain url=
>> http://10.36.9.70:11000/solr START replicas=[
>> http://mlim:11000/solr/domain/] nUpdates=100
>> 2015-06-04 14:26:10.121 -0700 (,,,) coreZkRegister-1-thread-3 : INFO
>> org.apache.solr.update.PeerSync - PeerSync: core=domain url=
>> http://10.36.9.70:11000/solr DONE.  We have no versions.  sync failed.
>> 2015-06-04 14:26:10.121 -0700 (,,,) coreZkRegister-1-thread-3 : INFO
>> org.apache.solr.cloud.ShardLeaderElectionContext - We failed sync, but we
>> have no versions - we can't sync in that case - we were active before, so
>> become leader anyway
>> 2015-06-04 14:26:10.121 -0700 (,,,) coreZkRegister-1-thread-3 : INFO
>> org.apache.solr.cloud.ShardLeaderElectionContext - I am the new leader:
>> http://10.36.9.70:11000/solr/domain/ shard1
>> 2015-06-04 14:26:11.153 -0700 (,,,) coreZkRegister-1-thread-3 : INFO
>> org.apache.solr.cloud.ZkController - No LogReplay needed for core=domain
>> baseURL=http://10.36.9.70:11000/solr
>> 2015-06-04 14:26:11.153 -0700 (,,,) coreZkRegister-1-thread-3 : INFO
>> org.apache.solr.cloud.ZkController - I am the leader, no recovery necessary
>>
>> This seems like a fairly common scenario. So I suspect, either I am doing
>> something incorrectly, or I have an incorrect assumption about how this is
>> supposed to work.
>>
>> Does anyone have any suggestions?
>>
>> Thanks
>>
>> Mike.
>>
>
>
>
>-- 
>Regards,
>Shalin Shekhar Mangar.


Re: CheckIndex failed for Solr 4.7.2 index

2015-06-09 Thread Michael McCandless
IBM's J9 JVM unfortunately still has a number of nasty bugs affecting
Lucene; most likely you are hitting one of these.  We used to test J9
in our continuous Jenkins jobs, but there were just too many
J9-specific failures and we couldn't get IBM's attention to resolve
them, so we stopped.  For now you should switch to Oracle JDK, or
OpenJDK.

But there's some good news!  Recently, a member from the IBM JDK team
replied to this Elasticsearch thread:
https://discuss.elastic.co/t/need-help-with-ibm-jdk-issues-with-es-1-4-5/1748/3

And then Robert Muir ran Lucene's tests with the latest J9 and opened
several issues; see the 2nd bullet under Apache Lucene at
https://www.elastic.co/blog/this-week-in-elasticsearch-and-apache-lucene-2015-06-09
and at least one of the issues seems to be making progress
(https://issues.apache.org/jira/browse/LUCENE-6522).

So there is hope for the future, but for today it's too dangerous to
use J9 with Lucene/Solr/Elasticsearch.

Mike McCandless

http://blog.mikemccandless.com


On Tue, Jun 9, 2015 at 12:23 PM, Guy Moshkowich  wrote:
> We are using Solr 4.7.2 and we found that when we run
> CheckIndex.checkIndex on one of the Solr shards we are getting the error
> below.
> Both replicas of the shard had the same error.
> The shard index looked healthy:
> 1) It appeared active in the Solr admin page.
> 2) We could run searches against it.
> 3) No relevant errors where found in Solr logs.
> 4) After we optimized the index in LUKE, CheckIndex did not report any
> error.
>
> My questions:
> 1) Is this is a real issue or a known bug in CheckIndex code that cause
> false negative ?
> 2) Is there a known fix for this issue?
>
> Here is the error we got:
>  validateIndex Segments file=segments_bhe numSegments=15 version=4.7
> format= userData={commitTimeMSec=1432689607801}
>   1 of 15: name=_6cth docCount=248744
> codec=Lucene46
> compound=false
> numFiles=11
> size (MB)=86.542
> diagnostics = {timestamp=1428883354605, os=Linux,
> os.version=2.6.32-431.23.3.el6.x86_64, mergeFactor=10, source=merge,
> lucene.version=4.7.2 1586229 - rmuir - 2014-04-10 09:00:35, os.arch=amd64,
> mergeMaxNumSegments=-1, java.version=1.7.0, java.vendor=IBM Corporation}
> has deletions [delGen=3174]
> test: open reader.FAILED
> WARNING: fixIndex() would remove reference to this segment; full
> exception:
> java.lang.RuntimeException: liveDocs count mismatch: info=156867, vs
> bits=156872
> at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:581)
> at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:372)
>
> Appreciate yout help,
> Guy.


  1   2   3   4   5   6   7   8   9   10   >