How to override a QueryComponent

2008-01-07 Thread Brendan Grainger

Hi,

I'm using a solr nightly build and I have created my own  
QueryComponent which is just a subclass of the default  
QueryComponent. FYI, in most cases I just delegate to the superclass,  
but I also allow a parameter to be used which will cause some custom  
filtering (which is why I'm doing all this in the first place).


Anyway, it's working great. The only question I have is have I  
registered QueryComponent incorrectly in solrconfig.xml as I get this  
exception when I start solr:


Jan 7, 2008 11:41:22 AM org.apache.solr.common.SolrException log
SEVERE: org.apache.solr.common.SolrException: Multiple  
searchComponent registered to the same name: query ignoring:  
[EMAIL PROTECTED]
at org.apache.solr.util.plugin.AbstractPluginLoader.load 
(AbstractPluginLoader.java:153)
at org.apache.solr.core.SolrCore.loadSearchComponents 
(SolrCore.java:504)

at org.apache.solr.core.SolrCore.(SolrCore.java:333)
at org.apache.solr.servlet.SolrDispatchFilter.init 
(SolrDispatchFilter.java:85)
at org.mortbay.jetty.servlet.FilterHolder.doStart 
(FilterHolder.java:99)
at org.mortbay.component.AbstractLifeCycle.start 
(AbstractLifeCycle.java:40)


This is how I've done registered the QueryComponent (with the  
relevant note from the example solrconfig.xml)


sorlconfig.xml
...
  

  
  class="com.mybiz.solr.handler.component.MyQueryComponent" />

...


It doesn't cause any other issue or problem, just a bit scary for  
people looking at the logs. Is this normal or have I incorrectly  
registered my QueryComponent?


Thanks!




Tomcat and Solr - out of memory

2008-01-07 Thread Jae Joo
Hi,

What happens if Solr application hit the max. memory of heap assigned?

Will be die or just slow down?

Jae


Re: How to override a QueryComponent

2008-01-07 Thread Ryan McKinley
You are doing things correctly, thanks for pointing this out.  I just 
changed the initialization process to only add components that are not 
specified:

http://svn.apache.org/viewvc?view=rev&revision=609717

thanks!
ryan


Brendan Grainger wrote:

Hi,

I'm using a solr nightly build and I have created my own QueryComponent 
which is just a subclass of the default QueryComponent. FYI, in most 
cases I just delegate to the superclass, but I also allow a parameter to 
be used which will cause some custom filtering (which is why I'm doing 
all this in the first place).


Anyway, it's working great. The only question I have is have I 
registered QueryComponent incorrectly in solrconfig.xml as I get this 
exception when I start solr:


Jan 7, 2008 11:41:22 AM org.apache.solr.common.SolrException log
SEVERE: org.apache.solr.common.SolrException: Multiple searchComponent 
registered to the same name: query ignoring: 
[EMAIL PROTECTED]
at 
org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:153) 

at 
org.apache.solr.core.SolrCore.loadSearchComponents(SolrCore.java:504)

at org.apache.solr.core.SolrCore.(SolrCore.java:333)
at 
org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:85)
at 
org.mortbay.jetty.servlet.FilterHolder.doStart(FilterHolder.java:99)
at 
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40)


This is how I've done registered the QueryComponent (with the relevant 
note from the example solrconfig.xml)


sorlconfig.xml
...
  

  
  class="com.mybiz.solr.handler.component.MyQueryComponent" />

...


It doesn't cause any other issue or problem, just a bit scary for people 
looking at the logs. Is this normal or have I incorrectly registered my 
QueryComponent?


Thanks!







Query - multiple

2008-01-07 Thread Jae Joo
If the number of results > 2500 then sort by company_name
otherwise, sort by revenue;

Do I have to access 2 times? One is to get the number of results and the
other one is for sort.
The second query should be accessed by necessary.

Any efficient way?

Thanks,

Jae


Problem with camelCase but not casing in general

2008-01-07 Thread Benjamin Higgins
Hi all, I am using a mostly out-of-the-box install of Solr that I'm
using to search through our code repositories.  I've run into a funny
problem where searches for text that is camelCased aren't returning
results unless the casing is exactly the same.  

For example, a query for "getElementById" returns 364 results, but
"getelementbyid" returns 0.

There isn't a problem with all casings, though.  For example, "function"
and "Function" both return the same number of results, as does
"FUNCTION" and "FUNCtion" (6,278 with my docs).  However, "funcTION"
returns only a few results--and it's where the word is actually split up
(e.g. "func tion")!

So it seems that something may be tokenizing words where casing appears
in the middle of them!

How can I get this to stop?

Thanks!

Ben


Here's the definition for the text field type in my schema.xml:


  







  
  







  




Re: Problem with camelCase but not casing in general

2008-01-07 Thread Brendan Grainger
I think your problem is happening because splitOnCaseChange is 1 in  
your WordDelimiterFilterFactory:




So "getElementById" is tokenized to:

(get,0,3)
(Element,3,10)
(By,10,12)
(Id,12,14)
(getElementById,0,14,posIncr=0)

However getelementbyid is tokenized to:

(getelementbyid,0,14)

which wouldn't be a term in the index??

I'm sure someone who knows more about solr will answer, but maybe  
that will help.


On Jan 7, 2008, at 5:15 PM, Benjamin Higgins wrote:







Re: Problem with camelCase but not casing in general

2008-01-07 Thread Yonik Seeley
On Jan 7, 2008 5:15 PM, Benjamin Higgins <[EMAIL PROTECTED]> wrote:
> Hi all, I am using a mostly out-of-the-box install of Solr that I'm
> using to search through our code repositories.  I've run into a funny
> problem where searches for text that is camelCased aren't returning
> results unless the casing is exactly the same.
>
> For example, a query for "getElementById" returns 364 results, but
> "getelementbyid" returns 0.
>
> There isn't a problem with all casings, though.  For example, "function"
> and "Function" both return the same number of results, as does
> "FUNCTION" and "FUNCtion" (6,278 with my docs).  However, "funcTION"
> returns only a few results--and it's where the word is actually split up
> (e.g. "func tion")!
>
> So it seems that something may be tokenizing words where casing appears
> in the middle of them!
>
> How can I get this to stop?

remove WordDelimiterFilter.

It's funny though, since WordDelimiterFilter should not have caused
this to happen (a query of getelementbyid should have matched a doc
with getElementById).

-Yonik

> Thanks!
>
> Ben
>
>
> Here's the definition for the text field type in my schema.xml:
>
>  positionIncrementGap="100">
>   
> 
> 
>  words="stopwords.txt"/>
>  generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
> 
>  protected="protwords.txt"/>
> 
>   
>   
> 
>  synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
>  words="stopwords.txt"/>
>  generateWordParts="1" generateNumberParts="1" catenateWords="0"
> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
> 
>  protected="protwords.txt"/>
> 
>   
> 
>
>


Re: Problem with camelCase but not casing in general

2008-01-07 Thread Yonik Seeley
On Jan 7, 2008 5:26 PM, Brendan Grainger <[EMAIL PROTECTED]> wrote:
> I think your problem is happening because splitOnCaseChange is 1 in
> your WordDelimiterFilterFactory:
>
>  generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>
> So "getElementById" is tokenized to:
>
> (get,0,3)
> (Element,3,10)
> (By,10,12)
> (Id,12,14)
> (getElementById,0,14,posIncr=0)
>
> However getelementbyid is tokenized to:
>
> (getelementbyid,0,14)
>
> which wouldn't be a term in the index??

It would be a term in the index since both go through the lowercase filter.

Anyway, if splits on capitalization changes is not desired, getting
rid of the WordDelimiterFilter in both the index and query analyzers
is the right thing to do.

-Yonik


Re: Problem with camelCase but not casing in general

2008-01-07 Thread Mike Klaas


On 7-Jan-08, at 2:35 PM, Yonik Seeley wrote:


Anyway, if splits on capitalization changes is not desired, getting
rid of the WordDelimiterFilter in both the index and query analyzers
is the right thing to do.


Well, he might want to split on punctuation.

self.object.frobulation.method()

probably shouldn't be one token.

The OP's problem might have to do with index/query-time analyzer  
mismatch.  We'd know more if he posted the schema definitions.


-Mike


RE: Problem with camelCase but not casing in general

2008-01-07 Thread Benjamin Higgins
> Well, he might want to split on punctuation.

I do, so I just turned off splitOnCaseChange instead of removing
WordDelimiterFilterFactory completely.

It's looking good now!

> The OP's problem might have to do with index/query-time analyzer  
> mismatch.  We'd know more if he posted the schema definitions.

I did post a portion of my schema in my original email.  I think I'm OK
there, since I don't recall fiddling with it any.

Thanks everyone.

Ben


Re: Problem with camelCase but not casing in general

2008-01-07 Thread Mike Klaas

On 7-Jan-08, at 3:21 PM, Benjamin Higgins wrote:


Well, he might want to split on punctuation.


I do, so I just turned off splitOnCaseChange instead of removing
WordDelimiterFilterFactory completely.

It's looking good now!


The OP's problem might have to do with index/query-time analyzer
mismatch.  We'd know more if he posted the schema definitions.


I did post a portion of my schema in my original email.  I think  
I'm OK

there, since I don't recall fiddling with it any.


Ah, I see it now.  How very odd that that was a problem given that  
schema.


You also might want to consider turning off the stemming for code  
search.


-Mike


Re: solr with hadoop

2008-01-07 Thread Stu Hood
As Mike suggested, we use Hadoop to organize our data en route to Solr. Hadoop 
allows us to load balance the indexing stage, and then we use the raw Lucene 
IndexWriter.addAllIndexes method to merge the data to be hosted on Solr 
instances.

Thanks,
Stu



-Original Message-
From: Mike Klaas <[EMAIL PROTECTED]>
Sent: Friday, January 4, 2008 3:04pm
To: solr-user@lucene.apache.org
Subject: Re: solr with hadoop

On 4-Jan-08, at 11:37 AM, Evgeniy Strokin wrote:

> I have huge index base (about 110 millions documents, 100 fields  
> each). But size of the index base is reasonable, it's about 70 Gb.  
> All I need is increase performance, since some queries, which match  
> big number of documents, are running slow.
> So I was thinking is any benefits to use hadoop for this? And if  
> so, what direction should I go? Is anybody did something for  
> integration Solr with Hadoop? Does it give any performance boost?
>
Hadoop might be useful for organizing your data enroute to Solr, but  
I don't see how it could be used to boost performance over a huge  
Solr index.  To accomplish that, you need to split it up over two  
machines (for which you might find hadoop useful).

-Mike




Newbie question: facets and filter query?

2008-01-07 Thread solruser2

I have two categories, CDs and DVDs, doing something like this:

  

  explicit
  
disc_name^2
disc_year
  
  1


  true
  category

  
  
  

  explicit
  
disc_name^2
disc_year

disc_artist
  
  1


  category:cd


  true
  type

  

The problem is that when I use the 'cd' request handler, the facet count for
'dvd' provided in the response is 0 because of the filter query used to only
show the 'cd' facet results. How do I retrieve facet counts for both
categories while only retrieving the results for one category?
-- 
View this message in context: 
http://www.nabble.com/Newbie-question%3A-facets-and-filter-query--tp14680213p14680213.html
Sent from the Solr - User mailing list archive at Nabble.com.



How do i normalize diff information (different type of documents) in the index ?

2008-01-07 Thread s d
e.g. if the index is field1 and field2 and documents of type (A) always have
information for field1 AND information for field2 while document of type (B)
always have information for field1 but NEVER information for field2.
The problem is that the formula will sum field1 and field2 hence skewing in
favour of documents of type (A).
If i combine the 2 fields into 1 field (in an attempt to normalize) i will
obviously skew the statistics.
Please advise,
Thanks,


Re: How do i normalize diff information (different type of documents) in the index ?

2008-01-07 Thread Mike Klaas


On 7-Jan-08, at 9:02 PM, s d wrote:

e.g. if the index is field1 and field2 and documents of type (A)  
always have
information for field1 AND information for field2 while document of  
type (B)

always have information for field1 but NEVER information for field2.
The problem is that the formula will sum field1 and field2 hence  
skewing in

favour of documents of type (A).
If i combine the 2 fields into 1 field (in an attempt to normalize)  
i will

obviously skew the statistics.


Try the dismax handler.  It's main goal is to query multiple fields  
while only counting the score of the highest-scoring one (mostly).


-Mike


Re: How do i normalize diff information (different type of documents) in the index ?

2008-01-07 Thread s d
Isn't there a better way to take the information into account but still
normalize? taking the score of only one of the fields doesn't sound like the
best thing to do (it's basically ignoring part of the information).

On Jan 7, 2008 9:20 PM, Mike Klaas <[EMAIL PROTECTED]> wrote:

>
> On 7-Jan-08, at 9:02 PM, s d wrote:
>
> > e.g. if the index is field1 and field2 and documents of type (A)
> > always have
> > information for field1 AND information for field2 while document of
> > type (B)
> > always have information for field1 but NEVER information for field2.
> > The problem is that the formula will sum field1 and field2 hence
> > skewing in
> > favour of documents of type (A).
> > If i combine the 2 fields into 1 field (in an attempt to normalize)
> > i will
> > obviously skew the statistics.
>
> Try the dismax handler.  It's main goal is to query multiple fields
> while only counting the score of the highest-scoring one (mostly).
>
> -Mike
>


Re: solr with hadoop

2008-01-07 Thread Otis Gospodnetic
Stu,

Interesting!  Can you provide more details about your setup?  By "load balance 
the indexing stage" you mean "distribute the indexing process", right?  Do you 
simply take your content to be indexed, split it into N chunks where N matches 
the number of TaskNodes in your Hadoop cluster and provide a map function that 
does the indexing?  What does the reduce function do?  Does that call 
IndexWriter.addAllIndexes or do you do that outside Hadoop?

Thanks,
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
From: Stu Hood <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Monday, January 7, 2008 7:14:20 PM
Subject: Re: solr with hadoop

As Mike suggested, we use Hadoop to organize our data en route to Solr.
 Hadoop allows us to load balance the indexing stage, and then we use
 the raw Lucene IndexWriter.addAllIndexes method to merge the data to be
 hosted on Solr instances.

Thanks,
Stu



-Original Message-
From: Mike Klaas <[EMAIL PROTECTED]>
Sent: Friday, January 4, 2008 3:04pm
To: solr-user@lucene.apache.org
Subject: Re: solr with hadoop

On 4-Jan-08, at 11:37 AM, Evgeniy Strokin wrote:

> I have huge index base (about 110 millions documents, 100 fields  
> each). But size of the index base is reasonable, it's about 70 Gb.  
> All I need is increase performance, since some queries, which match  
> big number of documents, are running slow.
> So I was thinking is any benefits to use hadoop for this? And if  
> so, what direction should I go? Is anybody did something for  
> integration Solr with Hadoop? Does it give any performance boost?
>
Hadoop might be useful for organizing your data enroute to Solr, but  
I don't see how it could be used to boost performance over a huge  
Solr index.  To accomplish that, you need to split it up over two  
machines (for which you might find hadoop useful).

-Mike