How to override a QueryComponent
Hi, I'm using a solr nightly build and I have created my own QueryComponent which is just a subclass of the default QueryComponent. FYI, in most cases I just delegate to the superclass, but I also allow a parameter to be used which will cause some custom filtering (which is why I'm doing all this in the first place). Anyway, it's working great. The only question I have is have I registered QueryComponent incorrectly in solrconfig.xml as I get this exception when I start solr: Jan 7, 2008 11:41:22 AM org.apache.solr.common.SolrException log SEVERE: org.apache.solr.common.SolrException: Multiple searchComponent registered to the same name: query ignoring: [EMAIL PROTECTED] at org.apache.solr.util.plugin.AbstractPluginLoader.load (AbstractPluginLoader.java:153) at org.apache.solr.core.SolrCore.loadSearchComponents (SolrCore.java:504) at org.apache.solr.core.SolrCore.(SolrCore.java:333) at org.apache.solr.servlet.SolrDispatchFilter.init (SolrDispatchFilter.java:85) at org.mortbay.jetty.servlet.FilterHolder.doStart (FilterHolder.java:99) at org.mortbay.component.AbstractLifeCycle.start (AbstractLifeCycle.java:40) This is how I've done registered the QueryComponent (with the relevant note from the example solrconfig.xml) sorlconfig.xml ... class="com.mybiz.solr.handler.component.MyQueryComponent" /> ... It doesn't cause any other issue or problem, just a bit scary for people looking at the logs. Is this normal or have I incorrectly registered my QueryComponent? Thanks!
Tomcat and Solr - out of memory
Hi, What happens if Solr application hit the max. memory of heap assigned? Will be die or just slow down? Jae
Re: How to override a QueryComponent
You are doing things correctly, thanks for pointing this out. I just changed the initialization process to only add components that are not specified: http://svn.apache.org/viewvc?view=rev&revision=609717 thanks! ryan Brendan Grainger wrote: Hi, I'm using a solr nightly build and I have created my own QueryComponent which is just a subclass of the default QueryComponent. FYI, in most cases I just delegate to the superclass, but I also allow a parameter to be used which will cause some custom filtering (which is why I'm doing all this in the first place). Anyway, it's working great. The only question I have is have I registered QueryComponent incorrectly in solrconfig.xml as I get this exception when I start solr: Jan 7, 2008 11:41:22 AM org.apache.solr.common.SolrException log SEVERE: org.apache.solr.common.SolrException: Multiple searchComponent registered to the same name: query ignoring: [EMAIL PROTECTED] at org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:153) at org.apache.solr.core.SolrCore.loadSearchComponents(SolrCore.java:504) at org.apache.solr.core.SolrCore.(SolrCore.java:333) at org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:85) at org.mortbay.jetty.servlet.FilterHolder.doStart(FilterHolder.java:99) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40) This is how I've done registered the QueryComponent (with the relevant note from the example solrconfig.xml) sorlconfig.xml ... class="com.mybiz.solr.handler.component.MyQueryComponent" /> ... It doesn't cause any other issue or problem, just a bit scary for people looking at the logs. Is this normal or have I incorrectly registered my QueryComponent? Thanks!
Query - multiple
If the number of results > 2500 then sort by company_name otherwise, sort by revenue; Do I have to access 2 times? One is to get the number of results and the other one is for sort. The second query should be accessed by necessary. Any efficient way? Thanks, Jae
Problem with camelCase but not casing in general
Hi all, I am using a mostly out-of-the-box install of Solr that I'm using to search through our code repositories. I've run into a funny problem where searches for text that is camelCased aren't returning results unless the casing is exactly the same. For example, a query for "getElementById" returns 364 results, but "getelementbyid" returns 0. There isn't a problem with all casings, though. For example, "function" and "Function" both return the same number of results, as does "FUNCTION" and "FUNCtion" (6,278 with my docs). However, "funcTION" returns only a few results--and it's where the word is actually split up (e.g. "func tion")! So it seems that something may be tokenizing words where casing appears in the middle of them! How can I get this to stop? Thanks! Ben Here's the definition for the text field type in my schema.xml:
Re: Problem with camelCase but not casing in general
I think your problem is happening because splitOnCaseChange is 1 in your WordDelimiterFilterFactory: So "getElementById" is tokenized to: (get,0,3) (Element,3,10) (By,10,12) (Id,12,14) (getElementById,0,14,posIncr=0) However getelementbyid is tokenized to: (getelementbyid,0,14) which wouldn't be a term in the index?? I'm sure someone who knows more about solr will answer, but maybe that will help. On Jan 7, 2008, at 5:15 PM, Benjamin Higgins wrote:
Re: Problem with camelCase but not casing in general
On Jan 7, 2008 5:15 PM, Benjamin Higgins <[EMAIL PROTECTED]> wrote: > Hi all, I am using a mostly out-of-the-box install of Solr that I'm > using to search through our code repositories. I've run into a funny > problem where searches for text that is camelCased aren't returning > results unless the casing is exactly the same. > > For example, a query for "getElementById" returns 364 results, but > "getelementbyid" returns 0. > > There isn't a problem with all casings, though. For example, "function" > and "Function" both return the same number of results, as does > "FUNCTION" and "FUNCtion" (6,278 with my docs). However, "funcTION" > returns only a few results--and it's where the word is actually split up > (e.g. "func tion")! > > So it seems that something may be tokenizing words where casing appears > in the middle of them! > > How can I get this to stop? remove WordDelimiterFilter. It's funny though, since WordDelimiterFilter should not have caused this to happen (a query of getelementbyid should have matched a doc with getElementById). -Yonik > Thanks! > > Ben > > > Here's the definition for the text field type in my schema.xml: > > positionIncrementGap="100"> > > > > words="stopwords.txt"/> > generateWordParts="1" generateNumberParts="1" catenateWords="1" > catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/> > > protected="protwords.txt"/> > > > > > synonyms="synonyms.txt" ignoreCase="true" expand="true"/> > words="stopwords.txt"/> > generateWordParts="1" generateNumberParts="1" catenateWords="0" > catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/> > > protected="protwords.txt"/> > > > > >
Re: Problem with camelCase but not casing in general
On Jan 7, 2008 5:26 PM, Brendan Grainger <[EMAIL PROTECTED]> wrote: > I think your problem is happening because splitOnCaseChange is 1 in > your WordDelimiterFilterFactory: > > generateWordParts="1" generateNumberParts="1" catenateWords="1" > catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/> > > So "getElementById" is tokenized to: > > (get,0,3) > (Element,3,10) > (By,10,12) > (Id,12,14) > (getElementById,0,14,posIncr=0) > > However getelementbyid is tokenized to: > > (getelementbyid,0,14) > > which wouldn't be a term in the index?? It would be a term in the index since both go through the lowercase filter. Anyway, if splits on capitalization changes is not desired, getting rid of the WordDelimiterFilter in both the index and query analyzers is the right thing to do. -Yonik
Re: Problem with camelCase but not casing in general
On 7-Jan-08, at 2:35 PM, Yonik Seeley wrote: Anyway, if splits on capitalization changes is not desired, getting rid of the WordDelimiterFilter in both the index and query analyzers is the right thing to do. Well, he might want to split on punctuation. self.object.frobulation.method() probably shouldn't be one token. The OP's problem might have to do with index/query-time analyzer mismatch. We'd know more if he posted the schema definitions. -Mike
RE: Problem with camelCase but not casing in general
> Well, he might want to split on punctuation. I do, so I just turned off splitOnCaseChange instead of removing WordDelimiterFilterFactory completely. It's looking good now! > The OP's problem might have to do with index/query-time analyzer > mismatch. We'd know more if he posted the schema definitions. I did post a portion of my schema in my original email. I think I'm OK there, since I don't recall fiddling with it any. Thanks everyone. Ben
Re: Problem with camelCase but not casing in general
On 7-Jan-08, at 3:21 PM, Benjamin Higgins wrote: Well, he might want to split on punctuation. I do, so I just turned off splitOnCaseChange instead of removing WordDelimiterFilterFactory completely. It's looking good now! The OP's problem might have to do with index/query-time analyzer mismatch. We'd know more if he posted the schema definitions. I did post a portion of my schema in my original email. I think I'm OK there, since I don't recall fiddling with it any. Ah, I see it now. How very odd that that was a problem given that schema. You also might want to consider turning off the stemming for code search. -Mike
Re: solr with hadoop
As Mike suggested, we use Hadoop to organize our data en route to Solr. Hadoop allows us to load balance the indexing stage, and then we use the raw Lucene IndexWriter.addAllIndexes method to merge the data to be hosted on Solr instances. Thanks, Stu -Original Message- From: Mike Klaas <[EMAIL PROTECTED]> Sent: Friday, January 4, 2008 3:04pm To: solr-user@lucene.apache.org Subject: Re: solr with hadoop On 4-Jan-08, at 11:37 AM, Evgeniy Strokin wrote: > I have huge index base (about 110 millions documents, 100 fields > each). But size of the index base is reasonable, it's about 70 Gb. > All I need is increase performance, since some queries, which match > big number of documents, are running slow. > So I was thinking is any benefits to use hadoop for this? And if > so, what direction should I go? Is anybody did something for > integration Solr with Hadoop? Does it give any performance boost? > Hadoop might be useful for organizing your data enroute to Solr, but I don't see how it could be used to boost performance over a huge Solr index. To accomplish that, you need to split it up over two machines (for which you might find hadoop useful). -Mike
Newbie question: facets and filter query?
I have two categories, CDs and DVDs, doing something like this: explicit disc_name^2 disc_year 1 true category explicit disc_name^2 disc_year disc_artist 1 category:cd true type The problem is that when I use the 'cd' request handler, the facet count for 'dvd' provided in the response is 0 because of the filter query used to only show the 'cd' facet results. How do I retrieve facet counts for both categories while only retrieving the results for one category? -- View this message in context: http://www.nabble.com/Newbie-question%3A-facets-and-filter-query--tp14680213p14680213.html Sent from the Solr - User mailing list archive at Nabble.com.
How do i normalize diff information (different type of documents) in the index ?
e.g. if the index is field1 and field2 and documents of type (A) always have information for field1 AND information for field2 while document of type (B) always have information for field1 but NEVER information for field2. The problem is that the formula will sum field1 and field2 hence skewing in favour of documents of type (A). If i combine the 2 fields into 1 field (in an attempt to normalize) i will obviously skew the statistics. Please advise, Thanks,
Re: How do i normalize diff information (different type of documents) in the index ?
On 7-Jan-08, at 9:02 PM, s d wrote: e.g. if the index is field1 and field2 and documents of type (A) always have information for field1 AND information for field2 while document of type (B) always have information for field1 but NEVER information for field2. The problem is that the formula will sum field1 and field2 hence skewing in favour of documents of type (A). If i combine the 2 fields into 1 field (in an attempt to normalize) i will obviously skew the statistics. Try the dismax handler. It's main goal is to query multiple fields while only counting the score of the highest-scoring one (mostly). -Mike
Re: How do i normalize diff information (different type of documents) in the index ?
Isn't there a better way to take the information into account but still normalize? taking the score of only one of the fields doesn't sound like the best thing to do (it's basically ignoring part of the information). On Jan 7, 2008 9:20 PM, Mike Klaas <[EMAIL PROTECTED]> wrote: > > On 7-Jan-08, at 9:02 PM, s d wrote: > > > e.g. if the index is field1 and field2 and documents of type (A) > > always have > > information for field1 AND information for field2 while document of > > type (B) > > always have information for field1 but NEVER information for field2. > > The problem is that the formula will sum field1 and field2 hence > > skewing in > > favour of documents of type (A). > > If i combine the 2 fields into 1 field (in an attempt to normalize) > > i will > > obviously skew the statistics. > > Try the dismax handler. It's main goal is to query multiple fields > while only counting the score of the highest-scoring one (mostly). > > -Mike >
Re: solr with hadoop
Stu, Interesting! Can you provide more details about your setup? By "load balance the indexing stage" you mean "distribute the indexing process", right? Do you simply take your content to be indexed, split it into N chunks where N matches the number of TaskNodes in your Hadoop cluster and provide a map function that does the indexing? What does the reduce function do? Does that call IndexWriter.addAllIndexes or do you do that outside Hadoop? Thanks, Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Stu Hood <[EMAIL PROTECTED]> To: solr-user@lucene.apache.org Sent: Monday, January 7, 2008 7:14:20 PM Subject: Re: solr with hadoop As Mike suggested, we use Hadoop to organize our data en route to Solr. Hadoop allows us to load balance the indexing stage, and then we use the raw Lucene IndexWriter.addAllIndexes method to merge the data to be hosted on Solr instances. Thanks, Stu -Original Message- From: Mike Klaas <[EMAIL PROTECTED]> Sent: Friday, January 4, 2008 3:04pm To: solr-user@lucene.apache.org Subject: Re: solr with hadoop On 4-Jan-08, at 11:37 AM, Evgeniy Strokin wrote: > I have huge index base (about 110 millions documents, 100 fields > each). But size of the index base is reasonable, it's about 70 Gb. > All I need is increase performance, since some queries, which match > big number of documents, are running slow. > So I was thinking is any benefits to use hadoop for this? And if > so, what direction should I go? Is anybody did something for > integration Solr with Hadoop? Does it give any performance boost? > Hadoop might be useful for organizing your data enroute to Solr, but I don't see how it could be used to boost performance over a huge Solr index. To accomplish that, you need to split it up over two machines (for which you might find hadoop useful). -Mike