Re: ranged query on multivalued field doesnt seem to work

2009-01-30 Thread zqzuk

Hi,

I am still struggling with this... but I guess would it be because for some
data there are maximum interger values for the fields "start_year"
"end_year", like "2.14748365E9", which solr does not recognise as "sfloat",
because there is a "E" letter? 

In terms of doing ranged queries on multivalued fields, I think it should be
ok because i have another two fields using sfloat and are multivalued, the
ranged queries work ok

Any hints are appreciated! thanks!




zqzuk wrote:
> 
> Hi all,
> 
> in my schema I have two multivalued fields as
> 
>  multiValued="true"/>
>  multiValued="true"/>
> 
> and I issued a query as: start_year:[400 TO *], the result seems to be
> incorrect because I got some records with start year = - 3000... and also
> start year = -2147483647 (Integer.MINVALUE) Also when I combine start_year
> with end_year, it also produced wrong results...
> 
> what could be wrong? is it because I used wrong field type "sfloat", which
> should be integer?
> 
> Any hints would be very much appreciated!
> 
> many thanks!
> 

-- 
View this message in context: 
http://www.nabble.com/ranged-query-on-multivalued-field-doesnt-seem-to-work-tp21731778p21743688.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: WebLogic 10 Compatibility Issue - StackOverflowError

2009-01-30 Thread Ilan Rabinovitch

I created a wiki page shortly after posting to the list:

http://wiki.apache.org/solr/SolrWeblogic

From what we could tell Solr itself was fully functional, it was only 
the admin tools that were failing.


Regards,
Ilan Rabinovitch

---
SCALE 7x: 2009 Southern California Linux Expo
Los Angeles, CA
http://www.socallinuxexpo.org


On 1/29/09 4:34 AM, Mark Miller wrote:

We should get this on the wiki.

- Mark


Ilan Rabinovitch wrote:


We were able to deploy Solr 1.3 on Weblogic 10.0 earlier today. Doing
so required two changes:

1) Creating a weblogic.xml file in solr.war's WEB-INF directory. The
weblogic.xml file is required to disable Solr's filter on FORWARD.

The contents of weblogic.xml should be:


http://www.bea.com/ns/weblogic/90";
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance";
xsi:schemaLocation="http://www.bea.com/ns/weblogic/90
http://www.bea.com/ns/weblogic/90/weblogic-web-app.xsd";>



false






2) Remove the pageEncoding attribute from line 1 of solr/admin/header.jsp




On 1/17/09 2:02 PM, KSY wrote:

I hit a major roadblock while trying to get Solr 1.3 running on WebLogic
10.0.

A similar message was posted before - (
http://www.nabble.com/Solr-1.3-stack-overflow-when-accessing-solr-admin-page-td20157873.html

http://www.nabble.com/Solr-1.3-stack-overflow-when-accessing-solr-admin-page-td20157873.html

) - but it seems like it hasn't been resolved yet, so I'm re-posting
here.

I am sure I configured everything correctly because it's working fine on
Resin.

Has anyone successfully run Solr 1.3 on WebLogic 10.0 or higher? Thanks.


SUMMARY:

When accessing /solr/admin page, StackOverflowError occurs due to an
infinite recursion in SolrDispatchFilter


ENVIRONMENT SETTING:

Solr 1.3.0
WebLogic 10.0
JRockit JVM 1.5


ERROR MESSAGE:

SEVERE: javax.servlet.ServletException: java.lang.StackOverflowError
at
weblogic.servlet.internal.RequestDispatcherImpl.forward(RequestDispatcherImpl.java:276)

at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:273)

at
weblogic.servlet.internal.FilterChainImpl.doFilter(FilterChainImpl.java:42)

at
weblogic.servlet.internal.RequestDispatcherImpl.invokeServlet(RequestDispatcherImpl.java:526)

at
weblogic.servlet.internal.RequestDispatcherImpl.forward(RequestDispatcherImpl.java:261)

at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:273)

at
weblogic.servlet.internal.FilterChainImpl.doFilter(FilterChainImpl.java:42)

at
weblogic.servlet.internal.RequestDispatcherImpl.invokeServlet(RequestDispatcherImpl.java:526)

at
weblogic.servlet.internal.RequestDispatcherImpl.forward(RequestDispatcherImpl.java:261)

at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:273)

at
weblogic.servlet.internal.FilterChainImpl.doFilter(FilterChainImpl.java:42)

at
weblogic.servlet.internal.RequestDispatcherImpl.invokeServlet(RequestDispatcherImpl.java:526)

at
weblogic.servlet.internal.RequestDispatcherImpl.forward(RequestDispatcherImpl.java:261)

at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:273)














Re: How to handle database replication delay when using DataImportHandler?

2009-01-30 Thread Shalin Shekhar Mangar
On Fri, Jan 30, 2009 at 12:27 AM, Gregg Donovan  wrote:

> Noble,
>
> Thanks for the suggestion. The unfortunate thing is that we really don't
> know ahead of time what sort of replication delay we're going to encounter
> -- it could be one millisecond or it could be one hour. So, we end up
> needing to do something like:
>
> For delta-import run N:
> 1. query DB slave for "seconds_behind_master", use this to calculate
> Date(N).
> 2. query DB slave for records updated since Date(N - 1)
>
> I see there are plugin points for EventListener classes (onImportStart,
> onImportEnd). Would those be the right spot to calculate these dates so
> that
> I could expose them to my custom function at query time?
>

Unfortunately, the Context object (which carries the context information and
way to pass messages to other components) is not exposed to Evaluator. We
should expose this information to be consistent with other DIH components.

I've opened an issue to track this at
https://issues.apache.org/jira/browse/SOLR-996

-- 
Regards,
Shalin Shekhar Mangar.


Re: got background_merge_hit_exception during optimization

2009-01-30 Thread Yonik Seeley
What system and JVM was this using?
Also, could you get the stack trace directly from the Solr logs and post it?

-Yonik

On Thu, Jan 29, 2009 at 4:06 PM, Qingdi  wrote:
>
> We got the following background_merge_hit_exception during optimization:
> exception:
> )background_merge_hit_exception__4zsgC136887658__50nfC995992__51i9C995977__52d5C995968__537yC995999__54xmC1892345__54xlC99593_into__54xn_optimize__javaioIOException_background_merge_hit_exception__4zsgC136887658__50nfC995992__51i9C995977__52d5C995968__537yC995999__54xmC1892345__54xlC99593_into__54xn_optimize__at_orgapacheluceneindexIndexWriteroptimizeIndexWriterjava2346__at_orgapacheluceneindexIndexWriteroptimizeIndexWriterjava2280__at_orgapachesolrupdateDirectUpdateHandler2commitDirectUpdateHandler2java355__at_orgapachesolrupdateprocessorRunUpdateProcessorprocessCommitRunUpdateProcessorFactoryjava77__at_orgapachesolrhandlerRequestHandlerUtilshandleCommitRequestHandlerUtilsjava104__at_orgapachesolrhandlerXmlUpdateRequestHandlerhandleRequestBodyXmlUpdateRequestHandlerjava113__at_orgapachesolrhandlerRequestHandlerBasehandleRequestRequestHandlerBasejava131__at_orgapachesolrcoreSolrCoreexecuteSolrCorejava1204__at_orgapachesolrservletSolrDispatchFilterexecuteSolrDispatchFilterjava303__at_orgapachesolrservletSolrDispatchFilterdoFilterSolrDispatchFilterjava232__at_orgmortbayjettyservletServletHandler$CachedChaindoFilterServletHandlerjava1089__at_orgmortbayjettyservletServletHandlerhandleServletHandlerjava365__at_orgmortbayjettysecuritySecurityHandlerhandleSecurityHandlerjava216__at_orgmortbayjettyservletSessionHandlerhandleSessionHandlerjava181__at_orgmortbayjettyhandlerContextHandlerhandleContextHandlerjava712__at_orgmortbayjettywebappWebAppContexthandleWebAppContextjava405__at_orgmortbayjettyhandlerContextHandlerCollectionhandleContextHandlerCollectionjava211__at_orgmortbayjettyhandlerHandlerCollectionhandleHandlerCollectionjava114__at_orgmortbayjettyhandlerHandlerWrapperhandleHandlerWrapperjava139__at_orgmortbayjettyServerhandleServerjava285__at_orgmortbayjettyHttpConnectionhandleRequestHttpConnectionjava502__a
>
> Does anyone know what could be the cause of the exception? what should we do
> to prevent this type of exception?
>
> Some posts in the Lucene forum say the exception is usually related with
> disk space issue. But there should be enough disk space in our system. Our
> index size was about 56G. And before optimization, the disk had about 360G
> free space.
>
> After the above background_merge_hit_exception raised, solr kept generating
> new segment files, which ate up all the CPU time and the disk space, so we
> had to kill the solr server.
>
> Thanks for your help.
>
> Qingdi
>
>
> --
> View this message in context: 
> http://www.nabble.com/got-background_merge_hit_exception-during-optimization-tp21735847p21735847.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>


MultiValue DynamicFields?

2009-01-30 Thread Bruno Aranda
Hi, it is possible to create a dynamic field that is multi valued?

Cheers,

Bruno


Re: MultiValue DynamicFields?

2009-01-30 Thread Alexander Ramos Jardim
Yes. It's totally acceptable.

2009/1/30 Bruno Aranda 

> Hi, it is possible to create a dynamic field that is multi valued?
>
> Cheers,
>
> Bruno
>



-- 
Alexander Ramos Jardim


Re: Optimizing & Improving results based on user feedback

2009-01-30 Thread Ryan McKinley
It may not be as fine-grained as you want, but also check the  
QueryElevationComponent.  This takes a preconfigured list of what the  
top results should be for a given query and makes thoes documents the  
top results.


Presumably, you could use click logs to determine what the top result  
should be.



On Jan 29, 2009, at 7:45 PM, Walter Underwood wrote:


"A Decision Theoretic Framework for Ranking using Implicit Feedback"
uses clicks, but the best part of that paper is all the side comments
about difficulties in evaluation. For example, if someone clicks on
three results, is that three times as good or two failures and a
success? We have to know the information need to decide. That paper
is in the LR4IR 2008 proceedings.

Both Radlinski and Joachims seem to be focusing on click data.

I'm thinking of something much simpler, like taking the first
N hits and reordering those before returning. Brute force, but
would get most of the benefit. Usually, you only have reliable
click data for a small number of documents on each query, so
it is a waste of time to rerank the whole list. Besides, if you
need to move something up 100 places on the list, you should
probably be tuning your regular scoring rather than patching
it with click data.

wunder

On 1/29/09 3:43 PM, "Matthew Runo"  wrote:


Agreed, it seems that a lot of the algorithms in these papers would
almost be a whole new RequestHandler ala Dismax. Luckily a lot of  
them

seem to be built on Lucene (at least the ones that I looked at that
had code samples).

Which papers did you see that actually talked about using clicks? I
don't see those, beyond "Addressing Malicious Noise in Clickthrough
Data" by Filip Radlinski and also his "Query Chains: Learning to Rank
from Implicit Feedback" - but neither is really on topic.

Thanks for your time!

Matthew Runo
Software Engineer, Zappos.com
mr...@zappos.com - 702-943-7833

On Jan 29, 2009, at 11:36 AM, Walter Underwood wrote:


Thanks, I didn't know there was so much research in this area.
Most of the papers at those workshops are about tuning the
entire ranking algorithm with machine learning techniques.

I am interested in adding one more feature, click data, to an
existing ranking algorithm. In my case, I have enough data to
use query-specific boosts instead of global document boosts.
We get about 2M search clicks per day from logged in users
(little or no click spam).

I'm checking out some papers from Thorsten Joachims and from
Microsoft Research that are specifically about clickthrough
feedback.

wunder

On 1/27/09 11:15 PM, "Neal Richter"  wrote:

OK I've implemented this before, written academic papers and  
patents

related to this task.

Here are some hints:
 - you're on the right track with the editorial boosting elevators
 - http://wiki.apache.org/solr/UserTagDesign
 - be darn careful about assuming that one click is enough evidence
to boost a long
   'distance'
 - first page effects in search will skew the learning badly if you
don't compensate.
  95% of users never go past the first page of results, 1% go
past the second
  page.  So perfectly good results on the second page get
permanently locked out
 - consider forgetting what you learn under some condition

In fact this whole area is called 'learning to rank' and is a hot
research topic in IR.
http://web.mit.edu/shivani/www/Ranking-NIPS-05/
http://research.microsoft.com/en-us/um/people/lr4ir-2007/
https://research.microsoft.com/en-us/um/people/lr4ir-2008/

- Neal Richter


On Tue, Jan 27, 2009 at 2:06 PM, Matthew Runo 
wrote:

Hello folks!

We've been thinking about ways to improve organic search results
for a while
(really, who hasn't?) and I'd like to get some ideas on ways to
implement a
feedback system that uses user behavior as input. Basically, it'd
work on
the premise that what the user actually clicked on is probably a
really good
match for their search, and should be boosted up in the results
for that
search.

For example, if I search for "rain boots", and really love the
10th result
down (and show it by clicking on it), then we'd like to capture
this and use
the data to boost up that result //for that search//. We've
thought about
using index time boosts for the documents, but that'd boost it
regardless of
the search terms, which isn't what we want. We've thought about
using the
Elevator handler, but we don't really want to force a product to
the top -
we'd prefer it slowly rises over time as more and more people
click it from
the same search terms. Another way might be to stuff the keyword
into the
document, the more times it's in the document the higher it'd
score - but
there's gotta be a better way than that.

Obviously this can't be done 100% in solr - but if anyone had some
clever
ideas about how this might be possible it'd be interesting to hear
them.

Thanks for your time!

Matthew Runo
Software Engineer, Zappos.com
mr...@zappos.com - 702-943-7833












Re: query with stemming, prefix and fuzzy?

2009-01-30 Thread Gert Brinkmann

Thanks, Mark, for your answer,

Mark Miller wrote:
> Truncation queries and stemming are difficult partners. You likely have
> to accept compromise. You can try using multiple fields like you are,

I already have multiple fields, one per language, to be able to use
different stemmers. Wouldn't become this too much?

> you can try indexing the full term at the same position as the stemmed
> term,

what does this mean "at the same position" and how could I do this?

> or you can accept the weirdness that comes from matching on a
> stemmed form (potentially very confusing for a user).

Currently I think about dropping the stemming and only use
prefix-search. But as highlighting does not work with a prefix "house*"
this is a problem for me. The hint to use "house?*" instead does not
work here.

> In any case though, a queryparser that support fuzzyquery should not be
> analyzing it. What parser are you using? If it is analyzing the fuzzy
> syntax, it doesnt likely support it.

I am using the following definitions (testing it with and without stemming):
>  positionIncrementGap="100">
>   
> 
> 
> 
>  ignoreCase="true"
> words="stopwords_de_de.txt"
> enablePositionIncrements="true"
> />
>  generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" 
> splitOnCaseChange="1"/>
> 
> 
> 
>   
>   
>   
> 
>  synonyms="synonyms_de_de.txt" ignoreCase="true" expand="true"/>
>  words="stopwords_de_de.txt"/>
>  generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" 
> splitOnCaseChange="1"/>
> 
> 
> 
> 
>   
> 

and, well, the parser? Where is the parser specified? Do you mean the
request handler "qt" (that will be "standard", as I do not set it yet)?


> The prefix length determines how many terms are enumerated - with the

Can the prefix length be set in Solr? I could not find such an option.

> The latest trunk build on Lucene will let us switch fuzzy query to use a
> constant score mode - this will eliminate the booleanquery and should
> perform much better on a large index. Solr already uses a constant score
> mode for Prefix and Wildcard queries.

much better performance is always good. When will this feature be
available in Solr?

> How big is your index? If its not that big, it may be odd that your
> seeing things that slow (number of unique terms in the index will play a
> large role).

Well, the index currently contains about 5000 documents. These are
HTML-pages, some of them are concatenated with PDF/DOCs (Downloads
linked from the HTML-page) converted to text. The index data is about
11MB (optimized). So think, this is just a smaller index.

Greetings,
Gert


Re: Optimizing & Improving results based on user feedback

2009-01-30 Thread Sean Timm

Matthew Runo wrote:
Which papers did you see that actually talked about using clicks? I 
don't see those, beyond "Addressing Malicious Noise in Clickthrough 
Data" by Filip Radlinski and also his "Query Chains: Learning to Rank 
from Implicit Feedback" - but neither is really on topic.

Here are three that I've found useful:

P. Young, C. Clarke, et.al. Improving Retrieval Accuracy by Weighting 
Document Types with Clickthrough Data. SIGIR 2007.


E. Agichtein, E. Brill, and S. Dumais. Improving Web Search Ranking by 
incorporation user behavior information. SIGIR 2006.


T. Joachims, L. Granka, and B. Pan. Accurately Interpreting Clickthrough 
Data as Implicit Feedback. SIGIR 2005.


-Sean


Re: Optimizing & Improving results based on user feedback

2009-01-30 Thread Matthew Runo
I've thought about patching the QueryElevationComponent to apply  
boosts rather than a specific sort. Then the file might look like..


   query>
And I could write a script that looks at click data once a day to fill  
out this file.

Thanks for your time!

Matthew Runo
Software Engineer, Zappos.com
mr...@zappos.com - 702-943-7833

On Jan 30, 2009, at 6:37 AM, Ryan McKinley wrote:

It may not be as fine-grained as you want, but also check the  
QueryElevationComponent.  This takes a preconfigured list of what  
the top results should be for a given query and makes thoes  
documents the top results.


Presumably, you could use click logs to determine what the top  
result should be.



On Jan 29, 2009, at 7:45 PM, Walter Underwood wrote:


"A Decision Theoretic Framework for Ranking using Implicit Feedback"
uses clicks, but the best part of that paper is all the side comments
about difficulties in evaluation. For example, if someone clicks on
three results, is that three times as good or two failures and a
success? We have to know the information need to decide. That paper
is in the LR4IR 2008 proceedings.

Both Radlinski and Joachims seem to be focusing on click data.

I'm thinking of something much simpler, like taking the first
N hits and reordering those before returning. Brute force, but
would get most of the benefit. Usually, you only have reliable
click data for a small number of documents on each query, so
it is a waste of time to rerank the whole list. Besides, if you
need to move something up 100 places on the list, you should
probably be tuning your regular scoring rather than patching
it with click data.

wunder

On 1/29/09 3:43 PM, "Matthew Runo"  wrote:


Agreed, it seems that a lot of the algorithms in these papers would
almost be a whole new RequestHandler ala Dismax. Luckily a lot of  
them

seem to be built on Lucene (at least the ones that I looked at that
had code samples).

Which papers did you see that actually talked about using clicks? I
don't see those, beyond "Addressing Malicious Noise in Clickthrough
Data" by Filip Radlinski and also his "Query Chains: Learning to  
Rank

from Implicit Feedback" - but neither is really on topic.

Thanks for your time!

Matthew Runo
Software Engineer, Zappos.com
mr...@zappos.com - 702-943-7833

On Jan 29, 2009, at 11:36 AM, Walter Underwood wrote:


Thanks, I didn't know there was so much research in this area.
Most of the papers at those workshops are about tuning the
entire ranking algorithm with machine learning techniques.

I am interested in adding one more feature, click data, to an
existing ranking algorithm. In my case, I have enough data to
use query-specific boosts instead of global document boosts.
We get about 2M search clicks per day from logged in users
(little or no click spam).

I'm checking out some papers from Thorsten Joachims and from
Microsoft Research that are specifically about clickthrough
feedback.

wunder

On 1/27/09 11:15 PM, "Neal Richter"  wrote:

OK I've implemented this before, written academic papers and  
patents

related to this task.

Here are some hints:
- you're on the right track with the editorial boosting elevators
- http://wiki.apache.org/solr/UserTagDesign
- be darn careful about assuming that one click is enough evidence
to boost a long
  'distance'
- first page effects in search will skew the learning badly if you
don't compensate.
 95% of users never go past the first page of results, 1% go
past the second
 page.  So perfectly good results on the second page get
permanently locked out
- consider forgetting what you learn under some condition

In fact this whole area is called 'learning to rank' and is a hot
research topic in IR.
http://web.mit.edu/shivani/www/Ranking-NIPS-05/
http://research.microsoft.com/en-us/um/people/lr4ir-2007/
https://research.microsoft.com/en-us/um/people/lr4ir-2008/

- Neal Richter


On Tue, Jan 27, 2009 at 2:06 PM, Matthew Runo 
wrote:

Hello folks!

We've been thinking about ways to improve organic search results
for a while
(really, who hasn't?) and I'd like to get some ideas on ways to
implement a
feedback system that uses user behavior as input. Basically, it'd
work on
the premise that what the user actually clicked on is probably a
really good
match for their search, and should be boosted up in the results
for that
search.

For example, if I search for "rain boots", and really love the
10th result
down (and show it by clicking on it), then we'd like to capture
this and use
the data to boost up that result //for that search//. We've
thought about
using index time boosts for the documents, but that'd boost it
regardless of
the search terms, which isn't what we want. We've thought about
using the
Elevator handler, but we don't really want to force a product to
the top -
we'd prefer it slowly rises over time as more and more people
click it from
the same search terms. Another way might be to stuff the keyword
into the
document, the more times it's in the document the 

Re: Rsyncd start and stop for multiple instances

2009-01-30 Thread sunnyfr

Hi,

How can I hack the existing script to support multiple rsync module
 
 rsyncd.conf file 

uid = root
gid = root
use chroot = no
list = no
pid file = /data/solr/book/logs/rsyncd.pid
log file = /data/solr/book/logs/rsyncd.log
[solr]
path = /data/solr/book/data
comment = Solr

How do I do for /data/solr/user ?? 
thanks a lot 









Bill Au wrote:
> 
> You can either use a dedicated rsync port for each instance or hack the
> existing scripts to support multiple rsync modules.  Both ways should
> work.
> 
> Bill
> 
> On Tue, Jul 1, 2008 at 3:49 AM, Jacob Singh  wrote:
> 
>> Hi Bill and Others:
>>
>>
>> Bill Au wrote:
>> > The rsyncd-start scripts gets the data_dir path from the command line
>> and
>> > create a rsyncd.conf on the fly exporting the path as the rsync module
>> named
>> > "solr".  The salves need the data_dir path on the master to look for
>> the
>> > latest snapshot.  But the rsync command used by the slaves relies on
>> the
>> > rsync module name "solr" to do the file transfer using rsyncd.
>>
>> So is the answer that replication simply won't work for multiple
>> instances unless I have a dedicated port for each one?
>>
>> Or is the answer that I have to hack the existing scripts?
>>
>> I'm a little confused when you say that slave needs to know the master's
>> data dir, but, no matter what it sends, it needs to match the one known
>> by the master when it starts rsyncd...
>>
>> Sorry if my questions are newbie, I've not actually used rsyncd, but
>> I've read up quite a bit now.
>>
>> Thanks,
>> Jacob
>>
>> >
>> > Bill
>> >
>> > On Tue, Jun 10, 2008 at 4:24 AM, Jacob Singh 
>> wrote:
>> >
>> >> Hey folks,
>> >>
>> >> I'm messing around with running multiple indexes on the same server
>> >> using Jetty contexts.  I've got the running groovy thanks to the
>> >> tutorial on the wiki, however I'm a little confused how the collection
>> >> distribution stuff will work for replication.
>> >>
>> >> The rsyncd-enable command is simple enough, but the rsyncd-start
>> command
>> >> takes a -d (data dir) as an argument... Since I'm hosting 4 different
>> >> instances, all with their own data dirs, how do I do this?
>> >>
>> >> Also, you have to specify the master data dir when you are connecting
>> >> from the slave anyway, so why does it need to be specified when I
>> start
>> >> the daemon?  If I just start it with any old data dir will it work for
>> >> anything the user running it has perms on?
>> >>
>> >> Thanks,
>> >> Jacob
>> >>
>> >
>>
>>
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Rsyncd-start-and-stop-for-multiple-instances-tp17750242p21750131.html
Sent from the Solr - User mailing list archive at Nabble.com.



User tag design for read-only index

2009-01-30 Thread Ryan McKinley
I am build a system that indexes a bunch of data and then will let  
users manually put the data in lists.  I have seen http://wiki.apache.org/solr/UserTagDesign


The behavior I would like is identical to 'tagging' each document with  
the list-id/user/order and then using standard faceting to show what  
lists documents are in and what users have put the docs into a list.


But - I would like the main index to be read only.  The index needs to  
be shared across many installations that should not have access to  
other users data.


Any thoughts on how this might be possible?  Off hand, it seems like  
manually filling an un-inverted field cache might be a place to start  
looking.  Perhaps using a multi-searcher and keeping two indexes --  
that seems like a lot of work.


thanks
ryan



RE: Re: WebLogic 10 Compatibility Issue - StackOverflowError

2009-01-30 Thread Feak, Todd
Are the issues ran into due to non-standard code in Solr, or is there
some WebLogic inconsistency?

-Todd Feak

-Original Message-
From: news [mailto:n...@ger.gmane.org] On Behalf Of Ilan Rabinovitch
Sent: Friday, January 30, 2009 1:11 AM
To: solr-user@lucene.apache.org
Subject: Re: WebLogic 10 Compatibility Issue - StackOverflowError

I created a wiki page shortly after posting to the list:

http://wiki.apache.org/solr/SolrWeblogic

 From what we could tell Solr itself was fully functional, it was only 
the admin tools that were failing.

Regards,
Ilan Rabinovitch

---
SCALE 7x: 2009 Southern California Linux Expo
Los Angeles, CA
http://www.socallinuxexpo.org


On 1/29/09 4:34 AM, Mark Miller wrote:
> We should get this on the wiki.
>
> - Mark
>
>
> Ilan Rabinovitch wrote:
>>
>> We were able to deploy Solr 1.3 on Weblogic 10.0 earlier today. Doing
>> so required two changes:
>>
>> 1) Creating a weblogic.xml file in solr.war's WEB-INF directory. The
>> weblogic.xml file is required to disable Solr's filter on FORWARD.
>>
>> The contents of weblogic.xml should be:
>>
>> 
>> > xmlns="http://www.bea.com/ns/weblogic/90";
>> xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance";
>> xsi:schemaLocation="http://www.bea.com/ns/weblogic/90
>> http://www.bea.com/ns/weblogic/90/weblogic-web-app.xsd";>
>>
>> 
>>
>>
false
>>
>> 
>>
>> 
>>
>>
>> 2) Remove the pageEncoding attribute from line 1 of
solr/admin/header.jsp
>>
>>
>>
>>
>> On 1/17/09 2:02 PM, KSY wrote:
>>> I hit a major roadblock while trying to get Solr 1.3 running on
WebLogic
>>> 10.0.
>>>
>>> A similar message was posted before - (
>>>
http://www.nabble.com/Solr-1.3-stack-overflow-when-accessing-solr-admin-
page-td20157873.html
>>>
>>>
http://www.nabble.com/Solr-1.3-stack-overflow-when-accessing-solr-admin-
page-td20157873.html
>>>
>>> ) - but it seems like it hasn't been resolved yet, so I'm re-posting
>>> here.
>>>
>>> I am sure I configured everything correctly because it's working
fine on
>>> Resin.
>>>
>>> Has anyone successfully run Solr 1.3 on WebLogic 10.0 or higher?
Thanks.
>>>
>>>
>>> SUMMARY:
>>>
>>> When accessing /solr/admin page, StackOverflowError occurs due to an
>>> infinite recursion in SolrDispatchFilter
>>>
>>>
>>> ENVIRONMENT SETTING:
>>>
>>> Solr 1.3.0
>>> WebLogic 10.0
>>> JRockit JVM 1.5
>>>
>>>
>>> ERROR MESSAGE:
>>>
>>> SEVERE: javax.servlet.ServletException: java.lang.StackOverflowError
>>> at
>>>
weblogic.servlet.internal.RequestDispatcherImpl.forward(RequestDispatche
rImpl.java:276)
>>>
>>> at
>>>
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.j
ava:273)
>>>
>>> at
>>>
weblogic.servlet.internal.FilterChainImpl.doFilter(FilterChainImpl.java:
42)
>>>
>>> at
>>>
weblogic.servlet.internal.RequestDispatcherImpl.invokeServlet(RequestDis
patcherImpl.java:526)
>>>
>>> at
>>>
weblogic.servlet.internal.RequestDispatcherImpl.forward(RequestDispatche
rImpl.java:261)
>>>
>>> at
>>>
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.j
ava:273)
>>>
>>> at
>>>
weblogic.servlet.internal.FilterChainImpl.doFilter(FilterChainImpl.java:
42)
>>>
>>> at
>>>
weblogic.servlet.internal.RequestDispatcherImpl.invokeServlet(RequestDis
patcherImpl.java:526)
>>>
>>> at
>>>
weblogic.servlet.internal.RequestDispatcherImpl.forward(RequestDispatche
rImpl.java:261)
>>>
>>> at
>>>
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.j
ava:273)
>>>
>>> at
>>>
weblogic.servlet.internal.FilterChainImpl.doFilter(FilterChainImpl.java:
42)
>>>
>>> at
>>>
weblogic.servlet.internal.RequestDispatcherImpl.invokeServlet(RequestDis
patcherImpl.java:526)
>>>
>>> at
>>>
weblogic.servlet.internal.RequestDispatcherImpl.forward(RequestDispatche
rImpl.java:261)
>>>
>>> at
>>>
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.j
ava:273)
>>>
>>>
>>
>>
>
>






Re: Optimizing & Improving results based on user feedback

2009-01-30 Thread Ryan McKinley

yes, applying a boost would be a good addition.

patches are always welcome ;)


On Jan 30, 2009, at 10:56 AM, Matthew Runo wrote:

I've thought about patching the QueryElevationComponent to apply  
boosts rather than a specific sort. Then the file might look like..




And I could write a script that looks at click data once a day to  
fill out this file.

Thanks for your time!

Matthew Runo
Software Engineer, Zappos.com
mr...@zappos.com - 702-943-7833

On Jan 30, 2009, at 6:37 AM, Ryan McKinley wrote:

It may not be as fine-grained as you want, but also check the  
QueryElevationComponent.  This takes a preconfigured list of what  
the top results should be for a given query and makes thoes  
documents the top results.


Presumably, you could use click logs to determine what the top  
result should be.



On Jan 29, 2009, at 7:45 PM, Walter Underwood wrote:


"A Decision Theoretic Framework for Ranking using Implicit Feedback"
uses clicks, but the best part of that paper is all the side  
comments

about difficulties in evaluation. For example, if someone clicks on
three results, is that three times as good or two failures and a
success? We have to know the information need to decide. That paper
is in the LR4IR 2008 proceedings.

Both Radlinski and Joachims seem to be focusing on click data.

I'm thinking of something much simpler, like taking the first
N hits and reordering those before returning. Brute force, but
would get most of the benefit. Usually, you only have reliable
click data for a small number of documents on each query, so
it is a waste of time to rerank the whole list. Besides, if you
need to move something up 100 places on the list, you should
probably be tuning your regular scoring rather than patching
it with click data.

wunder

On 1/29/09 3:43 PM, "Matthew Runo"  wrote:


Agreed, it seems that a lot of the algorithms in these papers would
almost be a whole new RequestHandler ala Dismax. Luckily a lot of  
them

seem to be built on Lucene (at least the ones that I looked at that
had code samples).

Which papers did you see that actually talked about using clicks? I
don't see those, beyond "Addressing Malicious Noise in Clickthrough
Data" by Filip Radlinski and also his "Query Chains: Learning to  
Rank

from Implicit Feedback" - but neither is really on topic.

Thanks for your time!

Matthew Runo
Software Engineer, Zappos.com
mr...@zappos.com - 702-943-7833

On Jan 29, 2009, at 11:36 AM, Walter Underwood wrote:


Thanks, I didn't know there was so much research in this area.
Most of the papers at those workshops are about tuning the
entire ranking algorithm with machine learning techniques.

I am interested in adding one more feature, click data, to an
existing ranking algorithm. In my case, I have enough data to
use query-specific boosts instead of global document boosts.
We get about 2M search clicks per day from logged in users
(little or no click spam).

I'm checking out some papers from Thorsten Joachims and from
Microsoft Research that are specifically about clickthrough
feedback.

wunder

On 1/27/09 11:15 PM, "Neal Richter"  wrote:

OK I've implemented this before, written academic papers and  
patents

related to this task.

Here are some hints:
- you're on the right track with the editorial boosting elevators
- http://wiki.apache.org/solr/UserTagDesign
- be darn careful about assuming that one click is enough  
evidence

to boost a long
 'distance'
- first page effects in search will skew the learning badly if  
you

don't compensate.
95% of users never go past the first page of results, 1% go
past the second
page.  So perfectly good results on the second page get
permanently locked out
- consider forgetting what you learn under some condition

In fact this whole area is called 'learning to rank' and is a hot
research topic in IR.
http://web.mit.edu/shivani/www/Ranking-NIPS-05/
http://research.microsoft.com/en-us/um/people/lr4ir-2007/
https://research.microsoft.com/en-us/um/people/lr4ir-2008/

- Neal Richter


On Tue, Jan 27, 2009 at 2:06 PM, Matthew Runo 
wrote:

Hello folks!

We've been thinking about ways to improve organic search results
for a while
(really, who hasn't?) and I'd like to get some ideas on ways to
implement a
feedback system that uses user behavior as input. Basically,  
it'd

work on
the premise that what the user actually clicked on is probably a
really good
match for their search, and should be boosted up in the results
for that
search.

For example, if I search for "rain boots", and really love the
10th result
down (and show it by clicking on it), then we'd like to capture
this and use
the data to boost up that result //for that search//. We've
thought about
using index time boosts for the documents, but that'd boost it
regardless of
the search terms, which isn't what we want. We've thought about
using the
Elevator handler, but we don't really want to force a product to
the top -
we'd prefer it slowly rises over time as more and more people

1.3 <-> 1.4 patch for onError handling

2009-01-30 Thread Jon Baer
Hi,

Ive just had a bump in the night where some feeds have disappeared, Im
wondering since Im running the base 1.3 copy would patching it w/

https://issues.apache.org/jira/browse/SOLR-842

Break anything?  Has anyone done this yet?

Thanks.

- Jon


Re: got background_merge_hit_exception during optimization

2009-01-30 Thread Qingdi


We are on solr 1.3, and we use the default jetty server, which is included
in the solr 1.3 download package.

The java version is:
java version "1.5.0_12"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_12-b04)
Java HotSpot(TM) 64-Bit Server VM (build 1.5.0_12-b04, mixed mode)

I checked the log files under logs and solr/logs, but don't see any error. 
Would you please let me know how to get the stack trace from the solr logs?

Appreciate your help.

Qingdi


Yonik Seeley-2 wrote:
> 
> What system and JVM was this using?
> Also, could you get the stack trace directly from the Solr logs and post
> it?
> 
> -Yonik
> 
> On Thu, Jan 29, 2009 at 4:06 PM, Qingdi  wrote:
>>
>> We got the following background_merge_hit_exception during optimization:
>> exception:
>> )background_merge_hit_exception__4zsgC136887658__50nfC995992__51i9C995977__52d5C995968__537yC995999__54xmC1892345__54xlC99593_into__54xn_optimize__javaioIOException_background_merge_hit_exception__4zsgC136887658__50nfC995992__51i9C995977__52d5C995968__537yC995999__54xmC1892345__54xlC99593_into__54xn_optimize__at_orgapacheluceneindexIndexWriteroptimizeIndexWriterjava2346__at_orgapacheluceneindexIndexWriteroptimizeIndexWriterjava2280__at_orgapachesolrupdateDirectUpdateHandler2commitDirectUpdateHandler2java355__at_orgapachesolrupdateprocessorRunUpdateProcessorprocessCommitRunUpdateProcessorFactoryjava77__at_orgapachesolrhandlerRequestHandlerUtilshandleCommitRequestHandlerUtilsjava104__at_orgapachesolrhandlerXmlUpdateRequestHandlerhandleRequestBodyXmlUpdateRequestHandlerjava113__at_orgapachesolrhandlerRequestHandlerBasehandleRequestRequestHandlerBasejava131__at_orgapachesolrcoreSolrCoreexecuteSolrCorejava1204__at_orgapachesolrservletSolrDispatchFilterexecuteSolrDispatchFilterjava303__at_orgapachesolrservletSolrDispatchFilterdoFilterSolrDispatchFilterjava232__at_orgmortbayjettyservletServletHandler$CachedChaindoFilterServletHandlerjava1089__at_orgmortbayjettyservletServletHandlerhandleServletHandlerjava365__at_orgmortbayjettysecuritySecurityHandlerhandleSecurityHandlerjava216__at_orgmortbayjettyservletSessionHandlerhandleSessionHandlerjava181__at_orgmortbayjettyhandlerContextHandlerhandleContextHandlerjava712__at_orgmortbayjettywebappWebAppContexthandleWebAppContextjava405__at_orgmortbayjettyhandlerContextHandlerCollectionhandleContextHandlerCollectionjava211__at_orgmortbayjettyhandlerHandlerCollectionhandleHandlerCollectionjava114__at_orgmortbayjettyhandlerHandlerWrapperhandleHandlerWrapperjava139__at_orgmortbayjettyServerhandleServerjava285__at_orgmortbayjettyHttpConnectionhandleRequestHttpConnectionjava502__a
>>
>> Does anyone know what could be the cause of the exception? what should we
>> do
>> to prevent this type of exception?
>>
>> Some posts in the Lucene forum say the exception is usually related with
>> disk space issue. But there should be enough disk space in our system.
>> Our
>> index size was about 56G. And before optimization, the disk had about
>> 360G
>> free space.
>>
>> After the above background_merge_hit_exception raised, solr kept
>> generating
>> new segment files, which ate up all the CPU time and the disk space, so
>> we
>> had to kill the solr server.
>>
>> Thanks for your help.
>>
>> Qingdi
>>
>>
>> --
>> View this message in context:
>> http://www.nabble.com/got-background_merge_hit_exception-during-optimization-tp21735847p21735847.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>>
> 
> 

-- 
View this message in context: 
http://www.nabble.com/got-background_merge_hit_exception-during-optimization-tp21735847p21751938.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: query with stemming, prefix and fuzzy?

2009-01-30 Thread Mark Miller

Gert Brinkmann wrote:

Thanks, Mark, for your answer,

Mark Miller wrote:
  

Truncation queries and stemming are difficult partners. You likely have
to accept compromise. You can try using multiple fields like you are,



I already have multiple fields, one per language, to be able to use
different stemmers. Wouldn't become this too much?
  
Possibly. Especially if you are using norms with all of those fields. 
Depends on your index though.
  

you can try indexing the full term at the same position as the stemmed
term,



what does this mean "at the same position" and how could I do this?
  
Write a custom filter. Normally, for every term, its position is 
incremented by 1 as the terms are broken out in tokenization. You can 
change this and index terms at the same position using your own filter. 
There are ramifications, because you are adding more terms to your 
index, but it allows you to index multiple forms of a term at the same 
position (so that phrase queries still work as expected).
  

or you can accept the weirdness that comes from matching on a
stemmed form (potentially very confusing for a user).



Currently I think about dropping the stemming and only use
prefix-search. But as highlighting does not work with a prefix "house*"
this is a problem for me. The hint to use "house?*" instead does not
work here.
  
Thats because wildcard queries are also not highlightable now. I 
actually have somewhat of a solution to this that I'll work on soon 
(I've gotten the ground work for it in or ready to be in Lucene). No 
guarantee on when or if it will be accepted in solr though.
  

In any case though, a queryparser that support fuzzyquery should not be
analyzing it. What parser are you using? If it is analyzing the fuzzy
syntax, it doesnt likely support it.



I am using the following definitions (testing it with and without stemming):
  


  









  
  








  




and, well, the parser? Where is the parser specified? Do you mean the
request handler "qt" (that will be "standard", as I do not set it yet)?
  

Thats odd. I'll have to look at this closer to be of help.


  

The prefix length determines how many terms are enumerated - with the



Can the prefix length be set in Solr? I could not find such an option.
  
I don't think there is an option in Solr. Patches welcome of course. It 
would be a nice one - using the default of 0 is *very* not scalable.
  

The latest trunk build on Lucene will let us switch fuzzy query to use a
constant score mode - this will eliminate the booleanquery and should
perform much better on a large index. Solr already uses a constant score
mode for Prefix and Wildcard queries.



much better performance is always good. When will this feature be
available in Solr?
  
Soon I hope. Since wildcard and prefix are already constant score, it 
only makes sense to make fuzzy query that way as well.
  

How big is your index? If its not that big, it may be odd that your
seeing things that slow (number of unique terms in the index will play a
large role).



Well, the index currently contains about 5000 documents. These are
HTML-pages, some of them are concatenated with PDF/DOCs (Downloads
linked from the HTML-page) converted to text. The index data is about
11MB (optimized). So think, this is just a smaller index.
  
Yeah, sounds small. Its odd you would see such slow performance. It 
depends though. You may still have a *lot* of unique terms in there.





Re: query with stemming, prefix and fuzzy?

2009-01-30 Thread Gert Brinkmann
Mark Miller wrote:
> Yeah, sounds small. Its odd you would see such slow performance. It
> depends though. You may still have a *lot* of unique terms in there.

Is there a way to retrieve the list of terms in the index?

Gert


Re: query with stemming, prefix and fuzzy?

2009-01-30 Thread Mark Miller

Gert Brinkmann wrote:

Mark Miller wrote:
  

Yeah, sounds small. Its odd you would see such slow performance. It
depends though. You may still have a *lot* of unique terms in there.



Is there a way to retrieve the list of terms in the index?

Gert
  

Try hitting /solr/admin/luke and see what it says.

- Mark


solr booosting

2009-01-30 Thread Marc Sturlese

Hey there,
I am trying to tune the boost of the results obtained using
DisMaxQueryParser.
As I understood lucene's boost, if you search for "John Le Carre" it will
give better score to the results that contains just the searched string that
results that have, for example, 50 words and the search is contained in the
words.

In Solr, my goal is to give more score to the docs that contains both words
but that have more words in the field.

I have tried 2 options:
1.-On index time, I check the length of the fields and if are bigger that
'x' chars i give more boost to that doc (I am adding 3.0 extra boost using
addBoost). 

2.-In another hand I have been playing with tie and pf but I think they are
not helping in my issue.

Before using Solr (my own Lucene searcher and indexer) the first option use
to work quite well, in Solr my extra boost seems to afect much less. Is this
normal as I am using DismaxQueryParser or it should be the same?

Any advice is more than welcome!

Thanks in advance
 
-- 
View this message in context: 
http://www.nabble.com/solr-booosting-tp21753617p21753617.html
Sent from the Solr - User mailing list archive at Nabble.com.



exceeded limit of maxWarmingSearchers

2009-01-30 Thread Jon Drukman

I am getting hit by a storm of these once a day or so:

SEVERE: org.apache.solr.common.SolrException: Error opening new 
searcher. exceeded limit of maxWarmingSearchers=16, try again later.


I keep bumping up maxWarmingSearchers.  It's at 32 now.  Is there any 
way to figure out what the "right" value is besides trial and error? 
Our site gets extremely minimal traffic so I'm really puzzled why the 
out-of-the-box settings are insufficient.


The index has about 61000 documents, very small, and we do less than one 
query per second.


-jsd-



Re: exceeded limit of maxWarmingSearchers

2009-01-30 Thread Yonik Seeley
I'd advise setting it to a very low limit (like 2) and committing less
often.  Once you get too many overlapping searchers, things will slow
to a crawl and that will just cause more to pile up.

The root cause is simply too many commits in conjunction with warming
too long.  If you are using a dev version of Solr 1.4, you might try
commitWithin instead of explicit commits. (see SOLR-793)  Depending
how long warming takes, you may want to lower autowarm counts.

-Yonik


On Fri, Jan 30, 2009 at 2:14 PM, Jon Drukman  wrote:
> I am getting hit by a storm of these once a day or so:
>
> SEVERE: org.apache.solr.common.SolrException: Error opening new searcher.
> exceeded limit of maxWarmingSearchers=16, try again later.
>
> I keep bumping up maxWarmingSearchers.  It's at 32 now.  Is there any way to
> figure out what the "right" value is besides trial and error? Our site gets
> extremely minimal traffic so I'm really puzzled why the out-of-the-box
> settings are insufficient.
>
> The index has about 61000 documents, very small, and we do less than one
> query per second.
>
> -jsd-
>
>


Re: query with stemming, prefix and fuzzy?

2009-01-30 Thread Gert Brinkmann
Mark Miller wrote:

> Try hitting /solr/admin/luke and see what it says.

Oh, interesting. I think I have to check the stopword list. Is there a
way to filter single characters like the "h"?


text_de_de
ITS--
ITS--
2340
57971 


1454  
1016
1008  
980  
927
924
895
843  
730  
730  

Thank you for the information.
Gert


Re: solr as the data store

2009-01-30 Thread Ian Connor
The other option was actually couchdb. It was very nice but the benefits
were not compelling compared to the pure simplicity of just having solr.

With the replication just so simple to setup now - it really does seem to
solve all the problems we are looking for in a redundant distributed storage
solution.

On Thu, Jan 29, 2009 at 12:50 AM, Neal Richter  wrote:

> You might examine what the Apache CouchDB people have done.
>
> It's a document oriented DB that is able to use JSON structured
> documents combined with Lucene indexing of the documents with a
> RESTful HTTP interface.
>
> It's a stretch, and written in Erlang.. but perhaps there is some
> inspiration to be had for 'solr as the data store'.
>
> - Neal Richter
>



-- 
Regards,

Ian Connor


Re: query with stemming, prefix and fuzzy?

2009-01-30 Thread Mark Miller

Gert Brinkmann wrote:

57971 
  
Its a lot for a small index. The fuzzy query will enumerate all of those 
terms and calculate an edit distance. Its not an insane amount of work, 
but it jives with the slowness you see. Doing that 60,000 times for a 
query is not that fast.


Unfortunately, without the prefix setting, FuzzyQueries are slow, slow 
with that many uniques. Solr should def allow the prefix to be set. 
There was talk a couple years back about changing the default prefix 
value in Lucene because its so slow - but it didn't happen. The 
developers decided that you could tweak it yourself if you needed to be 
able to scale (if you add a prefix length, up to that length won't be 
fuzzy). Unfortunately, Solr hasnt yet given this option to my knowledge.


- Mark


Re: query with stemming, prefix and fuzzy?

2009-01-30 Thread Shalin Shekhar Mangar
On Fri, Jan 30, 2009 at 11:37 PM, Mark Miller  wrote:

>
>>
>>> you can try indexing the full term at the same position as the stemmed
>>> term,
>>>
>>>
>>
>> what does this mean "at the same position" and how could I do this?
>>
>>
> Write a custom filter. Normally, for every term, its position is
> incremented by 1 as the terms are broken out in tokenization. You can change
> this and index terms at the same position using your own filter. There are
> ramifications, because you are adding more terms to your index, but it
> allows you to index multiple forms of a term at the same position (so that
> phrase queries still work as expected).


Can SOLR-763 help here? It is in trunk now.

https://issues.apache.org/jira/browse/SOLR-763

-- 
Regards,
Shalin Shekhar Mangar.


problems on solr search patterns and sorting rules

2009-01-30 Thread fei dong
Hi buddy, I work on an audio search based on solr engine. I want to realize
lyric search and sort by relevance. Here is my confusion. .
My schema.xml is like this:
   
   
   
   
   
   
   

 text

 
 
   
   
   
   
...
http://localhost:8983/solr/select/?q=lyric:(tear
 the house down)&fl=*,score&version=2.2&start=0&rows=10&indent=on
has  results

http://localhost:8983/solr/select/?q=tear the house
down&fl=*,score&version=2.2&start=0&rows=10&indent=on

have no result
http://localhost:8983/solr/select/?q=tear
 the house down&fl=*,score&qf=lyric&version=2.2&start=0&rows=10&indent=on
have no result

Q1: why the latter links not work while I have added lyric to copyField?

Q2: I want to set the priority of song name higher ,then artist name and
album so I try like this:
http://localhost:8983/solr/select/?q=sweet&fl=*,score&qf=mp3^5 artist
album^0.4&version=2.2&start=0&rows=10&indent=on
I find the score are totally same as without the argument of qf:
http://localhost:8983/solr/select/?q=sweet&fl=*,score&version=2.2&start=0&rows=10&indent=on

How could I modify the sorting?

Q3: I would like to realize the effect like :
http://mp3.baidu.com/m?f=ms&rn=&tn=baidump3&ct=134217728&word=tear+the+house+down&lm=-1
highlight the fragment. Can solr give me the range of minimum of all
keywords in an article like lyric?

Thank you for attention!


Separate error logs

2009-01-30 Thread James Brady
Hi all,What's the best way for me to split Solr/Lucene error message off to
a separate log?

Thanks
James


Re: exceeded limit of maxWarmingSearchers

2009-01-30 Thread Jon Drukman

Yonik Seeley wrote:

I'd advise setting it to a very low limit (like 2) and committing less
often.  Once you get too many overlapping searchers, things will slow
to a crawl and that will just cause more to pile up.

The root cause is simply too many commits in conjunction with warming
too long.  If you are using a dev version of Solr 1.4, you might try
commitWithin instead of explicit commits. (see SOLR-793)  Depending
how long warming takes, you may want to lower autowarm counts.


right now we commit on every update, but that's probably not more than 
once every few minutes.  should i back it off?


-jsd-



Re: solr as the data store

2009-01-30 Thread Paul Libbrecht
We've been using a Lucene index as the main data-store for ActiveMath,  
the indexing process of which takes the XML fragments apart and stores  
them in an organized way, including storage of the relationships both  
ways.


The difference between SQL and Lucene in this case? Pure java was the  
major reason back then. The performance of Lucene stayed top as well  
(compared to XML databases).


As of now because of 2.0, we had to split out the storage of the  
fragments themselves, keeping the rest in Lucene, because the  
functionality to reliably read and write fields and never have them be  
loaded as single strings has been missing us. Maybe it's back in 2.3...


Our fragments' size vary from 20 byte to 2 MBytes... about 25k of them  
is normal.


I'm looking forward to, one day, recycle it all to solr which would  
finally take care of it all in terms of index update and read  
management, adding a Luke-like web-access.


Scalability of Lucene has always been top.
Joins are not there... I could get along without them.
Summaries are also not really there... but again, we could get along  
without them.


paul


Le 28-janv.-09 à 21:37, Ian Connor a écrit :


Hi All,

Is anyone using Solr (and thus the lucene index) as there database  
store.


Up to now, we have been using a database to build Solr from.  
However, given
that lucene already keeps the stored data intact, and that  
rebuilding from
solr to solr can be very fast, the need for the separate database  
does not

seem so necessary.

It seems totally possible to maintain just the solr shards and treat  
them as
the database (backups, redundancy, etc are already built right in).  
The idea
that we would need to rebuild from scratch seems unlikely and the  
speed
boost by using solr shards for data massaging and reindexing seems  
very

appealing.

Has anyone else thought about this or done this and ran into  
problems that
caused them to go back to a seperate database model? Is there a  
critical

need you can think is missing?

--
Regards,

Ian Connor




smime.p7s
Description: S/MIME cryptographic signature


Re: Separate error logs

2009-01-30 Thread Ryan McKinley

check:
http://wiki.apache.org/solr/SolrLogging

You configure whatever flavor logger to write error to a separate log


On Jan 30, 2009, at 4:36 PM, James Brady wrote:

Hi all,What's the best way for me to split Solr/Lucene error message  
off to

a separate log?

Thanks
James




Re: exceeded limit of maxWarmingSearchers

2009-01-30 Thread Otis Gospodnetic
That should be fine (but apparently isn't), as long as you don't have some very 
slow machine or if your caches are are large and configured to copy a lot of 
data on commit.


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
> From: Jon Drukman 
> To: solr-user@lucene.apache.org
> Sent: Friday, January 30, 2009 4:54:06 PM
> Subject: Re: exceeded limit of maxWarmingSearchers
> 
> Yonik Seeley wrote:
> > I'd advise setting it to a very low limit (like 2) and committing less
> > often.  Once you get too many overlapping searchers, things will slow
> > to a crawl and that will just cause more to pile up.
> > 
> > The root cause is simply too many commits in conjunction with warming
> > too long.  If you are using a dev version of Solr 1.4, you might try
> > commitWithin instead of explicit commits. (see SOLR-793)  Depending
> > how long warming takes, you may want to lower autowarm counts.
> 
> right now we commit on every update, but that's probably not more than once 
> every few minutes.  should i back it off?
> 
> -jsd-



Re: Separate error logs

2009-01-30 Thread James Brady
Oh... I should really have found that myself :/
Thank you!

2009/1/30 Ryan McKinley 

> check:
> http://wiki.apache.org/solr/SolrLogging
>
> You configure whatever flavor logger to write error to a separate log
>
>
>
> On Jan 30, 2009, at 4:36 PM, James Brady wrote:
>
>  Hi all,What's the best way for me to split Solr/Lucene error message off
>> to
>> a separate log?
>>
>> Thanks
>> James
>>
>
>


Re: problems on solr search patterns and sorting rules

2009-01-30 Thread Koji Sekiguchi

fei dong wrote:

Hi buddy, I work on an audio search based on solr engine. I want to realize
lyric search and sort by relevance. Here is my confusion. .
My schema.xml is like this:
   
   
   
   
   
   
   

 text

 
 
   
   
   
   
...
http://localhost:8983/solr/select/?q=lyric:(tear
 the house down)&fl=*,score&version=2.2&start=0&rows=10&indent=on
has  results

http://localhost:8983/solr/select/?q=tear the house
down&fl=*,score&version=2.2&start=0&rows=10&indent=on

have no result
http://localhost:8983/solr/select/?q=tear
 the house down&fl=*,score&qf=lyric&version=2.2&start=0&rows=10&indent=on
have no result

Q1: why the latter links not work while I have added lyric to copyField?

  

Did you re-index after adding lyric to copyField?


Q2: I want to set the priority of song name higher ,then artist name and
album so I try like this:
http://localhost:8983/solr/select/?q=sweet&fl=*,score&qf=mp3^5 artist
album^0.4&version=2.2&start=0&rows=10&indent=on
I find the score are totally same as without the argument of qf:
http://localhost:8983/solr/select/?q=sweet&fl=*,score&version=2.2&start=0&rows=10&indent=on

How could I modify the sorting?

  

Try indexing-time boost:
http://wiki.apache.org/solr/UpdateXmlMessages#head-8315b8028923d028950ff750a57ee22cbf7977c6


Q3: I would like to realize the effect like :
http://mp3.baidu.com/m?f=ms&rn=&tn=baidump3&ct=134217728&word=tear+the+house+down&lm=-1
highlight the fragment. Can solr give me the range of minimum of all
keywords in an article like lyric?

  
I couldn't understand your requirement, but can you try 
&hl=on&hl.fl=lyric and see what you get?



Thank you for attention!

  




RE: Performance "dead-zone" due to garbage collection

2009-01-30 Thread wojtekpia

I profiled our application, and GC is definitely the problem. The IBM JVM
didn't change much. I'm currently looking into ways of reducing my memory
footprint. 

-- 
View this message in context: 
http://www.nabble.com/Performance-%22dead-zone%22-due-to-garbage-collection-tp21588427p21758001.html
Sent from the Solr - User mailing list archive at Nabble.com.



Solr on Sun Java Real-Time System

2009-01-30 Thread wojtekpia

Has anyone tried Solr on the Sun Java Real-Time JVM
(http://java.sun.com/javase/technologies/realtime/index.jsp)? I've read that
it includes better control over the garbage collector.

Thanks.

Wojtek
-- 
View this message in context: 
http://www.nabble.com/Solr-on-Sun-Java-Real-Time-System-tp21758035p21758035.html
Sent from the Solr - User mailing list archive at Nabble.com.



Range search question

2009-01-30 Thread Jim Adams
I have a string field in my schema that actually numeric data.  If I try a
range search:

fieldInQuestion:[ 100 TO 150 ]

I fetch back a lot of data that is NOT in this range, such as 11, etc.

Any idea why this happens?  Is it because this is a string?

Thanks.


Re: Range search question

2009-01-30 Thread Koji Sekiguchi

Jim Adams wrote:

I have a string field in my schema that actually numeric data.  If I try a
range search:

fieldInQuestion:[ 100 TO 150 ]

I fetch back a lot of data that is NOT in this range, such as 11, etc.

Any idea why this happens?  Is it because this is a string?

Thanks.

  


Yep, try sint field type instead.

Koji



Re: Range search question

2009-01-30 Thread Jim Adams
True, which is what I'll probably do, but is there any way to do this using
'string'?  Actually I have even seen this with date fields, which seems very
odd (more data being returned than I expected).

On Fri, Jan 30, 2009 at 7:04 PM, Koji Sekiguchi  wrote:

> Jim Adams wrote:
>
>> I have a string field in my schema that actually numeric data.  If I try a
>> range search:
>>
>> fieldInQuestion:[ 100 TO 150 ]
>>
>> I fetch back a lot of data that is NOT in this range, such as 11, etc.
>>
>> Any idea why this happens?  Is it because this is a string?
>>
>> Thanks.
>>
>>
>>
>
> Yep, try sint field type instead.
>
> Koji
>
>