Re: query with stemming, prefix and fuzzy?

2009-01-29 Thread Gert Brinkmann

Shalin Shekhar Mangar wrote:

Quite the opposite, you are actually working with some advanced stuff :)


Thank you for the response.


Please have some patience, someone is


Ok, I will have (what else could I do? ;) ). Meanwhile I while try some 
things and continue to search the web.


Greetings,
Gert


Re: Highlighting does not work?

2009-01-29 Thread Jarek Zgoda
Added appriopriate amendment to FAQ, but I'd consider reorganizing  
information in the whole wiki, like creating a section titled "Common  
Tasks". Bit of redundancy does not hurt if it comes to documentation.


Wiadomość napisana w dniu 2009-01-28, o godz. 20:01, przez Mike Klaas:

Well, both pages I listed are in the search results :).  But I agree  
that it isn't obvious to find, and that it should be improved.  (The  
Wiki is a community-created site which anyone can contribute to,  
incidentally.)


cheers,
-Mike

On 28-Jan-09, at 1:11 AM, Jarek Zgoda wrote:

I swear I was looking this information in Solr wiki. See for  
yourself if this is accessible at all:


http://wiki.apache.org/solr/?action=fullsearch&context=180&value=highlight&fullsearch=Text

Wiadomość napisana w dniu 2009-01-28, o godz. 00:58, przez Mike  
Klaas:


They are documented in http://wiki.apache.org/solr/FieldOptionsByUseCase 
 and in the FAQ , but I agree that it could be more readily  
accessible.


-Mike

On 27-Jan-09, at 5:26 AM, Jarek Zgoda wrote:

Finally found that the fields have to have an analyzer to be  
highlighted. Neat.


Can I ask somebody to document these all requirements?

Wiadomość napisana w dniu 2009-01-27, o godz. 13:49, przez Jarek  
Zgoda:


I turned these fields to indexed + stored but the results are  
exactly the same, no matter if I search in these fields or  
elsewhere.


Wiadomość napisana w dniu 2009-01-27, o godz. 13:09, przez Jarek  
Zgoda:



Solr 1.3

I'm trying to get highlighting working, with no luck so far.

Query with params  
q=cyrus&fl=*,score&qt=standard&hl=true&hl.fl=title+description  
finds 182 documents in my index. All of the top 10 hits contain  
the word "cyrus", but the highlights list is empty. The fields  
"title" and "description" are stored but not indexed. If I  
specify "*" as hl.fl value I get the same results.


Do I need to add some special configuration to enable  
highlighting feature?


--
We read Knuth so you don't have to. - Tim Peters

Jarek Zgoda, R&D, Redefine
jarek.zg...@redefine.pl



--
We read Knuth so you don't have to. - Tim Peters

Jarek Zgoda, R&D, Redefine
jarek.zg...@redefine.pl



--
We read Knuth so you don't have to. - Tim Peters

Jarek Zgoda, R&D, Redefine
jarek.zg...@redefine.pl





--
We read Knuth so you don't have to. - Tim Peters

Jarek Zgoda, R&D, Redefine
jarek.zg...@redefine.pl





--
We read Knuth so you don't have to. - Tim Peters

Jarek Zgoda, R&D, Redefine
jarek.zg...@redefine.pl



Re: newbie question --- multiple schemas

2009-01-29 Thread Noble Paul നോബിള്‍ नोब्ळ्
have two different cores and you can have separate schema for each.

On Thu, Jan 29, 2009 at 1:20 PM, Cheng Zhang  wrote:
> Hello,
>
> Is it possible to define more than one schema? I'm reading the example 
> schema.xml. It seems that we can only define one schema? What about if I want 
> to define one schema for document type A and another schema for document type 
> B?
>
> Thanks a lot,
> Kevin
>
>



-- 
--Noble Paul


Re: WebLogic 10 Compatibility Issue - StackOverflowError

2009-01-29 Thread Ilan Rabinovitch


We were able to deploy Solr 1.3 on Weblogic 10.0 earlier today.  Doing 
so required two changes:


1) Creating a weblogic.xml file in solr.war's  WEB-INF directory.  The 
weblogic.xml file is required to disable Solr's filter on FORWARD.


The contents of weblogic.xml should be:


http://www.bea.com/ns/weblogic/90";
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance";
xsi:schemaLocation="http://www.bea.com/ns/weblogic/90 
http://www.bea.com/ns/weblogic/90/weblogic-web-app.xsd";>




false





2)  Remove the pageEncoding attribute from line 1 of solr/admin/header.jsp




On 1/17/09 2:02 PM, KSY wrote:

I hit a major roadblock while trying to get Solr 1.3 running on WebLogic
10.0.

A similar message was posted before - (
http://www.nabble.com/Solr-1.3-stack-overflow-when-accessing-solr-admin-page-td20157873.html
http://www.nabble.com/Solr-1.3-stack-overflow-when-accessing-solr-admin-page-td20157873.html
) - but it seems like it hasn't been resolved yet, so I'm re-posting here.

I am sure I configured everything correctly because it's working fine on
Resin.

Has anyone successfully run Solr 1.3 on WebLogic 10.0 or higher?   Thanks.


SUMMARY:

When accessing /solr/admin page, StackOverflowError occurs due to an
infinite recursion in SolrDispatchFilter


ENVIRONMENT SETTING:

Solr 1.3.0
WebLogic 10.0
JRockit JVM 1.5


ERROR MESSAGE:

SEVERE: javax.servlet.ServletException: java.lang.StackOverflowError
at
weblogic.servlet.internal.RequestDispatcherImpl.forward(RequestDispatcherImpl.java:276)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:273)
at
weblogic.servlet.internal.FilterChainImpl.doFilter(FilterChainImpl.java:42)
at
weblogic.servlet.internal.RequestDispatcherImpl.invokeServlet(RequestDispatcherImpl.java:526)
at
weblogic.servlet.internal.RequestDispatcherImpl.forward(RequestDispatcherImpl.java:261)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:273)
at
weblogic.servlet.internal.FilterChainImpl.doFilter(FilterChainImpl.java:42)
at
weblogic.servlet.internal.RequestDispatcherImpl.invokeServlet(RequestDispatcherImpl.java:526)
at
weblogic.servlet.internal.RequestDispatcherImpl.forward(RequestDispatcherImpl.java:261)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:273)
at
weblogic.servlet.internal.FilterChainImpl.doFilter(FilterChainImpl.java:42)
at
weblogic.servlet.internal.RequestDispatcherImpl.invokeServlet(RequestDispatcherImpl.java:526)
at
weblogic.servlet.internal.RequestDispatcherImpl.forward(RequestDispatcherImpl.java:261)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:273)




--
Ilan Rabinovitch
i...@fonz.net

---
SCALE 7x: 2009 Southern California Linux Expo
Los Angeles, CA
http://www.socallinuxexpo.org



Registration for ApacheCon Europe 2009 is now open!

2009-01-29 Thread Erik Hatcher
Cross-posting this announcement.  There are several relevant Lucene/ 
Solr talks including:


Trainings
  - Lucene Boot Camp (Grant Ingersoll)
  - Solr Boot Camp (Erik Hatcher)

Sessions
  - Introducing Apache Mahout (Grant)
  - Lucene Case Studies (Erik)
  - Advanced Indexing Techniques with Apache Lucene (Michael Busch)

And a whole slew of Hadoop/cloud coverage.

Erik




--

ApacheCon EU 2009 registration is now open!
23-27 March -- Mövenpick Hotel, Amsterdam, Netherlands
http://www.eu.apachecon.com/


Registration for ApacheCon Europe 2009 is now open - act before early
bird prices expire 6 February.  Remember to book a room at the Mövenpick
and use the Registration Code: Special package attendees for the
conference registration, and get 150 Euros off your full conference
registration.

Lower Costs - Thanks to new VAT tax laws, our prices this year are 19%
lower than last year in Europe!  We've also negotiated a Mövenpick rate
of a maximum of 155 Euros per night for attendees in our room block.

Quick Links:

  http://xrl.us/aceu09sp  See the schedule
  http://xrl.us/aceu09hp  Get your hotel room
  http://xrl.us/aceu09rp  Register for the conference

Other important notes:

- Geeks for Geeks is a new mini-track where we can feature advanced
technical content from project committers.  And our Hackathon on Monday
and Tuesday is open to all attendees - be sure to check it off in your
registration.

- The Call for Papers for ApacheCon US 2009, held 2-6 November
2009 in Oakland, CA, is open through 28 February, so get your
submissions in now.  This ApacheCon will feature special events with
some of the ASF's original founders in celebration of the 10th
anniversary of The Apache Software Foundation.

  http://www.us.apachecon.com/c/acus2009/

- Interested in sponsoring the ApacheCon conferences?  There are plenty
of sponsor packages available - please contact Delia Frees at
de...@apachecon.com for further information.

==
ApacheCon EU 2008: A week of Open Source at it's best!

Hackathon - open to all! | Geeks for Geeks | Lunchtime Sessions
In-Depth Trainings | Multi-Track Sessions | BOFs | Business Panel
Lightning Talks | Receptions | Fast Feather Track | Expo... and more!

- Shane Curcuru, on behalf of
 Noirin Shirley, Conference Lead,
 and the whole ApacheCon Europe 2009 Team
 http://www.eu.apachecon.com/  23-27 March -- Amsterdam, Netherlands




Re: query with stemming, prefix and fuzzy?

2009-01-29 Thread Gert Brinkmann
Gert Brinkmann wrote:

>> A) fuzzy search
>>
>> What can I do to speed up the fuzzy query? 

Setting ramBufferSizeMB to a higher value seems to speed up the query
slightly. I have to continue with tuning though.

>> B) combine stemming, prefix and fuzzy search
>>
>> Is there a way to combine all this three query types in one query?
> (Or at least stemming and prefix search?)

I am a little bit confused now. Doing a fuzzy search seems to work now
on a normal analyzed field. (Hmm, did I change something?) So the
analyzers are not breaking up such queries atm.

Also the prefix search does work (sometimes). I have started to test
this with a german umlaut "lehmhü*" to find "lehmhütte". This does not
work. But searching for "lehmhu*" does find the words. (I am using the
German2 snowball stemmer). Another problem is, that the prefix search
does not return highlight snippets. Is this an issue or did I forget a
configuration detail?

Thanks,
Gert


Re: Pagination by facet?

2009-01-29 Thread Bruno Aranda
Further investigations leads me to think that I could achieve this by using
the parameters facet.offset and facet.limit. I wonder how to do this with
solrj, as I can see the SolrQuery.setFacetLimit() method but not a method to
specify the facet offset. I guess I can extend the class and the offset
method myself. Am I missing something?

Thanks,

Bruno

2009/1/28 Bruno Aranda 

> Hi, bear with me as I am new to Solr.
>
> I have a requirement in an application where I need to show a list of
> results by groups.
>
> For instance, each document in my index correspond to a person and they
> have a family name. I have hundreds of thousands of records (persons). What
> I would like to do is to show the records in a table, grouped by families. I
> could do this with sorting, no problem.
> However, now I need that my table shows a fixed number of families (e.g. 3)
> per page and the pages are family-based and not person-based. To do this, I
> would need to do queries limiting the results to the number of families
> found. Something like this:
>
>
>
> Family name Person  Age
> --  -
> Smith   John  31
>Kate   32
>Peter   3
> --
> Baker  Charles   55
> --
> Taylor  Richard   67
>Anne  64
> --
>
> And I would show always the same number of families per page (number of
> persons can be different), and I would paginate per family.
>
> I am using Java, so solrj would be good. Is there an easy way to achieve
> this?
>
> Thanks!
>
> Bruno
>


Re: WebLogic 10 Compatibility Issue - StackOverflowError

2009-01-29 Thread Mark Miller

We should get this on the wiki.

- Mark


Ilan Rabinovitch wrote:


We were able to deploy Solr 1.3 on Weblogic 10.0 earlier today.  Doing 
so required two changes:


1) Creating a weblogic.xml file in solr.war's  WEB-INF directory.  The 
weblogic.xml file is required to disable Solr's filter on FORWARD.


The contents of weblogic.xml should be:


http://www.bea.com/ns/weblogic/90";
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance";
xsi:schemaLocation="http://www.bea.com/ns/weblogic/90 
http://www.bea.com/ns/weblogic/90/weblogic-web-app.xsd";>




false 







2)  Remove the pageEncoding attribute from line 1 of 
solr/admin/header.jsp





On 1/17/09 2:02 PM, KSY wrote:

I hit a major roadblock while trying to get Solr 1.3 running on WebLogic
10.0.

A similar message was posted before - (
http://www.nabble.com/Solr-1.3-stack-overflow-when-accessing-solr-admin-page-td20157873.html 

http://www.nabble.com/Solr-1.3-stack-overflow-when-accessing-solr-admin-page-td20157873.html 

) - but it seems like it hasn't been resolved yet, so I'm re-posting 
here.


I am sure I configured everything correctly because it's working fine on
Resin.

Has anyone successfully run Solr 1.3 on WebLogic 10.0 or higher?   
Thanks.



SUMMARY:

When accessing /solr/admin page, StackOverflowError occurs due to an
infinite recursion in SolrDispatchFilter


ENVIRONMENT SETTING:

Solr 1.3.0
WebLogic 10.0
JRockit JVM 1.5


ERROR MESSAGE:

SEVERE: javax.servlet.ServletException: java.lang.StackOverflowError
at
weblogic.servlet.internal.RequestDispatcherImpl.forward(RequestDispatcherImpl.java:276) 


at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:273) 


at
weblogic.servlet.internal.FilterChainImpl.doFilter(FilterChainImpl.java:42) 


at
weblogic.servlet.internal.RequestDispatcherImpl.invokeServlet(RequestDispatcherImpl.java:526) 


at
weblogic.servlet.internal.RequestDispatcherImpl.forward(RequestDispatcherImpl.java:261) 


at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:273) 


at
weblogic.servlet.internal.FilterChainImpl.doFilter(FilterChainImpl.java:42) 


at
weblogic.servlet.internal.RequestDispatcherImpl.invokeServlet(RequestDispatcherImpl.java:526) 


at
weblogic.servlet.internal.RequestDispatcherImpl.forward(RequestDispatcherImpl.java:261) 


at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:273) 


at
weblogic.servlet.internal.FilterChainImpl.doFilter(FilterChainImpl.java:42) 


at
weblogic.servlet.internal.RequestDispatcherImpl.invokeServlet(RequestDispatcherImpl.java:526) 


at
weblogic.servlet.internal.RequestDispatcherImpl.forward(RequestDispatcherImpl.java:261) 


at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:273) 










Re: WebLogic 10 Compatibility Issue - StackOverflowError

2009-01-29 Thread Alexander Ramos Jardim
Ilan,

I had the same problem some months ago and had to remove the quoted line on
jsp. But I never got the other problem you said with 1.3 in Weblogic.

2009/1/29 Ilan Rabinovitch 

>
> We were able to deploy Solr 1.3 on Weblogic 10.0 earlier today.  Doing so
> required two changes:
>
> 1) Creating a weblogic.xml file in solr.war's  WEB-INF directory.  The
> weblogic.xml file is required to disable Solr's filter on FORWARD.
>
> The contents of weblogic.xml should be:
>
> 
> xmlns="http://www.bea.com/ns/weblogic/90";
>xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance";
>xsi:schemaLocation="http://www.bea.com/ns/weblogic/90
> http://www.bea.com/ns/weblogic/90/weblogic-web-app.xsd";>
>
>
>
>
> false
>
>
> 
>
>
> 2)  Remove the pageEncoding attribute from line 1 of solr/admin/header.jsp
>
>
>
>
>
> On 1/17/09 2:02 PM, KSY wrote:
>
>> I hit a major roadblock while trying to get Solr 1.3 running on WebLogic
>> 10.0.
>>
>> A similar message was posted before - (
>>
>> http://www.nabble.com/Solr-1.3-stack-overflow-when-accessing-solr-admin-page-td20157873.html
>>
>> http://www.nabble.com/Solr-1.3-stack-overflow-when-accessing-solr-admin-page-td20157873.html
>> ) - but it seems like it hasn't been resolved yet, so I'm re-posting here.
>>
>> I am sure I configured everything correctly because it's working fine on
>> Resin.
>>
>> Has anyone successfully run Solr 1.3 on WebLogic 10.0 or higher?   Thanks.
>>
>>
>> SUMMARY:
>>
>> When accessing /solr/admin page, StackOverflowError occurs due to an
>> infinite recursion in SolrDispatchFilter
>>
>>
>> ENVIRONMENT SETTING:
>>
>> Solr 1.3.0
>> WebLogic 10.0
>> JRockit JVM 1.5
>>
>>
>> ERROR MESSAGE:
>>
>> SEVERE: javax.servlet.ServletException: java.lang.StackOverflowError
>>at
>>
>> weblogic.servlet.internal.RequestDispatcherImpl.forward(RequestDispatcherImpl.java:276)
>>at
>>
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:273)
>>at
>>
>> weblogic.servlet.internal.FilterChainImpl.doFilter(FilterChainImpl.java:42)
>>at
>>
>> weblogic.servlet.internal.RequestDispatcherImpl.invokeServlet(RequestDispatcherImpl.java:526)
>>at
>>
>> weblogic.servlet.internal.RequestDispatcherImpl.forward(RequestDispatcherImpl.java:261)
>>at
>>
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:273)
>>at
>>
>> weblogic.servlet.internal.FilterChainImpl.doFilter(FilterChainImpl.java:42)
>>at
>>
>> weblogic.servlet.internal.RequestDispatcherImpl.invokeServlet(RequestDispatcherImpl.java:526)
>>at
>>
>> weblogic.servlet.internal.RequestDispatcherImpl.forward(RequestDispatcherImpl.java:261)
>>at
>>
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:273)
>>at
>>
>> weblogic.servlet.internal.FilterChainImpl.doFilter(FilterChainImpl.java:42)
>>at
>>
>> weblogic.servlet.internal.RequestDispatcherImpl.invokeServlet(RequestDispatcherImpl.java:526)
>>at
>>
>> weblogic.servlet.internal.RequestDispatcherImpl.forward(RequestDispatcherImpl.java:261)
>>at
>>
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:273)
>>
>>
>
> --
> Ilan Rabinovitch
> i...@fonz.net
>
> ---
> SCALE 7x: 2009 Southern California Linux Expo
> Los Angeles, CA
> http://www.socallinuxexpo.org
>
>


-- 
Alexander Ramos Jardim


RE: DIH handling of missing files

2009-01-29 Thread Nathan Adams
I'm running the example from the DIH wiki page:
 
http://wiki.apache.org/solr-data/attachments/DataImportHandler/attachments/example-solr-home.jar
 
-Nathan

 


From: Noble Paul ??? ?? [mailto:noble.p...@gmail.com]
Sent: Wed 01/28/2009 11:32 PM
To: solr-user@lucene.apache.org
Subject: Re: DIH handling of missing files



onError="continue" must help .

which version of DIH are you using? onError is a Solr 1.4 feature
--Noble

On Thu, Jan 29, 2009 at 5:04 AM, Nathan Adams  wrote:
> I am constructing documents from a JDBC datasource and a HTTP datasource
> (see data-config file below.)  My problem is that I cannot know if a
> particular HTTP URL is available at index time, so I need DIH to
> continue processing even if the HTTP location returns a 404.
> onError="continue" does not appear to help in this case.  Should it?
>
> 
>
> driver="oracle.jdbc.driver.OracleDriver" url="jdbc:oracle:thin:@?"
> user="???" password="???"/>
>
>
>
>
>
> query="select * from " onError="continue">
>
> url="http://???.com/$  {metadata.RESOURCEID}.xml" 
> forEach="/content"
> dataSource="http" processor="XPathEntityProcessor" onError="continue">
>
>
>
>
>
>
>
>
>
> 
>
> Thanks,
> Nathan
>



--
--Noble Paul




RE: DIH handling of missing files

2009-01-29 Thread Nathan Adams
Which appears to be v1.3, which explains the problem.  Thanks!



From: Nathan Adams [mailto:na...@umich.edu]
Sent: Thu 01/29/2009 8:28 AM
To: solr-user@lucene.apache.org
Subject: RE: DIH handling of missing files



I'm running the example from the DIH wiki page:

http://wiki.apache.org/solr-data/attachments/DataImportHandler/attachments/example-solr-home.jar

-Nathan




From: Noble Paul ??? ?? [mailto:noble.p...@gmail.com]
Sent: Wed 01/28/2009 11:32 PM
To: solr-user@lucene.apache.org
Subject: Re: DIH handling of missing files



onError="continue" must help .

which version of DIH are you using? onError is a Solr 1.4 feature
--Noble

On Thu, Jan 29, 2009 at 5:04 AM, Nathan Adams  wrote:
> I am constructing documents from a JDBC datasource and a HTTP datasource
> (see data-config file below.)  My problem is that I cannot know if a
> particular HTTP URL is available at index time, so I need DIH to
> continue processing even if the HTTP location returns a 404.
> onError="continue" does not appear to help in this case.  Should it?
>
> 
>
> driver="oracle.jdbc.driver.OracleDriver" url="jdbc:oracle:thin:@?"
> user="???" password="???"/>
>
>
>
>
>
> query="select * from " onError="continue">
>
> url="http://???.com/$    
> {metadata.RESOURCEID}.xml" forEach="/content"
> dataSource="http" processor="XPathEntityProcessor" onError="continue">
>
>
>
>
>
>
>
>
>
> 
>
> Thanks,
> Nathan
>



--
--Noble Paul






fuzzy search and uppercased word. finds moo~ not Moo~

2009-01-29 Thread Julian Davchev
Hi,
I am doing fuzzy search. And works correctly. For some reason though it
has problems with uppercase words.
e.g if I search moo~I get results but if I do Moo~ I don't.
I see in analyzer that LowerCaseFilterFactory is hitting but I gess with
fuzzy it's getting messy.
Any clue someone?

Cheers


Data Directory Sync.

2009-01-29 Thread Kalidoss MM
Hi,

   I have a requirement like, There is a running solr and having around
10K records indexed in it. Now i have to index another set of 30K records?

   The 10K data already in live, And i dont have an option to insert
that 30K records in live,

   Is there any way to run the solr in local system and get the 30K
records in data directory, and Update/Upgrade the local solr data directoy
INTO live data directory?

   Is there any tools available? Or is there any other method to
Sync/combine 2 different data directory and make it to 1 data directory.

Thanks,
Kalidoss.m,


Re: multilanguage + howto search in all languages?

2009-01-29 Thread Julian Davchev
Thank you both for points. For now I am hanlding with fuzzy search.
Let's hope this will do for sometime :)


Walter Underwood wrote:
> I've done this. There are five cases for the tokens in the search
> index:
>
> 1. Tokens that are unique after stemming (this is good).
> 2. Tokens that are common after stemming (usually trademarks,
>like LaserJet).
> 3. Tokens with collisions after stemming:
>German "mit", "MIT" the university
>German "Boot" (boat), English "boot" (a heavy shoe)
> 4. Tokens with collisions in the surface form:
>Dutch "mobile" (plural of furniture), English "mobile"
>German "die" (stemmed to "das"), English "die"
>
> You cannot fix every spurious match, but you can do OK with
> stemmed fields for each language and a raw (unstemmed surface
> token) field.
>
> I won't recommend weights, but you could have fields for
> text_en, text_de, and text_raw, for example.
>
> You really cannot automatically determine the language of a
> query, mostly because of proper nouns, especially trademarks.
> Identify the language of these queries:
>
> * Google
> * LaserJet
> * Obama
> * Las Vegas
> * Paris
>
> HTTP supports an Accept-Language header, but I have no idea
> how often that is sent. We honored that in Ultraseek, mostly
> because it was standard.
>
> Finally, if you are working with localization, please take the
> time to understand the difference between ISO language codes
> and ISO country codes.
>
> wunder
>
> On 1/28/09 4:47 PM, "Erick Erickson"  wrote:
>
>   
>> I'm not entirely sure about the fine points, but consider the
>> filters that are available that fold all the diacritics into their
>> low-ascii equivalents. Perhaps using that filter at *both* index
>> and search time on the English index would do the trick.
>>
>> In your example, both would be 'munchen'. Straight English
>> would be unaffected by the filter, but any German words with
>> diacritics that crept in would be folded into their low-ascii
>> "equivalents". This would also work at index time, just in case
>> you indexed English text that had some German words.
>>
>> NOTE: My experience is more on the Lucene side than the SOLR
>> side, but I'm sure the filters are available.
>>
>> Best
>> Erick
>>
>> On Wed, Jan 28, 2009 at 5:21 PM, Julian Davchev  wrote:
>>
>> 
>>> Hi,
>>> I currently have two indexes with solr. One for english version and one
>>> with german version. They use respectively english/german2 snowball
>>> factory.
>>> Right now depending on which language is website currently I query
>>> corresponding index.
>>> There is requirement though that stuff is found regardless in which
>>> language is found.
>>> So for example if searching for muenchen (will be caught correctly by
>>> german snowball factory as münchen) in english index it should be found.
>>> Right now
>>> it is not as I suppose english factory doesn't really care about umlauts.
>>>
>>> Any pointers are more than welcome. I am considering synonyms  but this
>>> will be kinda to heavy to follow/create.
>>> Cheers,
>>> JD
>>>
>>>   
>
>   



Re: Data Directory Sync.

2009-01-29 Thread Noble Paul നോബിള്‍ नोब्ळ्
On Thu, Jan 29, 2009 at 7:27 PM, Kalidoss MM  wrote:
> Hi,
>
>   I have a requirement like, There is a running solr and having around
> 10K records indexed in it. Now i have to index another set of 30K records?
>
>   The 10K data already in live, And i dont have an option to insert
> that 30K records in live,
you can index the 30K data to the live Solr .
>
>   Is there any way to run the solr in local system and get the 30K
> records in data directory, and Update/Upgrade the local solr data directoy
> INTO live data directory?
>
>   Is there any tools available? Or is there any other method to
> Sync/combine 2 different data directory and make it to 1 data directory.
>
> Thanks,
> Kalidoss.m,
>



-- 
--Noble Paul


check snapshoot/snapinstaller

2009-01-29 Thread sunnyfr

Hi,

I would like to know how can I check properly if a snapshot has been done
except by checking in the data directory.

Does snapshooter.log file is updated if snapshoot is executed automaticly
after a commit?

Is there a way to check snapshoot or snapinstaller last activity by
stats.jsp, don't think so but just to check :))

THANKS A LOT GUYS,
WISH YOU A NICE EVENING/AFTERNOON !!

Sunny !!
-- 
View this message in context: 
http://www.nabble.com/check-snapshoot-snapinstaller-tp21730105p21730105.html
Sent from the Solr - User mailing list archive at Nabble.com.



MASTER / SLAVES numdoc

2009-01-29 Thread sunnyfr

Hi,

I've one server and several slaves and I would like to know if I go to the
host.name/solr/admin/stat.jsp if there is a way to know the difference of
the numDoc per server? 

Thanks a lot
-- 
View this message in context: 
http://www.nabble.com/MASTER---SLAVES-numdoc-tp21730748p21730748.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: check snapshoot/snapinstaller

2009-01-29 Thread Bill Au
snapshooter.log logs all invocation of the snapshooter, including the
automatic ones triggered by a commit/optimize.

There are log files in the logs directory on various status/stats:

http://wiki.apache.org/solr/SolrCollectionDistributionStatusStats

These status/stats can be displayed in the admin page by clicking the
[DISTRIBUTION] link on the admin page.

Bill


On Thu, Jan 29, 2009 at 11:16 AM, sunnyfr  wrote:

>
> Hi,
>
> I would like to know how can I check properly if a snapshot has been done
> except by checking in the data directory.
>
> Does snapshooter.log file is updated if snapshoot is executed automaticly
> after a commit?
>
> Is there a way to check snapshoot or snapinstaller last activity by
> stats.jsp, don't think so but just to check :))
>
> THANKS A LOT GUYS,
> WISH YOU A NICE EVENING/AFTERNOON !!
>
> Sunny !!
> --
> View this message in context:
> http://www.nabble.com/check-snapshoot-snapinstaller-tp21730105p21730105.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>


warmupTime : 0

2009-01-29 Thread sunnyfr

Hi,

Do you think it's normal to have warmupTime : 0 ??

searcher  
class:  org.apache.solr.search.SolrIndexSearcher  
version:1.0  
description:index searcher  
stats:  searcherName : searc...@6f7cf6b6 main
caching : true
numDocs : 8207035
maxDoc : 8239991
readerImpl : ReadOnlyMultiSegmentReader
readerDir : org.apache.lucene.store.FSDirectory@/data/solr/video/data/index
indexVersion : 1228743257996
openedAt : Thu Jan 29 17:42:08 CET 2009
registeredAt : Thu Jan 29 17:42:09 CET 2009
warmupTime : 0 

I've around 12M of data.




   


  



thanks a lot,

-- 
View this message in context: 
http://www.nabble.com/warmupTime-%3A-0-tp21731301p21731301.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: fuzzy search and uppercased word. finds moo~ not Moo~

2009-01-29 Thread Mark Miller

Julian Davchev wrote:

Hi,
I am doing fuzzy search. And works correctly. For some reason though it
has problems with uppercase words.
e.g if I search moo~I get results but if I do Moo~ I don't.
I see in analyzer that LowerCaseFilterFactory is hitting but I gess with
fuzzy it's getting messy.
Any clue someone?

Cheers
  
There is a setting on the Lucene QueryParser that controls whether 
expanded terms get lowercased (they generally don't hit an analyzer). It 
looks like the SolrQueryParser, which extends the Lucene QueryParser, 
hardcodes the setting to false.


- Mark


Solr Gaze and Multicore?

2009-01-29 Thread Jacob Singh
Sorry if this is wrong place to ask since Solr Gaze is Lucid's
proejct, but I was trying to install this in a multicore environment,
and it doesn't seem to be working.

It says to add the plugin to solr.home/lib.

Which solr.home?  I got to /gaze and of course, it doesn't know where to look.

Thanks,
Jacob

P.S. Congrats on the new company!

-- 

+1 510 277-0891 (o)
+91  33 7458 (m)

web: http://pajamadesign.com

Skype: pajamadesign
Yahoo: jacobsingh
AIM: jacobsingh
gTalk: jacobsi...@gmail.com


ranged query on multivalued field doesnt seem to work

2009-01-29 Thread zqzuk

Hi all,

in my schema I have two multivalued fields as




and I issued a query as: start_year:[400 TO *], the result seems to be
incorrect because I got some records with start year = - 3000... and also
start year = -2147483647 (Integer.MINVALUE) Also when I combine start_year
with end_year, it also produced wrong results...

what could be wrong? is it because I used wrong field type "sfloat", which
should be integer?

Any hints would be very much appreciated!

many thanks!
-- 
View this message in context: 
http://www.nabble.com/ranged-query-on-multivalued-field-doesnt-seem-to-work-tp21731778p21731778.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: query with stemming, prefix and fuzzy?

2009-01-29 Thread Mark Miller
Truncation queries and stemming are difficult partners. You likely have 
to accept compromise. You can try using multiple fields like you are, 
you can try indexing the full term at the same position as the stemmed 
term, or you can accept the weirdness that comes from matching on a 
stemmed form (potentially very confusing for a user).


In any case though, a queryparser that support fuzzyquery should not be 
analyzing it. What parser are you using? If it is analyzing the fuzzy 
syntax, it doesnt likely support it.


Fuzzy queries are slow - especially if they match a lot of terms. A 
booleanquery is created with a clause for each term, and then an edit 
distance is calculated to filter out what doesnt match.


The prefix length determines how many terms are enumerated - with the 
default of 0, every term is enumerated I think. And an edit distance is 
calculated to filter them out. Thats real slow - a longer prefix will 
significantly cut down the number of terms that need to be enumerated.


Think of mark~0.6 - with a 0 prefix I will enumerate every term and 
check the edit distance. With a 2 prefix I will only enumerate the terms 
that start with ma, and calculate an edit distance. One might be just a 
bit faster.


The latest trunk build on Lucene will let us switch fuzzy query to use a 
constant score mode - this will eliminate the booleanquery and should 
perform much better on a large index. Solr already uses a constant score 
mode for Prefix and Wildcard queries.


How big is your index? If its not that big, it may be odd that your 
seeing things that slow (number of unique terms in the index will play a 
large role).


- Mark

Gert Brinkmann wrote:

Hello,

I am trying to get Solr to properly work. I have set up a Solr test
server (using jetty as mentioned in the tutorial). Also I had to modify
the schema.xml so that I have different fields for different languages
(with their own stemmers) that occur in the content management system
that I am indexing. So far everything does work fine including snippet
highlighting.

But now I am having some problems with two things:

A) fuzzy search

When trying to do a fuzzy search the analyzers seem to break up a search
string like "house~0.6" into "house", "0" and "6" so that e.g. a single
"6" is highlighted, too. So I tried to use an additional raw-field
without any stemming and just a lower case and white space analyzer.
This seems to work fine. But fuzzy query is very slow and takes 100% CPU
for several seconds with only one query at a time.

What can I do to speed up the fuzzy query? I e.g. have found a Lucene
parameter prefixLength but no according Solr option. Does this exist?
Are there some other options to pay attention to?


B) combine stemming, prefix and fuzzy search

Is there a way to combine all this three query types in one query?
Especially stemming and prefixing? I think it would be problematic as a
"house*" would be analyzed to "house" with the usual analyzers that are
required for stemming?

Do I need different query type fields and combine them with an boolean
OR in the query? Something like

  data:house OR data_fuzzy:house~0.6 OR data_prefix:house*

This feels to be a little bit circuitous. Is there a way to use
"house*~.6" including correct stemming?

Thank you,
Gert
  




Re: I get SEVERE: Lock obtain timed out

2009-01-29 Thread Jon Drukman

Julian Davchev wrote:

Hi,
Any documents or something I can read on how locks work and how I can
controll it. When do they occur etc.
Cause only way I got out of this mess was restarting tomcat

SEVERE: org.apache.lucene.store.LockObtainFailedException: Lock obtain
timed out: SingleInstanceLock: write.lock


Cheers,



Julian, have you had any luck figuring this out?  My production instance 
just started having this problem.  It seems to crop up after solr's been 
running for several hours.  Our usage is very light (maybe one query 
every few seconds).  I saw someone else mention an out of memory error - 
this machine has 8GB of RAM and is running 64bit Linux so it's all 
available to solr.  Our index is very small - under 40MB.  the solr 
process is using around 615MB of RAM according to top.




Re: Solr Gaze and Multicore?

2009-01-29 Thread Mark Miller

Jacob Singh wrote:

Sorry if this is wrong place to ask since Solr Gaze is Lucid's
proejct, but I was trying to install this in a multicore environment,
and it doesn't seem to be working.

It says to add the plugin to solr.home/lib.

Which solr.home?  I got to /gaze and of course, it doesn't know where to look.

Thanks,
Jacob

P.S. Congrats on the new company!

  

Hey Jacob,

Gaze currently only supports one core in a multi-core setup. So add the 
plugin jar to the lib folder in the core you want to monitor. Then when 
you are asked to setup the solr url, use : "http://.../solr/corename";. 
I'll make that more clear in the setup info.


- Mark


permanently setting log level?

2009-01-29 Thread Jon Drukman
if i go to /solr/admin/logging, i can set the "root" log level to 
WARNING, which is what i want.  however, every time solr restarts, it is 
set back to INFO.  Is there a way to get the WARNING level to stick 
permanently?


-jsd-



Question about rating documents

2009-01-29 Thread Reece
Currently I'm using SOLR 1.2 to index a few million documents.  It's
been requested that a way for users to rate the documents be done so
that something rated higher would show up higher in search results and
vice verse.

I've been thinking about it, but can't come up with a good way to do
this and still have the "best match" ranking of the results according
to search terms entered by the users.

I was hoping someone had done something similar or would have some
insight on it.

Thanks in advance!

-Reece


Re: permanently setting log level?

2009-01-29 Thread Vannia Rajan
On Thu, Jan 29, 2009 at 11:55 PM, Jon Drukman  wrote:

> if i go to /solr/admin/logging, i can set the "root" log level to WARNING,
> which is what i want.  however, every time solr restarts, it is set back to
> INFO.  Is there a way to get the WARNING level to stick permanently?
>
>
Hi,
You can set permanent logging-level by changing parameters in
$CATALINA_HOME/conf/logging.properties

Change all INFO to WARNING in the logging.properties

where, $CATALINA_HOME is the path of your apache-tomcat.

-- 
With Regards,
K.Vanniarajan


Re: Question about rating documents

2009-01-29 Thread Matthew Runo
You could use a boost function to gently boost up items which were  
marked as more popular.


You would send the function query in the "bf" parameter with your  
query, and you can find out more about syntax here: http://wiki.apache.org/solr/FunctionQuery


Thanks for your time!

Matthew Runo
Software Engineer, Zappos.com
mr...@zappos.com - 702-943-7833

On Jan 29, 2009, at 10:27 AM, Reece wrote:


Currently I'm using SOLR 1.2 to index a few million documents.  It's
been requested that a way for users to rate the documents be done so
that something rated higher would show up higher in search results and
vice verse.

I've been thinking about it, but can't come up with a good way to do
this and still have the "best match" ranking of the results according
to search terms entered by the users.

I was hoping someone had done something similar or would have some
insight on it.

Thanks in advance!

-Reece





Re: How to handle database replication delay when using DataImportHandler?

2009-01-29 Thread Gregg Donovan
Noble,

Thanks for the suggestion. The unfortunate thing is that we really don't
know ahead of time what sort of replication delay we're going to encounter
-- it could be one millisecond or it could be one hour. So, we end up
needing to do something like:

For delta-import run N:
1. query DB slave for "seconds_behind_master", use this to calculate
Date(N).
2. query DB slave for records updated since Date(N - 1)

I see there are plugin points for EventListener classes (onImportStart,
onImportEnd). Would those be the right spot to calculate these dates so that
I could expose them to my custom function at query time?

Thanks.

--Gregg

On Wed, Jan 28, 2009 at 11:20 PM, Noble Paul നോബിള്‍ नोब्ळ् <
noble.p...@gmail.com> wrote:

> The problem you are trying to solve is that you cannot use
> ${dataimporter.last_index_time} as is. you may need something like
> ${dataimporter.last_index_time} - 3secs
>
> am I right?
>
> There are no straight ways to do this .
> 1) you may write your own function say 'lastIndexMinus3Secs' and add
> them. functions can be plugged in to DIH using a  name="lastIndexMinus3Secs" class=""foo.Foo/> under the 
> tag. And you can use it as
> ${dataimporter.functions.lastIndexMinus3Secs()}
> this will add to the existing in-built functions
>
> http://wiki.apache.org/solr/DataImportHandler#head-5675e913396a42eb7c6c5d3c894ada5dadbb62d7
>
> the class must extend org.apache.solr.handler.dataimport.Evaluator
>
> we may add a standard function for this too . you can raise an issue
> --Noble
>
>
>
> On Thu, Jan 29, 2009 at 6:26 AM, Gregg  wrote:
> > I'd like to use the DataImportHandler running against a slave database
> that,
> > at any given time, may be significantly behind the master DB. This can
> cause
> > updates to be missed if you use the clock-time as the "last_index_time."
> > E.g., if the slave catches up to the master between two delta-imports.
> >
> > Has anyone run into this? In our non-DIH indexing system we get around
> this
> > by either using the slave DB's seconds-behind-master or the max last
> update
> > time of the records returned.
> >
> > Thanks.
> >
> > Gregg
> >
>
>
>
> --
> --Noble Paul
>


Re: I get SEVERE: Lock obtain timed out

2009-01-29 Thread Yonik Seeley
On Thu, Jan 29, 2009 at 1:16 PM, Jon Drukman  wrote:
> Julian, have you had any luck figuring this out?  My production instance
> just started having this problem.  It seems to crop up after solr's been
> running for several hours.  Our usage is very light (maybe one query every
> few seconds).  I saw someone else mention an out of memory error - this
> machine has 8GB of RAM and is running 64bit Linux so it's all available to
> solr.  Our index is very small - under 40MB.  the solr process is using
> around 615MB of RAM according to top.

I've only seen failure to remove the lock file either when an OOM
exception occured, or the JVM died or was killed.

-Yonik


Re: Solr Gaze and Multicore?

2009-01-29 Thread Jacob Singh
Hi Mark,

Thanks, I've got it working now.  Still waiting for the stats to update...

This is really cool! I've also been working pretty hard at an
automated benchmark suite using jmeter, rightscale and amazon web
services.  Next time I'm in Boston (March I think), it would be great
to show you.

Best,
Jacob

On Thu, Jan 29, 2009 at 1:16 PM, Mark Miller  wrote:
> Jacob Singh wrote:
>>
>> Sorry if this is wrong place to ask since Solr Gaze is Lucid's
>> proejct, but I was trying to install this in a multicore environment,
>> and it doesn't seem to be working.
>>
>> It says to add the plugin to solr.home/lib.
>>
>> Which solr.home?  I got to /gaze and of course, it doesn't know where to
>> look.
>>
>> Thanks,
>> Jacob
>>
>> P.S. Congrats on the new company!
>>
>>
>
> Hey Jacob,
>
> Gaze currently only supports one core in a multi-core setup. So add the
> plugin jar to the lib folder in the core you want to monitor. Then when you
> are asked to setup the solr url, use : "http://.../solr/corename";. I'll make
> that more clear in the setup info.
>
> - Mark
>



-- 

+1 510 277-0891 (o)
+91  33 7458 (m)

web: http://pajamadesign.com

Skype: pajamadesign
Yahoo: jacobsingh
AIM: jacobsingh
gTalk: jacobsi...@gmail.com


Re: permanently setting log level?

2009-01-29 Thread Jon Drukman

Vannia Rajan wrote:

On Thu, Jan 29, 2009 at 11:55 PM, Jon Drukman  wrote:


if i go to /solr/admin/logging, i can set the "root" log level to WARNING,
which is what i want.  however, every time solr restarts, it is set back to
INFO.  Is there a way to get the WARNING level to stick permanently?



Hi,
You can set permanent logging-level by changing parameters in
$CATALINA_HOME/conf/logging.properties

Change all INFO to WARNING in the logging.properties

where, $CATALINA_HOME is the path of your apache-tomcat.



i'm not using tomcat, i'm using the default jetty setup that comes with 
solr.  i grepped through the entire solr installation for 'INFO' but i 
don't see it.


i don't really know anything about jetty other than i have to run java 
-jar start.jar to get it to run solr.




Solr 1.3 and spellcheck.onlyMorePopular=true

2009-01-29 Thread Nicholas Piasecki
Hello All,

I'm new to Solr, so forgive me if I'm overlooking something obvious. My
observation is that the spellcheck.onlyMorePopular property of the
SpellCheckComponent seems to not do what I expect.

If I send the query "calvin klien" to my data store, then the spell
checker correctly suggests "klein" for "klien," and running the new
"calvin klein" query returns the expected many product results.

However, when sending the correct query of "calvin klein," the spell
checker will suggest "cin2" (another brand name in our data store) for
"klein," and running that new "calvin cin2" collated query obviously
returns zero results.

It would seem to me that the "onlyMorePopular" property, when set to
true, only performs its calculation of popularity on the particular
misspelled word alone, and not the query as a whole. Since there are
indeed more C-IN2 brand products in our database, it returns "cin2" has
a spelling correction for "klein," seeing that the "cin2" token alone
returns many results but not bothering to check that "calvin cin2"
returns none. 

A less astonishing behavior would be for it to suggest "cin2", test to
see how many hits "calvin cin2" returns, see that it returns less than
"calvin klein", and then exclude that suggestion because it is not more
popular in the context of the original query.

So:

1 - Is my analysis correct? Is this really how it works?

2 - Is there a configuration setting that I can do to make the spell
checker use the desired behavior? Or, should I just immediately submit a
request with its correlated suggestion with zero rows and do a
comparison on the results, effectively performing the "onlyMorePopular"
calculation myself?

Many thanks; so far, Solr is proving to be an excellent product!

V/R,
Nicholas Piasecki

Software Developer
Skiviez, Inc.
1-800-628-1693 x6003
n...@skiviez.com




Re: I get SEVERE: Lock obtain timed out

2009-01-29 Thread Jon Drukman

Yonik Seeley wrote:

On Thu, Jan 29, 2009 at 1:16 PM, Jon Drukman  wrote:

Julian, have you had any luck figuring this out?  My production instance
just started having this problem.  It seems to crop up after solr's been
running for several hours.  Our usage is very light (maybe one query every
few seconds).  I saw someone else mention an out of memory error - this
machine has 8GB of RAM and is running 64bit Linux so it's all available to
solr.  Our index is very small - under 40MB.  the solr process is using
around 615MB of RAM according to top.


I've only seen failure to remove the lock file either when an OOM
exception occured, or the JVM died or was killed.


i guess it's possible that we hit an out of memory error and the 
followup lock errors just bumped it out of the log file rotation.  i was 
running with multilog's default settings so my log files were getting 
thrown out very quickly.  i just bumped up the JVM's max heap size and 
told multilog to keep way more log files so if this happens again 
hopefully i will be able to get more info on what happened.


-jsd-



Re: Question about rating documents

2009-01-29 Thread Reece
Hmm, I already boost certain fields, but from what I know about it you
would need to know the boost value ahead of time which is not possible
as it would be a different boost for each document depending on how it
was rated..

I did think of one thing though.  If I had a field that had a value of
1-5 for each document, and took that and used it to then add a boost
to the fields I was actually searching on (or the final score) that
would probably work, is that possible?

-Reece



On Thu, Jan 29, 2009 at 1:51 PM, Matthew Runo  wrote:
> You could use a boost function to gently boost up items which were marked as
> more popular.
>
> You would send the function query in the "bf" parameter with your query, and
> you can find out more about syntax here:
> http://wiki.apache.org/solr/FunctionQuery
>
> Thanks for your time!
>
> Matthew Runo
> Software Engineer, Zappos.com
> mr...@zappos.com - 702-943-7833
>
> On Jan 29, 2009, at 10:27 AM, Reece wrote:
>
>> Currently I'm using SOLR 1.2 to index a few million documents.  It's
>> been requested that a way for users to rate the documents be done so
>> that something rated higher would show up higher in search results and
>> vice verse.
>>
>> I've been thinking about it, but can't come up with a good way to do
>> this and still have the "best match" ranking of the results according
>> to search terms entered by the users.
>>
>> I was hoping someone had done something similar or would have some
>> insight on it.
>>
>> Thanks in advance!
>>
>> -Reece
>>
>
>


Re: Optimizing & Improving results based on user feedback

2009-01-29 Thread Walter Underwood
Thanks, I didn't know there was so much research in this area.
Most of the papers at those workshops are about tuning the
entire ranking algorithm with machine learning techniques.

I am interested in adding one more feature, click data, to an
existing ranking algorithm. In my case, I have enough data to
use query-specific boosts instead of global document boosts.
We get about 2M search clicks per day from logged in users
(little or no click spam).

I'm checking out some papers from Thorsten Joachims and from
Microsoft Research that are specifically about clickthrough
feedback.

wunder

On 1/27/09 11:15 PM, "Neal Richter"  wrote:

> OK I've implemented this before, written academic papers and patents
> related to this task.
> 
> Here are some hints:
>- you're on the right track with the editorial boosting elevators
>- http://wiki.apache.org/solr/UserTagDesign
>- be darn careful about assuming that one click is enough evidence
> to boost a long
>  'distance'
>- first page effects in search will skew the learning badly if you
> don't compensate.
> 95% of users never go past the first page of results, 1% go
> past the second
> page.  So perfectly good results on the second page get
> permanently locked out
>- consider forgetting what you learn under some condition
> 
> In fact this whole area is called 'learning to rank' and is a hot
> research topic in IR.
> http://web.mit.edu/shivani/www/Ranking-NIPS-05/
> http://research.microsoft.com/en-us/um/people/lr4ir-2007/
> https://research.microsoft.com/en-us/um/people/lr4ir-2008/
> 
> - Neal Richter
> 
> 
> On Tue, Jan 27, 2009 at 2:06 PM, Matthew Runo  wrote:
>> Hello folks!
>> 
>> We've been thinking about ways to improve organic search results for a while
>> (really, who hasn't?) and I'd like to get some ideas on ways to implement a
>> feedback system that uses user behavior as input. Basically, it'd work on
>> the premise that what the user actually clicked on is probably a really good
>> match for their search, and should be boosted up in the results for that
>> search.
>> 
>> For example, if I search for "rain boots", and really love the 10th result
>> down (and show it by clicking on it), then we'd like to capture this and use
>> the data to boost up that result //for that search//. We've thought about
>> using index time boosts for the documents, but that'd boost it regardless of
>> the search terms, which isn't what we want. We've thought about using the
>> Elevator handler, but we don't really want to force a product to the top -
>> we'd prefer it slowly rises over time as more and more people click it from
>> the same search terms. Another way might be to stuff the keyword into the
>> document, the more times it's in the document the higher it'd score - but
>> there's gotta be a better way than that.
>> 
>> Obviously this can't be done 100% in solr - but if anyone had some clever
>> ideas about how this might be possible it'd be interesting to hear them.
>> 
>> Thanks for your time!
>> 
>> Matthew Runo
>> Software Engineer, Zappos.com
>> mr...@zappos.com - 702-943-7833
>> 
>> 



Re: Question about rating documents

2009-01-29 Thread Erick Erickson
This may not be practical, as it would involve re-indexing
all your documents periodically, but here goes anyway...

You could think about *index-time* boosts. Somewhere
you keep a record of the recommendations, then re-index
your corpus adding some suitable boost to each field in
your document based upon those recommendations.

>From an old post on the Lucene list by Hoss:

<<<...index time field boosts are a way to express things
like "this documents title is worth twice as much as the title
of most documents...">>>

Which seems like what you're after.

But it may not be practical to re-index your corpus,
and the other interesting issue would be how you keep
track of documents since the Lucene doc ID is probably
useless, you'd have to have your own unique, persistent
field.

Best
Erick

On Thu, Jan 29, 2009 at 2:27 PM, Reece  wrote:

> Hmm, I already boost certain fields, but from what I know about it you
> would need to know the boost value ahead of time which is not possible
> as it would be a different boost for each document depending on how it
> was rated..
>
> I did think of one thing though.  If I had a field that had a value of
> 1-5 for each document, and took that and used it to then add a boost
> to the fields I was actually searching on (or the final score) that
> would probably work, is that possible?
>
> -Reece
>
>
>
> On Thu, Jan 29, 2009 at 1:51 PM, Matthew Runo  wrote:
> > You could use a boost function to gently boost up items which were marked
> as
> > more popular.
> >
> > You would send the function query in the "bf" parameter with your query,
> and
> > you can find out more about syntax here:
> > http://wiki.apache.org/solr/FunctionQuery
> >
> > Thanks for your time!
> >
> > Matthew Runo
> > Software Engineer, Zappos.com
> > mr...@zappos.com - 702-943-7833
> >
> > On Jan 29, 2009, at 10:27 AM, Reece wrote:
> >
> >> Currently I'm using SOLR 1.2 to index a few million documents.  It's
> >> been requested that a way for users to rate the documents be done so
> >> that something rated higher would show up higher in search results and
> >> vice verse.
> >>
> >> I've been thinking about it, but can't come up with a good way to do
> >> this and still have the "best match" ranking of the results according
> >> to search terms entered by the users.
> >>
> >> I was hoping someone had done something similar or would have some
> >> insight on it.
> >>
> >> Thanks in advance!
> >>
> >> -Reece
> >>
> >
> >
>


Re: Solr 1.3 and spellcheck.onlyMorePopular=true

2009-01-29 Thread Mark Miller
I am not super familiar with the lucene/solr spell checking 
implementations, but here is my take:


By saying to only allow more popular, you are restricting suggestions to 
only those that have a higher instance frequency in the index. The score 
is still by edit distance, but only terms with a higher frequency than 
the term passed will be suggested. I agree this odd - it means you 
should only pass words in that you know are misspelled. You cant count 
on the spellchecker to kind of do that for you as it does without the 
more popular setting on.


So that is leaving you with a nasty suggestion. But it looks like the 
edit distance for that suggestion is larger. What you might try is 
adjusting the threshold (the min edit distance) to be a bit higher. That 
may  restrict  that suggestion. Its not a great solution though. Its 
likely to suggest something else :) Ideally, the spell checker should 
probably be better at not suggesting when you have chosen a good word. 
It doesn't care you have a good  word already - it sees another word 
with greater frequency and within the edit distance allowed.


If you don't set the more popular setting, upon finding a word in the 
index, the Spell checker returns the word passed in. With the more 
popular setting on, you get the results you see - its still suggests, 
but it specifically will not suggest the word you passed in itself (the 
comment says, 'that would be silly'). So you will likely see bad 
suggestions for correct words with this setting.


- Mark

Nicholas Piasecki wrote:

Hello All,

I'm new to Solr, so forgive me if I'm overlooking something obvious. My
observation is that the spellcheck.onlyMorePopular property of the
SpellCheckComponent seems to not do what I expect.

If I send the query "calvin klien" to my data store, then the spell
checker correctly suggests "klein" for "klien," and running the new
"calvin klein" query returns the expected many product results.

However, when sending the correct query of "calvin klein," the spell
checker will suggest "cin2" (another brand name in our data store) for
"klein," and running that new "calvin cin2" collated query obviously
returns zero results.

It would seem to me that the "onlyMorePopular" property, when set to
true, only performs its calculation of popularity on the particular
misspelled word alone, and not the query as a whole. Since there are
indeed more C-IN2 brand products in our database, it returns "cin2" has
a spelling correction for "klein," seeing that the "cin2" token alone
returns many results but not bothering to check that "calvin cin2"
returns none. 


A less astonishing behavior would be for it to suggest "cin2", test to
see how many hits "calvin cin2" returns, see that it returns less than
"calvin klein", and then exclude that suggestion because it is not more
popular in the context of the original query.

So:

1 - Is my analysis correct? Is this really how it works?

2 - Is there a configuration setting that I can do to make the spell
checker use the desired behavior? Or, should I just immediately submit a
request with its correlated suggestion with zero rows and do a
comparison on the results, effectively performing the "onlyMorePopular"
calculation myself?

Many thanks; so far, Solr is proving to be an excellent product!

V/R,
Nicholas Piasecki

Software Developer
Skiviez, Inc.
1-800-628-1693 x6003
n...@skiviez.com


  




Re: Solr 1.3 and spellcheck.onlyMorePopular=true

2009-01-29 Thread Mark Miller

Let me try that again. I think my email client is going nuts:

I am not super familiar with the lucene/solr spell checking 
implementations, but here is my take:


By saying to only allow more popular, you are restricting suggestions to 
only those that have a higher instance frequency in the index. The score 
is still by edit distance, but only terms with a higher frequency than 
the term passed will be suggested. I agree this odd - it means you 
should only pass words in that you know are misspelled. You cant count 
on the spellchecker to kind of do that for you as it does without the 
more popular setting on.


So that is leaving you with a nasty suggestion. But it looks like the 
edit distance for that suggestion is larger. What you might try is 
adjusting the threshold (the min edit distance) to be a bit higher. That 
may  restrict  that suggestion. Its not a great solution though. Its 
likely to suggest something else  :)  Ideally, the spell checker should 
probably be better at not suggesting when you have chosen a good word. 
It doesn't care you have a good  word already - it sees another word 
with greater frequency and within the edit distance allowed.


If you don't set the more popular setting, upon finding a word in the 
index, the Spell checker returns the word passed in. With the more 
popular setting on, you get the results you see - its still suggests, 
but it specifically will not suggest the word you passed in itself (the 
comment says, 'that would be silly'). So you will likely see bad 
suggestions for correct words with this setting.


- Mark




Nicholas Piasecki wrote:

Hello All,

I'm new to Solr, so forgive me if I'm overlooking something obvious. My
observation is that the spellcheck.onlyMorePopular property of the
SpellCheckComponent seems to not do what I expect.

If I send the query "calvin klien" to my data store, then the spell
checker correctly suggests "klein" for "klien," and running the new
"calvin klein" query returns the expected many product results.

However, when sending the correct query of "calvin klein," the spell
checker will suggest "cin2" (another brand name in our data store) for
"klein," and running that new "calvin cin2" collated query obviously
returns zero results.

It would seem to me that the "onlyMorePopular" property, when set to
true, only performs its calculation of popularity on the particular
misspelled word alone, and not the query as a whole. Since there are
indeed more C-IN2 brand products in our database, it returns "cin2" has
a spelling correction for "klein," seeing that the "cin2" token alone
returns many results but not bothering to check that "calvin cin2"
returns none. 


A less astonishing behavior would be for it to suggest "cin2", test to
see how many hits "calvin cin2" returns, see that it returns less than
"calvin klein", and then exclude that suggestion because it is not more
popular in the context of the original query.

So:

1 - Is my analysis correct? Is this really how it works?

2 - Is there a configuration setting that I can do to make the spell
checker use the desired behavior? Or, should I just immediately submit a
request with its correlated suggestion with zero rows and do a
comparison on the results, effectively performing the "onlyMorePopular"
calculation myself?

Many thanks; so far, Solr is proving to be an excellent product!

V/R,
Nicholas Piasecki

Software Developer
Skiviez, Inc.
1-800-628-1693 x6003
n...@skiviez.com


  




Re: Question about rating documents

2009-01-29 Thread Reece
Re-indexing so much would be a pretty big pain.   I do have a unique
ID for each document though that I use for updating them every day as
they change.

-Reece



On Thu, Jan 29, 2009 at 2:40 PM, Erick Erickson  wrote:
> This may not be practical, as it would involve re-indexing
> all your documents periodically, but here goes anyway...
>
> You could think about *index-time* boosts. Somewhere
> you keep a record of the recommendations, then re-index
> your corpus adding some suitable boost to each field in
> your document based upon those recommendations.
>
> From an old post on the Lucene list by Hoss:
>
> <<<...index time field boosts are a way to express things
> like "this documents title is worth twice as much as the title
> of most documents...">>>
>
> Which seems like what you're after.
>
> But it may not be practical to re-index your corpus,
> and the other interesting issue would be how you keep
> track of documents since the Lucene doc ID is probably
> useless, you'd have to have your own unique, persistent
> field.
>
> Best
> Erick
>
> On Thu, Jan 29, 2009 at 2:27 PM, Reece  wrote:
>
>> Hmm, I already boost certain fields, but from what I know about it you
>> would need to know the boost value ahead of time which is not possible
>> as it would be a different boost for each document depending on how it
>> was rated..
>>
>> I did think of one thing though.  If I had a field that had a value of
>> 1-5 for each document, and took that and used it to then add a boost
>> to the fields I was actually searching on (or the final score) that
>> would probably work, is that possible?
>>
>> -Reece
>>
>>
>>
>> On Thu, Jan 29, 2009 at 1:51 PM, Matthew Runo  wrote:
>> > You could use a boost function to gently boost up items which were marked
>> as
>> > more popular.
>> >
>> > You would send the function query in the "bf" parameter with your query,
>> and
>> > you can find out more about syntax here:
>> > http://wiki.apache.org/solr/FunctionQuery
>> >
>> > Thanks for your time!
>> >
>> > Matthew Runo
>> > Software Engineer, Zappos.com
>> > mr...@zappos.com - 702-943-7833
>> >
>> > On Jan 29, 2009, at 10:27 AM, Reece wrote:
>> >
>> >> Currently I'm using SOLR 1.2 to index a few million documents.  It's
>> >> been requested that a way for users to rate the documents be done so
>> >> that something rated higher would show up higher in search results and
>> >> vice verse.
>> >>
>> >> I've been thinking about it, but can't come up with a good way to do
>> >> this and still have the "best match" ranking of the results according
>> >> to search terms entered by the users.
>> >>
>> >> I was hoping someone had done something similar or would have some
>> >> insight on it.
>> >>
>> >> Thanks in advance!
>> >>
>> >> -Reece
>> >>
>> >
>> >
>>
>


RE: Solr 1.3 and spellcheck.onlyMorePopular=true

2009-01-29 Thread Nicholas Piasecki
Thanks for this lucid explanation. 

Indeed, turning the option off seems to give more intelligent results. I
think that this was more of an example of me seeing "onlyMorePopular"
and thinking "hmm, that must be good!" without fully understanding the
consequences of the setting.

The key point in your explanation is that with the "onlyMorePopular"
setting, you are *intentionally* sending it misspelled words. Now that
you've explained it, this makes sense to me now.

Thanks again.

V/R,
Nicholas Piasecki

Software Developer
Skiviez, Inc.
1-800-628-1693 x6003
n...@skiviez.com


-Original Message-
From: Mark Miller [mailto:markrmil...@gmail.com] 
Sent: Thursday, January 29, 2009 2:47 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr 1.3 and spellcheck.onlyMorePopular=true

Let me try that again. I think my email client is going nuts:

I am not super familiar with the lucene/solr spell checking 
implementations, but here is my take:

By saying to only allow more popular, you are restricting suggestions to

only those that have a higher instance frequency in the index. The score

is still by edit distance, but only terms with a higher frequency than 
the term passed will be suggested. I agree this odd - it means you 
should only pass words in that you know are misspelled. You cant count 
on the spellchecker to kind of do that for you as it does without the 
more popular setting on.

So that is leaving you with a nasty suggestion. But it looks like the 
edit distance for that suggestion is larger. What you might try is 
adjusting the threshold (the min edit distance) to be a bit higher. That

may  restrict  that suggestion. Its not a great solution though. Its 
likely to suggest something else  :)  Ideally, the spell checker should 
probably be better at not suggesting when you have chosen a good word. 
It doesn't care you have a good  word already - it sees another word 
with greater frequency and within the edit distance allowed.

If you don't set the more popular setting, upon finding a word in the 
index, the Spell checker returns the word passed in. With the more 
popular setting on, you get the results you see - its still suggests, 
but it specifically will not suggest the word you passed in itself (the 
comment says, 'that would be silly'). So you will likely see bad 
suggestions for correct words with this setting.

- Mark




Nicholas Piasecki wrote:
> Hello All,
>
> I'm new to Solr, so forgive me if I'm overlooking something obvious.
My
> observation is that the spellcheck.onlyMorePopular property of the
> SpellCheckComponent seems to not do what I expect.
>
> If I send the query "calvin klien" to my data store, then the spell
> checker correctly suggests "klein" for "klien," and running the new
> "calvin klein" query returns the expected many product results.
>
> However, when sending the correct query of "calvin klein," the spell
> checker will suggest "cin2" (another brand name in our data store) for
> "klein," and running that new "calvin cin2" collated query obviously
> returns zero results.
>
> It would seem to me that the "onlyMorePopular" property, when set to
> true, only performs its calculation of popularity on the particular
> misspelled word alone, and not the query as a whole. Since there are
> indeed more C-IN2 brand products in our database, it returns "cin2"
has
> a spelling correction for "klein," seeing that the "cin2" token alone
> returns many results but not bothering to check that "calvin cin2"
> returns none. 
>
> A less astonishing behavior would be for it to suggest "cin2", test to
> see how many hits "calvin cin2" returns, see that it returns less than
> "calvin klein", and then exclude that suggestion because it is not
more
> popular in the context of the original query.
>
> So:
>
> 1 - Is my analysis correct? Is this really how it works?
>
> 2 - Is there a configuration setting that I can do to make the spell
> checker use the desired behavior? Or, should I just immediately submit
a
> request with its correlated suggestion with zero rows and do a
> comparison on the results, effectively performing the
"onlyMorePopular"
> calculation myself?
>
> Many thanks; so far, Solr is proving to be an excellent product!
>
> V/R,
> Nicholas Piasecki
>
> Software Developer
> Skiviez, Inc.
> 1-800-628-1693 x6003
> n...@skiviez.com
>
>
>   



Re: permanently setting log level?

2009-01-29 Thread Vannia Rajan
>
> i'm not using tomcat, i'm using the default jetty setup that comes with
> solr.  i grepped through the entire solr installation for 'INFO' but i don't
> see it.
>
> i don't really know anything about jetty other than i have to run java -jar
> start.jar to get it to run solr.
>
>
If you are not using tomcat, then you must find your solution here:
http://wiki.apache.org/solr/LoggingInDefaultJettySetup

-- 
With Regards,
K.Vanniarajan


Re: Question about rating documents

2009-01-29 Thread Reece
Okay, so what if I added a "rating" field users could update from like
1-5, and then did something like this:

/solr/select?indent=on&debugQuery=on&rows=99&q=body:+something AND
type:I _val_:product(score, rating); _val_ desc, id desc

Would that sort the resultset by the product of the score and the rating?

-Reece

On Thu, Jan 29, 2009 at 2:47 PM, Reece  wrote:
> Re-indexing so much would be a pretty big pain.   I do have a unique
> ID for each document though that I use for updating them every day as
> they change.
>
> -Reece
>
>
>
> On Thu, Jan 29, 2009 at 2:40 PM, Erick Erickson  
> wrote:
>> This may not be practical, as it would involve re-indexing
>> all your documents periodically, but here goes anyway...
>>
>> You could think about *index-time* boosts. Somewhere
>> you keep a record of the recommendations, then re-index
>> your corpus adding some suitable boost to each field in
>> your document based upon those recommendations.
>>
>> From an old post on the Lucene list by Hoss:
>>
>> <<<...index time field boosts are a way to express things
>> like "this documents title is worth twice as much as the title
>> of most documents...">>>
>>
>> Which seems like what you're after.
>>
>> But it may not be practical to re-index your corpus,
>> and the other interesting issue would be how you keep
>> track of documents since the Lucene doc ID is probably
>> useless, you'd have to have your own unique, persistent
>> field.
>>
>> Best
>> Erick
>>
>> On Thu, Jan 29, 2009 at 2:27 PM, Reece  wrote:
>>
>>> Hmm, I already boost certain fields, but from what I know about it you
>>> would need to know the boost value ahead of time which is not possible
>>> as it would be a different boost for each document depending on how it
>>> was rated..
>>>
>>> I did think of one thing though.  If I had a field that had a value of
>>> 1-5 for each document, and took that and used it to then add a boost
>>> to the fields I was actually searching on (or the final score) that
>>> would probably work, is that possible?
>>>
>>> -Reece
>>>
>>>
>>>
>>> On Thu, Jan 29, 2009 at 1:51 PM, Matthew Runo  wrote:
>>> > You could use a boost function to gently boost up items which were marked
>>> as
>>> > more popular.
>>> >
>>> > You would send the function query in the "bf" parameter with your query,
>>> and
>>> > you can find out more about syntax here:
>>> > http://wiki.apache.org/solr/FunctionQuery
>>> >
>>> > Thanks for your time!
>>> >
>>> > Matthew Runo
>>> > Software Engineer, Zappos.com
>>> > mr...@zappos.com - 702-943-7833
>>> >
>>> > On Jan 29, 2009, at 10:27 AM, Reece wrote:
>>> >
>>> >> Currently I'm using SOLR 1.2 to index a few million documents.  It's
>>> >> been requested that a way for users to rate the documents be done so
>>> >> that something rated higher would show up higher in search results and
>>> >> vice verse.
>>> >>
>>> >> I've been thinking about it, but can't come up with a good way to do
>>> >> this and still have the "best match" ranking of the results according
>>> >> to search terms entered by the users.
>>> >>
>>> >> I was hoping someone had done something similar or would have some
>>> >> insight on it.
>>> >>
>>> >> Thanks in advance!
>>> >>
>>> >> -Reece
>>> >>
>>> >
>>> >
>>>
>>
>


got background_merge_hit_exception during optimization

2009-01-29 Thread Qingdi

We got the following background_merge_hit_exception during optimization:
exception:
)background_merge_hit_exception__4zsgC136887658__50nfC995992__51i9C995977__52d5C995968__537yC995999__54xmC1892345__54xlC99593_into__54xn_optimize__javaioIOException_background_merge_hit_exception__4zsgC136887658__50nfC995992__51i9C995977__52d5C995968__537yC995999__54xmC1892345__54xlC99593_into__54xn_optimize__at_orgapacheluceneindexIndexWriteroptimizeIndexWriterjava2346__at_orgapacheluceneindexIndexWriteroptimizeIndexWriterjava2280__at_orgapachesolrupdateDirectUpdateHandler2commitDirectUpdateHandler2java355__at_orgapachesolrupdateprocessorRunUpdateProcessorprocessCommitRunUpdateProcessorFactoryjava77__at_orgapachesolrhandlerRequestHandlerUtilshandleCommitRequestHandlerUtilsjava104__at_orgapachesolrhandlerXmlUpdateRequestHandlerhandleRequestBodyXmlUpdateRequestHandlerjava113__at_orgapachesolrhandlerRequestHandlerBasehandleRequestRequestHandlerBasejava131__at_orgapachesolrcoreSolrCoreexecuteSolrCorejava1204__at_orgapachesolrservletSolrDispatchFilterexecuteSolrDispatchFilterjava303__at_orgapachesolrservletSolrDispatchFilterdoFilterSolrDispatchFilterjava232__at_orgmortbayjettyservletServletHandler$CachedChaindoFilterServletHandlerjava1089__at_orgmortbayjettyservletServletHandlerhandleServletHandlerjava365__at_orgmortbayjettysecuritySecurityHandlerhandleSecurityHandlerjava216__at_orgmortbayjettyservletSessionHandlerhandleSessionHandlerjava181__at_orgmortbayjettyhandlerContextHandlerhandleContextHandlerjava712__at_orgmortbayjettywebappWebAppContexthandleWebAppContextjava405__at_orgmortbayjettyhandlerContextHandlerCollectionhandleContextHandlerCollectionjava211__at_orgmortbayjettyhandlerHandlerCollectionhandleHandlerCollectionjava114__at_orgmortbayjettyhandlerHandlerWrapperhandleHandlerWrapperjava139__at_orgmortbayjettyServerhandleServerjava285__at_orgmortbayjettyHttpConnectionhandleRequestHttpConnectionjava502__a

Does anyone know what could be the cause of the exception? what should we do
to prevent this type of exception?

Some posts in the Lucene forum say the exception is usually related with
disk space issue. But there should be enough disk space in our system. Our
index size was about 56G. And before optimization, the disk had about 360G
free space.

After the above background_merge_hit_exception raised, solr kept generating
new segment files, which ate up all the CPU time and the disk space, so we
had to kill the solr server.

Thanks for your help.

Qingdi


-- 
View this message in context: 
http://www.nabble.com/got-background_merge_hit_exception-during-optimization-tp21735847p21735847.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: warmupTime : 0

2009-01-29 Thread Yonik Seeley
On Thu, Jan 29, 2009 at 12:12 PM, sunnyfr  wrote:
> Do you think it's normal to have warmupTime : 0 ??

Sure, if the caches were empty or almost empty (say on startup).

-Yonik


RE: warmupTime : 0

2009-01-29 Thread Feak, Todd
This usually represents anything less then 8ms if you are on a Windows
system. The granularity on timing on Windows systems is around 16ms.

-Todd feak

-Original Message-
From: sunnyfr [mailto:johanna...@gmail.com] 
Sent: Thursday, January 29, 2009 9:13 AM
To: solr-user@lucene.apache.org
Subject: warmupTime : 0


Hi,

Do you think it's normal to have warmupTime : 0 ??

searcher  
class:  org.apache.solr.search.SolrIndexSearcher  
version:1.0  
description:index searcher  
stats:  searcherName : searc...@6f7cf6b6 main
caching : true
numDocs : 8207035
maxDoc : 8239991
readerImpl : ReadOnlyMultiSegmentReader
readerDir :
org.apache.lucene.store.FSDirectory@/data/solr/video/data/index
indexVersion : 1228743257996
openedAt : Thu Jan 29 17:42:08 CET 2009
registeredAt : Thu Jan 29 17:42:09 CET 2009
warmupTime : 0 

I've around 12M of data.




   


  



thanks a lot,

-- 
View this message in context:
http://www.nabble.com/warmupTime-%3A-0-tp21731301p21731301.html
Sent from the Solr - User mailing list archive at Nabble.com.




Re: got background_merge_hit_exception during optimization

2009-01-29 Thread Otis Gospodnetic
Hi,

I didn't look into this deeply, but you didn't say which version of Solr you 
are using (looks like it might be 1.3).  If using a nightly build is an option, 
you might try that instead - Yonik updated the Lucene jars recently and that 
might be enough to solve this problem.


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
> From: Qingdi 
> To: solr-user@lucene.apache.org
> Sent: Thursday, January 29, 2009 4:06:10 PM
> Subject: got background_merge_hit_exception during optimization
> 
> 
> We got the following background_merge_hit_exception during optimization:
> exception:
>
 
)background_merge_hit_exception__4zsgC136887658__50nfC995992__51i9C995977__52d5C995968__537yC995999__54xmC1892345__54xlC99593_into__54xn_optimize__javaioIOException_background_merge_hit_exception__4zsgC136887658__50nfC995992__51i9C995977__52d5C995968__537yC995999__54xmC1892345__54xlC99593_into__54xn_optimize__at_orgapacheluceneindexIndexWriteroptimizeIndexWriterjava2346__at_orgapacheluceneindexIndexWriteroptimizeIndexWriterjava2280__at_orgapachesolrupdateDirectUpdateHandler2commitDirectUpdateHandler2java355__at_orgapachesolrupdateprocessorRunUpdateProcessorprocessCommitRunUpdateProcessorFactoryjava77__at_orgapachesolrhandlerRequestHandlerUtilshandleCommitRequestHandlerUtilsjava104__at_orgapachesolrhandlerXmlUpdateRequestHandlerhandleRequestBodyXmlUpdateRequestHandlerjava113__at_orgapachesolrhandlerRequestHandlerBasehandleRequestRequestHandlerBasejava131__at_orgapachesolrcoreSolrCoreexecuteSolrCorejava1204__at_orgapachesolrservletSolrDispatchFilterexecu
teSolrDispatchFilterjava303__at_orgapachesolrservletSolrDispatchFilterdoFilterSolrDispatchFilterjava232__at_orgmortbayjettyservletServletHandler$CachedChaindoFilterServletHandlerjava1089__at_orgmortbayjettyservletServletHandlerhandleServletHandlerjava365__at_orgmortbayjettysecuritySecurityHandlerhandleSecurityHandlerjava216__at_orgmortbayjettyservletSessionHandlerhandleSessionHandlerjava181__at_orgmortbayjettyhandlerContextHandlerhandleContextHandlerjava712__at_orgmortbayjettywebappWebAppContexthandleWebAppContextjava405__at_orgmortbayjettyhandlerContextHandlerCollectionhandleContextHandlerCollectionjava211__at_orgmortbayjettyhandlerHandlerCollectionhandleHandlerCollectionjava114__at_orgmortbayjettyhandlerHandlerWrapperhandleHandlerWrapperjava139__at_orgmortbayjettyServerhandleServerjava285__at_orgmortbayjettyHttpConnectionhandleRequestHttpConnectionjava502__a
> 
> Does anyone know what could be the cause of the exception? what should we do
> to prevent this type of exception?
> 
> Some posts in the Lucene forum say the exception is usually related with
> disk space issue. But there should be enough disk space in our system. Our
> index size was about 56G. And before optimization, the disk had about 360G
> free space.
> 
> After the above background_merge_hit_exception raised, solr kept generating
> new segment files, which ate up all the CPU time and the disk space, so we
> had to kill the solr server.
> 
> Thanks for your help.
> 
> Qingdi
> 
> 
> -- 
> View this message in context: 
> http://www.nabble.com/got-background_merge_hit_exception-during-optimization-tp21735847p21735847.html
> Sent from the Solr - User mailing list archive at Nabble.com.



Re: Question about rating documents

2009-01-29 Thread Otis Gospodnetic
Reece,

Solr does have the ability to read custom field values from an external file.  
This is suitable for cases where these values change a lot.  You might want to 
consider that instead of updating the index.


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
> From: Reece 
> To: solr-user@lucene.apache.org
> Sent: Thursday, January 29, 2009 3:31:22 PM
> Subject: Re: Question about rating documents
> 
> Okay, so what if I added a "rating" field users could update from like
> 1-5, and then did something like this:
> 
> /solr/select?indent=on&debugQuery=on&rows=99&q=body:+something AND
> type:I _val_:product(score, rating); _val_ desc, id desc
> 
> Would that sort the resultset by the product of the score and the rating?
> 
> -Reece
> 
> On Thu, Jan 29, 2009 at 2:47 PM, Reece wrote:
> > Re-indexing so much would be a pretty big pain.   I do have a unique
> > ID for each document though that I use for updating them every day as
> > they change.
> >
> > -Reece
> >
> >
> >
> > On Thu, Jan 29, 2009 at 2:40 PM, Erick Erickson 
> wrote:
> >> This may not be practical, as it would involve re-indexing
> >> all your documents periodically, but here goes anyway...
> >>
> >> You could think about *index-time* boosts. Somewhere
> >> you keep a record of the recommendations, then re-index
> >> your corpus adding some suitable boost to each field in
> >> your document based upon those recommendations.
> >>
> >> From an old post on the Lucene list by Hoss:
> >>
> >> <<<...index time field boosts are a way to express things
> >> like "this documents title is worth twice as much as the title
> >> of most documents...">>>
> >>
> >> Which seems like what you're after.
> >>
> >> But it may not be practical to re-index your corpus,
> >> and the other interesting issue would be how you keep
> >> track of documents since the Lucene doc ID is probably
> >> useless, you'd have to have your own unique, persistent
> >> field.
> >>
> >> Best
> >> Erick
> >>
> >> On Thu, Jan 29, 2009 at 2:27 PM, Reece wrote:
> >>
> >>> Hmm, I already boost certain fields, but from what I know about it you
> >>> would need to know the boost value ahead of time which is not possible
> >>> as it would be a different boost for each document depending on how it
> >>> was rated..
> >>>
> >>> I did think of one thing though.  If I had a field that had a value of
> >>> 1-5 for each document, and took that and used it to then add a boost
> >>> to the fields I was actually searching on (or the final score) that
> >>> would probably work, is that possible?
> >>>
> >>> -Reece
> >>>
> >>>
> >>>
> >>> On Thu, Jan 29, 2009 at 1:51 PM, Matthew Runo wrote:
> >>> > You could use a boost function to gently boost up items which were 
> >>> > marked
> >>> as
> >>> > more popular.
> >>> >
> >>> > You would send the function query in the "bf" parameter with your query,
> >>> and
> >>> > you can find out more about syntax here:
> >>> > http://wiki.apache.org/solr/FunctionQuery
> >>> >
> >>> > Thanks for your time!
> >>> >
> >>> > Matthew Runo
> >>> > Software Engineer, Zappos.com
> >>> > mr...@zappos.com - 702-943-7833
> >>> >
> >>> > On Jan 29, 2009, at 10:27 AM, Reece wrote:
> >>> >
> >>> >> Currently I'm using SOLR 1.2 to index a few million documents.  It's
> >>> >> been requested that a way for users to rate the documents be done so
> >>> >> that something rated higher would show up higher in search results and
> >>> >> vice verse.
> >>> >>
> >>> >> I've been thinking about it, but can't come up with a good way to do
> >>> >> this and still have the "best match" ranking of the results according
> >>> >> to search terms entered by the users.
> >>> >>
> >>> >> I was hoping someone had done something similar or would have some
> >>> >> insight on it.
> >>> >>
> >>> >> Thanks in advance!
> >>> >>
> >>> >> -Reece
> >>> >>
> >>> >
> >>> >
> >>>
> >>
> >



Re: Solr Gaze and Multicore?

2009-01-29 Thread Mark Miller

Jacob Singh wrote:

This is really cool! I've also been working pretty hard at an
automated benchmark suite using jmeter, rightscale and amazon web
services.  Next time I'm in Boston (March I think), it would be great
to show you.
  
That sounds excellent! One problem with Solr's efficiency is that its 
hard to put a meaningful query load on a solr server with only a 
processor core or two and a single connection. Thats not even getting 
into a distributed setup. An auto scaling, ec2, jmeter benchmark tool 
would be extremely useful for people that expect a lot of traffic. Thats 
a great idea!


- Mark


Re: Highlighting does not work?

2009-01-29 Thread Mike Klaas

Thanks, Jarek.

-Mike

On 29-Jan-09, at 12:20 AM, Jarek Zgoda wrote:

Added appriopriate amendment to FAQ, but I'd consider reorganizing  
information in the whole wiki, like creating a section titled  
"Common Tasks". Bit of redundancy does not hurt if it comes to  
documentation.


Wiadomość napisana w dniu 2009-01-28, o godz. 20:01, przez Mike Klaas:

Well, both pages I listed are in the search results :).  But I  
agree that it isn't obvious to find, and that it should be  
improved.  (The Wiki is a community-created site which anyone can  
contribute to, incidentally.)


cheers,
-Mike

On 28-Jan-09, at 1:11 AM, Jarek Zgoda wrote:

I swear I was looking this information in Solr wiki. See for  
yourself if this is accessible at all:


http://wiki.apache.org/solr/?action=fullsearch&context=180&value=highlight&fullsearch=Text

Wiadomość napisana w dniu 2009-01-28, o godz. 00:58, przez Mike  
Klaas:


They are documented in http://wiki.apache.org/solr/FieldOptionsByUseCase 
 and in the FAQ , but I agree that it could be more readily  
accessible.


-Mike

On 27-Jan-09, at 5:26 AM, Jarek Zgoda wrote:

Finally found that the fields have to have an analyzer to be  
highlighted. Neat.


Can I ask somebody to document these all requirements?

Wiadomość napisana w dniu 2009-01-27, o godz. 13:49, przez Jarek  
Zgoda:


I turned these fields to indexed + stored but the results are  
exactly the same, no matter if I search in these fields or  
elsewhere.


Wiadomość napisana w dniu 2009-01-27, o godz. 13:09, przez  
Jarek Zgoda:



Solr 1.3

I'm trying to get highlighting working, with no luck so far.

Query with params  
q=cyrus&fl=*,score&qt=standard&hl=true&hl.fl=title+description  
finds 182 documents in my index. All of the top 10 hits  
contain the word "cyrus", but the highlights list is empty.  
The fields "title" and "description" are stored but not  
indexed. If I specify "*" as hl.fl value I get the same results.


Do I need to add some special configuration to enable  
highlighting feature?


--
We read Knuth so you don't have to. - Tim Peters

Jarek Zgoda, R&D, Redefine
jarek.zg...@redefine.pl



--
We read Knuth so you don't have to. - Tim Peters

Jarek Zgoda, R&D, Redefine
jarek.zg...@redefine.pl



--
We read Knuth so you don't have to. - Tim Peters

Jarek Zgoda, R&D, Redefine
jarek.zg...@redefine.pl





--
We read Knuth so you don't have to. - Tim Peters

Jarek Zgoda, R&D, Redefine
jarek.zg...@redefine.pl





--
We read Knuth so you don't have to. - Tim Peters

Jarek Zgoda, R&D, Redefine
jarek.zg...@redefine.pl





Re: Optimizing & Improving results based on user feedback

2009-01-29 Thread Matthew Runo
Agreed, it seems that a lot of the algorithms in these papers would  
almost be a whole new RequestHandler ala Dismax. Luckily a lot of them  
seem to be built on Lucene (at least the ones that I looked at that  
had code samples).


Which papers did you see that actually talked about using clicks? I  
don't see those, beyond "Addressing Malicious Noise in Clickthrough  
Data" by Filip Radlinski and also his "Query Chains: Learning to Rank  
from Implicit Feedback" - but neither is really on topic.


Thanks for your time!

Matthew Runo
Software Engineer, Zappos.com
mr...@zappos.com - 702-943-7833

On Jan 29, 2009, at 11:36 AM, Walter Underwood wrote:


Thanks, I didn't know there was so much research in this area.
Most of the papers at those workshops are about tuning the
entire ranking algorithm with machine learning techniques.

I am interested in adding one more feature, click data, to an
existing ranking algorithm. In my case, I have enough data to
use query-specific boosts instead of global document boosts.
We get about 2M search clicks per day from logged in users
(little or no click spam).

I'm checking out some papers from Thorsten Joachims and from
Microsoft Research that are specifically about clickthrough
feedback.

wunder

On 1/27/09 11:15 PM, "Neal Richter"  wrote:


OK I've implemented this before, written academic papers and patents
related to this task.

Here are some hints:
  - you're on the right track with the editorial boosting elevators
  - http://wiki.apache.org/solr/UserTagDesign
  - be darn careful about assuming that one click is enough evidence
to boost a long
'distance'
  - first page effects in search will skew the learning badly if you
don't compensate.
   95% of users never go past the first page of results, 1% go
past the second
   page.  So perfectly good results on the second page get
permanently locked out
  - consider forgetting what you learn under some condition

In fact this whole area is called 'learning to rank' and is a hot
research topic in IR.
http://web.mit.edu/shivani/www/Ranking-NIPS-05/
http://research.microsoft.com/en-us/um/people/lr4ir-2007/
https://research.microsoft.com/en-us/um/people/lr4ir-2008/

- Neal Richter


On Tue, Jan 27, 2009 at 2:06 PM, Matthew Runo   
wrote:

Hello folks!

We've been thinking about ways to improve organic search results  
for a while
(really, who hasn't?) and I'd like to get some ideas on ways to  
implement a
feedback system that uses user behavior as input. Basically, it'd  
work on
the premise that what the user actually clicked on is probably a  
really good
match for their search, and should be boosted up in the results  
for that

search.

For example, if I search for "rain boots", and really love the  
10th result
down (and show it by clicking on it), then we'd like to capture  
this and use
the data to boost up that result //for that search//. We've  
thought about
using index time boosts for the documents, but that'd boost it  
regardless of
the search terms, which isn't what we want. We've thought about  
using the
Elevator handler, but we don't really want to force a product to  
the top -
we'd prefer it slowly rises over time as more and more people  
click it from
the same search terms. Another way might be to stuff the keyword  
into the
document, the more times it's in the document the higher it'd  
score - but

there's gotta be a better way than that.

Obviously this can't be done 100% in solr - but if anyone had some  
clever
ideas about how this might be possible it'd be interesting to hear  
them.


Thanks for your time!

Matthew Runo
Software Engineer, Zappos.com
mr...@zappos.com - 702-943-7833








Re: Optimizing & Improving results based on user feedback

2009-01-29 Thread Walter Underwood
"A Decision Theoretic Framework for Ranking using Implicit Feedback"
uses clicks, but the best part of that paper is all the side comments
about difficulties in evaluation. For example, if someone clicks on
three results, is that three times as good or two failures and a
success? We have to know the information need to decide. That paper
is in the LR4IR 2008 proceedings.

Both Radlinski and Joachims seem to be focusing on click data.

I'm thinking of something much simpler, like taking the first
N hits and reordering those before returning. Brute force, but
would get most of the benefit. Usually, you only have reliable
click data for a small number of documents on each query, so
it is a waste of time to rerank the whole list. Besides, if you
need to move something up 100 places on the list, you should
probably be tuning your regular scoring rather than patching
it with click data.

wunder

On 1/29/09 3:43 PM, "Matthew Runo"  wrote:

> Agreed, it seems that a lot of the algorithms in these papers would
> almost be a whole new RequestHandler ala Dismax. Luckily a lot of them
> seem to be built on Lucene (at least the ones that I looked at that
> had code samples).
> 
> Which papers did you see that actually talked about using clicks? I
> don't see those, beyond "Addressing Malicious Noise in Clickthrough
> Data" by Filip Radlinski and also his "Query Chains: Learning to Rank
> from Implicit Feedback" - but neither is really on topic.
> 
> Thanks for your time!
> 
> Matthew Runo
> Software Engineer, Zappos.com
> mr...@zappos.com - 702-943-7833
> 
> On Jan 29, 2009, at 11:36 AM, Walter Underwood wrote:
> 
>> Thanks, I didn't know there was so much research in this area.
>> Most of the papers at those workshops are about tuning the
>> entire ranking algorithm with machine learning techniques.
>> 
>> I am interested in adding one more feature, click data, to an
>> existing ranking algorithm. In my case, I have enough data to
>> use query-specific boosts instead of global document boosts.
>> We get about 2M search clicks per day from logged in users
>> (little or no click spam).
>> 
>> I'm checking out some papers from Thorsten Joachims and from
>> Microsoft Research that are specifically about clickthrough
>> feedback.
>> 
>> wunder
>> 
>> On 1/27/09 11:15 PM, "Neal Richter"  wrote:
>> 
>>> OK I've implemented this before, written academic papers and patents
>>> related to this task.
>>> 
>>> Here are some hints:
>>>   - you're on the right track with the editorial boosting elevators
>>>   - http://wiki.apache.org/solr/UserTagDesign
>>>   - be darn careful about assuming that one click is enough evidence
>>> to boost a long
>>> 'distance'
>>>   - first page effects in search will skew the learning badly if you
>>> don't compensate.
>>>95% of users never go past the first page of results, 1% go
>>> past the second
>>>page.  So perfectly good results on the second page get
>>> permanently locked out
>>>   - consider forgetting what you learn under some condition
>>> 
>>> In fact this whole area is called 'learning to rank' and is a hot
>>> research topic in IR.
>>> http://web.mit.edu/shivani/www/Ranking-NIPS-05/
>>> http://research.microsoft.com/en-us/um/people/lr4ir-2007/
>>> https://research.microsoft.com/en-us/um/people/lr4ir-2008/
>>> 
>>> - Neal Richter
>>> 
>>> 
>>> On Tue, Jan 27, 2009 at 2:06 PM, Matthew Runo 
>>> wrote:
 Hello folks!
 
 We've been thinking about ways to improve organic search results
 for a while
 (really, who hasn't?) and I'd like to get some ideas on ways to
 implement a
 feedback system that uses user behavior as input. Basically, it'd
 work on
 the premise that what the user actually clicked on is probably a
 really good
 match for their search, and should be boosted up in the results
 for that
 search.
 
 For example, if I search for "rain boots", and really love the
 10th result
 down (and show it by clicking on it), then we'd like to capture
 this and use
 the data to boost up that result //for that search//. We've
 thought about
 using index time boosts for the documents, but that'd boost it
 regardless of
 the search terms, which isn't what we want. We've thought about
 using the
 Elevator handler, but we don't really want to force a product to
 the top -
 we'd prefer it slowly rises over time as more and more people
 click it from
 the same search terms. Another way might be to stuff the keyword
 into the
 document, the more times it's in the document the higher it'd
 score - but
 there's gotta be a better way than that.
 
 Obviously this can't be done 100% in solr - but if anyone had some
 clever
 ideas about how this might be possible it'd be interesting to hear
 them.
 
 Thanks for your time!
 
 Matthew Runo
 Software Engineer, Zappos.com
 mr...@zappos.com - 702-943-7833
 
 

Re: How to handle database replication delay when using DataImportHandler?

2009-01-29 Thread Noble Paul നോബിള്‍ नोब्ळ्
Yeah that is an option.

On Fri, Jan 30, 2009 at 12:27 AM, Gregg Donovan  wrote:
> Noble,
>
> Thanks for the suggestion. The unfortunate thing is that we really don't
> know ahead of time what sort of replication delay we're going to encounter
> -- it could be one millisecond or it could be one hour. So, we end up
> needing to do something like:
>
> For delta-import run N:
> 1. query DB slave for "seconds_behind_master", use this to calculate
> Date(N).
> 2. query DB slave for records updated since Date(N - 1)
>
> I see there are plugin points for EventListener classes (onImportStart,
> onImportEnd). Would those be the right spot to calculate these dates so that
> I could expose them to my custom function at query time?
>
> Thanks.
>
> --Gregg
>
> On Wed, Jan 28, 2009 at 11:20 PM, Noble Paul നോബിള്‍ नोब्ळ् <
> noble.p...@gmail.com> wrote:
>
>> The problem you are trying to solve is that you cannot use
>> ${dataimporter.last_index_time} as is. you may need something like
>> ${dataimporter.last_index_time} - 3secs
>>
>> am I right?
>>
>> There are no straight ways to do this .
>> 1) you may write your own function say 'lastIndexMinus3Secs' and add
>> them. functions can be plugged in to DIH using a > name="lastIndexMinus3Secs" class=""foo.Foo/> under the 
>> tag. And you can use it as
>> ${dataimporter.functions.lastIndexMinus3Secs()}
>> this will add to the existing in-built functions
>>
>> http://wiki.apache.org/solr/DataImportHandler#head-5675e913396a42eb7c6c5d3c894ada5dadbb62d7
>>
>> the class must extend org.apache.solr.handler.dataimport.Evaluator
>>
>> we may add a standard function for this too . you can raise an issue
>> --Noble
>>
>>
>>
>> On Thu, Jan 29, 2009 at 6:26 AM, Gregg  wrote:
>> > I'd like to use the DataImportHandler running against a slave database
>> that,
>> > at any given time, may be significantly behind the master DB. This can
>> cause
>> > updates to be missed if you use the clock-time as the "last_index_time."
>> > E.g., if the slave catches up to the master between two delta-imports.
>> >
>> > Has anyone run into this? In our non-DIH indexing system we get around
>> this
>> > by either using the slave DB's seconds-behind-master or the max last
>> update
>> > time of the records returned.
>> >
>> > Thanks.
>> >
>> > Gregg
>> >
>>
>>
>>
>> --
>> --Noble Paul
>>
>



-- 
--Noble Paul