date:20100604

Multi word synonyms + highlighting

2010-06-04 Thread Xavier Schepler


Hi,

Here's a field type using synonyms :



 
 
 synonyms="french-synonyms.txt" ignoreCase="true" expand="true"/>

 
 mapping="mapping-ISOLatin1Accent.txt"/>



 
 
 
 mapping="mapping-ISOLatin1Accent.txt"/>




Here are the contents of 'french-synonyms.txt' that I used for testing :

PC,parti communiste
PS,parti socialiste

When I query a field for the words : parti communiste, those things are 
highlighted :

"parti communiste"
"parti socialiste"
"parti"
"PC"
"PS"
"communiste"

Having "parti socialiste" highlighted is a problem.
I expected only "parti communiste", "parti", "communiste" and "PC" 
highlighted.


Is there a way to have things working like I expected ?

Here is the query I use :

wt=json
&q=qAndMSFR%3A%28parti%20communiste%29
&q.op=AND
&start=0
&rows=5
&fl=id,studyId,questionFR,modalitiesFR,variableLabelFR,variableName,nesstarVariableId,lang,studyTitle,nesstarStudyId,CevipofConcept,studyQuestionCount,questionPosition,preQuestionText,
&sort=score%20desc
&facet=true
&facet.field=CevipofConceptCode
&facet.field=studyDateAndId
&facet.sort=lex
&spellcheck=true
&spellcheck.collate=on
&spellcheck.count=10
&hl=on
&hl.fl=questionSMFR,modalitiesSMFR,variableLabelSMFR
&hl.fragsize=1
&hl.snippets=100
&hl.usePhraseHighlighter=true
&hl.highlightMultiTerm=true
&hl.simple.pre=%3Cb%3E
&hl.simple.post=%3C%2Fb%3E

Re: exclude docs with null field

2010-06-04 Thread Ahmet Arslan

> say my search query is "new york", and i am searching
> field1 and field2
> for it, how do i specify that i want to exlude docs where
> field3 doesnt
> exist?


http://search-lucene.com/m/1o5mEk8DjX1/

Re: exclude docs with null field

2010-06-04 Thread Geert-Jan Brits

field1:"new york"+field2:"new york"+field3:[* TO *]

2010/6/4 bluestar 

> hi there,
>
> say my search query is "new york", and i am searching field1 and field2
> for it, how do i specify that i want to exlude docs where field3 doesnt
> exist?
>
> thanks
>
>

Re: exclude docs with null field

2010-06-04 Thread bluestar

i could be wrong but it seems this way has a performance hit?

or i am missing something?

> field1:"new york"+field2:"new york"+field3:[* TO *]
>
> 2010/6/4 bluestar 
>
>> hi there,
>>
>> say my search query is "new york", and i am searching field1 and field2
>> for it, how do i specify that i want to exlude docs where field3 doesnt
>> exist?
>>
>> thanks
>>
>>
>

Re: exclude docs with null field

2010-06-04 Thread Ahmet Arslan


> i could be wrong but it seems this
> way has a performance hit?
> 
> or i am missing something?

Did you read Chris's message in http://search-lucene.com/m/1o5mEk8DjX1/
He proposes alternative (more efficient) way other than [* TO *]

Re: Logs for Java Replication in Solr

2010-06-04 Thread Peter Karich

Hoss,

thanks a lot! (We are using tomcat so the logging properties file is fine.)
Do you know what the reason of the mentioned exception could be?
It seems to me that if this exception accurs that even the replication
for that index does not work.
If I then remove the data director + reload + poll a replication all is
fine. But sometimes it occurs again :-/

Regards,
Peter.

> : 
> : where can I find more information about a failure of a Java replication
> : in Solr 1.4?
> : (Dashboard does not seem to be the best place!?)
>
> All the log message are written using the JDK Logging framework, so it 
> really depends on your servlet container, and where it's configured to 
> write the logs...
>
>   http://wiki.apache.org/solr/SolrLogging
>
>
>
> -Hoss

Re: exclude docs with null field

2010-06-04 Thread bluestar

nice one! thanks.

>
>> i could be wrong but it seems this
>> way has a performance hit?
>>
>> or i am missing something?
>
> Did you read Chris's message in http://search-lucene.com/m/1o5mEk8DjX1/
> He proposes alternative (more efficient) way other than [* TO *]
>
>
>
>

Re: exclude docs with null field

2010-06-04 Thread Geert-Jan Brits

Additionally, I should have mentioned that you can instead do:
fq=field_3:[* TO *], which uses the filtercache.

The method presented by Chris will probably outperform the above method but
only on the first request, from then on the filtercache takes over.
>From a performance standpoint it's probably not worth going the 'default
value for null-approach' imho.
It IS useful however if you want to be able to query on docs with a
null-value (instead of excluding them)


2010/6/4 bluestar 

> nice one! thanks.
>
> >
> >> i could be wrong but it seems this
> >> way has a performance hit?
> >>
> >> or i am missing something?
> >
> > Did you read Chris's message in http://search-lucene.com/m/1o5mEk8DjX1/
> > He proposes alternative (more efficient) way other than [* TO *]
> >
> >
> >
> >
>
>
>

MultiValue Exclusion

2010-06-04 Thread homerlex


How would you model this?

We have a table of news items that people can view in their news stream and
comment on.  Users have the ability to "mute" item so they never see them in
their feed or search results.

>From what I can see there are a couple ways to accomplish this.

1 - Post process the results and do not render any muted news items.  The
downside of the pagination become problematic.  Its possible we may forgo
pagination because of this but for now assume that pagination is a
requirement.

2 - Whenever we query for a given user we append a clause that excludes all
muted items.  I assume in Solr we'd need to do something like -item_id(1 AND
2 AND 3).  Obviously this doesn't scale very well.

3 - Have a multi-valued property in the index that contains all ids of users
who have muted the item.  Being new to Solr I don't even know how (or if its
possible) to run a query that says "user id not this multivalued property". 
Can this even be done (sample query please)?  Again, I know this doesn't
scale very well.

Any other suggestions?

Thanks in advance for the help.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/MultiValue-Exclusion-tp870173p870173.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: MultiValue Exclusion

2010-06-04 Thread Geert-Jan Brits

I guess the following works.

A. similar to your option 2, but using the filtercache
fq=-item_id:001 -item_id:002

B. similar to your option 3, but using the filtercache
fq=-users_excluded_field:

the advantage being that the filter is cached independently from the rest of
the query so it can be reused efficiently.

adv A over B. the 'muted news items' can be queried dynamically, i.e: they
aren't set in stone at index time.
B will probably perform a little bit better the first time (when nog
cached), but I'm not sure.

hope that helps,
Geert-Jan


2010/6/4 homerlex 

>
> How would you model this?
>
> We have a table of news items that people can view in their news stream and
> comment on.  Users have the ability to "mute" item so they never see them
> in
> their feed or search results.
>
> From what I can see there are a couple ways to accomplish this.
>
> 1 - Post process the results and do not render any muted news items.  The
> downside of the pagination become problematic.  Its possible we may forgo
> pagination because of this but for now assume that pagination is a
> requirement.
>
> 2 - Whenever we query for a given user we append a clause that excludes all
> muted items.  I assume in Solr we'd need to do something like -item_id(1
> AND
> 2 AND 3).  Obviously this doesn't scale very well.
>
> 3 - Have a multi-valued property in the index that contains all ids of
> users
> who have muted the item.  Being new to Solr I don't even know how (or if
> its
> possible) to run a query that says "user id not this multivalued property".
> Can this even be done (sample query please)?  Again, I know this doesn't
> scale very well.
>
> Any other suggestions?
>
> Thanks in advance for the help.
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/MultiValue-Exclusion-tp870173p870173.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Faceted Search Slows Down as index gets larger

2010-06-04 Thread Furkan Kuru

Hello,

I have been dealing with real-time data.

As the number of total indexed documents gets larger (now 5 M)

a faceted search on a text field limited by the creation time, which we use
to find the most used word in all these text fields, gets slow down.


query string: created_time:[NOW-1HOUR TO NOW] facet.field=text
facet.mincount=1

the document count matching the query is around 9000.


It takes around 80 seconds in a decent computer with 4GB ram, quad core cpu

I do not know the internal details of term indexing and their counts for
faceting.

Any suggestion for speeding up this query is appreciated.

Thanks in advance.

-- 
Furkan Kuru

Re: Faceted Search Slows Down as index gets larger

2010-06-04 Thread Yonik Seeley

Faceting on a full-text field is hard.
What version of Solr are you using?

If it's 1.4 or later, try setting
facet.method=enum

And to use the filterCache less, try
facet.enum.cache.minDf=100

-Yonik
http://www.lucidimagination.com

On Fri, Jun 4, 2010 at 10:31 AM, Furkan Kuru  wrote:
> Hello,
>
> I have been dealing with real-time data.
>
> As the number of total indexed documents gets larger (now 5 M)
>
> a faceted search on a text field limited by the creation time, which we use
> to find the most used word in all these text fields, gets slow down.
>
>
> query string: created_time:[NOW-1HOUR TO NOW] facet.field=text
> facet.mincount=1
>
> the document count matching the query is around 9000.
>
>
> It takes around 80 seconds in a decent computer with 4GB ram, quad core cpu
>
> I do not know the internal details of term indexing and their counts for
> faceting.
>
> Any suggestion for speeding up this query is appreciated.
>
> Thanks in advance.
>
> --
> Furkan Kuru
>

Re: OverlappingFileLockException when using startup

2010-06-04 Thread rabahb


Hi Guys,

I'm experiencing the same issue with a single war. I'm using a brand new
Solr war built from yestertay's version of the trunk. 

I've got one master with 2 cores and one slave with a single core. I'm using
one core from master as the master of the second core (which is configured
as a repeater). So that, the slave's core can poll the repeater for index
changes. 

( I was using solr 1.4, but experienced some issues with replication. While
rebuilding the index on the one master core, the new index was not
replicated succesfully to the other master core. Files were copied over but
the final commit failed on the snappuller. But sometimes, while restarting
the master, the replication would work fine between  master cores, then no
replication would be successful from master to slave core. I had the same
issue as described here: https://issues.apache.org/jira/browse/SOLR-1769 .
Which seems to be fixed in the trunk.

So I moved on to the trunk version of solr in order to tests the fix. This
seems to work better. As master cores replication works fine. But I've got a
weird behavior on slave. The index replication is successful only the second
time the slave is trying to get it even if for each replication trial, slave
spits out the following Exception (see below).

There seems to be a concurrrency issue but I don't quite undestand where the
concurrency is really happening. Can you please help on that issue?

org.apache.solr.common.SolrException: Index fetch failed :
at
org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:329)
at
org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:264)
at org.apache.solr.handler.SnapPuller$1.run(SnapPuller.java:159)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
at
java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecu
 
tor.java:98)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExec
 
utor.java:181)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.jav
 
a:205)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)
Caused by: java.nio.channels.OverlappingFileLockException
at
sun.nio.ch.FileChannelImpl$SharedFileLockTable.checkList(FileChannelImpl.java:1170)
at
sun.nio.ch.FileChannelImpl$SharedFileLockTable.add(FileChannelImpl.java:1072)
at sun.nio.ch.FileChannelImpl.tryLock(FileChannelImpl.java:878)
at java.nio.channels.FileChannel.tryLock(FileChannel.java:962)
at
org.apache.lucene.store.NativeFSLock.obtain(NativeFSLockFactory.java:260)
at org.apache.lucene.store.Lock.obtain(Lock.java:72)
at org.apache.lucene.index.IndexWriter.(IndexWriter.java:1061)
at org.apache.lucene.index.IndexWriter.(IndexWriter.java:950)
at
org.apache.solr.update.SolrIndexWriter.(SolrIndexWriter.java:192)
at
org.apache.solr.update.UpdateHandler.createMainIndexWriter(UpdateHandler.java:99)
at
org.apache.solr.update.DirectUpdateHandler2.openWriter(DirectUpdateHandler2.java:173)
at
org.apache.solr.update.DirectUpdateHandler2.forceOpenWriter(DirectUpdateHandler2.java:376)
at org.apache.solr.handler.SnapPuller.doCommit(SnapPuller.java:471)
at
org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:319)
... 11 more
 




-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/OverlappingFileLockException-when-using-str-name-replicateAfter-startup-str-tp488686p870589.html
Sent from the Solr - User mailing list archive at Nabble.com.

String Sort Nor Working

2010-06-04 Thread Patrick Wilson

All,
I am trying to sort on a text field and can't get it to work. I try sorting on 
"sortTitle" and get no errors, it just doesn't appear to sort. The pertinent 
parts of my schema:


... lots of filters that do work...














I set stored="true" on the sort field so I could see if anything was getting 
copied there, and it would appear that this is not the case. I don't see any 
"top 10" summaries like I do for other fiends, including another field 
populated by copyField. Is this just because of the filters I am using?

I'm sure this horse has or similar horses have been beaten to death before, but 
I'm new to this mailing list, so sorry about that. Any help is greatly 
appreciated!

Thanks,
Patrick

Re: Faceted Search Slows Down as index gets larger

2010-06-04 Thread Furkan Kuru

I am using 1.4 version.

I have tried your suggestion,

it takes around 25-30 seconds now.

Thank you,


On Fri, Jun 4, 2010 at 5:54 PM, Yonik Seeley wrote:

> Faceting on a full-text field is hard.
> What version of Solr are you using?
>
> If it's 1.4 or later, try setting
> facet.method=enum
>
> And to use the filterCache less, try
> facet.enum.cache.minDf=100
>
> -Yonik
> http://www.lucidimagination.com
>
> On Fri, Jun 4, 2010 at 10:31 AM, Furkan Kuru  wrote:
> > Hello,
> >
> > I have been dealing with real-time data.
> >
> > As the number of total indexed documents gets larger (now 5 M)
> >
> > a faceted search on a text field limited by the creation time, which we
> use
> > to find the most used word in all these text fields, gets slow down.
> >
> >
> > query string: created_time:[NOW-1HOUR TO NOW] facet.field=text
> > facet.mincount=1
> >
> > the document count matching the query is around 9000.
> >
> >
> > It takes around 80 seconds in a decent computer with 4GB ram, quad core
> cpu
> >
> > I do not know the internal details of term indexing and their counts for
> > faceting.
> >
> > Any suggestion for speeding up this query is appreciated.
> >
> > Thanks in advance.
> >
> > --
> > Furkan Kuru
> >
>



-- 
Furkan Kuru

RE: index growing with updates

2010-06-04 Thread Nagelberg, Kallin

Ok so I think that Solr (lucene) will only remove deleted/updated documents 
from the disk after an optimize or after an 'expungeDeletes' request. Is there 
a way to trigger the expunsion (new word) across the entire index? I tried :

final UpdateRequest request = new UpdateRequest()
request.setParam("expungeDeletes","true");
request.add someofmydocs
server.sendrequest..

But that didn't seem to do the trick as I know I have about 7 Gigs of documents 
that should be removed from the disk and the index size hasn't really budged.

Any ideas?

Thanks,
Kallin Nagelberg

-Original Message-
From: Nagelberg, Kallin 
Sent: Thursday, June 03, 2010 1:36 PM
To: 'solr-user@lucene.apache.org'
Subject: RE: index growing with updates

Is there a way to trigger a purge, or under what conditions does it occur?

-Kallin Nagelberg

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Thursday, June 03, 2010 12:40 PM
To: solr-user@lucene.apache.org
Subject: Re: index growing with updates

Assuming your config is set up to replace unique keys, you're really
doing a delete and an add (under the covers). It could very well be that
the deleted version of the document is still in your index taking up
space and will be until it is purged.

HTH
Erick

On Thu, Jun 3, 2010 at 10:22 AM, Nagelberg, Kallin <
knagelb...@globeandmail.com> wrote:

> Hey,
>
> If I add a document to the index that already exists (same uniquekey) what
> is the expected behavior? I would imagine that if the document is the same
> then the index should not grow, but mine appears to be growing. Any ideas?
>
> Thanks,
> -Kallin Nagelberg
>
>

Re: Highlighting a field with a certain value

2010-06-04 Thread Koji Sekiguchi


(10/05/25 0:31), n...@frameweld.com wrote:

Hello,

How am I able to highlight a field that contains a specific value? If I have a field 
called type, how am I able to highlight the rows whose values contain something like 
"title"?
   

http://localhost:8983/solr/select?q=title&hl=on&hl.fl=type

Koji

--
http://www.rondhuit.com/en/

Re: String Sort Nor Working

2010-06-04 Thread Ahmet Arslan

> 
> 

Simple lowercase F is causing this. It should be

RE: String Sort Nor Working

2010-06-04 Thread Patrick Wilson

That did it. Thank you =)

P.S. Might it be helpful for Solr to complain about invalid XML during startup? 
Does it do this and I'm just not noticing?

-Original Message-
From: Ahmet Arslan [mailto:iori...@yahoo.com]
Sent: Friday, June 04, 2010 12:18 PM
To: solr-user@lucene.apache.org
Subject: Re: String Sort Nor Working

> 
>

Simple lowercase F is causing this. It should be

Need help to install Solr on JBoss

2010-06-04 Thread Bondiga, Murali

I installed Solr on my local machine and it works fine with Jetty. I am trying 
to install on JBoss which is running on a Sun Solaris box and I have the 
following questions:


 1.  Do I need to copy the entire example folder from my local machine to Solr 
home on Sun Solaris box?
 2.  How can I have multiple cores on the Sun Solaris box?

Any help is appreciated.

Thanks,
Murali

RE: String Sort Nor Working

2010-06-04 Thread Ahmet Arslan

> P.S. Might it be helpful for Solr to complain about invalid
> XML during startup? Does it do this and I'm just not
> noticing?

Chris's explanation about a similar topic:
http://search-lucene.com/m/11JWX1hxL4u/

RE: String Sort Nor Working

2010-06-04 Thread Patrick Wilson

Very informative - thank you!

I think it might be useful to have this feature - maybe have an interface for 
plugins to register a XSD or otherwise declare its expected xml elements and 
attributes. I'm not sure if there's enough demand for this to justify the time 
it would take to make this change though. Just a thought.

-Original Message-
From: Ahmet Arslan [mailto:iori...@yahoo.com]
Sent: Friday, June 04, 2010 1:41 PM
To: solr-user@lucene.apache.org
Subject: RE: String Sort Nor Working

> P.S. Might it be helpful for Solr to complain about invalid
> XML during startup? Does it do this and I'm just not
> noticing?

Chris's explanation about a similar topic:
http://search-lucene.com/m/11JWX1hxL4u/

conditional Document Boost

2010-06-04 Thread MitchK


Hello out there,

I am searching for a solution for conditional Document Boosting.
During analyzing the fields of a document, I want to create a document boost
based on some metrics.

There are two approaches:
First: I preprocess the data. The main problem with this is, that I need to
take care about the preprocessing-part and I can't do it out of the box
(implementing an analyzer,  compute the boosting value and afterwards store
those values or send them to solr.).

Second: Using the UpdateRequestProcessor (does it work with DIH?). However,
the problem would also be custom work and taking care that the used params
are up-to-date. 

Third: Setting the Document Boost while analyzing-process is running with
the help of a TokenFilter  (is this possible?).

What would you do?


I think what I want to do is quite the same as working with Mahout and Solr.
I never worked with Mahout - but how can I use it to improve the user's
search-experience? 
Where can I use Mahout in Solr, if I want to influence document's boosts?
And where in general (i.e. for classification).

References, ideas and whatever could be useful are welcome :-).

Thank you.

Kind regards
- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/conditional-Document-Boost-tp871108p871108.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: TikaEntityProcessor not working?

2010-06-04 Thread Brad Greenlee

You are my hero. I replaced the Tika 0.8 snapshots that were included with Solr 
with 0.6 and it works now. Thank you!

Brad

On Jun 3, 2010, at 6:22 AM, David George wrote:

> 
> Which version of Tika do you have? There was a problem introduced somewhere
> between Tika 0.6 and Tika 0.7 whereby the TikaConfig method
> config.getParsers() was returns an empty parser list due to class loader
> scope issues with Solr running under an application server.
> 
> There is a fix in the Tika 0.8 branch and I note that a 0.8 snapshot of Tika
> is including in the Solr trunk. I've not tried to get this to work and am
> not sure what config is needed to make this work. I simply installed Tika
> 0.6 which can be dowloaded from the apache tika website.
> -- 
> View this message in context: 
> http://lucene.472066.n3.nabble.com/TikaEntityProcessor-not-working-tp856965p867572.html
> Sent from the Solr - User mailing list archive at Nabble.com.

RE: index growing with updates

2010-06-04 Thread Chris Hostetter


: Ok so I think that Solr (lucene) will only remove deleted/updated 
: documents from the disk after an optimize or after an 'expungeDeletes' 
: request. Is there a way to trigger the expunsion (new word) across the 
: entire index? I tried :

deletes are removed when segments are merged -- an optimize merges all 
segments, so it forcibley removes all deleted docs, but regular merges as 
documents are added/updated will clean things up periodicly -- so if you 
have a fixed set of documents that you keep updating over and over your 
index size will not grow with out bounds -- it will ossilate between a min 
(completely optimized) and a max (lots of segments with lots of deletions 
just about to be merged)



-Hoss

Range query on long value

2010-06-04 Thread David


Hi,

I have an issue with range queries on a long value in our dataset (the 
dataset is fairly large, but i believe the problem still exists for 
smaller datasets).  When i query the index with a range, as such: id:[1 
TO 2000], I get values back that are well outside that range.  Its as if 
the range query is ignoring the values and doing something like id:[* TO 
*]. We are running Solr 1.3.  The value is set as the unique key for the 
index.


Our schema is similar to this:


required="true" />
required="false" />
required="false" />

.
.
.
required="false" />


id


Has anyone else had this problem?  If so, how did you correct it?  
Thanks in advance.

Need help with document format

2010-06-04 Thread Moazzam Khan

Hi guys,


I have a list of consultants and the users (people who work for the
company) are supposed to be able to search for consultants based on
the time frame they worked for, for a company. For example, I should
be able to search for all consultants who worked for Bear Stearns in
the month of july. What is the best of accomplishing this?

I was thinking of formatting the document like this


Bear Stearns
   2000-01-01
   present


AIG
   1999-01-01
   2000-01-01


Is this possible?

Thanks,

Moazzam

Re: Need help to install Solr on JBoss

2010-06-04 Thread Juan Pedro


Check the wiki


1.  Do I need to copy the entire example folder from my local machine to Solr 
home on Sun Solaris box?


http://wiki.apache.org/solr/SolrJBoss


2.  How can I have multiple cores on the Sun Solaris box?


http://wiki.apache.org/solr/CoreAdmin


Regards

Juan

www.linebee.com



Bondiga, Murali wrote:

I installed Solr on my local machine and it works fine with Jetty. I am trying 
to install on JBoss which is running on a Sun Solaris box and I have the 
following questions:


 1.  Do I need to copy the entire example folder from my local machine to Solr 
home on Sun Solaris box?
 2.  How can I have multiple cores on the Sun Solaris box?

Any help is appreciated.

Thanks,
Murali

Index-time vs. search-time boosting performance

2010-06-04 Thread Asif Rahman

Hi,

What are the performance ramifications for using a function-based boost at
search time (through bf in dismax parser) versus an index-time boost?
Currently I'm using boost functions on a 15GB index of ~14mm documents.  Our
queries generally match many thousands of documents.  I'm wondering if I
would see a performance improvement by switching over to index-time
boosting.

Thanks,

Asif

-- 
Asif Rahman
Lead Engineer - NewsCred
a...@newscred.com
http://platform.newscred.com

Re: Range query on long value

2010-06-04 Thread Ahmet Arslan


> I have an issue with range queries on a long value in our
> dataset (the dataset is fairly large, but i believe the
> problem still exists for smaller datasets).  When i
> query the index with a range, as such: id:[1 TO 2000], I get
> values back that are well outside that range.  Its as
> if the range query is ignoring the values and doing
> something like id:[* TO *]. We are running Solr 1.3. 
> The value is set as the unique key for the index.
> 
> Our schema is similar to this:
> 
>  stored="true" required="true" />
>  stored="false" required="true" />
>  stored="false" required="false" />
>  stored="false" required="false" />

You need to use sortable double in solr 1.3.0 type="slong" for range queries to 
work correctly. Default schema.xml has an explanation about sortable (sint etc) 
types.

Re: Range query on long value

2010-06-04 Thread David


On 10-06-04 05:11 PM, Ahmet Arslan wrote:
   

I have an issue with range queries on a long value in our
dataset (the dataset is fairly large, but i believe the
problem still exists for smaller datasets).  When i
query the index with a range, as such: id:[1 TO 2000], I get
values back that are well outside that range.  Its as
if the range query is ignoring the values and doing
something like id:[* TO *]. We are running Solr 1.3. 
The value is set as the unique key for the index.


Our schema is similar to this:





 

You need to use sortable double in solr 1.3.0 type="slong" for range queries to 
work correctly. Default schema.xml has an explanation about sortable (sint etc) types.



   
Thanks for the fast response Ahmet.  This fixed my issue, but I have a 
question as to whether there is a performance hit if I change other 
fields to a sortable type, even if im not sure they will ever be used 
for range searches?

Re: general debugging techniques?

2010-06-04 Thread Chris Hostetter


: to format the data from my sources.  I can read through the catalina
: log, but this seems to just log requests; not much info is given about
: errors or when the service hangs.  Here are some examples:

if you are only seeing one log line per request, then you are just looking 
at the "request" log ... there should be more logs with messages from all 
over the code base with various levels of severity -- and using standard 
java log level controls you can turn these up/down for various components.

: Although I am keeping document size under 5MB, I regularly see
: "SEVERE: java.lang.OutOfMemoryError: Java heap space" errors.  How can
: I find what component had this problem?

that's one of java's most anoying problems -- even if you have the full 
stack trace of the OOM, that just tells you which code path was hte straw 
that broke the camels back -- it doesn't tell you where all your memory 
was being used.  for that you really need to use a java profiler, or turn 
on heap dumps and use a heap dump analyzer after the OOM occurs.

: After the above error, I often see this followup error on the next
: document: "SEVERE: org.apache.lucene.store.LockObtainFailedException:
: Lock obtain timed out: NativeFSLock@/var/lib/solr/data/
: index/lucene-d6f7b3bf6fe64f362b4d45bfd4924f54-write.lock" .  This has
: a backtrace, so I could dive directly into the code.  Is this the best
: way to track down the problem, or are there debugging settings that
: could help show why the lock is being held elsewhere?

probably not -- after an OOM, most java apps are just screwed in general 
after an OOM (or any other low level error).

: I attempted to turn on indexing logging with the line
: 
: true
: 
: but I can't seem to find this file in either the tomacat or the index 
directory.

it will probably be in whatever the Current Working Directory (CWD) is -- 
assuming the file permissions allow writting to it.  the top of the Solr 
admin screen tells you what the CWD is in case it's not clear from how 
your servlet container is run.


-Hoss

RE: general debugging techniques?

2010-06-04 Thread Chris Hostetter


: That is still really small for 5MB documents. I think the default solr 
: document cache is 512 items, so you would need at least 3 GB of memory 
: if you didn't change that and the cache filled up.

that assumes that the extracted text tika extracts from each document is 
the same size as the original raw files *and* that he's configured that 
content field to be "stored" ... in practice if you only stored=true the 
summary fields (title, author, short summary, etc...) the document cache 
isn't going to be nearly that big (and even if you do store the entire 
content field, the plain text is usually *much* msaller then the binary 
source file)

: -Xmx128M - my understanding is that this bumps heap size to 128M.

FWIW: depending on how many docs you are indexing, and wether you want to 
support things like faceting that rely on building in memory caches to be 
fast, 128MB is really, really, really small for a typical Solr instance.

Even on a box that is only doing indexing (no queries) i would imagine 
Tika likes to have a lot of ram when doing extraction (most doc types are 
gong to require the raw binary data is entirely in the heap, plus all hte 
extracted Strings, plus all of the connecting objects to build the DOM, 
etc  And that's before you even start thinking about Solr & Lucene and 
the index itself.

-Hoss

Help with Shingled queries

2010-06-04 Thread Greg Bowyer

Hi all

Interesting and by the looks of things very solid project you have here with 
SOLR, however ..

I have an index that contains a large number of "phrases" that I need to search 
for over, each of these phrases is fairly small being on average about 4 words 
long.

The search terms that I am given to search these phrases are very long, and 
quite arbitrary, sometimes the search terms will be up to 25 words long.

As such the performance of my index when built naively is sporadic sometimes 
searches are very fast on average they are somewhat slower.

I have attempted to improve this situation by using shingling for the phrases 
and the related search queries, in my schema I have the following



  



  
  



  


In the indexes, as seen with luke I do indeed have a large range of shingled 
terms.

When I run the analyser for either query or index terms I also see the 
breakdown 
with the shingled terms correctly displayed.

However when I attempt to use this in a query I do not see the terms applied in 
the debug output, for example with the term "short red evil fox" I would expect 
to see the shingles
'short_red' 'red_evil' 'evil_fox'

but instead I get the following

"debug":{
  "rawquerystring":"short red evil fox",
  "querystring":"short red evil fox",
  "parsedquery":"+() ()",
  "parsedquery_toString":"+() ()",
  "explain":{},
  "QParser":"DisMaxQParser",
  "altquerystring":null,
  "boostfuncs":null,
  "filter_queries":["atomId:(8235 10914 10911 )"],
  "parsed_filter_queries":["atomId:8235 atomId:10914 atomId:10911"],
  "timing":{ ..

Does anyone know what I could be doing wrong here, is it a bug in the debug 
output, a stupid mistake misconception or piece of idiocy on my part or 
something else.


Many thanks

-- Greg Bowyer

Re: Index-time vs. search-time boosting performance

2010-06-04 Thread Erick Erickson

Index time boosting is different than search time boosting, so
asking about performance is irrelevant.

Paraphrasing Hossman from years ago on the Lucene list (from
memory).

...index time boosting is a way of saying this documents'
title is more important than other documents' titles. Search
time boosting is a way of saying "I care about documents
whose titles contain this term more than other documents
whose titles may match other parts of this query"

HTH
Erick

On Fri, Jun 4, 2010 at 5:10 PM, Asif Rahman  wrote:

> Hi,
>
> What are the performance ramifications for using a function-based boost at
> search time (through bf in dismax parser) versus an index-time boost?
> Currently I'm using boost functions on a 15GB index of ~14mm documents.
>  Our
> queries generally match many thousands of documents.  I'm wondering if I
> would see a performance improvement by switching over to index-time
> boosting.
>
> Thanks,
>
> Asif
>
> --
> Asif Rahman
> Lead Engineer - NewsCred
> a...@newscred.com
> http://platform.newscred.com
>

Re: Faceted Search Slows Down as index gets larger

2010-06-04 Thread Andy

Yonik,

Just curious why does using enum improve the facet performance. 

Furkan was faceting on a text field with each word being a facet value. I'd 
imagine that'd mean there's a large number of facet values. According to the 
documentation (http://wiki.apache.org/solr/SimpleFacetParameters#facet.method) 
facet.method=fc is faster when a field has many unique terms. So how come enum, 
not fc, is faster in this case?

Also why use filterCache less?

Thanks
Andy

--- On Fri, 6/4/10, Furkan Kuru  wrote:

> From: Furkan Kuru 
> Subject: Re: Faceted Search Slows Down as index gets larger
> To: solr-user@lucene.apache.org, yo...@lucidimagination.com
> Date: Friday, June 4, 2010, 11:25 AM
> I am using 1.4 version.
> 
> I have tried your suggestion,
> 
> it takes around 25-30 seconds now.
> 
> Thank you,
> 
> 
> On Fri, Jun 4, 2010 at 5:54 PM, Yonik Seeley 
> wrote:
> 
> > Faceting on a full-text field is hard.
> > What version of Solr are you using?
> >
> > If it's 1.4 or later, try setting
> > facet.method=enum
> >
> > And to use the filterCache less, try
> > facet.enum.cache.minDf=100
> >
> > -Yonik
> > http://www.lucidimagination.com
> >
> > On Fri, Jun 4, 2010 at 10:31 AM, Furkan Kuru 
> wrote:
> > > Hello,
> > >
> > > I have been dealing with real-time data.
> > >
> > > As the number of total indexed documents gets
> larger (now 5 M)
> > >
> > > a faceted search on a text field limited by the
> creation time, which we
> > use
> > > to find the most used word in all these text
> fields, gets slow down.
> > >
> > >
> > > query string: created_time:[NOW-1HOUR TO NOW]
> facet.field=text
> > > facet.mincount=1
> > >
> > > the document count matching the query is around
> 9000.
> > >
> > >
> > > It takes around 80 seconds in a decent computer
> with 4GB ram, quad core
> > cpu
> > >
> > > I do not know the internal details of term
> indexing and their counts for
> > > faceting.
> > >
> > > Any suggestion for speeding up this query is
> appreciated.
> > >
> > > Thanks in advance.
> > >
> > > --
> > > Furkan Kuru
> > >
> >
> 
> 
> 
> -- 
> Furkan Kuru
>

Re: Index-time vs. search-time boosting performance

2010-06-04 Thread Asif Rahman

Perhaps I should have been more specific in my initial post.  I'm doing
date-based boosting on the documents in my index, so as to assign a higher
score to more recent documents.  Currently I'm using a boost function to
achieve this.  I'm wondering if there would be a performance improvement if
instead of using the boost function at search time, I indexed the documents
with a date-based boost.

On Fri, Jun 4, 2010 at 7:30 PM, Erick Erickson wrote:

> Index time boosting is different than search time boosting, so
> asking about performance is irrelevant.
>
> Paraphrasing Hossman from years ago on the Lucene list (from
> memory).
>
> ...index time boosting is a way of saying this documents'
> title is more important than other documents' titles. Search
> time boosting is a way of saying "I care about documents
> whose titles contain this term more than other documents
> whose titles may match other parts of this query"
>
> HTH
> Erick
>
> On Fri, Jun 4, 2010 at 5:10 PM, Asif Rahman  wrote:
>
> > Hi,
> >
> > What are the performance ramifications for using a function-based boost
> at
> > search time (through bf in dismax parser) versus an index-time boost?
> > Currently I'm using boost functions on a 15GB index of ~14mm documents.
> >  Our
> > queries generally match many thousands of documents.  I'm wondering if I
> > would see a performance improvement by switching over to index-time
> > boosting.
> >
> > Thanks,
> >
> > Asif
> >
> > --
> > Asif Rahman
> > Lead Engineer - NewsCred
> > a...@newscred.com
> > http://platform.newscred.com
> >
>



-- 
Asif Rahman
Lead Engineer - NewsCred
a...@newscred.com
http://platform.newscred.com

Re: Help with Shingled queries

2010-06-04 Thread Robert Muir

the queryparser first splits on whitespace.

so each individual word of your query: short,red,evil,fox gets its own
tokenstream, and therefore isn't shingled.

On Fri, Jun 4, 2010 at 6:21 PM, Greg Bowyer  wrote:

> Hi all
>
> Interesting and by the looks of things very solid project you have here
> with
> SOLR, however ..
>
> I have an index that contains a large number of "phrases" that I need to
> search
> for over, each of these phrases is fairly small being on average about 4
> words
> long.
>
> The search terms that I am given to search these phrases are very long, and
> quite arbitrary, sometimes the search terms will be up to 25 words long.
>
> As such the performance of my index when built naively is sporadic
> sometimes
> searches are very fast on average they are somewhat slower.
>
> I have attempted to improve this situation by using shingling for the
> phrases
> and the related search queries, in my schema I have the following
>
>
> positionIncrementGap="100">
>  
>
>
> outputUnigramIfNoNgram="true" />
>  
>  
>
>
> outputUnigramIfNoNgram="true" />
>  
>
>
> In the indexes, as seen with luke I do indeed have a large range of
> shingled
> terms.
>
> When I run the analyser for either query or index terms I also see the
> breakdown
> with the shingled terms correctly displayed.
>
> However when I attempt to use this in a query I do not see the terms
> applied in
> the debug output, for example with the term "short red evil fox" I would
> expect
> to see the shingles
> 'short_red' 'red_evil' 'evil_fox'
>
> but instead I get the following
>
> "debug":{
>  "rawquerystring":"short red evil fox",
>  "querystring":"short red evil fox",
>  "parsedquery":"+() ()",
>  "parsedquery_toString":"+() ()",
>  "explain":{},
>  "QParser":"DisMaxQParser",
>  "altquerystring":null,
>  "boostfuncs":null,
>  "filter_queries":["atomId:(8235 10914 10911 )"],
>  "parsed_filter_queries":["atomId:8235 atomId:10914 atomId:10911"],
>  "timing":{ ..
>
> Does anyone know what I could be doing wrong here, is it a bug in the debug
> output, a stupid mistake misconception or piece of idiocy on my part or
> something else.
>
>
> Many thanks
>
> -- Greg Bowyer
>
>
>


-- 
Robert Muir
rcm...@gmail.com

Re: Index-time vs. search-time boosting performance

2010-06-04 Thread Jay Hill

I've done a lot of recency boosting to documents, and I'm wondering why you
would want to do that at index time. If you are continuously indexing new
documents, what was "recent" when it was indexed becomes, over time "less
recent". Are you unsatisfied with your current performance with the boost
function? Query-time recency boosting is a fairly common thing to do, and,
if done correctly, shouldn't be a performance concern.

-Jay
http://lucidimagination.com


On Fri, Jun 4, 2010 at 4:50 PM, Asif Rahman  wrote:

> Perhaps I should have been more specific in my initial post.  I'm doing
> date-based boosting on the documents in my index, so as to assign a higher
> score to more recent documents.  Currently I'm using a boost function to
> achieve this.  I'm wondering if there would be a performance improvement if
> instead of using the boost function at search time, I indexed the documents
> with a date-based boost.
>
> On Fri, Jun 4, 2010 at 7:30 PM, Erick Erickson  >wrote:
>
> > Index time boosting is different than search time boosting, so
> > asking about performance is irrelevant.
> >
> > Paraphrasing Hossman from years ago on the Lucene list (from
> > memory).
> >
> > ...index time boosting is a way of saying this documents'
> > title is more important than other documents' titles. Search
> > time boosting is a way of saying "I care about documents
> > whose titles contain this term more than other documents
> > whose titles may match other parts of this query"
> >
> > HTH
> > Erick
> >
> > On Fri, Jun 4, 2010 at 5:10 PM, Asif Rahman  wrote:
> >
> > > Hi,
> > >
> > > What are the performance ramifications for using a function-based boost
> > at
> > > search time (through bf in dismax parser) versus an index-time boost?
> > > Currently I'm using boost functions on a 15GB index of ~14mm documents.
> > >  Our
> > > queries generally match many thousands of documents.  I'm wondering if
> I
> > > would see a performance improvement by switching over to index-time
> > > boosting.
> > >
> > > Thanks,
> > >
> > > Asif
> > >
> > > --
> > > Asif Rahman
> > > Lead Engineer - NewsCred
> > > a...@newscred.com
> > > http://platform.newscred.com
> > >
> >
>
>
>
> --
> Asif Rahman
> Lead Engineer - NewsCred
> a...@newscred.com
> http://platform.newscred.com
>

Re: Faceted Search Slows Down as index gets larger

2010-06-04 Thread Yonik Seeley

On Fri, Jun 4, 2010 at 7:33 PM, Andy  wrote:
> Yonik,
>
> Just curious why does using enum improve the facet performance.
>
> Furkan was faceting on a text field with each word being a facet value. I'd 
> imagine that'd mean there's a large number of facet values. According to the 
> documentation 
> (http://wiki.apache.org/solr/SimpleFacetParameters#facet.method) 
> facet.method=fc is faster when a field has many unique terms. So how come 
> enum, not fc, is faster in this case?

facet.method=fc is faster when there are many unique terms, and
relatively few terms per document.  A full-text field doesn't fit that
bill.

> Also why use filterCache less?

Take sup a lot of memory.

-Yonik
http://www.lucidimagination.com

Re: Does SolrJ support nested annotated beans?

2010-06-04 Thread Thomas J. Buhr

+1

Good question, my use of Solr would benefit from nested annotated beans as well.

Awaiting the reply,

Thom


On 2010-06-03, at 1:35 PM, Peter Hanning wrote:

> 
> When modeling documents with a lot of fields (hundreds) the bean class used
> with SolrJ to interact with the Solr index tends to get really big and
> unwieldy. I was hoping that it would be possible to extract groups of
> properties into nested beans and move the @Field annotations along.
> 
> Basically, I want to refactor something like the following:
> 
>  // Imports have been omitted for this example.
>  public class TheBigOne
>  {
>@Field("UniqueKey")
>private String uniqueKey;
>@Field("Name_en")
>private String name_en;
>@Field("Name_es")
>private String name_es;
>@Field("Name_fr")
>private String name_fr;
>@Field("Category")
>private String category;
>@Field("Color")
>private String color;
>// Additional properties, getters and setters have been omitted for this
> example.
>  }
> 
> into something like the following:
> 
>  // Imports have been omitted for this example.
>  public class TheBigOne
>  {
>@Field("UniqueKey")
>private String uniqueKey;
>private Names names = new Names();
>private Classification classification = new Classification();
>// Additional properties, getters and setters have been omitted for this
> example.
>  }
> 
>  // Imports have been omitted for this example.
>  public class Names
>  {
>@Field("Name_en")
>private String name_en;
>@Field("Name_es")
>private String name_es;
>@Field("Name_fr")
>private String name_fr;
>// Additional properties, getters and setters have been omitted for this
> example.
>  }
> 
>  // Imports have been omitted for this example.
>  public class Classification
>  {
>@Field("Category")
>private String category;
>@Field("Color")
>private String color;
>// Additional properties, getters and setters have been omitted for this
> example.
>  }
> 
> This did not work however as the DocumentObjectBinder does not seem to walk
> the nested object graph. Am I doing something wrong, or is this not
> supported?
> 
> I see JIRA tickets 1129 and 1357 could alleviate this issue somewhat for the
> Name* fields once 1.5 comes out. Still, it would be great to be able to nest
> beans without using dynamic names in the field annotations like in the
> Classification example above.
> 
> 
> As a quick and naive test I tried to change the DocumentObjectBinder's
> collectInfo method to something like the following:
> 
>  private List collectInfo(Class clazz) {
>List fields = new ArrayList();
>Class superClazz = clazz;
>ArrayList members = new ArrayList();
>while (superClazz != null && superClazz != Object.class) {
>  members.addAll(Arrays.asList(superClazz.getDeclaredFields()));
>  members.addAll(Arrays.asList(superClazz.getDeclaredMethods()));
>  superClazz = superClazz.getSuperclass();
>}
>for (AccessibleObject member : members) {
>  if (member.isAnnotationPresent(Field.class)) {
>member.setAccessible(true);
>fields.add(new DocField(member));
>  } // BEGIN changes
>  else { // A quick test supporting only Field, not Method and others
>if (member instanceof java.lang.reflect.Field) {
>  java.lang.reflect.Field field = (java.lang.reflect.Field) member;
>  fields.addAll(collectInfo(field.getType()));
>}
>  } // END changes
>}
>return fields;
>  }
> 
> This worked in that SolrJ started walking down into nested beans, checking
> for and handling @Field annotations in the nested beans. However, when
> trying to retrieve the values of the fields in the nested beans, SolrJ still
> tried to look for them in the main bean as far as I can tell.
> 
> ERROR 2010-06-02 09:28:35,326 (main) () (SolrIndexer.java:335 main) -
> Exception encountered:
> java.lang.RuntimeException: Exception while getting value: private
> java.lang.String Names.Name_en
>at
> org.apache.solr.client.solrj.beans.DocumentObjectBinder$DocField.get(DocumentObjectBinder.java:377)
>at
> org.apache.solr.client.solrj.beans.DocumentObjectBinder.toSolrInputDocument(DocumentObjectBinder.java:71)
>at
> org.apache.solr.client.solrj.SolrServer.addBeans(SolrServer.java:56)
> ...
> Caused by: java.lang.IllegalArgumentException: Can not set java.lang.String
> field Names.Name_en to TheBigOne
>at
> sun.reflect.UnsafeFieldAccessorImpl.throwSetIllegalArgumentException(UnsafeFieldAccessorImpl.java:146)
>at
> sun.reflect.UnsafeFieldAccessorImpl.throwSetIllegalArgumentException(UnsafeFieldAccessorImpl.java:150)
>at
> sun.reflect.UnsafeFieldAccessorImpl.ensureObj(UnsafeFieldAccessorImpl.java:37)
>at
> sun.reflect.UnsafeObjectFieldAccessorImpl.get(UnsafeObjectFieldAccessorImpl.java:18)
>at java.lang.reflect.Field.get(Field.java:358)
>at
> org.apache.solr.client.solrj.beans.Docu

Re: Index-time vs. search-time boosting performance

2010-06-04 Thread Asif Rahman

It seems like it would be far more efficient to calculate the boost factor
once and store it rather than calculating it for each request in real-time.
Some of our queries match tens of thousands if not hundreds of thousands of
documents in a 15GB index.  However, I'm not well-versed in lucene internals
so I may be misunderstanding what is going on here.


On Fri, Jun 4, 2010 at 8:31 PM, Jay Hill  wrote:

> I've done a lot of recency boosting to documents, and I'm wondering why you
> would want to do that at index time. If you are continuously indexing new
> documents, what was "recent" when it was indexed becomes, over time "less
> recent". Are you unsatisfied with your current performance with the boost
> function? Query-time recency boosting is a fairly common thing to do, and,
> if done correctly, shouldn't be a performance concern.
>
> -Jay
> http://lucidimagination.com
>
>
> On Fri, Jun 4, 2010 at 4:50 PM, Asif Rahman  wrote:
>
> > Perhaps I should have been more specific in my initial post.  I'm doing
> > date-based boosting on the documents in my index, so as to assign a
> higher
> > score to more recent documents.  Currently I'm using a boost function to
> > achieve this.  I'm wondering if there would be a performance improvement
> if
> > instead of using the boost function at search time, I indexed the
> documents
> > with a date-based boost.
> >
> > On Fri, Jun 4, 2010 at 7:30 PM, Erick Erickson  > >wrote:
> >
> > > Index time boosting is different than search time boosting, so
> > > asking about performance is irrelevant.
> > >
> > > Paraphrasing Hossman from years ago on the Lucene list (from
> > > memory).
> > >
> > > ...index time boosting is a way of saying this documents'
> > > title is more important than other documents' titles. Search
> > > time boosting is a way of saying "I care about documents
> > > whose titles contain this term more than other documents
> > > whose titles may match other parts of this query"
> > >
> > > HTH
> > > Erick
> > >
> > > On Fri, Jun 4, 2010 at 5:10 PM, Asif Rahman  wrote:
> > >
> > > > Hi,
> > > >
> > > > What are the performance ramifications for using a function-based
> boost
> > > at
> > > > search time (through bf in dismax parser) versus an index-time boost?
> > > > Currently I'm using boost functions on a 15GB index of ~14mm
> documents.
> > > >  Our
> > > > queries generally match many thousands of documents.  I'm wondering
> if
> > I
> > > > would see a performance improvement by switching over to index-time
> > > > boosting.
> > > >
> > > > Thanks,
> > > >
> > > > Asif
> > > >
> > > > --
> > > > Asif Rahman
> > > > Lead Engineer - NewsCred
> > > > a...@newscred.com
> > > > http://platform.newscred.com
> > > >
> > >
> >
> >
> >
> > --
> > Asif Rahman
> > Lead Engineer - NewsCred
> > a...@newscred.com
> > http://platform.newscred.com
> >
>



-- 
Asif Rahman
Lead Engineer - NewsCred
a...@newscred.com
http://platform.newscred.com

RE: Index-time vs. search-time boosting performance

2010-06-04 Thread Jonathan Rochkind

The SolrRelevancyFAQ does suggest that both index-time and search-time boosting 
can be used to boost the score of newer documents, but doesn't suggest what 
reasons/contexts one might choose one vs the other.  It only provides an 
example of search-time boost though, so it doesn't answer the question of how 
to do an index time boost, if that was a question. 

http://wiki.apache.org/solr/SolrRelevancyFAQ#How_can_I_boost_the_score_of_newer_documents

Sorry, this doesn't answer your question, but does contribute the fact that 
some author of the FAQ at some point considered index-time boost not 
neccesarily unreasonable. 

From: Asif Rahman [a...@newscred.com]
Sent: Friday, June 04, 2010 11:31 PM
To: solr-user@lucene.apache.org
Subject: Re: Index-time vs. search-time boosting performance

It seems like it would be far more efficient to calculate the boost factor
once and store it rather than calculating it for each request in real-time.
Some of our queries match tens of thousands if not hundreds of thousands of
documents in a 15GB index.  However, I'm not well-versed in lucene internals
so I may be misunderstanding what is going on here.


On Fri, Jun 4, 2010 at 8:31 PM, Jay Hill  wrote:

> I've done a lot of recency boosting to documents, and I'm wondering why you
> would want to do that at index time. If you are continuously indexing new
> documents, what was "recent" when it was indexed becomes, over time "less
> recent". Are you unsatisfied with your current performance with the boost
> function? Query-time recency boosting is a fairly common thing to do, and,
> if done correctly, shouldn't be a performance concern.
>
> -Jay
> http://lucidimagination.com
>
>
> On Fri, Jun 4, 2010 at 4:50 PM, Asif Rahman  wrote:
>
> > Perhaps I should have been more specific in my initial post.  I'm doing
> > date-based boosting on the documents in my index, so as to assign a
> higher
> > score to more recent documents.  Currently I'm using a boost function to
> > achieve this.  I'm wondering if there would be a performance improvement
> if
> > instead of using the boost function at search time, I indexed the
> documents
> > with a date-based boost.
> >
> > On Fri, Jun 4, 2010 at 7:30 PM, Erick Erickson  > >wrote:
> >
> > > Index time boosting is different than search time boosting, so
> > > asking about performance is irrelevant.
> > >
> > > Paraphrasing Hossman from years ago on the Lucene list (from
> > > memory).
> > >
> > > ...index time boosting is a way of saying this documents'
> > > title is more important than other documents' titles. Search
> > > time boosting is a way of saying "I care about documents
> > > whose titles contain this term more than other documents
> > > whose titles may match other parts of this query"
> > >
> > > HTH
> > > Erick
> > >
> > > On Fri, Jun 4, 2010 at 5:10 PM, Asif Rahman  wrote:
> > >
> > > > Hi,
> > > >
> > > > What are the performance ramifications for using a function-based
> boost
> > > at
> > > > search time (through bf in dismax parser) versus an index-time boost?
> > > > Currently I'm using boost functions on a 15GB index of ~14mm
> documents.
> > > >  Our
> > > > queries generally match many thousands of documents.  I'm wondering
> if
> > I
> > > > would see a performance improvement by switching over to index-time
> > > > boosting.
> > > >
> > > > Thanks,
> > > >
> > > > Asif
> > > >
> > > > --
> > > > Asif Rahman
> > > > Lead Engineer - NewsCred
> > > > a...@newscred.com
> > > > http://platform.newscred.com
> > > >
> > >
> >
> >
> >
> > --
> > Asif Rahman
> > Lead Engineer - NewsCred
> > a...@newscred.com
> > http://platform.newscred.com
> >
>



--
Asif Rahman
Lead Engineer - NewsCred
a...@newscred.com
http://platform.newscred.com

43 matches

Mail list logo