Question re snapinstaller

2007-02-13 Thread Ken Krugler

Hi all,

In looking at the snapinstaller script, it seems to do the following:

1. Copy a new index directory from the master to the slave's Solr 
data directory, giving it a name "index.tmp".


2. Delete the current index directory ("index").

3. Rename the temp index directory to be "index".

Then the commit script will send a  POST to the 
.../solr/update service, and the new index gets swapped into use.


I feel like I must be missing something, because it seems like any 
request that's in the middle of being processed between step #2 and 
the end of a successful swap could fail due to the index changing 
underneath. Any insights here?


Thanks,

-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"Find Code, Find Answers"


Re: Question re snapinstaller

2007-02-13 Thread Yonik Seeley

On 2/13/07, Ken Krugler <[EMAIL PROTECTED]> wrote:

Hi all,

In looking at the snapinstaller script, it seems to do the following:

1. Copy a new index directory from the master to the slave's Solr
data directory, giving it a name "index.tmp".

2. Delete the current index directory ("index").

3. Rename the temp index directory to be "index".

Then the commit script will send a  POST to the
.../solr/update service, and the new index gets swapped into use.

I feel like I must be missing something, because it seems like any
request that's in the middle of being processed between step #2 and
the end of a successful swap could fail due to the index changing
underneath. Any insights here?


A Lucene's IndexReader opens all index files it needs when it is instantiated.
Changes to a Lucene index via IndexWriter never change an existing
file... new files are always created.
Put the two together and it allows an IndexWriter (or anything else,
like snapinstaller) to change the index in the background without
impact to open readers.

-Yonik


question about highlighting

2007-02-13 Thread nick19701

I can't locate any concrete examples of using highlighting.
After checking out the following wiki, 
http://wiki.apache.org/solr/HighlightingParameters

I sent my solr server the following request:

select?indent=on&version=2.2&q=dell&start=0&rows=10&fl=*%2Cscore&qt=standard&wt=standard&explainOther=&hl=on&hl.fl=pageContent

pageContent is the default search field which I need highlighting. But the
xml response I got doesn't include any highlight info.
What have I done wrong?
-- 
View this message in context: 
http://www.nabble.com/question-about-highlighting-tf3221641.html#a8947578
Sent from the Solr - User mailing list archive at Nabble.com.



Re: question about highlighting

2007-02-13 Thread Andre Halama
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

nick19701 schrieb:

Hi Nick,

> select?indent=on&version=2.2&q=dell&start=0&rows=10&fl=*%2Cscore&qt=standard&wt=standard&explainOther=&hl=on&hl.fl=pageContent

try hl=true...

Hth,

Andre
- --
Andre Halama
hbz, Gruppe Portale, Projekt vascoda (TB1), Softwareentwicklung
- - DigiBib, Online-Fernleihe, Projekt vascoda -
Jülicher Str. 6, 50674 Köln, Deutschland
Telefon +49-221-40075-229, Fax +49-221-40075-190
[EMAIL PROTECTED] - www.vascoda.de - www.hbz-nrw.de
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org

iD8DBQFF0eqUKWvEdxF+FtQRAjwJAJ0fiS2dnQlAwGzUQHf9TPi8nOVrPgCgxYtl
ikt2z3HAqzeJbKmPT8mOS1o=
=sUBb
-END PGP SIGNATURE-


RE: Tagging

2007-02-13 Thread Binkley, Peter
I still wonder if there's a good way of storing the tags outside the
Lucene index and using them via facets whose bitsets are manipulated
directly rather than being populated from the index. In my project,
reindexing a documents whenever a user adds a tag is very very bad,
since we're indexing potentially hundreds of pages of full text in the
body field of the document. A solution that gets the tag into the system
immediately without forcing a reindexing of the document is essential.

Peter

-Original Message-
From: Ryan McKinley [mailto:[EMAIL PROTECTED] 
Sent: Monday, February 12, 2007 7:07 PM
To: solr-user@lucene.apache.org
Subject: Re: Tagging

there no good solution yet.  There has been discussion on possible
approaches

http://www.nabble.com/convert-custom-facets-to-Solr-facets...-tf3163183.
html#a8790179

http://wiki.apache.org/solr/UserTagDesign



On 2/12/07, Gmail Account <[EMAIL PROTECTED]> wrote:
> I know that I've seen this topic before.. Is there a guidline on the 
> best way to create tagging in solr?  For example, keeping track of 
> what user tagged what item in solr. And facetting based on tags?
>
> Thanks,
> Mike
>
>


Re: question about highlighting

2007-02-13 Thread nick19701

Hi, Andre,
I tried hl=true. But it still doesn't work.

Here is my request:

select?indent=on&version=2.2&q=pageContent%3Adell&start=0&rows=10&fl=pageContent&qt=standard&wt=standard&explainOther=&hl=true&hl.fl=pageContent

This is part of the response:


standard
10

0
pageContent
on
pageContent
true
pageContent:dell
standard
2.2





Dell Business has their Dell W2306C 23" LCD HD-Ready TV for $649 - $150 with
coupon code  $$6FHFKBB6BSBJ  [Exp 2/6, 1900 uses] =  $499  with free
shipping. Features 1366 x 768 resolution, 550:1 contrast ratio, 16 ms
response time.  
Dell W2607C 26" LCD HDTV  $899 - $200 code  WT4035HNC7MGB2  =  $799 
Dell Dell W5001C 50" Plasma HDTV  $2599 - $400 code  BXMHNRV49$D744  = 
$2199 
Click Here 
dell has got to make these alot cheaper if they want to stay competetive. 
Holy overpriced TV Batman. 


-- 
View this message in context: 
http://www.nabble.com/question-about-highlighting-tf3221641.html#a8948538
Sent from the Solr - User mailing list archive at Nabble.com.



question about synonyms

2007-02-13 Thread nick19701

Hi,
I put this line in my synonyms.txt

bestbuy,bb,best buy

I expect that when bb is searched, all results
including "bestbuy", "bb" or "best buy" will be returned.
But in my test I only got back the results which include "bestbuy"
or "best buy". The results which include "bb" are not returned.

what am I missing here?
-- 
View this message in context: 
http://www.nabble.com/question-about-synonyms-tf3222067.html#a8948902
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Question re snapinstaller

2007-02-13 Thread Ken Krugler

On 2/13/07, Ken Krugler <[EMAIL PROTECTED]> wrote:

Hi all,

In looking at the snapinstaller script, it seems to do the following:

1. Copy a new index directory from the master to the slave's Solr
data directory, giving it a name "index.tmp".

2. Delete the current index directory ("index").

3. Rename the temp index directory to be "index".

Then the commit script will send a  POST to the
.../solr/update service, and the new index gets swapped into use.

I feel like I must be missing something, because it seems like any
request that's in the middle of being processed between step #2 and
the end of a successful swap could fail due to the index changing
underneath. Any insights here?


A Lucene's IndexReader opens all index files it needs when it is instantiated.
Changes to a Lucene index via IndexWriter never change an existing
file... new files are always created.
Put the two together and it allows an IndexWriter (or anything else,
like snapinstaller) to change the index in the background without
impact to open readers.


Right, but from looking at the snapinstaller script it seemed as 
though the deletion of the "index" directory would be deleting files 
out from under the IndexReader.


Thanks,

-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"Find Code, Find Answers"


Re: Question re snapinstaller

2007-02-13 Thread Mike Klaas

On 2/13/07, Ken Krugler <[EMAIL PROTECTED]> wrote:


>A Lucene's IndexReader opens all index files it needs when it is instantiated.
>Changes to a Lucene index via IndexWriter never change an existing
>file... new files are always created.
>Put the two together and it allows an IndexWriter (or anything else,
>like snapinstaller) to change the index in the background without
>impact to open readers.

Right, but from looking at the snapinstaller script it seemed as
though the deletion of the "index" directory would be deleting files
out from under the IndexReader.


Files aren't truly deleted until no process has an open file descriptor:

http://users.actcom.co.il/~choo/lupg/tutorials/handling-files/handling-files.html#sys_file_unlink

-Mike


Re: question about synonyms

2007-02-13 Thread Yonik Seeley

On 2/13/07, nick19701 <[EMAIL PROTECTED]> wrote:


Hi,
I put this line in my synonyms.txt

bestbuy,bb,best buy

I expect that when bb is searched, all results
including "bestbuy", "bb" or "best buy" will be returned.
But in my test I only got back the results which include "bestbuy"
or "best buy". The results which include "bb" are not returned.


Are you using the synonyms at index time, query time, or both?
Did you reindex if you made changes to an "index" analyzer?
It would help if you post the fieldtype for the field you are searching.

-Yonik


Incremental replication...

2007-02-13 Thread escher2k

I was wondering if the scripts provided in Solr do incremental replication.
Looking at the script for snapshooter, it seems like the whole index
directory is copied over. Is that correct ? If so, isn't performance a
problem over the long run ? Thanks for the clarification in advance (I hope
I am wrong !!).
-- 
View this message in context: 
http://www.nabble.com/Incremental-replication...-tf3222946.html#a8951862
Sent from the Solr - User mailing list archive at Nabble.com.



RE: Incremental replication...

2007-02-13 Thread Graham Stead
We have used replication for a few weeks now and it generally works well.

I believe you'll find that commit operations cause only new segments to be
transferred, whereas optimize operations cause the entire index to be
transferred. Therefore, the amount of data transferred really depends on how
frequently you index new data and how often you call  and
.

Hope this helps,
-Graham




RE: Incremental replication...

2007-02-13 Thread escher2k


Graham Stead-2 wrote:
> 
> We have used replication for a few weeks now and it generally works well.
> 
> I believe you'll find that commit operations cause only new segments to be
> transferred, whereas optimize operations cause the entire index to be
> transferred. Therefore, the amount of data transferred really depends on
> how
> frequently you index new data and how often you call  and
> .
> 
> Hope this helps,
> -Graham
> 
> 
> 
> 

Thanks Graham. Atleast from looking at the snapshooter script, it doesn't
seem to be doing anything specific.  The following is a fragment from the
script -

snap_name=snapshot.`date +"%Y%m%d%H%M%S"`
name=${data_dir}/${snap_name}
temp=${data_dir}/temp-${snap_name}

if [[ -d ${name} ]]
then
logMessage snapshot directory ${name} already exists
logExit aborted 1
fi

if [[ -d ${temp} ]]
then
logMessage snapshoting of ${name} in progress
logExit aborted 1
fi

# clean up after INT/TERM
trap 'echo cleaning up, please wait ...;/bin/rm -rf ${name} ${temp};logExit
aborted 13' INT TERM

logMessage taking snapshot ${name}

# take a snapshot using hard links into temporary location
# then move it into place atomically
cp -lr ${data_dir}/index ${temp}
mv ${temp} ${name}
-- 
View this message in context: 
http://www.nabble.com/Incremental-replication...-tf3222946.html#a8952716
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Incremental replication...

2007-02-13 Thread Bertrand Delacretaz

On 2/13/07, escher2k <[EMAIL PROTECTED]> wrote:


...Atleast from looking at the snapshooter script, it doesn't
seem to be doing anything specific...


The snapshooter script only makes an "instant snapshot" of the index
directory using cp -lr. This does not involve any copying of index
data.

The actual replication is done using rsync in the other scripts, by
copying the index snapshot elsewhere.

Rsync only copies what has changed since the last copy, and not many
files change in a Lucene index when adding documents, so it's correct
that replication uses little bandwidth when adding documents.

Index optimization, OTOH, causes much larger changes in the index
directory, so after an optimization rsync will usually have much more
data to transfer.

-Bertrand


Re: question about synonyms

2007-02-13 Thread nick19701


Yonik Seeley wrote:
> 
> Are you using the synonyms at index time, query time, or both?
> Did you reindex if you made changes to an "index" analyzer?
> It would help if you post the fieldtype for the field you are searching.
> 

I am using the synonyms only at query time.
Below is the field analysis.
It seems like the culpit is the space in the phrase "best buy" in
synonyms.txt.
what should I do about it? put quotes around it?

BTW, the default operator is "AND":
solrQueryParser defaultOperator="AND"


Index Analyzer
org.apache.solr.analysis.WhitespaceTokenizerFactory   {}


term position
1

term text
bb

term type
word

source start,end
0,2

org.apache.solr.analysis.StopFilterFactory   {words=stopwords.txt,
ignoreCase=true}


term position
1

term text
bb

term type
word

source start,end
0,2

org.apache.solr.analysis.WordDelimiterFilterFactory   {catenateWords=1,
catenateNumbers=1, catenateAll=0, generateNumberParts=1,
generateWordParts=1}


term position
1

term text
bb

term type
word

source start,end
0,2

org.apache.solr.analysis.LowerCaseFilterFactory   {}


term position
1

term text
bb

term type
word

source start,end
0,2

org.apache.solr.analysis.EnglishPorterFilterFactory  
{protected=protwords.txt}


term position
1

term text
bb

term type
word

source start,end
0,2

org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory   {}


term position
1

term text
bb

term type
word

source start,end
0,2

Query Analyzer
org.apache.solr.analysis.WhitespaceTokenizerFactory   {}


term position
1

term text
bb

term type
word

source start,end
0,2

org.apache.solr.analysis.SynonymFilterFactory   {expand=true,
ignoreCase=true, synonyms=synonyms.txt}


term position
1   2

term text
bestbuy buy

bb

best

term type
wordword

word

word

source start,end
0,2 0,2

0,2

0,2

org.apache.solr.analysis.StopFilterFactory   {words=stopwords.txt,
ignoreCase=true}


term position
1   2

term text
bestbuy buy

bb

best

term type
wordword

word

word

source start,end
0,2 0,2

0,2

0,2

org.apache.solr.analysis.WordDelimiterFilterFactory   {catenateWords=0,
catenateNumbers=0, catenateAll=0, generateNumberParts=1,
generateWordParts=1}


term position
1   2

term text
bestbuy buy

bb

best

term type
wordword

word

word

source start,end
0,2 0,2

0,2

0,2

org.apache.solr.analysis.LowerCaseFilterFactory   {}


term position
1   2

term text
bestbuy buy

bb

best

term type
wordword

word

word

source start,end
0,2 0,2

0,2

0,2

org.apache.solr.analysis.EnglishPorterFilterFactory  
{protected=protwords.txt}


term position
1   2

term text
bestbuy buy

bb

best

term type
wordword

word

word

source start,end
0,2 0,2

0,2

0,2

org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory   {}


term position
1   2

term text
bestbuy buy

bb

best

term type
wordword

word

word

source start,end
0,2 0,2

0,2

0,2

-- 
View this message in context: 
http://www.nabble.com/question-about-synonyms-tf3222067.html#a8953229
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Gentoo: problem with xml-apis.jar/Apache Tomcat Native Library

2007-02-13 Thread Chris Hostetter

Solr isn't really doing anything particularly special when this
RuntimeException occurs, the line is purely...

  static final XPathFactory xpathFactory = XPathFactory.newInstance();

According to the 1.5 javadocs for this method...

Get a new XPathFactory instance using the default object model,
DEFAULT_OBJECT_MODEL_URI, the W3C DOM.
This method is functionally equivalent to:
   newInstance(DEFAULT_OBJECT_MODEL_URI)
Since the implementation for the W3C DOM is always available, this
method will never fail.

I suspect you could create a non-Solr Tomcat/Gentoo test case for this
with a simple HelloWorldServlet that had the same line, and then perhaps
the Tomcat or gentoo user communities would be able to shed some more
light on the issue.

(I honestly have no idea what '"The Apache Tomcat Native library"
extensions in Gentoo' are).


: Caused by: java.lang.RuntimeException: XPathFactory#newInstance() failed to
: create an XPathFactory for the default object model:
: http://java.sun.com/jaxp/xpath/dom with the
: XPathFactoryConfigurationException:
: javax.xml.xpath.XPathFactoryConfigurationException: No XPathFctory
: implementation found for the object model:
: http://java.sun.com/jaxp/xpath/dom
:
: at javax.xml.xpath.XPathFactory.newInstance(Unknown Source)
:
: at org.apache.solr.core.Config.(Config.java:49)



-Hoss



Re: question about highlighting

2007-02-13 Thread Chris Hostetter

: This is part of the response:

what's the rest of the response?

highlighting info comes in a seperate block, after the  section.


(for the record, "hl=on" should work fine too)


-Hoss



Re: question about synonyms

2007-02-13 Thread Chris Hostetter

: I am using the synonyms only at query time.
: Below is the field analysis.

FYI: I think what yonik ment was the section of your schema.xml that
defines the fieldtype.

: It seems like the culpit is the space in the phrase "best buy" in
: synonyms.txt.

because of some limitations in the way Analyzers can indicate that
multiple tokens occupy the same space, multiword synonyms are inheriently
tricky ... there is extensive discussion on this in the wiki...

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#head-2c461ac74b4ddd82e453dc68fcfc92da77358d46

...in a nut shell: there is no clean way to do query time multiword
synonyms.


-Hoss



Re: Question re snapinstaller

2007-02-13 Thread Bill Au

Solr snapshots are created using hard links.  The file is not deleted as
long as there is 1 or more
link to it.

Bill

On 2/13/07, Mike Klaas <[EMAIL PROTECTED]> wrote:


On 2/13/07, Ken Krugler <[EMAIL PROTECTED]> wrote:

> >A Lucene's IndexReader opens all index files it needs when it is
instantiated.
> >Changes to a Lucene index via IndexWriter never change an existing
> >file... new files are always created.
> >Put the two together and it allows an IndexWriter (or anything else,
> >like snapinstaller) to change the index in the background without
> >impact to open readers.
>
> Right, but from looking at the snapinstaller script it seemed as
> though the deletion of the "index" directory would be deleting files
> out from under the IndexReader.

Files aren't truly deleted until no process has an open file descriptor:


http://users.actcom.co.il/~choo/lupg/tutorials/handling-files/handling-files.html#sys_file_unlink

-Mike



Re: question about synonyms

2007-02-13 Thread Yonik Seeley

On 2/13/07, Chris Hostetter <[EMAIL PROTECTED]> wrote:


: I am using the synonyms only at query time.
: Below is the field analysis.

FYI: I think what yonik ment was the section of your schema.xml that
defines the fieldtype.

: It seems like the culpit is the space in the phrase "best buy" in
: synonyms.txt.

because of some limitations in the way Analyzers can indicate that
multiple tokens occupy the same space, multiword synonyms are inheriently
tricky ... there is extensive discussion on this in the wiki...

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#head-2c461ac74b4ddd82e453dc68fcfc92da77358d46

...in a nut shell: there is no clean way to do query time multiword
synonyms.


To be clear, no clean way to do *expansion* as opposed to reduction at
query time, when the alternatives are of different lengths.

You could use index-time expansion, a combination of index time and
query time reduction on the same synonym dictionary, or only handle
the multi-token alternatives during indexing with expansion, and do
query-time synonym expansion on the remaining alternatives.

-Yonik


Re: Question re snapinstaller

2007-02-13 Thread Yonik Seeley

On 2/13/07, Bill Au <[EMAIL PROTECTED]> wrote:

Solr snapshots are created using hard links.  The file is not deleted as
long as there is 1 or more
link to it.


Or a process that holds it open.  It would work even if there were no
links in the filesystem because the IndexReader would still be holding
them open.

-Yonik


Re: Incremental replication...

2007-02-13 Thread Bill Au

FYI, additional information on replication is available in the Solr TWiki:

http://wiki.apache.org/solr/CollectionDistribution

Bill

On 2/13/07, Bertrand Delacretaz <[EMAIL PROTECTED]> wrote:


On 2/13/07, escher2k <[EMAIL PROTECTED]> wrote:

> ...Atleast from looking at the snapshooter script, it doesn't
> seem to be doing anything specific...

The snapshooter script only makes an "instant snapshot" of the index
directory using cp -lr. This does not involve any copying of index
data.

The actual replication is done using rsync in the other scripts, by
copying the index snapshot elsewhere.

Rsync only copies what has changed since the last copy, and not many
files change in a Lucene index when adding documents, so it's correct
that replication uses little bandwidth when adding documents.

Index optimization, OTOH, causes much larger changes in the index
directory, so after an optimization rsync will usually have much more
data to transfer.

-Bertrand



Re: question about synonyms

2007-02-13 Thread Chris Hostetter

: To be clear, no clean way to do *expansion* as opposed to reduction at
: query time, when the alternatives are of different lengths.

Reduction at query time doesn't work either ... when query parser sees the
string:
my best buy
...it analyzes each white space sepearted string seperately, so a synonym
reduction of "best buy"=>bestbuy won't ever be triggered.  As i said, this
is all covered in the wiki. (it's probably the topic in the wiki with the
most complete coverage: multi word synonyms it kicked my ass up and down
the street about a year ago)



-Hoss



Re: convert custom facets to Solr facets...

2007-02-13 Thread Yonik Seeley

On 2/12/07, Erik Hatcher <[EMAIL PROTECTED]> wrote:

On Feb 12, 2007, at 9:10 PM, Gmail Account wrote:
> This would be great!  I can't help with the solution but I am very
> interested in using it if one of you guys can figure it out.
>
> I can't wait to see if this works out.

And just for the record, Solr drives Collex @ NINES:  which implements tagging along with faceted and
full-text search.  I've recently hacked our system such that the bulk
of our custom  caches are only refreshed when a new batch of data is
loaded, and only the "collectable cache" is updated on a .
This reduced our new index searcher visibility time from 45 seconds
down to only a few seconds or less.


Wow!
This (the warming time) is a major benefit to having separate
documents that reference "main" documents that are updated less
frequently.
The hard part would be to generalize something like that.
A separate index for each document type sounds like it would help.


As part of Flare, I will experiment with the tagging design Yonik has
posted to the wiki but for now our "legacy" application is running
fine with my early hacks.


My design so far was to show how we could get very far with what we
have now (or almost have now, with updateable documents).  However, it
doesn't take into account things like warming time.

It might be a while before we could come up with any kind of generic
mechanism that could perform as well as your "hacks" w.r.t warming
speed :-)

-Yonik


Re: question about synonyms

2007-02-13 Thread Yonik Seeley

On 2/13/07, Chris Hostetter <[EMAIL PROTECTED]> wrote:


: To be clear, no clean way to do *expansion* as opposed to reduction at
: query time, when the alternatives are of different lengths.

Reduction at query time doesn't work either ... when query parser sees the
string:
my best buy
...it analyzes each white space sepearted string seperately


Unless you put it in a phrase query.  But yes, that's not as flexible
and would probably cause pain with the dismax handler.

-Yonik


Re: Tagging

2007-02-13 Thread Yonik Seeley

On 2/13/07, Binkley, Peter <[EMAIL PROTECTED]> wrote:

I still wonder if there's a good way of storing the tags outside the
Lucene index and using them via facets whose bitsets are manipulated
directly rather than being populated from the index. In my project,
reindexing a documents whenever a user adds a tag is very very bad,
since we're indexing potentially hundreds of pages of full text in the
body field of the document. A solution that gets the tag into the system
immediately without forcing a reindexing of the document is essential.


Interesting... what are you indexing that is that large, the book contents?
You could build a custom request handler and store tag info outside
the index.  You could also store it inside the index in separate
documents as Erik does with Collex.

For a more general solution, I'm thinking a separate lucene index
might be ideal.

-Yonik


non-relative scoring

2007-02-13 Thread solr
Is it possible to generate a non-relative score for each result, from solr?

I would like to be able to generate a web page that shows the first 3
results' scores as 87%, 73%, and 72%.  If the range of solr document match
scores were between 0 and 1, it would be easy.  But I never know what my
MaxScore is going to be. Sometimes it's around 2.5 but sometimes reaches
6.



Re: non-relative scoring

2007-02-13 Thread Walter Underwood
You can declare the top result to be 100% and scale from there.

"Percent relevant" is not a concept that really holds together.
What does it mean to be 100% relevant? I'm not even sure what
"twice as relevant" means.

A tf.idf engine, like Lucene, might not have a maximum score.
What if a document contains the word a thousand times?
A million times?

An engine with a probabilistic model does have absolute scores,
but those have other problems.

wunder

On 2/13/07 3:58 PM, "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> wrote:

> Is it possible to generate a non-relative score for each result, from solr?
> 
> I would like to be able to generate a web page that shows the first 3
> results' scores as 87%, 73%, and 72%.  If the range of solr document match
> scores were between 0 and 1, it would be easy.  But I never know what my
> MaxScore is going to be. Sometimes it's around 2.5 but sometimes reaches
> 6.
> 



Help with tuning solr

2007-02-13 Thread Ian Meyer

All,

I'm having some performance issues with solr. I will give some
background on our setup and implementation of solr. I'm completely
open to reworking everything if the way we are currently doing things
are not optimal. I'll try to be as verbose as I can in explaining all
of this, but feel free to ask more questions if something doesn't make
sense.

Firstly, we have three messageboards of varying traffic, totaling
about 225K hits per day. Search is
used maybe 500 times a day. Each board has it's two instances of solr,
with Tomcat as the container, and loaded via JDNI. One instance is for
topics, one instance for the posts themselves. I feel as though this
may not be optimal, but I can't think of a better way to handle this.
After reading the schema, maybe someone will have some better ideas.
We use php to interface with solr, and we do some sorting on relevance
and on the date and my thought was that could be causing solr to run
out of memory.

The boards are bco, vlv and wbc. I'll list the number of docs for each
below along with how many added per day.

bco (topics): 180,530 (~200 added daily)
bco (posts): 3,961,053 (~5,000 added daily)
vlv (topics): 3,817 (~200 added daily)
vlv (posts): 84,005 (~7,000 added daily)
wbc (topics): 29,603 (~50 added daily)
wbc (posts):  739,660 (~1000 added daily)

total: ~5 million total docs, with ~13.5K added per day.

we add docs at :00 for bco, :20 for wbc, :40 for vlv. we feel an hour
is a good enough amount of time to where results aren't lagged too
much.  the add process is fast, as well as the commit and i'm more
than impressed with solr's ability to handle the load it does.

The server hardware is 4GB memory, 1 dual-core 2GHZ opteron.. RAID 10
SATA.. the machine runs PostgreSQL, PHP and Apache. I feel that this
isn't optimal either, but the costs to buy another server to separate
either the solr or Postgres component is too great right now. Most of
the errors I see are the jvm running out of heap space. The jvm is set
to use the default for max heap size (256m I think?). I can't increase
it too much, because Postgres needs as much memory as it can so the
databases will still reside in memory.

My first implementation of search for these sites was with pyLucene,
and while that was fast, there was some sort of bug where if I added
docs to the index, they wouldn't show up until I optimized the index,
and that eventually just ate up too much cpu and hosed the server
while it ran, which eventually started taking upwards of 2 hours of
99% cpu and that's just no good. :)

When I set up solr, I had cache warming enabled and that also caused
the server to choke way too soon.  So I turned that off and that
seemed to hold things off for awhile.

I've attached the schemas and configs to this email so you can see how
we have things set up. Every site is the same (config-wise) so just
the names are different. It's relatively simple and I feel like the
jvm shouldn't be choking so soon, but, who knows. :)

One thought we had was having two instances of solr, with a board_id
field and the id field as the unique id, but I wasn't sure if solr
supported compound unique ids.. if not, that would make that solution
moot.

Hopefully this makes sense, but if not, ask me for clarification on
whatever is unclear.

Thanks in advance for your help and suggestions!
Ian



  













  


  

  


  






  
  







  



  







  


 

   
   
   
   
   
   

 

 id

 body

 

 






  /opt/db/solr/bco_posts
  
false
10
1000
2147483647
1
1000
1
  

  

false
10
1000
2147483647
1
false
  
  
 
  1

  


  
1024
false
false
  

  

 
   explicit
 
  
  
solr
solrconfig.xml schema.xml admin-extra.html

 qt=dismax&q=solr&start=3&fq=id:[* TO *]&fq=cat:[* TO *]

  




Re: Tagging

2007-02-13 Thread Erik Hatcher
There is also the possibility of keeping tags with the original  
documents and having them individually updated without having to  
resend the original full text as well: 


And yeah, Peter is a solr4lib kinda guy, doing some way cool stuff  
with Lucene and Solr already: 


With separate indexes we're back to the relational model that adds a  
lot of complexity.  For example, I cannot use MoreLikeThis with tags  
to allow commonly tagged objects to be considered similar.  I'm sure  
there are other ways to implement that sort of thing, though I've not  
thought it through.


Erik


On Feb 13, 2007, at 6:17 PM, Yonik Seeley wrote:


On 2/13/07, Binkley, Peter <[EMAIL PROTECTED]> wrote:

I still wonder if there's a good way of storing the tags outside the
Lucene index and using them via facets whose bitsets are manipulated
directly rather than being populated from the index. In my project,
reindexing a documents whenever a user adds a tag is very very bad,
since we're indexing potentially hundreds of pages of full text in  
the
body field of the document. A solution that gets the tag into the  
system
immediately without forcing a reindexing of the document is  
essential.


Interesting... what are you indexing that is that large, the book  
contents?

You could build a custom request handler and store tag info outside
the index.  You could also store it inside the index in separate
documents as Erik does with Collex.

For a more general solution, I'm thinking a separate lucene index
might be ideal.

-Yonik




Re: Tagging

2007-02-13 Thread Yonik Seeley

On 2/13/07, Erik Hatcher <[EMAIL PROTECTED]> wrote:

There is also the possibility of keeping tags with the original
documents and having them individually updated without having to
resend the original full text as well: 


But it does require having all original fields stored, and does
re-analyze and re-index.
Then there's the caching issue... you've changed the index and
internal docids, and all the filters needed for efficient faceting
need to be re-generated (document ones too, not just the tag related
ones, since they are on the same documents).

I do agree that tags-on-docs is desirable and simpler, if the
performance is acceptable from both a re-indexing perspective, and a
time-to-viewable perspective.  The latter will probably be a bigger
problem than the former unless you have a really popular site.


And yeah, Peter is a solr4lib kinda guy, doing some way cool stuff
with Lucene and Solr already: 


FYI, your mailer is always breaking your links... I always have to
cut-n-paste them back together again.

-Yonik


Re: Tagging

2007-02-13 Thread Erik Hatcher


On Feb 13, 2007, at 9:01 PM, Yonik Seeley wrote:

And yeah, Peter is a solr4lib kinda guy, doing some way cool stuff
with Lucene and Solr already: 


FYI, your mailer is always breaking your links... I always have to
cut-n-paste them back together again.


The links are completely intact when viewing my own messages (and  
others with long links that are surrounded by ) in that  
same mailer (Mail.app on Mac OS X).  *shrugs*


Erik



Re: Tagging

2007-02-13 Thread Yonik Seeley

On 2/13/07, Erik Hatcher <[EMAIL PROTECTED]> wrote:


On Feb 13, 2007, at 9:01 PM, Yonik Seeley wrote:
>> And yeah, Peter is a solr4lib kinda guy, doing some way cool stuff
>> with Lucene and Solr already: > search/?
>> search=raw&pageNumber=1&index=peelbib&field=body&rawQuery=dog&digstat
>> us=
>> on>
>
> FYI, your mailer is always breaking your links... I always have to
> cut-n-paste them back together again.

The links are completely intact when viewing my own messages (and
others with long links that are surrounded by ) in that
same mailer (Mail.app on Mac OS X).  *shrugs*


Nabble thinks they're broken too:
http://www.nabble.com/Re%3A-Tagging-p8957261.html
vs
http://www.nabble.com/Re%3A-question-about-synonyms-p8954457.html

-Yonik


Re: Help with tuning solr

2007-02-13 Thread Yonik Seeley

Yes, sorting by fields does take up memory (the fieldcache).
256M is pretty small for a 5M doc index.
If you have any more memory slots, spring for some more memory (a
little over $100 for 1GB).

Lucene also likes to have free memory left over available for OS cache
-  otherwise searches start to be limited by disk bandwidth... not a
good thing.

To try and lessen the memory used by the Lucene FieldCache, you might
try lowering the mergeFactor of the index (see solrconfig.xml).  This
will cause more merges, slowing indexing, but it will squeeze out
deleted documents faster.  Also, try to optimize as often as possible
(nightly?) for the same reasons.

-Yonik

On 2/13/07, Ian Meyer <[EMAIL PROTECTED]> wrote:

All,

I'm having some performance issues with solr. I will give some
background on our setup and implementation of solr. I'm completely
open to reworking everything if the way we are currently doing things
are not optimal. I'll try to be as verbose as I can in explaining all
of this, but feel free to ask more questions if something doesn't make
sense.

Firstly, we have three messageboards of varying traffic, totaling
about 225K hits per day. Search is
used maybe 500 times a day. Each board has it's two instances of solr,
with Tomcat as the container, and loaded via JDNI. One instance is for
topics, one instance for the posts themselves. I feel as though this
may not be optimal, but I can't think of a better way to handle this.
After reading the schema, maybe someone will have some better ideas.
We use php to interface with solr, and we do some sorting on relevance
and on the date and my thought was that could be causing solr to run
out of memory.

The boards are bco, vlv and wbc. I'll list the number of docs for each
below along with how many added per day.

bco (topics): 180,530 (~200 added daily)
bco (posts): 3,961,053 (~5,000 added daily)
vlv (topics): 3,817 (~200 added daily)
vlv (posts): 84,005 (~7,000 added daily)
wbc (topics): 29,603 (~50 added daily)
wbc (posts):  739,660 (~1000 added daily)

total: ~5 million total docs, with ~13.5K added per day.

we add docs at :00 for bco, :20 for wbc, :40 for vlv. we feel an hour
is a good enough amount of time to where results aren't lagged too
much.  the add process is fast, as well as the commit and i'm more
than impressed with solr's ability to handle the load it does.

The server hardware is 4GB memory, 1 dual-core 2GHZ opteron.. RAID 10
SATA.. the machine runs PostgreSQL, PHP and Apache. I feel that this
isn't optimal either, but the costs to buy another server to separate
either the solr or Postgres component is too great right now. Most of
the errors I see are the jvm running out of heap space. The jvm is set
to use the default for max heap size (256m I think?). I can't increase
it too much, because Postgres needs as much memory as it can so the
databases will still reside in memory.

My first implementation of search for these sites was with pyLucene,
and while that was fast, there was some sort of bug where if I added
docs to the index, they wouldn't show up until I optimized the index,
and that eventually just ate up too much cpu and hosed the server
while it ran, which eventually started taking upwards of 2 hours of
99% cpu and that's just no good. :)

When I set up solr, I had cache warming enabled and that also caused
the server to choke way too soon.  So I turned that off and that
seemed to hold things off for awhile.

I've attached the schemas and configs to this email so you can see how
we have things set up. Every site is the same (config-wise) so just
the names are different. It's relatively simple and I feel like the
jvm shouldn't be choking so soon, but, who knows. :)

One thought we had was having two instances of solr, with a board_id
field and the id field as the unique id, but I wasn't sure if solr
supported compound unique ids.. if not, that would make that solution
moot.

Hopefully this makes sense, but if not, ask me for clarification on
whatever is unclear.

Thanks in advance for your help and suggestions!
Ian


Re: Tagging

2007-02-13 Thread Erik Hatcher


On Feb 13, 2007, at 9:23 PM, Yonik Seeley wrote:


On 2/13/07, Erik Hatcher <[EMAIL PROTECTED]> wrote:


On Feb 13, 2007, at 9:01 PM, Yonik Seeley wrote:
>> And yeah, Peter is a solr4lib kinda guy, doing some way cool stuff
>> with Lucene and Solr already: > search/?
>>  
search=raw&pageNumber=1&index=peelbib&field=body&rawQuery=dog&digstat

>> us=
>> on>
>
> FYI, your mailer is always breaking your links... I always have to
> cut-n-paste them back together again.

The links are completely intact when viewing my own messages (and
others with long links that are surrounded by ) in that
same mailer (Mail.app on Mac OS X).  *shrugs*


Nabble thinks they're broken too:
http://www.nabble.com/Re%3A-Tagging-p8957261.html
vs
http://www.nabble.com/Re%3A-question-about-synonyms-p8954457.html


(sending this message as rich text instead of plain text - there are  
no wrapping options in Mail.app that I've found).


Sorry if I'm sending things mangled somehow - and if anyone has  
suggestions on correcting I'm all ears.


There is some precedent for putting angle brackets around URLs in e- 
mails:  this mechanism was documented in Tim Berners-Lee's original  
URL format specification, RFC1738:


APPENDIX: Recommendations for URLs in Context

   URIs, including URLs, are intended to be transmitted through
   protocols which provide a context for their interpretation.

   In some cases, it will be necessary to distinguish URLs from other
   possible data structures in a syntactic structure. In this case, is
   recommended that URLs be preceeded with a prefix consisting of the
   characters "URL:". For example, this prefix may be used to
   distinguish URLs from other kinds of URIs.

   In addition, there are many occasions when URLs are included in  
other

   kinds of text; examples include electronic mail, USENET news
   messages, or printed on paper. In such cases, it is convenient to
   have a separate syntactic wrapper that delimits the URL and  
separates

   it from the rest of the text, and in particular from punctuation
   marks that might be mistaken for part of the URL. For this purpose,
   is recommended that angle brackets ("<" and ">"), along with the
   prefix "URL:", be used to delimit the boundaries of the URL.  This
   wrapper does not form part of the URL and should not be used in
   contexts in which delimiters are already specified.

   In the case where a fragment/anchor identifier is associated with a
   URL (following a "#"), the identifier would be placed within the
   brackets as well.

   In some cases, extra whitespace (spaces, linebreaks, tabs, etc.) may
   need to be added to break long URLs across lines.  The whitespace
   should be ignored when extracting the URL.

   No whitespace should be introduced after a hyphen ("-") character.
   Because some typesetters and printers may (erroneously) introduce a
   hyphen at the end of line when breaking a line, the interpreter of a
   URL containing a line break immediately after a hyphen should ignore
   all unencoded whitespace around the line break, and should be aware
   that the hyphen may or may not actually be part of the URL.

-


Re: Help with tuning solr

2007-02-13 Thread Ian Meyer

On 2/13/07, Yonik Seeley <[EMAIL PROTECTED]> wrote:

Yes, sorting by fields does take up memory (the fieldcache).
256M is pretty small for a 5M doc index.
If you have any more memory slots, spring for some more memory (a
little over $100 for 1GB).


Yeah, I'll see if I can give solr a bit more.



Lucene also likes to have free memory left over available for OS cache
-  otherwise searches start to be limited by disk bandwidth... not a
good thing.



To try and lessen the memory used by the Lucene FieldCache, you might
try lowering the mergeFactor of the index (see solrconfig.xml).  This
will cause more merges, slowing indexing, but it will squeeze out
deleted documents faster.  Also, try to optimize as often as possible
(nightly?) for the same reasons.


Ah, I don't know if I mentioned, but we're optimizing nightly when
impressions are at their lowest. So, I will lower the mergeFactor and
re-load all of the docs to see if that helps us out.. I believe I left
it high when we were tuning for the initial loading of ~4M docs before
we realized batching them into groups of 1000 before doing a commit
(instead of add, commit, add commit, etc) was a more efficient way of
doing it. As it stands, loading ~600 docs takes about 2 seconds, so if
it takes 15 seconds, I won't complain. :)

Thanks for the tips.

- Ian



-Yonik

On 2/13/07, Ian Meyer <[EMAIL PROTECTED]> wrote:
> All,
>
> I'm having some performance issues with solr. I will give some
> background on our setup and implementation of solr. I'm completely
> open to reworking everything if the way we are currently doing things
> are not optimal. I'll try to be as verbose as I can in explaining all
> of this, but feel free to ask more questions if something doesn't make
> sense.
>
> Firstly, we have three messageboards of varying traffic, totaling
> about 225K hits per day. Search is
> used maybe 500 times a day. Each board has it's two instances of solr,
> with Tomcat as the container, and loaded via JDNI. One instance is for
> topics, one instance for the posts themselves. I feel as though this
> may not be optimal, but I can't think of a better way to handle this.
> After reading the schema, maybe someone will have some better ideas.
> We use php to interface with solr, and we do some sorting on relevance
> and on the date and my thought was that could be causing solr to run
> out of memory.
>
> The boards are bco, vlv and wbc. I'll list the number of docs for each
> below along with how many added per day.
>
> bco (topics): 180,530 (~200 added daily)
> bco (posts): 3,961,053 (~5,000 added daily)
> vlv (topics): 3,817 (~200 added daily)
> vlv (posts): 84,005 (~7,000 added daily)
> wbc (topics): 29,603 (~50 added daily)
> wbc (posts):  739,660 (~1000 added daily)
>
> total: ~5 million total docs, with ~13.5K added per day.
>
> we add docs at :00 for bco, :20 for wbc, :40 for vlv. we feel an hour
> is a good enough amount of time to where results aren't lagged too
> much.  the add process is fast, as well as the commit and i'm more
> than impressed with solr's ability to handle the load it does.
>
> The server hardware is 4GB memory, 1 dual-core 2GHZ opteron.. RAID 10
> SATA.. the machine runs PostgreSQL, PHP and Apache. I feel that this
> isn't optimal either, but the costs to buy another server to separate
> either the solr or Postgres component is too great right now. Most of
> the errors I see are the jvm running out of heap space. The jvm is set
> to use the default for max heap size (256m I think?). I can't increase
> it too much, because Postgres needs as much memory as it can so the
> databases will still reside in memory.
>
> My first implementation of search for these sites was with pyLucene,
> and while that was fast, there was some sort of bug where if I added
> docs to the index, they wouldn't show up until I optimized the index,
> and that eventually just ate up too much cpu and hosed the server
> while it ran, which eventually started taking upwards of 2 hours of
> 99% cpu and that's just no good. :)
>
> When I set up solr, I had cache warming enabled and that also caused
> the server to choke way too soon.  So I turned that off and that
> seemed to hold things off for awhile.
>
> I've attached the schemas and configs to this email so you can see how
> we have things set up. Every site is the same (config-wise) so just
> the names are different. It's relatively simple and I feel like the
> jvm shouldn't be choking so soon, but, who knows. :)
>
> One thought we had was having two instances of solr, with a board_id
> field and the id field as the unique id, but I wasn't sure if solr
> supported compound unique ids.. if not, that would make that solution
> moot.
>
> Hopefully this makes sense, but if not, ask me for clarification on
> whatever is unclear.
>
> Thanks in advance for your help and suggestions!
> Ian



Re: Help with tuning solr

2007-02-13 Thread Mike Klaas

On 2/13/07, Ian Meyer <[EMAIL PROTECTED]> wrote:

On 2/13/07, Yonik Seeley <[EMAIL PROTECTED]> wrote:
> Yes, sorting by fields does take up memory (the fieldcache).
> 256M is pretty small for a 5M doc index.
> If you have any more memory slots, spring for some more memory (a
> little over $100 for 1GB).

Yeah, I'll see if I can give solr a bit more.


I'll second that--it is the cheapest way of improving Solr performance
(_way_ cheaper than dev time)


> To try and lessen the memory used by the Lucene FieldCache, you might
> try lowering the mergeFactor of the index (see solrconfig.xml).  This
> will cause more merges, slowing indexing, but it will squeeze out
> deleted documents faster.  Also, try to optimize as often as possible
> (nightly?) for the same reasons.

Ah, I don't know if I mentioned, but we're optimizing nightly when
impressions are at their lowest. So, I will lower the mergeFactor and
re-load all of the docs to see if that helps us out.. I believe I left


Once you've optimized, the merge factor is irrelevant--the index is
already as tight as possible.  At that point, it should only affect
your incremental updates, and then only if you add dups (ie. do
deletions).

-Mike


Re: Tagging

2007-02-13 Thread Mike Klaas

On 2/13/07, Erik Hatcher <[EMAIL PROTECTED]> wrote:


Sorry if I'm sending things mangled somehow - and if anyone has
suggestions on correcting I'm all ears.


Unfortunately, no.


There is some precedent for putting angle brackets around URLs in e-
mails:  this mechanism was documented in Tim Berners-Lee's original
URL format specification, RFC1738:


Absolutely.  The problem seems to be that Mail.app does not recognize
this specification, not that you are following it.

-Mike


Re: Tagging

2007-02-13 Thread Yonik Seeley

On 2/13/07, Erik Hatcher <[EMAIL PROTECTED]> wrote:


On Feb 13, 2007, at 9:01 PM, Yonik Seeley wrote:
>> And yeah, Peter is a solr4lib kinda guy, doing some way cool stuff
>> with Lucene and Solr already: > search/?
>> search=raw&pageNumber=1&index=peelbib&field=body&rawQuery=dog&digstat
>> us=
>> on>
>
> FYI, your mailer is always breaking your links... I always have to
> cut-n-paste them back together again.

The links are completely intact when viewing my own messages (and
others with long links that are surrounded by ) in that
same mailer (Mail.app on Mac OS X).  *shrugs*


I think it's the spaces at the ends of your lines that mess up most
other clients trying to put the URL back together again.  I
cut'n'pasted your message to myself, and gmail put it all back
together, except the last "on>", presumably because the previous line
ended with a "=".

I guess when your mail client wrapps your lines, it breaks on a space,
but instead of replacing the space with a newline, it adds the newline
after the space.

-Yonik