Re: Index-time Boosting

2006-12-04 Thread Tracey Jaquith

Hi all,

[initially I replied to a thread which went to Mike Klass email so after 
his helpful

reply, I'm trying to merge this back into the list discussion]

Quick intro.  Server Engineer at Internet Archive.
I just spent a mere 3 days porting nearly our entire site to use your 
*wonderful* project!


I, too, am looking for a kind of "boosting".
If I understand your reply here, if i reindex *all* my documents with
  i'm super, thanks for asking!
and make sure that any subsequent incremental (re)indexing of documents
use that same extra ' boost="100" ' then I should be making the relevance
of the title in our documents 100x (or whatever that translates to) 
"heavier"

than other non-title fields, correct?

I know this prolly isn't the relevant place to otherwise gush,
but THANK YOU for this fantastic (and maintained!) code
and we look forward to using this in the near future on our site!
Go opensource!

--tracey jaquith

[We are most interested in always having "title", "description", and a 
few other

fields boosted.  We have both user queries of phrases/words as well as
"field-specific" queries (eg: "mediatype:moves AND collection:prelinger")
so my thought is std might be better than dismax.
I've tried some experiments, adjusting the boosts at index time and running
the std handler to see the ordering of the results change for 
"fieldless queries"
(eg: "q=tracey+pooh").  I have 33 fields using source="..."/>

 (where "text" is our default field to query)
to allow for checking across most of our std XML fields.  I gather that 
a boost

 applied to "title" on indexing a docuement must somehow "propogate" to the
 "text" field?   Otherwise, I'm not sure how playing with boosts to fields
 not named "text" would cause any change on the ranking of results for 
queries

 like "q=tracey+pooh".  Am I starting to catch on?]




Re: Index-time Boosting

2006-12-05 Thread Tracey Jaquith

Hi Yonik!

Yonik Seeley wrote:

On 12/5/06, Tracey Jaquith <[EMAIL PROTECTED]> wrote:

Quick intro.  Server Engineer at Internet Archive.
I just spent a mere 3 days porting nearly our entire site to use your
*wonderful* project!

I, too, am looking for a kind of "boosting".
If I understand your reply here, if i reindex *all* my documents with
   i'm super, thanks for asking!
and make sure that any subsequent incremental (re)indexing of documents
use that same extra ' boost="100" ' then I should be making the 
relevance

of the title in our documents 100x (or whatever that translates to)
"heavier"
than other non-title fields, correct?

I know this prolly isn't the relevant place to otherwise gush,
but THANK YOU for this fantastic (and maintained!) code
and we look forward to using this in the near future on our site!
Go opensource!


Welcome aboard!

From a "fresh" user perspective, what was your hardest or most
confusing part of starting to use Solr?

Thanks!  Well, we presently have a (very badly) homegrown version of an 
SE that has lucene + jetty under the hood.  It locks up a lot (badly 
threaded), hangs on updates, and generally has "persona non gratia" 
status with developers here where noone wants to touch it.  So the 
*easiest* thing about Solr was the fact that it uses lucene query syntax 
(like ours).  The hardest parts were:
1) I tried to make ant run from the included ant.jar (w/o getting the 
latest ant from apache) (and spent an hour or so before trying getting ant)
2) Our SE starts responses with document "1".  Initially (totally my 
overlooking from going a little too fast) I just directly "translated" 
that concept so I was crushed to find a lot of my documents weren't 
coming back like they should.  Once I figured I needed to make 
"start=0", not "start=1", everything was great.
3) boosts!  I spent just about 2 days porting our entire site over (have 
a nice PHP toggle "define('SOLR', 1);" now in a single place to cut over 
to it; spent only 2 hours (clocktime) to index our site (about 450K 
documents).  But now I've spent about 1-1/2 days experimenting and not 
quite getting the boosts right 8-)



[We are most interested in always having "title", "description", and a
few other
 fields boosted.  We have both user queries of phrases/words as well as
 "field-specific" queries (eg: "mediatype:moves AND 
collection:prelinger")

 so my thought is std might be better than dismax.


Yes, for the example above you want the standard request handler
because you are searching for different things in different fields
rather than the same thing in different fields.

However, there are multiple ways of doing everything...
It looks like at least some of your clauses are restrictions rather
than full-text queries, and can be more efficiently modeled as
filters.  Since filters are cached separately, this can lead to a
large increase in performance.

So in either the standard or dismax handlers, you could do
q="foo bar"&fq=mediatype:movies&fq=collection:prelinger

OK, great to know.  I'll prolly stick with our current "pass through" of 
our queries in lucene syntax version, and in the future, for speedups, 
start moving some of the filters to "&fq="
 I've tried some experiments, adjusting the boosts at index time and 
running

 the std handler to see the ordering of the results change for
"fieldless queries"
 (eg: "q=tracey+pooh").  I have 33 fields using 
  (where "text" is our default field to query)
 to allow for checking across most of our std XML fields.  I gather that
a boost
  applied to "title" on indexing a docuement must somehow "propogate" 
to the

  "text" field?


Background: for an indexed field name there is a single boost value
per document.  This is true even if the field is multi-valued... all
values for that document "share" the same boost.  This is a Lucene
restriction so we can't fix it in Solr in any way.
ok, that's no problem for us -- our main two fields to boost are 
"singletons"

anyway.  the other two fields we boost can have multiple values, but
most of the time, in practice, they won't matter.  of course, great to know.

Solr *does* propagate the index-time boost when doing copyField, but
this just ends up being multiplied into all the other boosts for
values for that document.   Matches on the resulting text field will
*always* score higher, regardless of which "part" matched.  Does that
make sense?

OK, that *mostly* is making sense.  Let me see if I'm understanding it
mostly.  I'm thinking (after still thrashing around a bit) that the way that
seems to be getting the results I *expect* (or at least, that we are likely
use

Re: Index-time Boosting

2006-12-05 Thread Tracey Jaquith




wow, that makes sense now.  my bad.
OK, great.  further testing shows "you mean what you say" -- not
only verbatim, but case sensitive.

so for my dwindling number of remaining "string" types, in my XSL
transform (on the input to index the doc) i'll lowercase them all, too
8-)

thanks!!
--t

Yonik Seeley wrote:
On 12/5/06, Tracey Jaquith <[EMAIL PROTECTED]>
wrote:
  
  Now I have one new mystery that's popped up
for me.

With std req handler, this simple query

    q=title:commute

is *not* returning me all documents that have the word "commute" in the

title.

There must be some other filter/clause or something happening that I'm
not

aware of?

(For example, I do "indent=on&fl=title&q=commute" in a wget and
grep the

results

 for  and then grep -i for commute, there are 23 hits. 
But doing

 "&q=title:commute" only returns one of those hits..)

  
  
title in your schema is of type "string" which indexes the whole value
verbatim.
  
There is only one document with title:commute
  
Most likely you want to change the type of that field to "text" or
  
some other analyzed type that at least breaks apart words by
  
whitespace.
  
  
  
-Yonik
  


-- 
  
--Tracey Jaquith - http://www.archive.org/~tracey
--





Re: Index-time Boosting

2006-12-05 Thread Tracey Jaquith

Hi Mike,

OK, I guess my "problem" is more of a partially still coming up
to speed / partially wanting to be lazy.

If I make a dismax handler called "dissed", I'd like it to "work"
whether or not i pass in "commute" or "title:commute" to the query.
(Now I *do* realize those are two completely different kinds of
queries -- 1st would be all docs with "commute" in the default field
(which is most of our fields copied into it, so the document)
and the 2nd would be all docs with "commute" in the "title" field
(now that I've redone the field "type" to be "text" and not "string"
as Yonik pointed out)).

So this returns no documents because, I gather, you can't feed
the "field:value" lucene syntax directly in to a dismax handler
(although you can to a standard handler):
  indent=on&fl=identifier&q=title:commute&qt=dissed

So I think simply my breaking up the queries from our search bar
(example "raw" formats:
  grateful dead
  "grateful dead"
  mediatype:movies AND collection:prelinger )
into an expanded query of:
 description:"[clause]"^10 text:"[clause]"^1 
fore each clause will work the best for us.

Is there any lucene or solr class / method that can break up
a string into clauses (eg: split on AND, OR, NOT, ()s, etc.)?

--tracey


Mike Klaas wrote:

On 12/5/06, Tracey Jaquith <[EMAIL PROTECTED]> wrote:


Now I have one new mystery that's popped up for me.
With std req handler, this simple query
q=title:commute
is *not* returning me all documents that have the word "commute" in the
title.
There must be some other filter/clause or something happening that 
I'm not

aware of?
(For example, I do "indent=on&fl=title&q=commute" in a wget and grep the
results
 for  and then grep -i for commute, there are 23 hits.  But doing
 "&q=title:commute" only returns one of those hits..)


Indeed--those are different queries.  The "fl" parameter controlled
the stored fields returned by Solr; it does not affect which documents
are returned.  The first query asks for the titles of all documents
containing the word "commute", the second for all documents with
"commute" in their title.

see http://wiki.apache.org/solr/CommonQueryParameters

I'm not sure what problems you are experiencing with dismax, but it is
important to note that you cannot specify a raw lucene query in the
"q" parameter of a dismax handler.  If you want to search for a word
across fields, you can specify the qf (query fields) parameter.

eg.
q=commute
qf=title^10 body
(see 
http://incubator.apache.org/solr/docs/api/org/apache/solr/request/DisMaxRequestHandler.html) 



Turning on debugQuery=true is invaluable for determining what factors
are influencing scoring.

Was your previous solution QueryParser-based?  If so, you should be
able to use the exact same queries as before, passed to
StandardRequestHAndler (assuming the fields are also set up
identically).

cheers,
-MIke


--
*   --Tracey Jaquith - http://www.archive.org/~tracey 
<http://www.archive.org/%7Etracey> --*


Re: Index-time Boosting

2006-12-05 Thread Tracey Jaquith




oh, and yes, i've always understood, thankfully, that queries
of
   "q=commute&fl=title"
and
   "q=title:commute&fl=title"
are *quite* different
(but that is probably mostly due to my prior experience with
 lucene with our current broken SE 8-)

-t

Mike Klaas wrote:
On 12/5/06, Tracey Jaquith <[EMAIL PROTECTED]>
wrote:
  
  
  Now I have one new mystery that's popped up
for me.

With std req handler, this simple query

    q=title:commute

is *not* returning me all documents that have the word "commute" in the

title.

There must be some other filter/clause or something happening that I'm
not

aware of?

(For example, I do "indent=on&fl=title&q=commute" in a wget and
grep the

results

 for  and then grep -i for commute, there are 23 hits. 
But doing

 "&q=title:commute" only returns one of those hits..)

  
  
Indeed--those are different queries.  The "fl" parameter controlled
  
the stored fields returned by Solr; it does not affect which documents
  
are returned.  The first query asks for the titles of all documents
  
containing the word "commute", the second for all documents with
  
"commute" in their title.
  
  
see http://wiki.apache.org/solr/CommonQueryParameters
  
  
I'm not sure what problems you are experiencing with dismax, but it is
  
important to note that you cannot specify a raw lucene query in the
  
"q" parameter of a dismax handler.  If you want to search for a word
  
across fields, you can specify the qf (query fields) parameter.
  
  
eg.
  
q=commute
  
qf=title^10 body
  
(see
http://incubator.apache.org/solr/docs/api/org/apache/solr/request/DisMaxRequestHandler.html)
  
  
Turning on debugQuery=true is invaluable for determining what factors
  
are influencing scoring.
  
  
Was your previous solution QueryParser-based?  If so, you should be
  
able to use the exact same queries as before, passed to
  
StandardRequestHAndler (assuming the fields are also set up
  
identically).
  
  
cheers,
  
-MIke
  


-- 
  
--Tracey Jaquith - http://www.archive.org/~tracey
--





Re: Index-time Boosting

2006-12-05 Thread Tracey Jaquith




ok, great to know -- all this is invaluable.
i'm stashing away "ideas" like this for the future (because..)

i think for now i'll stick with XSL transforming the fields to lowercase
because we already need this small XSLT from our item XML to
XML that solr can index.

-t

Chris Hostetter wrote:

  : so for my dwindling number of remaining "string" types, in my XSL
: transform (on the input to index the doc) i'll lowercase them all, too 8-)

I don't beleive that is strictly neccessary, these two field types should
be functionally equivilent...

   
   
  

  
   

...so i'm pretty sure you could just use...

   
  

    
  
   


-Hoss

  


-- 
  
--Tracey Jaquith - http://www.archive.org/~tracey
--





Re: Index-time Boosting

2006-12-05 Thread Tracey Jaquith

ahh, after rereading this about 20 times today 8-)
i think i finally "get it" (your final question below).

if i do index-time boosts, and search only "text" (default field)
the boosts will propogate into "text", but only insofar that the
document will weight higher when a phrase is found in the "text"
field (regardless of whether that "hit" really was due to something
copyField-ed in with boost 1, boost 100, etc.)

so that solution would have the effect of making certain documents
have higher scores in the "text" field, not the effect we'd like.

[example documentA]
 [description] i like to commute
  [title] commuting thoughts
copyField text to:
 [text] i like to commute commuting thoughts

we, the Archive, want query hits in title to boost ^100.
if we do q=commute (which searches "text")
with index-time boosting, solr/lucene won't know
the hit due to "title" should effect a much higher ranking
compared to documents with commute in "text" but
not in "title".   however, the above document *will* have a higher
score, in general, because the "title" portion was nearly
half of the "text" field.  Yet A will have a
higher ranking even for matches like "q=like"
compared to documentB like:
 [description] i like bread
 [text] i like bread
(when in reality, we'd like them to have near equal weighting).
So index boosts won't due for us.  I'm learning!

--tracey


 the std handler to see the ordering of the results change for
"fieldless queries"
 (eg: "q=tracey+pooh").  I have 33 fields using 
  (where "text" is our default field to query)
 to allow for checking across most of our std XML fields.  I gather that
a boost
  applied to "title" on indexing a docuement must somehow "propogate" 
to the

  "text" field?
 I've tried some experiments, adjusting the boosts at index time and 
running


Background: for an indexed field name there is a single boost value
per document.  This is true even if the field is multi-valued... all
values for that document "share" the same boost.  This is a Lucene
restriction so we can't fix it in Solr in any way.

Solr *does* propagate the index-time boost when doing copyField, but
this just ends up being multiplied into all the other boosts for
values for that document.   Matches on the resulting text field will
*always* score higher, regardless of which "part" matched.  Does that
make sense?


*ith - http://www.archive.org/~tracey  --*


Re: Result: numFound inaccuracies

2006-12-09 Thread Tracey Jaquith




hey, this bit me last week, too ;-)
it had me completely miserable, thinking "oh no, solr doesn't work for
us!"
when i was installing it, and took me a few hours to figure it out!


while "on the phone" now, I'm happy to announce from Internet Archive
some results.

We indexed 523K documents in about 2 hours, yielding an index of a mere
0.9 GB.
I slipped in into friday night's live servers for about 90 minutes to
watch performance.

It was a *CHAMP*!!
It easily laughed at load queries of 3 req/sec, using miniscule amounts
of disk I/O,
no swapping/paging, and only minor CPU bursts (on one dual-core 4GB
intel linux box).

I'll report more, but that's enough of a "happy holidays" for us at IA!
(Compare this to our current search engine embarrassment/disaster --
   5 boxes (replication -- 4 readers all with 4GB dual-core intel + 1
writer 8GB quad intel)
   handles about 3 query req/sec and often has CPU at 100% and mem at
50%.
   index for slightly *smaller* docset is a "WTF?" 23GB)

--tracey


Andrew Nagy wrote:

  - Original Message -
From: Yonik Seeley <[EMAIL PROTECTED]>
Date: Friday, December 8, 2006 6:01 pm
Subject: Re: Result: numFound inaccuracies
To: solr-user@lucene.apache.org

  
  
start is 0 based :-)


  
  
Man do I feel dumb!

Andrew
  


-- 
  
--Tracey Jaquith - http://www.archive.org/~tracey
--





Re: Strange Sorting results on a Text Field

2006-12-13 Thread Tracey Jaquith

Despite considerations of stemming and such for "text"
type fields, is it the case that 
if we have a single value "text" type field,

will sorting work, though?

--tracey

On 9/11/06, Tom Weber <[EMAIL PROTECTED]> wrote:

  Thanks also for the "multiValued" explanation, this is useful for
my current application. But then, if I use this field and I ask for
sorting, how will the sorting be done, alphanumeric on the first
entry for this field ? Until now, I entered more than one entry by
separting them with a space in the same field, like text1 text2 text3.
 


Sorting is currently only supported when there is at most one value
(or token) per document.  This is a lucene restriction.

-Yonik




listing/enumerating field information

2007-01-10 Thread Tracey Jaquith





The Internet Archive is getting close to going live with Solr.
I have two remaining classes of problems.

1) across the entire index, enumerate all the unique values for a given
field.
2) we use unrestricted dynamicField additions from documents.  (that is
our users are free to add any named field they like to their document's
data (which is metadata for their item)).  we want to list all the
unique field names in the index.

Eg:

  ... 
 audio


  ... 
  movies
  prelinger


1) would yield a list of audio and movies if the field passed in was
mediatype
2) would yield a list of  mediatype and collection


>From our prior implementation of a java + lucene search engine, we
already
ran in to queries that our SE could not handle.  So we nightly build a
cache
structure to handle those other queries.  We *could* solve 1) and 2) in
this nightly cache, but ideally we'd like to use Solr if possible.

thanks!
--tracey


-- 
      
--Tracey Jaquith - http://www.archive.org/~tracey
--





Re: listing/enumerating field information

2007-01-11 Thread Tracey Jaquith




interesting!  

Code-searching for relevant lucene classes led me to try adding
   
to my solrconfig.xml

This allowed me to try this request...
   http://localhost:8983/solr/select?rows=0&qt=test&q=fields
which I think gets me (2) below.

--tracey


Tracey Jaquith wrote:

  
  
  
The Internet Archive is getting close to going live with Solr.
I have two remaining classes of problems.
  
1) across the entire index, enumerate all the unique values for a given
field.
2) we use unrestricted dynamicField additions from documents.  (that is
our users are free to add any named field they like to their document's
data (which is metadata for their item)).  we want to list all the
unique field names in the index.
  
Eg:

  ... 
 audio


  ... 
  movies
  prelinger

  
1) would yield a list of audio and movies if the field passed in was
mediatype
2) would yield a list of  mediatype and collection
  
  
>From our prior implementation of a java + lucene search engine, we
already
ran in to queries that our SE could not handle.  So we nightly build a
cache
structure to handle those other queries.  We *could* solve 1) and 2) in
this nightly cache, but ideally we'd like to use Solr if possible.
  
thanks!
--tracey
  
  
  -- 
    
  --Tracey Jaquith -
  http://www.archive.org/~tracey
--
  


-- 
  
--Tracey Jaquith - http://www.archive.org/~tracey
--





INTERNET ARCHIVE goes SOLR!

2007-01-27 Thread Tracey Jaquith

 Internet Archive on Monday afternoon switched over to SOLR!

 We converted from a badly deteriorating "home grown" server that
 was made up of java + jetty ( + rsync for replication) + an older
 version of lucene.
 I make some comparisons of SOLR vs. "prior" using "[]" notes below.

 I parsed 2 days worth of SOLR logs to determine:
   Max queries/sec: 8.8
   Avg queries/sec: 5.4
   Number (re)indexed / day: 3372

 Index size: 1.1gb [vs. 26gb]
 Number of document fields searched on a quoted unqualified query:
   5 [vs. 677] *

 Horsepower:
   one 4gb RAM dual core cpu 
   [vs. three 4gb RAM dual core cpu (readers) and one 8gb RAM 2 dual 
core cpus (writer)]


 Solr hardly touches our disks, load avg stays around 0.5, typically.
 "sar" shows we average 85% idle!
 Solr seems quicker to respond, overall, and much more stable.
 We can reindex our entire set of 575K items in about 2 hours
(where we are limited more by the "crawling" of our 190 servers for 
XML than Solr).


 With our current configuration, we can show index changes on our live 
site in < 15 minutes

 (compared to our last SE which could take 4+ hours)
 Related to above point, we commit every 15 minutes; we optimize 
once/day late at night.


 * To be fair, Michael StAck (our greatest help for prior SE "life 
support")
 has smartly pointed out that by making a smarter schema and strategy, 
I could

 reduce the number of fields searched from 677 to 5, with the same overall
 functionality.  677 fields search on most queries was surely part of 
bucket

 of nails in the coffin of our prior SE.


 [Some information and configuration]

 We've done essentially no optimizing outside of focusing on a "smart" 
schema.

 We do query-time boosting (more on that follows).
 We (presently) do not use replication.
 We do (server-side) XSLT of output into our prior SE's XML format.
 We don't use DisMax and (as of now) do not use faceting.
 We override defaultOperator of "OR" to "AND".
 We increased our commitLockTimeout to 5 minutes, and unlockOnStartup.
 We useCompoundFile (for the index).
 External to Solr, we use XSLT to transform our item XML into a 
post-able form for Solr to (re)index.


 And finally, the hardest part to convert to Solr.
 I had to write a PHP front-end custom converter to take our query strings,
 parse the clauses and lucene syntax into pieces, and "expand" clauses 
where
 they were not searching a specific field to expand it to our 
query-time boosting.
 Eg: if someone were to look for "tracey pooh" on our site, we expand 
it to:
 (title:"tracey pooh"^100 OR description:"tracey pooh"^15 OR 
collection:"tracey pooh"^10 OR language:"tracey pooh"^10 OR text:"tracey 
pooh"^1)

 (but 'creator:"tracey pooh"' would pass to SOLR as is).

 Lastly, a feelgood. All of Internet Archive's written code is opensource,
 as is *all* the third-party code we use!
 So go SOLR and thank you SO much for keeping it open, keeping it real, 
and for *saving our site*!
 Thanks for the great mail list and all the continual work, updating, 
and thinking the Solr
 team continues to do.  We have all been greatly impressed by this 
project and it has worked out

 better than we had hoped!

--
*   --Tracey Jaquith - http://www.archive.org/~tracey 
<http://www.archive.org/%7Etracey> --*


Re: INTERNET ARCHIVE goes SOLR!

2007-02-01 Thread Tracey Jaquith




Yes, any of our search bars on our site will use Solr.
So your example is using Solr.  8-)


Otis Gospodnetic wrote:

  Hi Tracey,

Thanks for sharing.  Which search exactly is powered by Solr now?
http://www.archive.org/search.php?query=middlebury for example?

Thanks,
Otis

- Original Message 
From: Tracey Jaquith <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Sunday, January 28, 2007 5:12:44 AM
Subject: INTERNET ARCHIVE goes SOLR!


Internet Archive on Monday afternoon switched over to SOLR!

  We converted from a badly deteriorating "home grown" server that
  was made up of java + jetty ( + rsync for replication) + an older
  version of lucene.
  I make some comparisons of SOLR vs. "prior" using "[]" notes below.

  I parsed 2 days worth of SOLR logs to determine:
Max queries/sec: 8.8
Avg queries/sec: 5.4
Number (re)indexed / day: 3372

  Index size: 1.1gb [vs. 26gb]
  Number of document fields searched on a quoted unqualified query:
5 [vs. 677] *

  Horsepower:
one 4gb RAM dual core cpu 
[vs. three 4gb RAM dual core cpu (readers) and one 8gb RAM 2 dual 
core cpus (writer)]

  Solr hardly touches our disks, load avg stays around 0.5, typically.
  "sar" shows we average 85% idle!
  Solr seems quicker to respond, overall, and much more stable.
  We can reindex our entire set of 575K items in about 2 hours
 (where we are limited more by the "crawling" of our 190 servers for 
XML than Solr).

  With our current configuration, we can show index changes on our live 
site in < 15 minutes
  (compared to our last SE which could take 4+ hours)
  Related to above point, we commit every 15 minutes; we optimize 
once/day late at night.

  * To be fair, Michael StAck (our greatest help for prior SE "life 
support")
  has smartly pointed out that by making a smarter schema and strategy, 
I could
  reduce the number of fields searched from 677 to 5, with the same overall
  functionality.  677 fields search on most queries was surely part of 
bucket
  of nails in the coffin of our prior SE.


  [Some information and configuration]

  We've done essentially no optimizing outside of focusing on a "smart" 
schema.
  We do query-time boosting (more on that follows).
  We (presently) do not use replication.
  We do (server-side) XSLT of output into our prior SE's XML format.
  We don't use DisMax and (as of now) do not use faceting.
  We override defaultOperator of "OR" to "AND".
  We increased our commitLockTimeout to 5 minutes, and unlockOnStartup.
  We useCompoundFile (for the index).
  External to Solr, we use XSLT to transform our item XML into a 
post-able form for Solr to (re)index.

  And finally, the hardest part to convert to Solr.
  I had to write a PHP front-end custom converter to take our query strings,
  parse the clauses and lucene syntax into pieces, and "expand" clauses 
where
  they were not searching a specific field to expand it to our 
query-time boosting.
  Eg: if someone were to look for "tracey pooh" on our site, we expand 
it to:
  (title:"tracey pooh"^100 OR description:"tracey pooh"^15 OR 
collection:"tracey pooh"^10 OR language:"tracey pooh"^10 OR text:"tracey 
pooh"^1)
  (but 'creator:"tracey pooh"' would pass to SOLR as is).

  Lastly, a feelgood. All of Internet Archive's written code is opensource,
  as is *all* the third-party code we use!
  So go SOLR and thank you SO much for keeping it open, keeping it real, 
and for *saving our site*!
  Thanks for the great mail list and all the continual work, updating, 
and thinking the Solr
  team continues to do.  We have all been greatly impressed by this 
project and it has worked out
  better than we had hoped!

  


-- 
  
--Tracey Jaquith - http://www.archive.org/~tracey
--





Re: INTERNET ARCHIVE goes SOLR!

2007-02-01 Thread Tracey Jaquith




oh, tee hee if if our eternal admiration and gratitude isn't
obvious...  8-)

i concur, the amount one *can* customize simply from the XML
configuration
and schema is fantastically impressive!  almost all of the configuration
setup it is quick to do in our experience, too

--tracey

Walter Underwood wrote:

  On 1/27/07 1:12 PM, "Tracey Jaquith" <[EMAIL PROTECTED]> wrote:
  
  
 * To be fair, Michael StAck (our greatest help for prior SE "life support")
  has smartly pointed out that by making a smarter schema and strategy,
  I could reduce the number of fields searched from 677 to 5, with the
  same overall functionality.  677 fields search on most queries was
  surely part of bucket of nails in the coffin of our prior SE.

  
  
Solr makes it so easy to understand and change fields that I would
be inclined to still give some credit to Solr for making the faster
config easier to achieve.

wunder

  


-- 
  
--Tracey Jaquith - http://www.archive.org/~tracey
--