Re: boosting certain terms within one field?

2008-11-30 Thread Grant Ingersoll

Hi Peter,

What are the downsides to your last alternative approach below?  That  
seems like the simplest approach and should work as long as the terms  
within those fields do not need to be boosted separately.


If you want to go the boosting terms route, this is handled via a  
thing called Payloads in Lucene.  Payloads are an array of bytes that  
are added during indexing at the term level through the analysis  
process.  To do this in Solr, you would need to write your own  
TokenFilter that adds payloads as needed.  Then, during search, you  
can take advantage of these payloads by using the BoostingTermQuery  
from Lucene.  The downside to all of this is Solr doesn't currently  
support it, so you would be coding it up yourself.  I'm sure, though,  
that if you were to start a patch on it, there would be others who are  
interested.


Note, on the payloads.  The biggest sticking point, I think, is coming  
up w/ an efficient way of encoding the byte array and putting it into  
the XML format, such that one can send in payloads when indexing.   
It's not particularly hard, but no one has done it yet.


-Grant


On Nov 29, 2008, at 10:45 PM, Peter Wolanin wrote:


I've recently started working on the Drupal integration module for
SOLR, and we are looking for suggestions for how to address this
question:  how do we boost the importance of a subset of terms within
a field.

For example, we are using the standard request handler for queries,
and the default field for keyword searches is a concatentation of the
title, body, taxonomy terms, etc.

One "hackish" way I can imagine is that terms we want to boost (for
example the title, or text inside h2 tags) could be concatenated on
multiple times.  Would this be effective and reasonable?

It seems like the alternative is to try to switch to using the dismax
handler, storing the terms that we desire to have different boosts
into different fields, all of which are in the list of query fields?

Thanks in advance for your suggestions.

-Peter

--
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
[EMAIL PROTECTED]


--
Grant Ingersoll

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ












Re: range queries on string field with millions of values

2008-11-30 Thread Yonik Seeley
On Sun, Nov 30, 2008 at 2:04 AM, Naomi Dushay <[EMAIL PROTECTED]> wrote:
> The terms component approach, if i understand it correctly, will be
> problematic.  I need to present not only the next X call numbers in
> sequence, but other fields in those documents (e.g. title, author).

You can still use the method Hoss suggested of doing 2 requests to
satisfy this type of search:

>> But as Yonik said: the new TermsComponent may actually be a better option
>> for you -- doing two requests for every page (the first to get the N Terms
>> in your id field starting with your input, the second to do an query for
>> docs matching any of those N ids) might actually be faster even though
>> there won't likely even be any cache hits.

So TermsComponent gets the next 10 IDs, then you do a standard query
with those 10 IDs.

-Yonik


> assume the Terms Component approach will only give me the next X call number
> values, not the documents.
>
> It sounds like Glen Newton's suggestion of mapping the call numbers to a
> float number is the most likely solution.
>
> I know it sounds ridiculous to do all this for a "call number browse" but
> our faculty have explicitly asked for this.  For humanities scholars
> especially, they know the call numbers that are of interest to them, and
> they browse the stacks that way (ML 1500s are opera, V35 is verdi ...).
> They are using the research methods that have been successful for their
> entire careers.  Plus, library materials are going to off site, high density
> storage, so the only way for them to to browse all materials, regardless of
> location, via call number is online.   I doubt they'll find this feature as
> useful as they expect, but it behooves us to give the users what they ask
> for.
>
> So yeah, our user needs are perhaps a little outside of your expectations.
>  :-)
>
> - Naomi
>
>
> On Nov 29, 2008, at 2:58 PM, Chris Hostetter wrote:
>
>>
>> : The results are correct.  But the response time sucks.
>> :
>> : Reading the docs about caches, I thought I could populate the query
>> result
>> : cache with an autowarming query and the response time would be okay.
>>  But that
>> : hasn't worked.  (See excerpts from my solrConfig file below.)
>> :
>> : A repeated query is very fast, implying caching happens for a particular
>> : starting point ("42" above).
>> :
>> : Is there a way to populate the cache with the ENTIRE sorted list of
>> values for
>> : the field, so any arbitrary starting point will get results from the
>> cache,
>> : rather than grabbing all results from (x) to the end, then sorting all
>> these
>> : results, then returning the first 10?
>>
>> there's two "caches" that come into play for something like this...
>>
>> the first cache is a low level Lucene cache called the "FieldCache" that
>> is completley hidden from you (and for the most part: from Solr).
>> anytime you sort on a field, it get's built, and reuse for all sorts on
>> that field.  My originl concern was that it wasn't getting warmed on
>> "newSearcher" (because you have to be explicit about that.
>>
>> the second cache is the queryResultsCache which caches a "window" of an
>> ordered list of documents based on a query, and a sort.  you can see this
>> cache in your Solr stats, and yes: these two requests results in different
>> cache keys for the queryResultsCache...
>>
>>   q=yourField:[42+TO+*]&sort=yourField+asc&rows=10
>>   q=yourField:[52+TO+*]&sort=yourField+asc&rows=10
>>
>> ...BUT! ... the two queries below will result in the same cache key, and
>> the second will be a cache hit, provided a sufficient value for
>> the "queryResultWindowSize" ...
>>
>>   q=yourField:[42+TO+*]&sort=yourField+asc&rows=10
>>   q=yourField:[42+TO+*]&sort=yourField+asc&rows=10&start=10
>>
>> so perhaps the key to your problem is to just make sure that once the user
>> gives you an id to start with, you "scroll" by increasing the start param
>> (not altering the id) ... the first query might be "slow" but every query
>> after that should be a cache hit (depending on your page size, and how far
>> you expect people to scroll, you should consider increasing
>> queryResultWindowSize)
>>
>> But as Yonik said: the new TermsComponent may actually be a better option
>> for you -- doing two requests for every page (the first to get the N Terms
>> in your id field starting with your input, the second to do an query for
>> docs matching any of those N ids) might actually be faster even though
>> there won't likely even be any cache hits.
>>
>>
>> My opinion:  Your use case sounds like a waste of effort.  I can't imagine
>> anyone using a library catalog system ever wanting to lookup a callnumber,
>> and then scroll through all posisble books with similar call numbers -- it
>> seems much more likely that i'd want to look at other books with similar
>> authors, or keywords, or tags ... all things that are actaully *easier* to
>> do with Solr.  (but then again: i don't work in a library.  i trust that
>> y

Re: boosting certain terms within one field?

2008-11-30 Thread Peter Wolanin
Hi Grant,

Thanks for your feedback.  The major short-term downside to switching
to dismax with multiple fields would be the required re-writing of our
current PHP code  - especially our code to handle addition of facets
fields to the q parameter.  From reading about dismax, seems we would
need to instead use fq to limit the search results to those matching a
specific facet value.

Best,

Peter


On Sun, Nov 30, 2008 at 8:43 AM, Grant Ingersoll <[EMAIL PROTECTED]> wrote:
> Hi Peter,
>
> What are the downsides to your last alternative approach below?  That seems
> like the simplest approach and should work as long as the terms within those
> fields do not need to be boosted separately.
>
> If you want to go the boosting terms route, this is handled via a thing
> called Payloads in Lucene.  Payloads are an array of bytes that are added
> during indexing at the term level through the analysis process.  To do this
> in Solr, you would need to write your own TokenFilter that adds payloads as
> needed.  Then, during search, you can take advantage of these payloads by
> using the BoostingTermQuery from Lucene.  The downside to all of this is
> Solr doesn't currently support it, so you would be coding it up yourself.
>  I'm sure, though, that if you were to start a patch on it, there would be
> others who are interested.
>
> Note, on the payloads.  The biggest sticking point, I think, is coming up w/
> an efficient way of encoding the byte array and putting it into the XML
> format, such that one can send in payloads when indexing.  It's not
> particularly hard, but no one has done it yet.
>
> -Grant
>
>
> On Nov 29, 2008, at 10:45 PM, Peter Wolanin wrote:
>
>> I've recently started working on the Drupal integration module for
>> SOLR, and we are looking for suggestions for how to address this
>> question:  how do we boost the importance of a subset of terms within
>> a field.
>>
>> For example, we are using the standard request handler for queries,
>> and the default field for keyword searches is a concatentation of the
>> title, body, taxonomy terms, etc.
>>
>> One "hackish" way I can imagine is that terms we want to boost (for
>> example the title, or text inside h2 tags) could be concatenated on
>> multiple times.  Would this be effective and reasonable?
>>
>> It seems like the alternative is to try to switch to using the dismax
>> handler, storing the terms that we desire to have different boosts
>> into different fields, all of which are in the list of query fields?
>>
>> Thanks in advance for your suggestions.
>>
>> -Peter
>>
>> --
>> Peter M. Wolanin, Ph.D.
>> Momentum Specialist,  Acquia. Inc.
>> [EMAIL PROTECTED]
>
> --
> Grant Ingersoll
>
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
>
>
>
>
>
>
>
>
>
>
>



-- 
--
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
[EMAIL PROTECTED]


Re: boosting certain terms within one field?

2008-11-30 Thread Erik Hatcher
Adding constraints obtained from facets is best done using fq anyway,  
so it's worth making that switch in your client code anyway.


Erik

On Nov 30, 2008, at 10:43 AM, Peter Wolanin wrote:


Hi Grant,

Thanks for your feedback.  The major short-term downside to switching
to dismax with multiple fields would be the required re-writing of our
current PHP code  - especially our code to handle addition of facets
fields to the q parameter.  From reading about dismax, seems we would
need to instead use fq to limit the search results to those matching a
specific facet value.

Best,

Peter


On Sun, Nov 30, 2008 at 8:43 AM, Grant Ingersoll  
<[EMAIL PROTECTED]> wrote:

Hi Peter,

What are the downsides to your last alternative approach below?   
That seems
like the simplest approach and should work as long as the terms  
within those

fields do not need to be boosted separately.

If you want to go the boosting terms route, this is handled via a  
thing
called Payloads in Lucene.  Payloads are an array of bytes that are  
added
during indexing at the term level through the analysis process.  To  
do this
in Solr, you would need to write your own TokenFilter that adds  
payloads as
needed.  Then, during search, you can take advantage of these  
payloads by
using the BoostingTermQuery from Lucene.  The downside to all of  
this is
Solr doesn't currently support it, so you would be coding it up  
yourself.
I'm sure, though, that if you were to start a patch on it, there  
would be

others who are interested.

Note, on the payloads.  The biggest sticking point, I think, is  
coming up w/
an efficient way of encoding the byte array and putting it into the  
XML

format, such that one can send in payloads when indexing.  It's not
particularly hard, but no one has done it yet.

-Grant


On Nov 29, 2008, at 10:45 PM, Peter Wolanin wrote:


I've recently started working on the Drupal integration module for
SOLR, and we are looking for suggestions for how to address this
question:  how do we boost the importance of a subset of terms  
within

a field.

For example, we are using the standard request handler for queries,
and the default field for keyword searches is a concatentation of  
the

title, body, taxonomy terms, etc.

One "hackish" way I can imagine is that terms we want to boost (for
example the title, or text inside h2 tags) could be concatenated on
multiple times.  Would this be effective and reasonable?

It seems like the alternative is to try to switch to using the  
dismax

handler, storing the terms that we desire to have different boosts
into different fields, all of which are in the list of query fields?

Thanks in advance for your suggestions.

-Peter

--
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
[EMAIL PROTECTED]


--
Grant Ingersoll

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ















--
--
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
[EMAIL PROTECTED]




What are the scenarios when a new Searcher is created ?

2008-11-30 Thread souravm
Hi All,

Say I have started a new Solr server instance using the start.jar in java 
command. Now for this Solr server instance when all a new Searcher would be 
created ?

I am aware of following scenarios -

1. When the instance is started for autowarming a new Searcher is created. But 
not sure whether this searcher will continue to be alive or will die after the 
autowarming is over.
2. When I do the first search in this server instance through select, a new 
searcher would be created and then onwards the same searcher would be used for 
all select to this instance. Even if I run multiple search request concurrently 
I see that the same Searcher is used to service  those requests.
3. When I try to add an index to this instance through update statement a new 
searcher is created.

Please let me know if there are any other situation when a new Searcher is 
created.

Regards,
Sourav



 CAUTION - Disclaimer *
This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely 
for the use of the addressee(s). If you are not the intended recipient, please 
notify the sender by e-mail and delete the original message. Further, you are 
not 
to copy, disclose, or distribute this e-mail or its contents to any other 
person and 
any such actions are unlawful. This e-mail may contain viruses. Infosys has 
taken 
every reasonable precaution to minimize this risk, but is not liable for any 
damage 
you may sustain as a result of any virus in this e-mail. You should carry out 
your 
own virus checks before opening the e-mail or attachment. Infosys reserves the 
right to monitor and review the content of all messages sent to or from this 
e-mail 
address. Messages sent to or from this e-mail address may be stored on the 
Infosys e-mail system.
***INFOSYS End of Disclaimer INFOSYS***


Re: NIO not working yet

2008-11-30 Thread Yonik Seeley
OK, the development version of Solr should now be fixed (i.e. NIO
should be the default for non-Windows platforms).  The next nightly
build (Dec-01-2008) should have the changes.

-Yonik

On Wed, Nov 12, 2008 at 2:59 PM, Yonik Seeley <[EMAIL PROTECTED]> wrote:
> NIO support in the latest Solr development versions does not work yet
> (I previously advised that some people with possible lock contention
> problems try it out).  We'll let you know when it's fixed, but in the
> meantime you can always set the system property
> "org.apache.lucene.FSDirectory.class" to
> "org.apache.lucene.store.NIOFSDirectory" to try it out.
>
> for example:
>
> java 
> -Dorg.apache.lucene.FSDirectory.class=org.apache.lucene.store.NIOFSDirectory
>  -jar start.jar
>
> -Yonik


Solr with Networkn File Server

2008-11-30 Thread souravm

Hi,

I have huge index files to query. On a first cut calculation it looks like I 
would need around 3 boxes (each box not more than 125 M records of size 12.5GB) 
for around 25 apps - so all together 75 boxes. 

However the number of concurrent users would be lesser - may not be more than 
20 at a time or max 25.

So thinking of an option where I use around 20-25 servers each with 2 GB heap 
size and loading all indexes in a network file server. I know this would impact 
the performance (especially for the first time query) but not sure how much 
impact it would be. 

If anybody already tried this type of solution please let me know how was the 
performance impact.

Regards,
Sourav

 CAUTION - Disclaimer *
This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely 
for the use of the addressee(s). If you are not the intended recipient, please 
notify the sender by e-mail and delete the original message. Further, you are 
not 
to copy, disclose, or distribute this e-mail or its contents to any other 
person and 
any such actions are unlawful. This e-mail may contain viruses. Infosys has 
taken 
every reasonable precaution to minimize this risk, but is not liable for any 
damage 
you may sustain as a result of any virus in this e-mail. You should carry out 
your 
own virus checks before opening the e-mail or attachment. Infosys reserves the 
right to monitor and review the content of all messages sent to or from this 
e-mail 
address. Messages sent to or from this e-mail address may be stored on the 
Infosys e-mail system.
***INFOSYS End of Disclaimer INFOSYS***


Re: Upgrade from 1.2 to 1.3 gives 3x slowdown + script!

2008-11-30 Thread Grant Ingersoll

Hi Fergie,

Haven't forgotten about you, but I've been traveling and then into  
some US Holidays here.


To confirm I am understanding, you are seeing a slowdown between 1.3- 
dev from April and one from September, right?


Can you produce an MD5 hash of the WAR file or something, such that I  
can know I have the exact bits.  Better yet, perhaps you can put those  
files up somewhere where they can be downloaded.


Thanks,
Grant

On Nov 26, 2008, at 10:54 AM, Fergus McMenemie wrote:


Hello Grant,

Not much good with Java profilers (yet!) so I thought I
would send a script!

Details... details! Having decided to produce a script to
replicate the 1.2 vis 1.3 speed problem. The required rigor
revealed a lot more.

1) The faster version I have previously referred to as 1.2,
  was actually a "1.3-dev" I had downloaded as part of the
  solr bootcamp class at ApacheCon Europe 2008. The ID
  string in the CHANGES.txt document is:-
  $Id: CHANGES.txt 643465 2008-04-01 16:10:19Z gsingers $

2) I did actually download and speed test a version of 1.2
  from the internet. It's CHANGES.txt id is:-
  $Id: CHANGES.txt 543263 2007-05-31 21:19:02Z yonik $
  Speed wise it was about the same as 1.3 at 64min. It also
  had lots of char set issues and is ignored from now on.

3) The version I was planning to use, till I found this,
  speed issue was the "latest" official version:-
  $Id: CHANGES.txt 694377 2008-09-11 17:40:11Z klaas $
  I also verified the behavior with a nightly build.
  $Id: CHANGES.txt 712457 2008-11-09 01:24:11Z koji $

Anyway, The following script indexes the content in 22min
for the 1.3-dev version and takes 68min for the newer releases
of 1.3. I took the conf directory from the 1.3dev (bootcamp)
release and used it replace the conf directory from the
official 1.3 release. The 3x slow down was still there; it is
not a configuration issue!
=






#! /bin/bash

# This script assumes a /usr/local/tomcat link to whatever version
# of tomcat you have installed. I have "apache-tomcat-5.5.20" Also
# /usr/local/tomcat/conf/Catalina/localhost contains no solr.xml.
# All the following was done as root.


# I have a directory /usr/local/ts which contains four versions of  
solr. The
# "official" 1.2 along with two 1.3 releases and a version of 1.2 or  
a 1.3beata
# I got while attending a solr bootcamp. I indexed the same content  
using the

# different versions of solr as follows:
cd /usr/local/ts
if [ "" ]
then
  echo "Starting from a-fresh"
  sleep 5 # allow time for me to interrupt!
  cp -Rp apache-solr-bc/example/solr  ./solrbc  #bc = bootcamp
  cp -Rp apache-solr-nightly/example/solr ./solrnightly
  cp -Rp apache-solr-1.3.0/example/solr   ./solr13

  # the gaz is regularly updated and its name keeps changing :-) The  
page
  # http://earth-info.nga.mil/gns/html/namefiles.htm has a link to  
the latest

  # version.
  curl "http://earth-info.nga.mil/gns/html/geonames_dd_dms_date_20081118.zip 
" > geonames.zip

  unzip -q geonames.zip
  # delete corrupt blips!
  perl -i -n -e 'print unless
  ($. > 2128495 and $. < 2128505) or
  ($. > 5944254 and $. < 5944260)
  ;' geonames_dd_dms_date_20081118.txt
  #following was used to detect bad short records
  #perl -a -F\\t -n -e ' print "line $. is bad with ",scalar(@F),"  
args\n" if (@F != 26);' geonames_dd_dms_date_20081118.txt


  # my set of fields and copyfields for the schema.xml
  fields='
  
 stored="true" required="true" />
 stored="true"/>
 stored="true"/>
 stored="true"/>
 stored="true"/>
 stored="true"/>
 stored="true"/>
 stored="true"/>
 stored="true"/>
 stored="true"/>
 
 

  '
  copyfields='
 
 
 
  '

  # add in my fields and copyfields
  perl -i -p -e "print qq($fields) if s///;"   solr*/ 
conf/schema.xml
  perl -i -p -e "print qq($copyfields) if s[][];" solr*/ 
conf/schema.xml

  # change the unique key and mark the "id" field as not required
  perl -i -p -e "s/id/UNI/i;"solr*/ 
conf/schema.xml
  perl -i -p -e 's/required="true"//i if m/conf/schema.xml

  # enable remote streaming in solrconfig file
  perl -i -p -e 's/enableRemoteStreaming="false"/ 
enableRemoteStreaming="true"/;' solr*/conf/solrconfig.xml

  fi

# some constants to keep the curl command shorter
skip 
= 
"MODIFY_DATE 
,RC 
,UFI 
,DMS_LAT 
,DMS_LONG 
,FC,PC,ADM1,ADM2,POP,ELEV,CC2,NT,LC,SHORT_FORM,GENERIC,SORT_NAME"

file=`pwd`"/geonames.txt"

export JAVA_OPTS=" -Xmx512M -Xms512M -Dsolr.home=`pwd`/solr - 
Dsolr.solr.home=`pwd`/solr"


echo 'Getting ready to index the data set using solrbc (bc =  
bootcamp)'

/usr/local/tomcat/bin/shutdown.sh
sleep 15
if [ -n "`ps awxww | grep tomcat | grep -v grep`" ]
  then
  echo "Tomcat would not shutdown"
  exit
  fi
rm -r /usr/local/tomcat/webapps/solr*
rm -r /usr/local/tomcat/logs/*.out
rm -r /usr/local/tomcat/work/Catalina/localhost/solr
cp apache-solr-bc/example/webapps/solr.war /usr/local/tomcat/webapps
rm solr

Re: NIO not working yet

2008-11-30 Thread Jon Baer
Sorry missed that (and probably dumb question), does that -D flag work  
for setting as a RAMDirectory as well?


- Jon

On Nov 30, 2008, at 8:42 PM, Yonik Seeley wrote:


OK, the development version of Solr should now be fixed (i.e. NIO
should be the default for non-Windows platforms).  The next nightly
build (Dec-01-2008) should have the changes.

-Yonik

On Wed, Nov 12, 2008 at 2:59 PM, Yonik Seeley <[EMAIL PROTECTED]>  
wrote:

NIO support in the latest Solr development versions does not work yet
(I previously advised that some people with possible lock contention
problems try it out).  We'll let you know when it's fixed, but in the
meantime you can always set the system property
"org.apache.lucene.FSDirectory.class" to
"org.apache.lucene.store.NIOFSDirectory" to try it out.

for example:

java - 
Dorg 
.apache 
.lucene.FSDirectory.class=org.apache.lucene.store.NIOFSDirectory

-jar start.jar

-Yonik




Re: What are the scenarios when a new Searcher is created ?

2008-11-30 Thread Aleksander M. Stensby
When adding documents to solr, the searcher will not be replaced, but once  
you do a commit, (dependening on settings) a new searcher will be opened  
and warmed up while the old searcher will still be open and used when  
searching. Once the new searcher has finished its warmup procedure, the  
old searcher will be replaced with the new warmed searcher, which will now  
allow you to search the newest documents added to the index.


- Aleks

On Mon, 01 Dec 2008 01:32:05 +0100, souravm <[EMAIL PROTECTED]> wrote:


Hi All,

Say I have started a new Solr server instance using the start.jar in  
java command. Now for this Solr server instance when all a new Searcher  
would be created ?


I am aware of following scenarios -

1. When the instance is started for autowarming a new Searcher is  
created. But not sure whether this searcher will continue to be alive or  
will die after the autowarming is over.
2. When I do the first search in this server instance through select, a  
new searcher would be created and then onwards the same searcher would  
be used for all select to this instance. Even if I run multiple search  
request concurrently I see that the same Searcher is used to service   
those requests.
3. When I try to add an index to this instance through update statement  
a new searcher is created.


Please let me know if there are any other situation when a new Searcher  
is created.


Regards,
Sourav



 CAUTION - Disclaimer *
This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended  
solely
for the use of the addressee(s). If you are not the intended recipient,  
please
notify the sender by e-mail and delete the original message. Further,  
you are not
to copy, disclose, or distribute this e-mail or its contents to any  
other person and
any such actions are unlawful. This e-mail may contain viruses. Infosys  
has taken
every reasonable precaution to minimize this risk, but is not liable for  
any damage
you may sustain as a result of any virus in this e-mail. You should  
carry out your
own virus checks before opening the e-mail or attachment. Infosys  
reserves the
right to monitor and review the content of all messages sent to or from  
this e-mail
address. Messages sent to or from this e-mail address may be stored on  
the

Infosys e-mail system.
***INFOSYS End of Disclaimer INFOSYS***





--
Aleksander M. Stensby
Senior software developer
Integrasco A/S
www.integrasco.no