Re: boosting certain terms within one field?
Hi Peter, What are the downsides to your last alternative approach below? That seems like the simplest approach and should work as long as the terms within those fields do not need to be boosted separately. If you want to go the boosting terms route, this is handled via a thing called Payloads in Lucene. Payloads are an array of bytes that are added during indexing at the term level through the analysis process. To do this in Solr, you would need to write your own TokenFilter that adds payloads as needed. Then, during search, you can take advantage of these payloads by using the BoostingTermQuery from Lucene. The downside to all of this is Solr doesn't currently support it, so you would be coding it up yourself. I'm sure, though, that if you were to start a patch on it, there would be others who are interested. Note, on the payloads. The biggest sticking point, I think, is coming up w/ an efficient way of encoding the byte array and putting it into the XML format, such that one can send in payloads when indexing. It's not particularly hard, but no one has done it yet. -Grant On Nov 29, 2008, at 10:45 PM, Peter Wolanin wrote: I've recently started working on the Drupal integration module for SOLR, and we are looking for suggestions for how to address this question: how do we boost the importance of a subset of terms within a field. For example, we are using the standard request handler for queries, and the default field for keyword searches is a concatentation of the title, body, taxonomy terms, etc. One "hackish" way I can imagine is that terms we want to boost (for example the title, or text inside h2 tags) could be concatenated on multiple times. Would this be effective and reasonable? It seems like the alternative is to try to switch to using the dismax handler, storing the terms that we desire to have different boosts into different fields, all of which are in the list of query fields? Thanks in advance for your suggestions. -Peter -- Peter M. Wolanin, Ph.D. Momentum Specialist, Acquia. Inc. [EMAIL PROTECTED] -- Grant Ingersoll Lucene Helpful Hints: http://wiki.apache.org/lucene-java/BasicsOfPerformance http://wiki.apache.org/lucene-java/LuceneFAQ
Re: range queries on string field with millions of values
On Sun, Nov 30, 2008 at 2:04 AM, Naomi Dushay <[EMAIL PROTECTED]> wrote: > The terms component approach, if i understand it correctly, will be > problematic. I need to present not only the next X call numbers in > sequence, but other fields in those documents (e.g. title, author). You can still use the method Hoss suggested of doing 2 requests to satisfy this type of search: >> But as Yonik said: the new TermsComponent may actually be a better option >> for you -- doing two requests for every page (the first to get the N Terms >> in your id field starting with your input, the second to do an query for >> docs matching any of those N ids) might actually be faster even though >> there won't likely even be any cache hits. So TermsComponent gets the next 10 IDs, then you do a standard query with those 10 IDs. -Yonik > assume the Terms Component approach will only give me the next X call number > values, not the documents. > > It sounds like Glen Newton's suggestion of mapping the call numbers to a > float number is the most likely solution. > > I know it sounds ridiculous to do all this for a "call number browse" but > our faculty have explicitly asked for this. For humanities scholars > especially, they know the call numbers that are of interest to them, and > they browse the stacks that way (ML 1500s are opera, V35 is verdi ...). > They are using the research methods that have been successful for their > entire careers. Plus, library materials are going to off site, high density > storage, so the only way for them to to browse all materials, regardless of > location, via call number is online. I doubt they'll find this feature as > useful as they expect, but it behooves us to give the users what they ask > for. > > So yeah, our user needs are perhaps a little outside of your expectations. > :-) > > - Naomi > > > On Nov 29, 2008, at 2:58 PM, Chris Hostetter wrote: > >> >> : The results are correct. But the response time sucks. >> : >> : Reading the docs about caches, I thought I could populate the query >> result >> : cache with an autowarming query and the response time would be okay. >> But that >> : hasn't worked. (See excerpts from my solrConfig file below.) >> : >> : A repeated query is very fast, implying caching happens for a particular >> : starting point ("42" above). >> : >> : Is there a way to populate the cache with the ENTIRE sorted list of >> values for >> : the field, so any arbitrary starting point will get results from the >> cache, >> : rather than grabbing all results from (x) to the end, then sorting all >> these >> : results, then returning the first 10? >> >> there's two "caches" that come into play for something like this... >> >> the first cache is a low level Lucene cache called the "FieldCache" that >> is completley hidden from you (and for the most part: from Solr). >> anytime you sort on a field, it get's built, and reuse for all sorts on >> that field. My originl concern was that it wasn't getting warmed on >> "newSearcher" (because you have to be explicit about that. >> >> the second cache is the queryResultsCache which caches a "window" of an >> ordered list of documents based on a query, and a sort. you can see this >> cache in your Solr stats, and yes: these two requests results in different >> cache keys for the queryResultsCache... >> >> q=yourField:[42+TO+*]&sort=yourField+asc&rows=10 >> q=yourField:[52+TO+*]&sort=yourField+asc&rows=10 >> >> ...BUT! ... the two queries below will result in the same cache key, and >> the second will be a cache hit, provided a sufficient value for >> the "queryResultWindowSize" ... >> >> q=yourField:[42+TO+*]&sort=yourField+asc&rows=10 >> q=yourField:[42+TO+*]&sort=yourField+asc&rows=10&start=10 >> >> so perhaps the key to your problem is to just make sure that once the user >> gives you an id to start with, you "scroll" by increasing the start param >> (not altering the id) ... the first query might be "slow" but every query >> after that should be a cache hit (depending on your page size, and how far >> you expect people to scroll, you should consider increasing >> queryResultWindowSize) >> >> But as Yonik said: the new TermsComponent may actually be a better option >> for you -- doing two requests for every page (the first to get the N Terms >> in your id field starting with your input, the second to do an query for >> docs matching any of those N ids) might actually be faster even though >> there won't likely even be any cache hits. >> >> >> My opinion: Your use case sounds like a waste of effort. I can't imagine >> anyone using a library catalog system ever wanting to lookup a callnumber, >> and then scroll through all posisble books with similar call numbers -- it >> seems much more likely that i'd want to look at other books with similar >> authors, or keywords, or tags ... all things that are actaully *easier* to >> do with Solr. (but then again: i don't work in a library. i trust that >> y
Re: boosting certain terms within one field?
Hi Grant, Thanks for your feedback. The major short-term downside to switching to dismax with multiple fields would be the required re-writing of our current PHP code - especially our code to handle addition of facets fields to the q parameter. From reading about dismax, seems we would need to instead use fq to limit the search results to those matching a specific facet value. Best, Peter On Sun, Nov 30, 2008 at 8:43 AM, Grant Ingersoll <[EMAIL PROTECTED]> wrote: > Hi Peter, > > What are the downsides to your last alternative approach below? That seems > like the simplest approach and should work as long as the terms within those > fields do not need to be boosted separately. > > If you want to go the boosting terms route, this is handled via a thing > called Payloads in Lucene. Payloads are an array of bytes that are added > during indexing at the term level through the analysis process. To do this > in Solr, you would need to write your own TokenFilter that adds payloads as > needed. Then, during search, you can take advantage of these payloads by > using the BoostingTermQuery from Lucene. The downside to all of this is > Solr doesn't currently support it, so you would be coding it up yourself. > I'm sure, though, that if you were to start a patch on it, there would be > others who are interested. > > Note, on the payloads. The biggest sticking point, I think, is coming up w/ > an efficient way of encoding the byte array and putting it into the XML > format, such that one can send in payloads when indexing. It's not > particularly hard, but no one has done it yet. > > -Grant > > > On Nov 29, 2008, at 10:45 PM, Peter Wolanin wrote: > >> I've recently started working on the Drupal integration module for >> SOLR, and we are looking for suggestions for how to address this >> question: how do we boost the importance of a subset of terms within >> a field. >> >> For example, we are using the standard request handler for queries, >> and the default field for keyword searches is a concatentation of the >> title, body, taxonomy terms, etc. >> >> One "hackish" way I can imagine is that terms we want to boost (for >> example the title, or text inside h2 tags) could be concatenated on >> multiple times. Would this be effective and reasonable? >> >> It seems like the alternative is to try to switch to using the dismax >> handler, storing the terms that we desire to have different boosts >> into different fields, all of which are in the list of query fields? >> >> Thanks in advance for your suggestions. >> >> -Peter >> >> -- >> Peter M. Wolanin, Ph.D. >> Momentum Specialist, Acquia. Inc. >> [EMAIL PROTECTED] > > -- > Grant Ingersoll > > Lucene Helpful Hints: > http://wiki.apache.org/lucene-java/BasicsOfPerformance > http://wiki.apache.org/lucene-java/LuceneFAQ > > > > > > > > > > > -- -- Peter M. Wolanin, Ph.D. Momentum Specialist, Acquia. Inc. [EMAIL PROTECTED]
Re: boosting certain terms within one field?
Adding constraints obtained from facets is best done using fq anyway, so it's worth making that switch in your client code anyway. Erik On Nov 30, 2008, at 10:43 AM, Peter Wolanin wrote: Hi Grant, Thanks for your feedback. The major short-term downside to switching to dismax with multiple fields would be the required re-writing of our current PHP code - especially our code to handle addition of facets fields to the q parameter. From reading about dismax, seems we would need to instead use fq to limit the search results to those matching a specific facet value. Best, Peter On Sun, Nov 30, 2008 at 8:43 AM, Grant Ingersoll <[EMAIL PROTECTED]> wrote: Hi Peter, What are the downsides to your last alternative approach below? That seems like the simplest approach and should work as long as the terms within those fields do not need to be boosted separately. If you want to go the boosting terms route, this is handled via a thing called Payloads in Lucene. Payloads are an array of bytes that are added during indexing at the term level through the analysis process. To do this in Solr, you would need to write your own TokenFilter that adds payloads as needed. Then, during search, you can take advantage of these payloads by using the BoostingTermQuery from Lucene. The downside to all of this is Solr doesn't currently support it, so you would be coding it up yourself. I'm sure, though, that if you were to start a patch on it, there would be others who are interested. Note, on the payloads. The biggest sticking point, I think, is coming up w/ an efficient way of encoding the byte array and putting it into the XML format, such that one can send in payloads when indexing. It's not particularly hard, but no one has done it yet. -Grant On Nov 29, 2008, at 10:45 PM, Peter Wolanin wrote: I've recently started working on the Drupal integration module for SOLR, and we are looking for suggestions for how to address this question: how do we boost the importance of a subset of terms within a field. For example, we are using the standard request handler for queries, and the default field for keyword searches is a concatentation of the title, body, taxonomy terms, etc. One "hackish" way I can imagine is that terms we want to boost (for example the title, or text inside h2 tags) could be concatenated on multiple times. Would this be effective and reasonable? It seems like the alternative is to try to switch to using the dismax handler, storing the terms that we desire to have different boosts into different fields, all of which are in the list of query fields? Thanks in advance for your suggestions. -Peter -- Peter M. Wolanin, Ph.D. Momentum Specialist, Acquia. Inc. [EMAIL PROTECTED] -- Grant Ingersoll Lucene Helpful Hints: http://wiki.apache.org/lucene-java/BasicsOfPerformance http://wiki.apache.org/lucene-java/LuceneFAQ -- -- Peter M. Wolanin, Ph.D. Momentum Specialist, Acquia. Inc. [EMAIL PROTECTED]
What are the scenarios when a new Searcher is created ?
Hi All, Say I have started a new Solr server instance using the start.jar in java command. Now for this Solr server instance when all a new Searcher would be created ? I am aware of following scenarios - 1. When the instance is started for autowarming a new Searcher is created. But not sure whether this searcher will continue to be alive or will die after the autowarming is over. 2. When I do the first search in this server instance through select, a new searcher would be created and then onwards the same searcher would be used for all select to this instance. Even if I run multiple search request concurrently I see that the same Searcher is used to service those requests. 3. When I try to add an index to this instance through update statement a new searcher is created. Please let me know if there are any other situation when a new Searcher is created. Regards, Sourav CAUTION - Disclaimer * This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely for the use of the addressee(s). If you are not the intended recipient, please notify the sender by e-mail and delete the original message. Further, you are not to copy, disclose, or distribute this e-mail or its contents to any other person and any such actions are unlawful. This e-mail may contain viruses. Infosys has taken every reasonable precaution to minimize this risk, but is not liable for any damage you may sustain as a result of any virus in this e-mail. You should carry out your own virus checks before opening the e-mail or attachment. Infosys reserves the right to monitor and review the content of all messages sent to or from this e-mail address. Messages sent to or from this e-mail address may be stored on the Infosys e-mail system. ***INFOSYS End of Disclaimer INFOSYS***
Re: NIO not working yet
OK, the development version of Solr should now be fixed (i.e. NIO should be the default for non-Windows platforms). The next nightly build (Dec-01-2008) should have the changes. -Yonik On Wed, Nov 12, 2008 at 2:59 PM, Yonik Seeley <[EMAIL PROTECTED]> wrote: > NIO support in the latest Solr development versions does not work yet > (I previously advised that some people with possible lock contention > problems try it out). We'll let you know when it's fixed, but in the > meantime you can always set the system property > "org.apache.lucene.FSDirectory.class" to > "org.apache.lucene.store.NIOFSDirectory" to try it out. > > for example: > > java > -Dorg.apache.lucene.FSDirectory.class=org.apache.lucene.store.NIOFSDirectory > -jar start.jar > > -Yonik
Solr with Networkn File Server
Hi, I have huge index files to query. On a first cut calculation it looks like I would need around 3 boxes (each box not more than 125 M records of size 12.5GB) for around 25 apps - so all together 75 boxes. However the number of concurrent users would be lesser - may not be more than 20 at a time or max 25. So thinking of an option where I use around 20-25 servers each with 2 GB heap size and loading all indexes in a network file server. I know this would impact the performance (especially for the first time query) but not sure how much impact it would be. If anybody already tried this type of solution please let me know how was the performance impact. Regards, Sourav CAUTION - Disclaimer * This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely for the use of the addressee(s). If you are not the intended recipient, please notify the sender by e-mail and delete the original message. Further, you are not to copy, disclose, or distribute this e-mail or its contents to any other person and any such actions are unlawful. This e-mail may contain viruses. Infosys has taken every reasonable precaution to minimize this risk, but is not liable for any damage you may sustain as a result of any virus in this e-mail. You should carry out your own virus checks before opening the e-mail or attachment. Infosys reserves the right to monitor and review the content of all messages sent to or from this e-mail address. Messages sent to or from this e-mail address may be stored on the Infosys e-mail system. ***INFOSYS End of Disclaimer INFOSYS***
Re: Upgrade from 1.2 to 1.3 gives 3x slowdown + script!
Hi Fergie, Haven't forgotten about you, but I've been traveling and then into some US Holidays here. To confirm I am understanding, you are seeing a slowdown between 1.3- dev from April and one from September, right? Can you produce an MD5 hash of the WAR file or something, such that I can know I have the exact bits. Better yet, perhaps you can put those files up somewhere where they can be downloaded. Thanks, Grant On Nov 26, 2008, at 10:54 AM, Fergus McMenemie wrote: Hello Grant, Not much good with Java profilers (yet!) so I thought I would send a script! Details... details! Having decided to produce a script to replicate the 1.2 vis 1.3 speed problem. The required rigor revealed a lot more. 1) The faster version I have previously referred to as 1.2, was actually a "1.3-dev" I had downloaded as part of the solr bootcamp class at ApacheCon Europe 2008. The ID string in the CHANGES.txt document is:- $Id: CHANGES.txt 643465 2008-04-01 16:10:19Z gsingers $ 2) I did actually download and speed test a version of 1.2 from the internet. It's CHANGES.txt id is:- $Id: CHANGES.txt 543263 2007-05-31 21:19:02Z yonik $ Speed wise it was about the same as 1.3 at 64min. It also had lots of char set issues and is ignored from now on. 3) The version I was planning to use, till I found this, speed issue was the "latest" official version:- $Id: CHANGES.txt 694377 2008-09-11 17:40:11Z klaas $ I also verified the behavior with a nightly build. $Id: CHANGES.txt 712457 2008-11-09 01:24:11Z koji $ Anyway, The following script indexes the content in 22min for the 1.3-dev version and takes 68min for the newer releases of 1.3. I took the conf directory from the 1.3dev (bootcamp) release and used it replace the conf directory from the official 1.3 release. The 3x slow down was still there; it is not a configuration issue! = #! /bin/bash # This script assumes a /usr/local/tomcat link to whatever version # of tomcat you have installed. I have "apache-tomcat-5.5.20" Also # /usr/local/tomcat/conf/Catalina/localhost contains no solr.xml. # All the following was done as root. # I have a directory /usr/local/ts which contains four versions of solr. The # "official" 1.2 along with two 1.3 releases and a version of 1.2 or a 1.3beata # I got while attending a solr bootcamp. I indexed the same content using the # different versions of solr as follows: cd /usr/local/ts if [ "" ] then echo "Starting from a-fresh" sleep 5 # allow time for me to interrupt! cp -Rp apache-solr-bc/example/solr ./solrbc #bc = bootcamp cp -Rp apache-solr-nightly/example/solr ./solrnightly cp -Rp apache-solr-1.3.0/example/solr ./solr13 # the gaz is regularly updated and its name keeps changing :-) The page # http://earth-info.nga.mil/gns/html/namefiles.htm has a link to the latest # version. curl "http://earth-info.nga.mil/gns/html/geonames_dd_dms_date_20081118.zip " > geonames.zip unzip -q geonames.zip # delete corrupt blips! perl -i -n -e 'print unless ($. > 2128495 and $. < 2128505) or ($. > 5944254 and $. < 5944260) ;' geonames_dd_dms_date_20081118.txt #following was used to detect bad short records #perl -a -F\\t -n -e ' print "line $. is bad with ",scalar(@F)," args\n" if (@F != 26);' geonames_dd_dms_date_20081118.txt # my set of fields and copyfields for the schema.xml fields=' stored="true" required="true" /> stored="true"/> stored="true"/> stored="true"/> stored="true"/> stored="true"/> stored="true"/> stored="true"/> stored="true"/> stored="true"/> ' copyfields=' ' # add in my fields and copyfields perl -i -p -e "print qq($fields) if s///;" solr*/ conf/schema.xml perl -i -p -e "print qq($copyfields) if s[][];" solr*/ conf/schema.xml # change the unique key and mark the "id" field as not required perl -i -p -e "s/id/UNI/i;"solr*/ conf/schema.xml perl -i -p -e 's/required="true"//i if m/conf/schema.xml # enable remote streaming in solrconfig file perl -i -p -e 's/enableRemoteStreaming="false"/ enableRemoteStreaming="true"/;' solr*/conf/solrconfig.xml fi # some constants to keep the curl command shorter skip = "MODIFY_DATE ,RC ,UFI ,DMS_LAT ,DMS_LONG ,FC,PC,ADM1,ADM2,POP,ELEV,CC2,NT,LC,SHORT_FORM,GENERIC,SORT_NAME" file=`pwd`"/geonames.txt" export JAVA_OPTS=" -Xmx512M -Xms512M -Dsolr.home=`pwd`/solr - Dsolr.solr.home=`pwd`/solr" echo 'Getting ready to index the data set using solrbc (bc = bootcamp)' /usr/local/tomcat/bin/shutdown.sh sleep 15 if [ -n "`ps awxww | grep tomcat | grep -v grep`" ] then echo "Tomcat would not shutdown" exit fi rm -r /usr/local/tomcat/webapps/solr* rm -r /usr/local/tomcat/logs/*.out rm -r /usr/local/tomcat/work/Catalina/localhost/solr cp apache-solr-bc/example/webapps/solr.war /usr/local/tomcat/webapps rm solr
Re: NIO not working yet
Sorry missed that (and probably dumb question), does that -D flag work for setting as a RAMDirectory as well? - Jon On Nov 30, 2008, at 8:42 PM, Yonik Seeley wrote: OK, the development version of Solr should now be fixed (i.e. NIO should be the default for non-Windows platforms). The next nightly build (Dec-01-2008) should have the changes. -Yonik On Wed, Nov 12, 2008 at 2:59 PM, Yonik Seeley <[EMAIL PROTECTED]> wrote: NIO support in the latest Solr development versions does not work yet (I previously advised that some people with possible lock contention problems try it out). We'll let you know when it's fixed, but in the meantime you can always set the system property "org.apache.lucene.FSDirectory.class" to "org.apache.lucene.store.NIOFSDirectory" to try it out. for example: java - Dorg .apache .lucene.FSDirectory.class=org.apache.lucene.store.NIOFSDirectory -jar start.jar -Yonik
Re: What are the scenarios when a new Searcher is created ?
When adding documents to solr, the searcher will not be replaced, but once you do a commit, (dependening on settings) a new searcher will be opened and warmed up while the old searcher will still be open and used when searching. Once the new searcher has finished its warmup procedure, the old searcher will be replaced with the new warmed searcher, which will now allow you to search the newest documents added to the index. - Aleks On Mon, 01 Dec 2008 01:32:05 +0100, souravm <[EMAIL PROTECTED]> wrote: Hi All, Say I have started a new Solr server instance using the start.jar in java command. Now for this Solr server instance when all a new Searcher would be created ? I am aware of following scenarios - 1. When the instance is started for autowarming a new Searcher is created. But not sure whether this searcher will continue to be alive or will die after the autowarming is over. 2. When I do the first search in this server instance through select, a new searcher would be created and then onwards the same searcher would be used for all select to this instance. Even if I run multiple search request concurrently I see that the same Searcher is used to service those requests. 3. When I try to add an index to this instance through update statement a new searcher is created. Please let me know if there are any other situation when a new Searcher is created. Regards, Sourav CAUTION - Disclaimer * This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely for the use of the addressee(s). If you are not the intended recipient, please notify the sender by e-mail and delete the original message. Further, you are not to copy, disclose, or distribute this e-mail or its contents to any other person and any such actions are unlawful. This e-mail may contain viruses. Infosys has taken every reasonable precaution to minimize this risk, but is not liable for any damage you may sustain as a result of any virus in this e-mail. You should carry out your own virus checks before opening the e-mail or attachment. Infosys reserves the right to monitor and review the content of all messages sent to or from this e-mail address. Messages sent to or from this e-mail address may be stored on the Infosys e-mail system. ***INFOSYS End of Disclaimer INFOSYS*** -- Aleksander M. Stensby Senior software developer Integrasco A/S www.integrasco.no