Filtering results
Hello All, I'm looking for a way to filter results by some ranking mechanism. For example... Suppose you have 30 docs in an index, and they are in groups of 10, like this A, 1 A, 2 : A, 10 B, 1 B, 2 : B, 10 C, 1 C, 2 : C, 10 I would like to get 3 records back such that I get a single, "best", result from each logical group. So, if I searched with a term that would match all the docs in the index, I could be certain to get a doc with A in it, one with B in it and one with C in it. The the moment, I have a solr index that has a category field, and the index will have between 1 and 2 million results when we are done indexing. I'm going to spend some time today researching this. If anyone can send me some advice, I would be grateful. I've considered post processing the results, but I'm not sure if this is the wisest plan. And, I don't know how I would accurate result counts, to do pagination. cheers
Re: Filtering results
thanks. very interesting. The plot thickens. And, yes, I think field collapsing is exactly what I'm after. I'm am considering now trying this patch. I have a solr 1.2 instance on Jetty. I looks like I need to install the patch. Does anyone use that patch? Recommend it? The wiki page (http://wiki.apache.org/solr/FieldCollapsing) says "This patch is not complete, but it will be useful to keep this page updated while the interface evolves." And the page was last updated over a year ago, so I'm not sure if that is a good. I'm trying to read through all the comments now. . I'm also considering creating a second index of just the categories which contains all the content from the main index collapsed down in to the corresponding categories - basically a complete collapsed index. Initial searches will be done against this collapsed category index, and then the first 10 results will be used to do 10 field queries against the main index to get the "top" records to return with each Category. Haven't decided which path to take yet. cheers gene On Wed, Sep 17, 2008 at 9:42 AM, Chris Hostetter <[EMAIL PROTECTED]> wrote: > > : 1. Identify all records that would match search terms. (Suppose I > : search for 'dog', and get 450,000 matches) > : 2. Of those records, find the distinct list of groups over all the > : matches. (Suppose there are 300.) > : 3. Now get the top ranked record from each group, as if you search > : just for docs in the group. > > this sounds similar to "Field Collapsing" although i don't really > understand it or your specific use case enough to be certain that it's the > same thing. You may find the patch, and/or the discussions about the > patch useful starting points... > > https://issues.apache.org/jira/browse/SOLR-236 > http://wiki.apache.org/solr/FieldCollapsing > > > -Hoss > >
How to copy a solr index to another index with a different schema collapsing stored data?
Is it possible to copy stored index data from index to another, but concatenating it as you go. Suppose 2 categories A and B both with 20 docs, for a total of 40 docs in the index. The index has a stored field for the content from the docs. I want a new index with only two docs in it, one for A and one for B. And it would have a stored field that is the sum of all the stored data for the 20 docs of A and of B respectively. So, then a query on this index will tell me give me a relevant list of Categories? Perhaps there's a solr query to get that data out, and then I can handle concatenating it, and then indexing it in the new index. I'm hoping I don't have to reindex all this data from scratch? It has taken weeks! thanks gene
Re: How to copy a solr index to another index with a different schema collapsing stored data?
is it possible to query out the stored data as, uh, tokens I suppose. Then, index those tokens in the next index? thanks gene On Wed, Sep 17, 2008 at 1:14 PM, Gene Campbell <[EMAIL PROTECTED]> wrote: > I was pretty sure you'd say that. But, I means lots that you take the > time to confirm it. Thanks Otis. > > I don't want to give details, but we crawl for our data, and we don't > save it in a DB or on disk. It goes from download to index. Was a > good idea at the time; when we thought our designs were done evolving. > :) > > cheers > gene > > > On Wed, Sep 17, 2008 at 12:51 PM, Otis Gospodnetic > <[EMAIL PROTECTED]> wrote: >> You can't copy+merge+flatten indices like that. Reindexing would be the >> easiest. Indexing taking weeks sounds suspicious. How much data are you >> reindexing and how big are your indices? >> >> Otis >> -- >> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch >> >> >> >> - Original Message >>> From: ristretto.rb <[EMAIL PROTECTED]> >>> To: solr-user@lucene.apache.org >>> Sent: Tuesday, September 16, 2008 8:14:16 PM >>> Subject: How to copy a solr index to another index with a different schema >>> collapsing stored data? >>> >>> Is it possible to copy stored index data from index to another, but >>> concatenating it as you go. >>> >>> Suppose 2 categories A and B both with 20 docs, for a total of 40 docs >>> in the index. The index has a stored field for the content from the >>> docs. >>> >>> I want a new index with only two docs in it, one for A and one for B. >>> And it would have a stored field that is the sum of all the stored >>> data for the 20 docs of A and of B respectively. >>> >>> So, then a query on this index will tell me give me a relevant list of >>> Categories? >>> >>> Perhaps there's a solr query to get that data out, and then I can >>> handle concatenating it, and then indexing it in the new index. >>> >>> I'm hoping I don't have to reindex all this data from scratch? It has >>> taken weeks! >>> >>> thanks >>> gene >> >> >
How to set term frequency given a term and a value stating the frequency?
Hello, I'm looking through the wiki, so if it's there, I'll find it, and you can ignore this post. If this isn't documented, can anyone explain how to achieve this? Suppose I have two docs A and B that I want to index. I want to index these documents so that A has the equivalent of 100 copies of 'Banana', and B has the equivalent of 20 copies of 'Banana', so that searches for Banana will rank A before B, due to term frequency. When indexing, I would have something like A Banana 100 B Banana 20. Will I have to repeat 'Banana' 100 times in a string variable that I send to the index? And likewise 20 times for B? Or is there a better way to accomplish this? thanks gene
how to find terms on a page?
Hello, I haven't heard of or found a way to find the number of times a term is found on a page. Lucene uses it in scoring, I believe, (solr scoring: http://tinyurl.com/4tb55r) Basically, for a given page, I would like a list of terms on the page and number of times the terms appear on the page? thanks gene
Re: Filtering results
Otis, Would be reasonable to run a query like this http://localhost:8280/solr/select/?q=terms_x&version=2.2&start=0&rows=0&indent=on 10 times, one for each result from an initial category query on a different index. So, it's still 1+10, but I'm not returning values. This would give me the number of pages that would match, and I can display that number. Not ideal, but better then nothing, and hopefully not a problem with scaling. cheers gene On Wed, Sep 17, 2008 at 1:21 PM, Gene Campbell <[EMAIL PROTECTED]> wrote: > OK thanks Otis. Any gut feeling on the best approach to get this > collapsed data? I hate to ask you to do my homework, but I'm coming > to the > end of my Solr/Lucene knowledge. I don't code java too well - used > to, but switched to Python a while back. > > gene > > > > > On Wed, Sep 17, 2008 at 12:47 PM, Otis Gospodnetic > <[EMAIL PROTECTED]> wrote: >> Gene, >> >> The latest patch from Bojan for SOLR-236 works with whatever revision of >> Solr he used when he made the patch. >> >> I didn't follow this thread to know your original requirements, but running >> 1+10 queries doesn't sound good to me from scalability/performance point of >> view. >> >> Otis >> -- >> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch >> >> >> >> - Original Message >>> From: ristretto.rb <[EMAIL PROTECTED]> >>> To: solr-user@lucene.apache.org >>> Sent: Tuesday, September 16, 2008 6:45:02 PM >>> Subject: Re: Filtering results >>> >>> thanks. very interesting. The plot thickens. And, yes, I think >>> field collapsing is exactly what I'm after. >>> >>> I'm am considering now trying this patch. I have a solr 1.2 instance >>> on Jetty. I looks like I need to install the patch. >>> Does anyone use that patch? Recommend it? The wiki page >>> (http://wiki.apache.org/solr/FieldCollapsing) says >>> "This patch is not complete, but it will be useful to keep this page >>> updated while the interface evolves." And the page >>> was last updated over a year ago, so I'm not sure if that is a good. >>> I'm trying to read through all the comments now. >>> >>> . I'm also considering creating a second index of just the >>> categories which contains all the content from the main index >>> collapsed >>> down in to the corresponding categories - basically a complete >>> collapsed index. >>> Initial searches will be done against this collapsed category index, >>> and then the first 10 results >>> will be used to do 10 field queries against the main index to get the >>> "top" records to return with each Category. >>> >>> Haven't decided which path to take yet. >>> >>> cheers >>> gene >>> >>> >>> On Wed, Sep 17, 2008 at 9:42 AM, Chris Hostetter >>> wrote: >>> > >>> > : 1. Identify all records that would match search terms. (Suppose I >>> > : search for 'dog', and get 450,000 matches) >>> > : 2. Of those records, find the distinct list of groups over all the >>> > : matches. (Suppose there are 300.) >>> > : 3. Now get the top ranked record from each group, as if you search >>> > : just for docs in the group. >>> > >>> > this sounds similar to "Field Collapsing" although i don't really >>> > understand it or your specific use case enough to be certain that it's the >>> > same thing. You may find the patch, and/or the discussions about the >>> > patch useful starting points... >>> > >>> > https://issues.apache.org/jira/browse/SOLR-236 >>> > http://wiki.apache.org/solr/FieldCollapsing >>> > >>> > >>> > -Hoss >>> > >>> > >> >> >
Re: Filtering results
Thanks Otis for reply! Always appreciated! That is indeed what we are looking for implementing. But, I'm running out of time to prototype or experiment for this release. I'm going to run the two index thing for now, unless I find something saying is really easy and sensible to run one and collapse on a field. thanks gene On Fri, Sep 19, 2008 at 3:24 PM, Otis Gospodnetic <[EMAIL PROTECTED]> wrote: > Gene, > I haven't looked at Field Collapsing for a while, but if you have a single > index and collapse hits on your category field, then won't first 10 hits be > items you are looking for - top 1 item for each category x 10 using a single > query. > > Otis > -- > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > > - Original Message >> From: ristretto.rb <[EMAIL PROTECTED]> >> To: solr-user@lucene.apache.org >> Sent: Thursday, September 18, 2008 7:35:43 PM >> Subject: Re: Filtering results >> >> Otis, >> >> Would be reasonable to run a query like this >> >> http://localhost:8280/solr/select/?q=terms_x&version=2.2&start=0&rows=0&indent=on >> >> 10 times, one for each result from an initial category query on a >> different index. >> So, it's still 1+10, but I'm not returning values. >> This would give me the number of pages that would match, and I can >> display that number. >> Not ideal, but better then nothing, and hopefully not a problem with scaling. >> >> cheers >> gene >> >> >> >> On Wed, Sep 17, 2008 at 1:21 PM, Gene Campbell wrote: >> > OK thanks Otis. Any gut feeling on the best approach to get this >> > collapsed data? I hate to ask you to do my homework, but I'm coming >> > to the >> > end of my Solr/Lucene knowledge. I don't code java too well - used >> > to, but switched to Python a while back. >> > >> > gene >> > >> > >> > >> > >> > On Wed, Sep 17, 2008 at 12:47 PM, Otis Gospodnetic >> > wrote: >> >> Gene, >> >> >> >> The latest patch from Bojan for SOLR-236 works with whatever revision of >> >> Solr >> he used when he made the patch. >> >> >> >> I didn't follow this thread to know your original requirements, but >> >> running >> 1+10 queries doesn't sound good to me from scalability/performance point of >> view. >> >> >> >> Otis >> >> -- >> >> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch >> >> >> >> >> >> >> >> - Original Message >> >>> From: ristretto.rb >> >>> To: solr-user@lucene.apache.org >> >>> Sent: Tuesday, September 16, 2008 6:45:02 PM >> >>> Subject: Re: Filtering results >> >>> >> >>> thanks. very interesting. The plot thickens. And, yes, I think >> >>> field collapsing is exactly what I'm after. >> >>> >> >>> I'm am considering now trying this patch. I have a solr 1.2 instance >> >>> on Jetty. I looks like I need to install the patch. >> >>> Does anyone use that patch? Recommend it? The wiki page >> >>> (http://wiki.apache.org/solr/FieldCollapsing) says >> >>> "This patch is not complete, but it will be useful to keep this page >> >>> updated while the interface evolves." And the page >> >>> was last updated over a year ago, so I'm not sure if that is a good. >> >>> I'm trying to read through all the comments now. >> >>> >> >>> . I'm also considering creating a second index of just the >> >>> categories which contains all the content from the main index >> >>> collapsed >> >>> down in to the corresponding categories - basically a complete >> >>> collapsed index. >> >>> Initial searches will be done against this collapsed category index, >> >>> and then the first 10 results >> >>> will be used to do 10 field queries against the main index to get the >> >>> "top" records to return with each Category. >> >>> >> >>> Haven't decided which path to take yet. >> >>> >> >>> cheers >> >>> gene >> >>> >> >>> >> >>> On Wed, Sep 17, 2008 at 9:42 AM, Chris Hostetter >> >>> wrote: >> >>> > >> >>> > : 1. Identify all records that would match search terms. (Suppose I >> >>> > : search for 'dog', and get 450,000 matches) >> >>> > : 2. Of those records, find the distinct list of groups over all the >> >>> > : matches. (Suppose there are 300.) >> >>> > : 3. Now get the top ranked record from each group, as if you search >> >>> > : just for docs in the group. >> >>> > >> >>> > this sounds similar to "Field Collapsing" although i don't really >> >>> > understand it or your specific use case enough to be certain that it's >> >>> > the >> >>> > same thing. You may find the patch, and/or the discussions about the >> >>> > patch useful starting points... >> >>> > >> >>> > https://issues.apache.org/jira/browse/SOLR-236 >> >>> > http://wiki.apache.org/solr/FieldCollapsing >> >>> > >> >>> > >> >>> > -Hoss >> >>> > >> >>> > >> >> >> >> >> > > >
Are facet searches slower on large indexes?
Hello I'm doing a facet search like the following. The content field schema is /solr/select?q=dirt field:www.example.com&facet=true&facet.field=content&facet.limit=-1&facet.mincount=1 If I run this on a server with a total of 1000 pages that contain pages for www.example.com, it returns in about 1 second, and gives me 37 docs, and quite a few facet values. If I run this same search on a server with over a 1,000,000 pages in total, including the pages that are in the first example, it returns in about 2 minutes! still giving me 37 docs and the same amount of facet values. Seems to me the search should have been constrained to field:www.example.com in both cases, so perhaps shouldn't be much different in time to execute. Is there any more in formation on facet searching that will explain what's going on? thanks gene
Re: solr on ubuntu 8.04
I had absolutely not luck with the jetty-solr package on Ubuntu 8.04. I haven't tried Tomcat for solr. I do have it running on Ubuntu though. Here's what I did. Hope this helps. Don't do this unless you understand the steps. When I say things like 'remove contents' I don't know what you have in there. But, if you don't have jetty or solr yet, you'll probably be safe 1) wget http://www.trieuvan.com/apache/lucene/solr/1.3.0/apache-solr-1.3.0.tgz 2) wget http://dist.codehaus.org/jetty/jetty-6.1.11/jetty-6.1.11.zip (possibily something newer will work, but so does this.) 3) Get Java 1.6 4) tar -zxvf jetty-6.1.11.zip into /opt 5) ln -s /opt/jetty-6.1.11 jetty 6) remove contents of /opt/jetty/context 7) mkdir /opt/jetty/context/solr 8) tar -zxvf apache-solr-1.3.0.tgz into /opt/jetty/context/solr 9) create a file called /opt/jetty/contexts/jetty-solr.xml with the following contents Change the path to jetty home to you're jetty home. My example here shows /var/solr/home http://jetty.mortbay.org/configure.dtd";> org.mortbay.jetty.webapp.WebInfConfiguration org.mortbay.jetty.plus.webapp.EnvConfiguration org.mortbay.jetty.plus.webapp.Configuration org.mortbay.jetty.webapp.JettyWebXmlConfiguration org.mortbay.jetty.webapp.TagLibConfiguration /solr /opt/jetty/contexts/solr/ false false /opt/jetty/contexts/solr/WEB-INF/web.xml solr/home /var/solr/home 10) start jetty a) java -jar start.jar etc/jetty.xml while developing b) nohup java -jar start.jar etc/jetty.xml > /dev/null 2>&1 & to make it quiet and headless. I know this doesn't help you get it going on Tomcat. But, perhaps you should considering Jetty. It works under Ubuntu fine. (Note this is a developoment setup. I don't have a production tested setup yet. I hope it's not too different, but I thought I shoudl mention that.) Hope this helps someone cheers gene On Fri, Oct 3, 2008 at 4:48 AM, Tricia Williams <[EMAIL PROTECTED]> wrote: > I haven't tried installing the ubuntu package, but the releases from > apache.org come with an example that contains a directory called "solr" > which contains a directory called "conf" where schema.xml and solrconfig.xml > are important. Is it possible these files do not exist in the path? > > Tricia > > Jack Bates wrote: >> >> No sweat - did you install the Ubuntu solr package or the solr.war from >> http://lucene.apache.org/solr/? >> >> When you say it doesn't work, what exactly do you mean? >> >> On Thu, 2008-10-02 at 07:43 -0700, [EMAIL PROTECTED] wrote: >> >>> >>> Hi Jack, >>> Really I would love if you could help me about it ... and tell me what >>> you have in your file >>> ./var/lib/tomcat5.5/webapps >>> ./usr/share/tomcat5.5/webapps >>> >>> It doesn't work I dont know why :( >>> Thanks a lot >>> Johanna >>> >>> Jack Bates-2 wrote: >>> Thanks for your suggestions. I have now tried installing Solr on two different machines. On one machine I installed the Ubuntu solr-tomcat5.5 package, and on the other I simply dropped "solr.war" into /var/lib/tomcat5.5/webapps Both machines are running Tomcat 5.5 I get the same error message on both machines: SEVERE: Exception starting filter SolrRequestFilter java.lang.NoClassDefFoundError: Could not initialize class org.apache.solr.core.SolrConfig The full error message is attached. I can confirm that the /usr/share/solr/WEB-INF/lib/apache-solr-1.2.0.jar jar file contains: org/apache/solr/core/SolrConfig.class - however I do not know why Tomcat does not find it. Thanks again, Jack > > Hardy has solr packages already. You might want to look how they > packaged > solr if you cannot move to that version. > Did you just drop the war file? Or did you use JNDI? You probably need > to > configure solr/home, and maybe fiddle with > securitymanager stuff. > > Albert > > On Thu, May 1, 2008 at 6:46 PM, Jack Bates freezone.co.uk> > wrote: > > >> >> I am trying to evaluate Solr for an open source records management >> project to which I contribute: http://code.google.com/p/qubit-toolkit/ >> >> I installed the Ubuntu solr-tomcat5.5 package: >> http://packages.ubuntu.com/hardy/solr-tomcat5.5 >> >> - and pointed my browser at: http://localhost:8180/solr/admin (The >> Ubuntu and Debian Tomcat packages run on port 8180) >> >> However, in response I get a Tomcat 404: The requested >> resource(/solr/admin) is not available. >> >> This differs from the response I get accessing a random URL: >> http://localhost:8180/foo/bar >> >> - which displays a blank page. >> >> From this I gather that the solr-tomcat5.5 package installed >> *something*, but that it's misconfigur
Re: solr on ubuntu 8.04
SEVERE: Exception starting filter SolrRequestFilter > java.lang.NoClassDefFoundError: Could not initialize class > org.apache.solr.core.SolrConfig btw, this looks like you are you using current 1.3 or head versions of classes in Schema.xml or solrconfig.xml, but you are running on a 1.2 version of solr. Perhaps if you look up the output a bit you'll see it finding and loading these files. And then blowing out on one of them. You basically need to get the jar files with the support you have. The easiest way I know is get 1.3. I'm still quite novice with solr admin, but this one I've seen before. Hope this helps. gene On Fri, Oct 3, 2008 at 10:14 AM, ristretto. rb <[EMAIL PROTECTED]> wrote: > I had absolutely not luck with the jetty-solr package on Ubuntu 8.04. > I haven't tried Tomcat for solr. > I do have it running on Ubuntu though. Here's what I did. Hope this > helps. Don't do this unless you > understand the steps. When I say things like 'remove contents' I > don't know what you have in there. > But, if you don't have jetty or solr yet, you'll probably be safe > > 1) wget > http://www.trieuvan.com/apache/lucene/solr/1.3.0/apache-solr-1.3.0.tgz > 2) wget http://dist.codehaus.org/jetty/jetty-6.1.11/jetty-6.1.11.zip > (possibily something newer will work, but so does this.) > 3) Get Java 1.6 > 4) tar -zxvf jetty-6.1.11.zip into /opt > 5) ln -s /opt/jetty-6.1.11 jetty > 6) remove contents of /opt/jetty/context > 7) mkdir /opt/jetty/context/solr > 8) tar -zxvf apache-solr-1.3.0.tgz into /opt/jetty/context/solr > 9) create a file called /opt/jetty/contexts/jetty-solr.xml with the > following contents > > Change the path to jetty home to you're jetty home. My example here > shows /var/solr/home > > > "http://jetty.mortbay.org/configure.dtd";> > > > > > > > > > > org.mortbay.jetty.webapp.WebInfConfiguration > org.mortbay.jetty.plus.webapp.EnvConfiguration > org.mortbay.jetty.plus.webapp.Configuration > org.mortbay.jetty.webapp.JettyWebXmlConfiguration > org.mortbay.jetty.webapp.TagLibConfiguration > > > >/solr >/opt/jetty/contexts/solr/ >false >false >/opt/jetty/contexts/solr/WEB-INF/web.xml > > > >solr/home >/var/solr/home > > > > > > > > 10) start jetty > a) java -jar start.jar etc/jetty.xml while developing > b) nohup java -jar start.jar etc/jetty.xml > /dev/null 2>&1 & to > make it quiet and headless. > > I know this doesn't help you get it going on Tomcat. But, perhaps you > should considering Jetty. It works under Ubuntu fine. (Note this is > a developoment setup. I don't have a production tested setup yet. I > hope it's not too different, but I thought I shoudl mention that.) > > Hope this helps someone > > cheers > gene > > On Fri, Oct 3, 2008 at 4:48 AM, Tricia Williams > <[EMAIL PROTECTED]> wrote: >> I haven't tried installing the ubuntu package, but the releases from >> apache.org come with an example that contains a directory called "solr" >> which contains a directory called "conf" where schema.xml and solrconfig.xml >> are important. Is it possible these files do not exist in the path? >> >> Tricia >> >> Jack Bates wrote: >>> >>> No sweat - did you install the Ubuntu solr package or the solr.war from >>> http://lucene.apache.org/solr/? >>> >>> When you say it doesn't work, what exactly do you mean? >>> >>> On Thu, 2008-10-02 at 07:43 -0700, [EMAIL PROTECTED] wrote: >>> >>>> >>>> Hi Jack, >>>> Really I would love if you could help me about it ... and tell me what >>>> you have in your file >>>> ./var/lib/tomcat5.5/webapps >>>> ./usr/share/tomcat5.5/webapps >>>> >>>> It doesn't work I dont know why :( >>>> Thanks a lot >>>> Johanna >>>> >>>> Jack Bates-2 wrote: >>>> >>>>> >>>>> Thanks for your suggestions. I have now tried installing Solr on two >>>>> different machines. On one machine I installed the Ubuntu solr-tomcat5.5 >>>>> package, and on the other I simply dropped "solr.war" >>>>> into /var/lib/tomcat5.5/webapps >>>>> >>>>> Both machines are running Tomcat 5.5 >>>>> >>>>> I get the same error message on both machines:
How are multivalued fields used?
How does one use of this field type. Forums, wiki, Lucene in Action, all coming up empty. If there's a doc somewhere please point me there. I use pysolr to index. But, that's not a requirement. I'm not sure how one adds multivalues to a document. And once added, if you want to remove one how do you specify? Based on http://wiki.apache.org/solr/FieldOptionsByUseCase, it says to use it to "add multiple values, maintaining order " Is the order for indexing/searching or for storing/returning? thanks gene