Highlighting - field criteria highlights in other fields
Hi all, we have situation in which we have documents that have an introduction (text) , a body (text) and some meta data fields (integers mostly). when we create a query like this : q=( +(body_nl:( brussel) ) AND ( (+publicationid:("3430" OR "3451")) )&fq= +publishdateAsString:[20070520 TO 20080520]&start=0&rows=11&hl=on&hl.fl=body_nl&hl.snippets=3&hl.fragsize=320&hl.simple.pre=&hl.simple.post=&sort=publishdateAsString desc,publicationname desc&fl=id,score,introduction we get nice highlighting from the body_nl field but Solr also highlights 3430 and 3451 if there is such a "word" in the body_nl, while we were expecting only to get highlighting from the word "brussel" in the body_nl. So it seems that all posible criteria terms are highlighted in any of the given highlighting fields. Is it possible to disable this (with some kind of parameter of something) and only let the hl.fl's highlight the criteria for their own field ? greetings, Tim Please see our disclaimer, http://www.infosupport.be/Pages/Disclaimer.aspx
Re: [SPAM] [poll] Change logging to SLF4J?
me too... [ ] Keep solr logging as it is. (JDK Logging) [X] Use SLF4J. On Mon, May 19, 2008 at 10:32 PM, Matthew Runo <[EMAIL PROTECTED]> wrote: > I just read through the dev list's thread.. and I'm voting for SLF4J as > well. > > Thanks! > > Matthew Runo > Software Developer > Zappos.com > 702.943.7833 > > On May 6, 2008, at 7:40 AM, Ryan McKinley wrote: >> >> Hello- >> >> There has been a long running thread on solr-dev proposing switching >> the logging system to use something other then JDK logging. >> http://www.nabble.com/Solr-Logging-td16836646.html >> http://www.nabble.com/logging-through-log4j-td13747253.html >> >> We are considering using http://www.slf4j.org/. Check: >> https://issues.apache.org/jira/browse/SOLR-560 >> >> The "pro" argument is that: >> * SLFJ allows more flexibility for people using solr outside the >> canned .war to configure logging without touching JDK logging. >> >> The "con" argument goes something like: >> * JDK logging is already is the standard logging framework. >> * JDK logging is already in in use. >> * SLF4J adds another dependency (for something that already works) >> >> On the dev lists there are a strong opinions on either side, but we >> would like to get a larger sampling of option and validation before >> making this change. >> >> [ ] Keep solr logging as it is. (JDK Logging) >> [ ] Use SLF4J. >> >> As an bonus question (this time fill in the blank): >> I have tried SOLR-560 with my logging system and >> ___. >> >> thanks >> ryan >> > >
Re: Searching "inside of words"
Thanks a million! That totally did the trick. It is now working at least 95% like I want it to. Gotta tweak it a little more but it seems like the hard part is over. Thanks once again to everybody who helped out. //Daniel Chris Hostetter wrote: : You are doing the right thing. If you are creating n-grams at index : time, you have to match that at query time. If the query is "monitor", : you need to pass that through n-gram tokenizer, too. n-grams of length : 18 look a little weird you don't *have* to use ngrams at query time ... his goal is "parital" word matching, so he wants to create various sized ngrams so that input like "onit" matches "monitor" but does not match "on it" Daniel: the options for NGramTokenizerFactory are minGramSize and maxGramSize ... not minGram and maxGram ... you are getting the defaults (which are 1 and 2 i think) it confused me too untill i tried you schema changes, and then looked at the analysis.jsp link and saw only 1 and 2 gram tokens being created .. then i checked the class. -Hoss
Re: the time factor
Hi Otis, I tried this. It doesn't seem to solve my problem, though. I think it's best used to make small adjustment when relevance scores are similar. In my case, if I want to rank the most recent documents first (because it's about news), I have to use very large boost, which will end up getting the docs that are not so relevant to the top. I haven't been able to get desired results of showing only recent documents with decent relevance scores. Ideally, I think it can be solved by doing a query for the past 24 hours and keeping the docs with best relevance scores, then another query for the previous 24 hours ... but this really isn't very efficient. Maybe OK for news because I may need to serve for up to 7 days. Still, 7 solr queries for a front-end query doesn't sound ideal. So I'm still in search for a better way ... Thanks, Jack On Tue, May 13, 2008 at 9:06 PM, Otis Gospodnetic <[EMAIL PROTECTED]> wrote: > The answer is: function queries! :) > You can easily use function queries with DisMaxRequestHandler. For example, > this is what you can add to the dismax config section in solrconfig.xml: > > >recip(rord(addDate),1,1000,1000)^2.5 > > > Assuming you have an addDate field, this will give fresher document some > boost. Look for this on the Wiki, it's all there. > > Otis
Re: Minion, anyone?
I've been reading Stephen Green's posts, as you can tell by looking at the comments and I, too, wondered if one could put Solr on top of it. After this initial thought I quickly "decided" it would probably be very messy with various Solr configuration settings being very Lucene-specific. I didn't really spend a lot of time thinking about it, though. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: "Binkley, Peter" <[EMAIL PROTECTED]> > To: solr-user@lucene.apache.org > Sent: Tuesday, May 20, 2008 6:06:27 PM > Subject: Minion, anyone? > > Has anyone in the Solr community started looking at Sun's Minion (now > released under GPL 2.0)? > > https://minion.dev.java.net/ > > And (dare I say it) might it be possible to wrap Minion into Solr as an > alternative to Lucene? The Search Guy (Stephen Green) has been writing a > series of postings comparing Minion and Lucene: > > http://blogs.sun.com/searchguy/tags/lucene > > The differences aren't huge (as he says, "In an alternate world where > Sun opened up a bit earlier, I would have been working on Lucene from > the get-go, rather than starting from scratch."), but Minion has some > functionality that might be useful to Solr users in some circumstances. > > Peter > > Peter Binkley > Digital Initiatives Technology Librarian > Information Technology Services > 4-30 Cameron Library > University of Alberta Libraries > Edmonton, Alberta > Canada T6G 2J8 > Phone: (780) 492-3743 > Fax: (780) 492-9243 > e-mail: [EMAIL PROTECTED] > > ~ The code is willing, but the data is weak. ~
Re: the time factor
Another possible way to get this done is by assigning weights to field values (e.g. pubDate field should have N% weight and relevancy score should have 100-N% weight) and using their weighted values along with Lucene-provided relevancy score to compute a weighted score. I haven't tried this, it may or may not work, or it may produce similar results as the function I suggested below. If you try this, it would be great to hear if this works. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: Jack <[EMAIL PROTECTED]> > To: solr-user@lucene.apache.org > Sent: Tuesday, May 20, 2008 3:59:06 PM > Subject: Re: the time factor > > Hi Otis, > > I tried this. It doesn't seem to solve my problem, though. I think > it's best used to make small adjustment when relevance scores are > similar. In my case, if I want to rank the most recent documents first > (because it's about news), I have to use very large boost, which will > end up getting the docs that are not so relevant to the top. I haven't > been able to get desired results of showing only recent documents with > decent relevance scores. > > Ideally, I think it can be solved by doing a query for the past 24 > hours and keeping the docs with best relevance scores, then another > query for the previous 24 hours ... but this really isn't very > efficient. Maybe OK for news because I may need to serve for up to 7 > days. Still, 7 solr queries for a front-end query doesn't sound ideal. > So I'm still in search for a better way ... > > Thanks, > Jack > > On Tue, May 13, 2008 at 9:06 PM, Otis Gospodnetic > wrote: > > The answer is: function queries! :) > > You can easily use function queries with DisMaxRequestHandler. For > > example, > this is what you can add to the dismax config section in solrconfig.xml: > > > > > >recip(rord(addDate),1,1000,1000)^2.5 > > > > > > Assuming you have an addDate field, this will give fresher document some > boost. Look for this on the Wiki, it's all there. > > > > Otis
Re: DocSet to BitSet
: I have a custom query object that extends ContstantScoreQuery. I give it : a key which pulls some documents out of a cache. Thinking to make it : more efficient, I used DocSet, backed by OpenBitSet or OpenHashSet. : However, I need to set the BitSet object for the Lucene filter. Any idea : on how to best do this from DocSet? It seems like this is a problem that : people have encountered before. I've never really encountered this particular problem ... typically any "sets" i'm dealing with can be passed as "filters" directly to the SolrIndexSearcher method -- so I use DocSets. if I *had* to use a ConstantScoreQuery, i'd probably skip DocSet initially and use a BitSet from the get go (the BitSets could still be cashed using custom cache). but you could also just create anew custom constnat scoring Query class that used a Scorer that refrenced your DocSet directly. if you look at the source of ConstantScoreQuery it should be fairly obvious how to make something similar backed by a DocSet instead of a Filter. (in future versions of Lucene this will all be moot, as the Filter API will no longer require a BitSet and can intead return a "DocIdSet" which is essentially just an iterator that Solr's DocSet can implement trivially. ... if you look at the trunk version of ConstantScoreQuery it already does this ... that class may serve as an even better example of implementing a Query that scores based on an o.a.s.search.DocIterator) -Hoss
Re: Minion, anyone?
On 20-May-08, at 9:06 AM, Binkley, Peter wrote: Has anyone in the Solr community started looking at Sun's Minion (now released under GPL 2.0)? https://minion.dev.java.net/ And (dare I say it) might it be possible to wrap Minion into Solr as an alternative to Lucene? The Search Guy (Stephen Green) has been writing a series of postings comparing Minion and Lucene: http://blogs.sun.com/searchguy/tags/lucene The differences aren't huge (as he says, "In an alternate world where Sun opened up a bit earlier, I would have been working on Lucene from the get-go, rather than starting from scratch."), but Minion has some functionality that might be useful to Solr users in some circumstances. Does it? I read all his posts, and it seems that the benefits are indeed quite trivial (and the cons potentially substantial, though these weren't really discussed). The license is also an issue. -Mike
Re[2]: the time factor
Hello Otis, Could you be a bit more specific or point me to some documentation pages? Can this be done through modifying schema and solrconfig or does it involve some coding? This sounds like a generic problem to me so I'm hoping to find a generic solution. Thanks, Jack Tuesday, May 20, 2008, 9:28:34 AM, you wrote: > Another possible way to get this done is by assigning weights to > field values (e.g. pubDate field should have N% weight and relevancy > score should have 100-N% weight) and using their weighted values > along with Lucene-provided relevancy score to compute a weighted > score. I haven't tried this, it may or may not work, or it may > produce similar results as the function I suggested below. > If you try this, it would be great to hear if this works. > Otis > -- > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > - Original Message >> From: Jack <[EMAIL PROTECTED]> >> To: solr-user@lucene.apache.org >> Sent: Tuesday, May 20, 2008 3:59:06 PM >> Subject: Re: the time factor >> >> Hi Otis, >> >> I tried this. It doesn't seem to solve my problem, though. I think >> it's best used to make small adjustment when relevance scores are >> similar. In my case, if I want to rank the most recent documents first >> (because it's about news), I have to use very large boost, which will >> end up getting the docs that are not so relevant to the top. I haven't >> been able to get desired results of showing only recent documents with >> decent relevance scores. >> >> Ideally, I think it can be solved by doing a query for the past 24 >> hours and keeping the docs with best relevance scores, then another >> query for the previous 24 hours ... but this really isn't very >> efficient. Maybe OK for news because I may need to serve for up to 7 >> days. Still, 7 solr queries for a front-end query doesn't sound ideal. >> So I'm still in search for a better way ... >> >> Thanks, >> Jack >> >> On Tue, May 13, 2008 at 9:06 PM, Otis Gospodnetic >> wrote: >> > The answer is: function queries! :) >> > You can easily use function queries with >> DisMaxRequestHandler. For example, >> this is what you can add to the dismax config section in solrconfig.xml: >> > >> > >> >recip(rord(addDate),1,1000,1000)^2.5 >> > >> > >> > Assuming you have an addDate field, this will give fresher document some >> boost. Look for this on the Wiki, it's all there.
query for number of field entries in a multivalued field?
Any way to query how many items are in a multivalued field? (Or use a functionquery against that # or anything?)
Re: Re[2]: the time factor
Hi Jack, This was just an idea, nothing like that currently exists, as far as I know. Thus, it's not a config thing (yet), it's something you'd have to develop yourself and figur eout how to plug it into Solr/Lucene. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: JLIST <[EMAIL PROTECTED]> > To: Otis Gospodnetic > Sent: Tuesday, May 20, 2008 4:32:05 PM > Subject: Re[2]: the time factor > > Hello Otis, > > Could you be a bit more specific or point me to some documentation > pages? Can this be done through modifying schema and solrconfig or > does it involve some coding? This sounds like a generic problem to me > so I'm hoping to find a generic solution. > > Thanks, > Jack > > Tuesday, May 20, 2008, 9:28:34 AM, you wrote: > > > Another possible way to get this done is by assigning weights to > > field values (e.g. pubDate field should have N% weight and relevancy > > score should have 100-N% weight) and using their weighted values > > along with Lucene-provided relevancy score to compute a weighted > > score. I haven't tried this, it may or may not work, or it may > > produce similar results as the function I suggested below. > > > If you try this, it would be great to hear if this works. > > > Otis > > -- > > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > > - Original Message > >> From: Jack > >> To: solr-user@lucene.apache.org > >> Sent: Tuesday, May 20, 2008 3:59:06 PM > >> Subject: Re: the time factor > >> > >> Hi Otis, > >> > >> I tried this. It doesn't seem to solve my problem, though. I think > >> it's best used to make small adjustment when relevance scores are > >> similar. In my case, if I want to rank the most recent documents first > >> (because it's about news), I have to use very large boost, which will > >> end up getting the docs that are not so relevant to the top. I haven't > >> been able to get desired results of showing only recent documents with > >> decent relevance scores. > >> > >> Ideally, I think it can be solved by doing a query for the past 24 > >> hours and keeping the docs with best relevance scores, then another > >> query for the previous 24 hours ... but this really isn't very > >> efficient. Maybe OK for news because I may need to serve for up to 7 > >> days. Still, 7 solr queries for a front-end query doesn't sound ideal. > >> So I'm still in search for a better way ... > >> > >> Thanks, > >> Jack > >> > >> On Tue, May 13, 2008 at 9:06 PM, Otis Gospodnetic > >> wrote: > >> > The answer is: function queries! :) > >> > You can easily use function queries with > >> DisMaxRequestHandler. For example, > >> this is what you can add to the dismax config section in solrconfig.xml: > >> > > >> > > >> >recip(rord(addDate),1,1000,1000)^2.5 > >> > > >> > > >> > Assuming you have an addDate field, this will give fresher document some > >> boost. Look for this on the Wiki, it's all there.
Re: Minion, anyone?
Yes, I have not yet seen any major benefits. I did see a "native NOT" mention (a query that is just a negation), which is nice. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: Mike Klaas <[EMAIL PROTECTED]> > To: solr-user@lucene.apache.org > Sent: Tuesday, May 20, 2008 3:25:57 PM > Subject: Re: Minion, anyone? > > > On 20-May-08, at 9:06 AM, Binkley, Peter wrote: > > > Has anyone in the Solr community started looking at Sun's Minion (now > > released under GPL 2.0)? > > > > https://minion.dev.java.net/ > > > > And (dare I say it) might it be possible to wrap Minion into Solr as > > an > > alternative to Lucene? The Search Guy (Stephen Green) has been > > writing a > > series of postings comparing Minion and Lucene: > > > > http://blogs.sun.com/searchguy/tags/lucene > > > > The differences aren't huge (as he says, "In an alternate world where > > Sun opened up a bit earlier, I would have been working on Lucene from > > the get-go, rather than starting from scratch."), but Minion has some > > functionality that might be useful to Solr users in some > > circumstances. > > Does it? I read all his posts, and it seems that the benefits are > indeed quite trivial (and the cons potentially substantial, though > these weren't really discussed). The license is also an issue. > > -Mike
Re: Highlighting - field criteria highlights in other fields
On 20-May-08, at 12:31 AM, Tim Mahy wrote: Hi all, we have situation in which we have documents that have an introduction (text) , a body (text) and some meta data fields (integers mostly). when we create a query like this : q=( +(body_nl:( brussel) ) AND ( (+publicationid:("3430" OR "3451")) )&fq= +publishdateAsString:[20070520 TO 20080520 ]&start = 0 &rows = 11 &hl = on &hl .fl = body_nl &hl .snippets=3&hl.fragsize=320&hl.simple.pre=&hl.simple.post=strong>&sort=publishdateAsString desc,publicationname desc&fl=id,score,introduction we get nice highlighting from the body_nl field but Solr also highlights 3430 and 3451 if there is such a "word" in the body_nl, while we were expecting only to get highlighting from the word "brussel" in the body_nl. So it seems that all posible criteria terms are highlighted in any of the given highlighting fields. Is it possible to disable this (with some kind of parameter of something) and only let the hl.fl's highlight the criteria for their own field ? see http://wiki.apache.org/solr/HighlightingParameters You can also avoid this by making the publication id restriction clause a filter (fq param). regards, -Mike
Re: the time factor
: similar. In my case, if I want to rank the most recent documents first : (because it's about news), I have to use very large boost, which will : end up getting the docs that are not so relevant to the top. I haven't it sounds like you only attempted tweaking the boost value, and not tweaking the function params ... you can change the curve so that really new things get a large score increase, but older things get less of an increase. : Ideally, I think it can be solved by doing a query for the past 24 : hours and keeping the docs with best relevance scores, then another : query for the previous 24 hours ... but this really isn't very : efficient. Maybe OK for news because I may need to serve for up to 7 : days. Still, 7 solr queries for a front-end query doesn't sound ideal. : So I'm still in search for a better way ... if you have discrete chunks oftime in which you consider stories "more relevant" (ie: today, yesterday, this week) you can always use regula range queries as boost queries to bump their scores up... bq=pubDate:[NOW/DAY TO *]^10 pubDate:[NOW/DAY-1DAY TO *]^5 pubDate:[NOW/DAY-7DAY TO *] with large enough boostfactors on those queries, you can essentially force them to be themost significant part of the score, so you get everything from today sorted by relevancy, followed by everything from yesterday, etc... although at a certain point, if you know this is waht you want, you might just want a date field indexed after rounding down to the nearest day, and sort by it, then sort by score. -Hoss
Re: DocSet to BitSet
One of the primary reasons that I was doing it this way is because I am sending several filters, one is a big docset and others are BooleanQuery objects (products in stock, etc.). Since, the interface for SolrIndexSearcher.getDocListAndSet supports only (Query, DocSet,...) or (Query, List,...), I was going to give it a list of filters. I haven't investigated further to see if patching the Solr code to allow both methods (Query, List, DocSet) would cause any problems. My guess is that it was done this way for a reason. Barring that solution, I will probably use the Query, DocSet method. I have my DocSet for my bit-based filters in a single DocSet. And then I can take my previous list of filter queries and add them onto the main Query object that was created by the front-end. I'm not sure what this will do to cache performance though. Since, now each variation in the filter queries will become an entirely different query for the cache. - Original Message From: Chris Hostetter <[EMAIL PROTECTED]> To: Solr Sent: Tuesday, May 20, 2008 12:00:59 PM Subject: Re: DocSet to BitSet : I have a custom query object that extends ContstantScoreQuery. I give it : a key which pulls some documents out of a cache. Thinking to make it : more efficient, I used DocSet, backed by OpenBitSet or OpenHashSet. : However, I need to set the BitSet object for the Lucene filter. Any idea : on how to best do this from DocSet? It seems like this is a problem that : people have encountered before. I've never really encountered this particular problem ... typically any "sets" i'm dealing with can be passed as "filters" directly to the SolrIndexSearcher method -- so I use DocSets. if I *had* to use a ConstantScoreQuery, i'd probably skip DocSet initially and use a BitSet from the get go (the BitSets could still be cashed using custom cache). but you could also just create anew custom constnat scoring Query class that used a Scorer that refrenced your DocSet directly. if you look at the source of ConstantScoreQuery it should be fairly obvious how to make something similar backed by a DocSet instead of a Filter. (in future versions of Lucene this will all be moot, as the Filter API will no longer require a BitSet and can intead return a "DocIdSet" which is essentially just an iterator that Solr's DocSet can implement trivially. ... if you look at the trunk version of ConstantScoreQuery it already does this ... that class may serve as an even better example of implementing a Query that scores based on an o.a.s.search.DocIterator) -Hoss
expression in an fq parameter fails
We are trying to use fq parameter to limit our result set. We specify the fq in the solrconfig.xml file within a DisMax ... storeAvailableDate:[* TO NOW] storeExpirationDate:[NOW TO *] This works perfectly. Only trouble is that the two data fields may actually be empty, in which case this filters out such records and we want to include them. I've been unable to figure out how to do this. We've tried: storeAvailableDate:[* TO NOW] OR - storeAvailableDate:[* TO *] storeExpirationDate:[NOW TO *] OR - storeExpirationDate:[* TO *] Which is what I'd imagine should work based on http://wiki.apache.org/solr/SolrQuerySyntax and http://lucene.apache.org/java/docs/queryparsersyntax.html but no dice. Is the "OR" even allowed in this place? And also, as a long shot: storeAvailableDate([* TO NOW] OR -[* TO *]) storeExpirationDate:([NOW TO *] OR -[* TO *]) And, no surprise, that didn't work. But I don't understand why the first thing we tried didn't work. Any help? Thanks, Ezra E.
Exception on the use of dataimport.jar in Full Import Example
I wanted to learn how to index data that I have on my dB. I followed the instructions on the wiki page for the Data Import Handler (Full Import Example -example-solr-home.jar). I got an exception running it as is (see below). Anyway, I have a couple of questions before addressing the source of the exception. 1) what is the purpose of the dataimport jar file? Is it used to populate the dB? 2) The dataimport.jar file is NOT included in the war file. Why is that? It was recommended that it be added if I already have an existing installation of my war file. 3) at what point in time is the hsqldb dB supposed to be populated? Executing the URL http://localhost:2455/solr/db/dataimport I get a response with so to this degree is correctly configured. The exception happens when I attempt to execute the full-import command. Exception follows: SEVERE: The query failed 'select * from item' org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to execute query: select * from item at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.(J dbcDataSource.java:166) .. Caused by: java.sql.SQLException: Table not found in statement [select * from item] at org.hsqldb.jdbc.Util.sqlException(Unknown Source) Julio Castillo Edgenuity Inc.
What are stopwords and protwords ???
Hi, I am a beginner to Solr, I have successfully indexed my db in solr. I want to know that what are the stopwords and protwords ??? and how much they have effect on my search results ? Thanks in advance. -- Akeel