Re: spellcheck.onlyMorePopular
On Sun, Feb 15, 2009 at 8:56 AM, Mark Miller wrote: > I think thats the problem with it. People do think of it this way, and it > ends up being very confusing. > > If you dont use onlyMorePopular, and you ask for suggestions for a word > that happens to be in the index, you get the word back. > > So if I ask for corrections to Lucene, and its in the index, it suggests > Lucene. This is nice for multi term suggestions, because for "mrk lucene" it > might suggest "mark lucene". > > Now say I want to toggle onlyMorePopular to add frequency into the mix - my > expectation is that, perhaps now I will get the suggestion "mork lucene" if > mork has a higher freq than mark. > > But I will get maybe "mork luke" instead, because I am guaranteed not to > get Lucene as a suggestion if onlyMorePopular is on. onlyMorePopular=true considers tokens of frequency greater than equal to frequency of original token. So you may still get Lucene as a suggestion. > Personally I think it all ends up being pretty counter intuitive, > especially when asking for suggestions for multiple terms. You start getting > suggestions for alternate spellings no matter what - Lucene could be in the > index a billion times, it will still suggest something else. But with > onlyMorePopular off, it will throw back Lucene. You can deal with it if you > know whats up, but as we have seen from all the questions on this, its not > easy to understand why things change like that. I agree that it is confusing. Do you have any suggestions on ways to fix this? More/better documentation, changes in behavior, change 'onlyMorePopular' parameter's name, etc.? -- Regards, Shalin Shekhar Mangar.
Re: spellcheck.onlyMorePopular
Shalin Shekhar Mangar wrote: On Sun, Feb 15, 2009 at 8:56 AM, Mark Miller wrote: I think thats the problem with it. People do think of it this way, and it ends up being very confusing. If you dont use onlyMorePopular, and you ask for suggestions for a word that happens to be in the index, you get the word back. So if I ask for corrections to Lucene, and its in the index, it suggests Lucene. This is nice for multi term suggestions, because for "mrk lucene" it might suggest "mark lucene". Now say I want to toggle onlyMorePopular to add frequency into the mix - my expectation is that, perhaps now I will get the suggestion "mork lucene" if mork has a higher freq than mark. But I will get maybe "mork luke" instead, because I am guaranteed not to get Lucene as a suggestion if onlyMorePopular is on. onlyMorePopular=true considers tokens of frequency greater than equal to frequency of original token. So you may still get Lucene as a suggestion. Is that the only difference? When I look at the code (I'm new to this area of the code, so I certainly could be wrong, wouldnt be the first time, or less than the 100,000th probably), I see: // if the word exists in the real index and we don't care for word frequency, return the word itself if (!morePopular && freq > 0) { return new String[] { word }; } So if you have onlyMorePopular=false, Lucene will get Lucene if its in the index. But if we make it past that line (onlyMorePopular=true), later there is: // don't suggest a word for itself, that would be silly if (sugWord.string.equals(word)) { continue; } So you end up only getting all of the suggestions *but* Lucene, right? You had to already know the word was misspelled, and now your asking for a better one. With the onlyMorePopular=false, you only get a correction if the word is misspelled. It seems to me, if you are trying to use the suggested query thats built up, you change the behavior beyond just: onlyMorePopular=true considers tokens of frequency greater than equal to frequency of original token. - Mark
Re: spellcheck.onlyMorePopular
On Sun, Feb 15, 2009 at 10:00 PM, Mark Miller wrote: > But if we make it past that line (onlyMorePopular=true), later there is: > > // don't suggest a word for itself, that would be silly > if (sugWord.string.equals(word)) { > continue; > } > > So you end up only getting all of the suggestions *but* Lucene, right? You > had to already know the word was misspelled, and now your asking for a > better one. With the onlyMorePopular=false, you only get a correction if the > word is misspelled. Yes of course, you are right, one would never get Lucene back if onlyMorePopular=true. > > > It seems to me, if you are trying to use the suggested query thats built > up, you change the behavior beyond just: > > > onlyMorePopular=true considers tokens of frequency greater than equal to > frequency of original token. > We definitely need better documentation for this option. -- Regards, Shalin Shekhar Mangar.
Re: facet count on partial results
On Sat, Feb 14, 2009 at 6:45 AM, karl wettin wrote: > Also, as my threadshold is based on the distance in score between the > first result it sounds like using a result start position greater than > 0 is something I have to look out for. Or? Hmmm - this isn't that easy in general as it requires knowledge of the max score, right? That essentially requires two passes over the data (two queries)... one to find the max score, and the other to filter out anything below that max score. An additional query component right after the current query component might be the easiest way... it could modify the DocSet used in faceting. -Yonik http://www.lucidimagination.com
suggestion queries
Hi, What's the best way to set up a suggestion box with solr ? I mean, if i type one letter, it would resquest for all the "categories" beginning with that letter, and so on as the user adds letters. thanks -- Yves Hougardyhttp://www.clever-age.com Clever Age - conseil en architecture technique Tél: +33 1 53 34 66 10
Word Locations & Search Components
Hi there, I was told before that I'd need to create a custom search component to do what I want to do, but I'm thinking it might actually be a custom analyzer. Basically, I'm indexing e-mail in XML in Solr and searching the 'content' field which is parsed as 'text'. I want to ignore certain elements of the e-mail (i.e. corporate banners), but also identify the actual content of those e-mails including corporate information. To identify the banners I need something a little more developed than a stop word list. I need to evaluate the frequency of certain words around words like 'privileged' and 'corporate' within a word window of about 100ish words to determine whether they're banners and then remove them from being indexed. I need to do the opposite during the same time to identify, in a similar manner, which e-mails include corporate information in their actual content. I suppose if I'm doing this I don't want what's processed to be indexed as what's returned in a search, because then presumably it won't be the full e-mail, so do I need to store some kind of copy field that keeps the full e-mail and is fully indexed to be returned instead? Can what I'm suggesting be done and can anyone direct me to a guide? On another note, is there an easy way to destroy an index...any custom code? Thanks for any help! -- View this message in context: http://www.nabble.com/Word-Locations---Search-Components-tp22031139p22031139.html Sent from the Solr - User mailing list archive at Nabble.com.
debug distributed performance
Is there any debug settings to see where the time is taken during a distributed search? I suspect some of the time is spent in network overhead between the shards consolidating the results but I don't have a good way to pin this down. Sometimes, the results come back very quickly - so I know it is not all network related and want to know if there is a way to see this from within a distributed request. Turning on... debugQuery=on does not seem to report distributed performance statistics. When I query all shards together, I get: http://host:8880/solr/select/?shards=host1:8881/solr,host2:8882/solr,host3:8883/solr,host4:8884/solr,host5:8885/solr,host6:8886/solr,host7:8887/solr&q=cancer 428 then 287 If I isolate each shard like this: http://host:8880/solr/select/?shards=host1:8881/solr&q=cancer 195,146,844,230,51,48,43 Then going directly gets this: http://host1:8881/solr/select/?q=cancer 0,1,0,1,1,1,1 I can see taking a few sample responses is not conclusive to say one shard is slower or faster. However, the query time directly is orders of magnitude faster than through shards. My only guess is this is network based and involved in passing the results around in order to reduce them. Is there any debug or way to confirm and investigate this further? -- Regards, Ian Connor
Release of solr 1.4 & autosuggest
Hi All, I am interested in TermComponent addition in solr 1.4 ( http://wiki.apache.org/solr/TermsComponent). When should we expect solr 1.4 to be available for use? Also, can this Termcomponent be made available as a plugin for solr 1.3? Kindly reply if you have any idea. Regards, Pooja
Multilanguage
Hi, I have a scenario where ,i need to convert pdf content to text and then index the same at run time .I do not know as to what language the pdf would be ,in this case which is the best soln i have with respect the content field type in the schema where the text content would be indexed to? That is can i use the default tokenizer for all languages and since i would not know the language and hence would not be able to stem the tokens,how would this impact search?Is there any other solution for the same? Rgds
Outofmemory error for large files
I am trying to index around 150 MB text file with 1024 MB max heap. But I get Outofmemory error in the SolrJ code. Exception in thread "main" java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2882) at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.jav a:100) at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:572) at java.lang.StringBuffer.append(StringBuffer.java:320) at java.io.StringWriter.write(StringWriter.java:60) at org.apache.solr.common.util.XML.escape(XML.java:206) at org.apache.solr.common.util.XML.escapeCharData(XML.java:79) at org.apache.solr.common.util.XML.writeXML(XML.java:149) at org.apache.solr.client.solrj.util.ClientUtils.writeXML(ClientUtils.java: 115) at org.apache.solr.client.solrj.request.UpdateRequest.writeXML(UpdateReques t.java:200) at org.apache.solr.client.solrj.request.UpdateRequest.getXML(UpdateRequest. java:178) at org.apache.solr.client.solrj.request.UpdateRequest.getContentStreams(Upd ateRequest.java:173) at org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(Embedde dSolrServer.java:136) at org.apache.solr.client.solrj.request.UpdateRequest.process(UpdateRequest .java:243) at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:63) I modified the UpdateRequest class to initialize the StringWriter object in UpdateRequest.getXML with initial size, and cleared the SolrInputDocument that is having the reference of the file text. Then I am getting OOM as below: Exception in thread "main" java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2786) at java.lang.StringCoding.safeTrim(StringCoding.java:64) at java.lang.StringCoding.access$300(StringCoding.java:34) at java.lang.StringCoding$StringEncoder.encode(StringCoding.java:251) at java.lang.StringCoding.encode(StringCoding.java:272) at java.lang.String.getBytes(String.java:947) at org.apache.solr.common.util.ContentStreamBase$StringStream.getStream(Con tentStreamBase.java:142) at org.apache.solr.common.util.ContentStreamBase$StringStream.getReader(Con tentStreamBase.java:154) at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:61) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(Conte ntStreamHandlerBase.java:54) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerB ase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1333) at org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(Embedde dSolrServer.java:139) at org.apache.solr.client.solrj.request.UpdateRequest.process(UpdateRequest .java:249) at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:63) After I increase the heap size upto 1250 MB, I get OOM as Exception in thread "main" java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOfRange(Arrays.java:3209) at java.lang.String.(String.java:216) at java.lang.StringBuffer.toString(StringBuffer.java:585) at com.ctc.wstx.util.TextBuffer.contentsAsString(TextBuffer.java:403) at com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:821) at org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:276) at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:139) at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:69) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(Conte ntStreamHandlerBase.java:54) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerB ase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1333) at org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(Embedde dSolrServer.java:139) at org.apache.solr.client.solrj.request.UpdateRequest.process(UpdateRequest .java:249) at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:63) So looks like I won't be able to get out of these OOMs. Is there any way to avoid these OOMs? One option I see is to break the file in chunks, but with this, I won't be able to search with multiple words if they are distributed in different documents. Also, can somebody tell me the minimum heap size required w.r.t. file size so that document get indexed successfully? Thanks, Siddharth