Re: spellcheck.onlyMorePopular

2009-02-15 Thread Shalin Shekhar Mangar
On Sun, Feb 15, 2009 at 8:56 AM, Mark Miller  wrote:

> I think thats the problem with it. People do think of it this way, and it
> ends up being very confusing.
>
> If you dont use onlyMorePopular, and you ask for suggestions for a word
> that happens to be in the index, you get the word back.
>
> So if I ask for corrections to Lucene, and its in the index, it suggests
> Lucene. This is nice for multi term suggestions, because for "mrk lucene" it
> might suggest "mark lucene".
>
> Now say I want to toggle onlyMorePopular to add frequency into the mix - my
> expectation is that, perhaps now I will get the suggestion "mork lucene" if
> mork has a higher freq than mark.
>
> But I will get maybe "mork luke" instead, because I am guaranteed not to
> get Lucene as a suggestion if onlyMorePopular is on.


onlyMorePopular=true considers tokens of frequency greater than equal to
frequency of original token. So you may still get Lucene as a suggestion.


> Personally I think it all ends up being pretty counter intuitive,
> especially when asking for suggestions for multiple terms. You start getting
> suggestions for alternate spellings no matter what - Lucene could be in the
> index a billion times, it will still suggest something else. But with
> onlyMorePopular off, it will throw back Lucene. You can deal with it if you
> know whats up, but as we have seen from all the questions on this, its not
> easy to understand why things change like that.


I agree that it is confusing. Do you have any suggestions on ways to fix
this? More/better documentation, changes in behavior, change
'onlyMorePopular' parameter's name, etc.?
-- 
Regards,
Shalin Shekhar Mangar.


Re: spellcheck.onlyMorePopular

2009-02-15 Thread Mark Miller

Shalin Shekhar Mangar wrote:

On Sun, Feb 15, 2009 at 8:56 AM, Mark Miller  wrote:

  

I think thats the problem with it. People do think of it this way, and it
ends up being very confusing.

If you dont use onlyMorePopular, and you ask for suggestions for a word
that happens to be in the index, you get the word back.

So if I ask for corrections to Lucene, and its in the index, it suggests
Lucene. This is nice for multi term suggestions, because for "mrk lucene" it
might suggest "mark lucene".

Now say I want to toggle onlyMorePopular to add frequency into the mix - my
expectation is that, perhaps now I will get the suggestion "mork lucene" if
mork has a higher freq than mark.

But I will get maybe "mork luke" instead, because I am guaranteed not to
get Lucene as a suggestion if onlyMorePopular is on.




onlyMorePopular=true considers tokens of frequency greater than equal to
frequency of original token. So you may still get Lucene as a suggestion.

  
Is that the only difference? When I look at the code (I'm new to this 
area of the code, so I certainly could be wrong, wouldnt be the first 
time, or less than the 100,000th probably), I see:


   // if the word exists in the real index and we don't care for word 
frequency, return the word itself

   if (!morePopular && freq > 0) {
 return new String[] { word };
   }

So if you have onlyMorePopular=false, Lucene will get Lucene if its in 
the index. But if we make it past that line (onlyMorePopular=true), 
later there is:


 // don't suggest a word for itself, that would be silly
 if (sugWord.string.equals(word)) {
   continue;
 }

So you end up only getting all of the suggestions *but* Lucene, right? 
You had to already know the word was misspelled, and now your asking for 
a better one. With the onlyMorePopular=false, you only get a correction 
if the word is misspelled.


It seems to me, if you are trying to use the suggested query thats built 
up, you change the behavior beyond just:


onlyMorePopular=true considers tokens of frequency greater than equal to
frequency of original token.

- Mark





Re: spellcheck.onlyMorePopular

2009-02-15 Thread Shalin Shekhar Mangar
On Sun, Feb 15, 2009 at 10:00 PM, Mark Miller  wrote:

> But if we make it past that line (onlyMorePopular=true), later there is:
>
> // don't suggest a word for itself, that would be silly
> if (sugWord.string.equals(word)) {
>   continue;
> }
>
> So you end up only getting all of the suggestions *but* Lucene, right? You
> had to already know the word was misspelled, and now your asking for a
> better one. With the onlyMorePopular=false, you only get a correction if the
> word is misspelled.


Yes of course, you are right, one would never get Lucene back if
onlyMorePopular=true.


>
>
> It seems to me, if you are trying to use the suggested query thats built
> up, you change the behavior beyond just:
>
>
> onlyMorePopular=true considers tokens of frequency greater than equal to
> frequency of original token.
>

We definitely need better documentation for this option.

-- 
Regards,
Shalin Shekhar Mangar.


Re: facet count on partial results

2009-02-15 Thread Yonik Seeley
On Sat, Feb 14, 2009 at 6:45 AM, karl wettin  wrote:
> Also, as my threadshold is based on the distance in score between the
> first result it sounds like using a result start position greater than
> 0 is something I have to look out for. Or?

Hmmm - this isn't that easy in general as it requires knowledge of the
max score, right?
That essentially requires two passes over the data (two queries)...
one to find the max score, and the other to filter out anything below
that max score.

An additional query component right after the current query component
might be the easiest way... it could modify the DocSet used in
faceting.

-Yonik
http://www.lucidimagination.com


suggestion queries

2009-02-15 Thread Yves Hougardy
Hi,
What's the best way to set up a suggestion box with solr ?
I mean, if i type one letter, it would resquest for all the "categories"
beginning with that letter, and so on as the user adds letters.

thanks



-- 
Yves Hougardyhttp://www.clever-age.com
Clever Age - conseil en architecture technique
Tél: +33 1 53 34 66 10


Word Locations & Search Components

2009-02-15 Thread Johnny X

Hi there,


I was told before that I'd need to create a custom search component to do
what I want to do, but I'm thinking it might actually be a custom analyzer.

Basically, I'm indexing e-mail in XML in Solr and searching the 'content'
field which is parsed as 'text'.

I want to ignore certain elements of the e-mail (i.e. corporate banners),
but also identify the actual content of those e-mails including corporate
information.

To identify the banners I need something a little more developed than a stop
word list. I need to evaluate the frequency of certain words around words
like 'privileged' and 'corporate' within a word window of about 100ish words
to determine whether they're banners and then remove them from being
indexed.

I need to do the opposite during the same time to identify, in a similar
manner, which e-mails include corporate information in their actual content.

I suppose if I'm doing this I don't want what's processed to be indexed as
what's returned in a search, because then presumably it won't be the full
e-mail, so do I need to store some kind of copy field that keeps the full
e-mail and is fully indexed to be returned instead?

Can what I'm suggesting be done and can anyone direct me to a guide?


On another note, is there an easy way to destroy an index...any custom code?


Thanks for any help!



-- 
View this message in context: 
http://www.nabble.com/Word-Locations---Search-Components-tp22031139p22031139.html
Sent from the Solr - User mailing list archive at Nabble.com.



debug distributed performance

2009-02-15 Thread Ian Connor
Is there any debug settings to see where the time is taken during a
distributed search?

I suspect some of the time is spent in network overhead between the shards
consolidating the results but I don't have a good way to pin this down.
Sometimes, the results come back very quickly - so I know it is not all
network related and want to know if there is a way to see this from within a
distributed request.

Turning on...
debugQuery=on

does not seem to report distributed performance statistics.

When I query all shards together, I get:
http://host:8880/solr/select/?shards=host1:8881/solr,host2:8882/solr,host3:8883/solr,host4:8884/solr,host5:8885/solr,host6:8886/solr,host7:8887/solr&q=cancer
428 then 287

If I isolate each shard like this:
http://host:8880/solr/select/?shards=host1:8881/solr&q=cancer
195,146,844,230,51,48,43

Then going directly gets this:
http://host1:8881/solr/select/?q=cancer
0,1,0,1,1,1,1

I can see taking a few sample responses is not conclusive to say one shard
is slower or faster. However, the query time directly is orders of magnitude
faster than through shards.

My only guess is this is network based and involved in passing the results
around in order to reduce them.

Is there any debug or way to confirm and investigate this further?
-- 
Regards,

Ian Connor


Release of solr 1.4 & autosuggest

2009-02-15 Thread Pooja Verlani
Hi All,
I am interested in TermComponent addition in solr 1.4 (
http://wiki.apache.org/solr/TermsComponent). When
should we expect solr 1.4 to be available for use?
Also, can this Termcomponent be made available as a plugin for solr 1.3?

Kindly reply if you have any idea.

Regards,
Pooja


Multilanguage

2009-02-15 Thread revathy arun
Hi,
I have a scenario where ,i need to  convert pdf content to text  and then
index the same at run time .I do not know as to what language the pdf would
be ,in this case which is the best  soln i have with respect the content
field type in the schema where the text content would be indexed to?

That is can i use the default tokenizer for all languages and  since i would
not know the language and hence would not be able to stem the
tokens,how would  this impact search?Is there any other solution for the
same?

Rgds


Outofmemory error for large files

2009-02-15 Thread Gargate, Siddharth

I am trying to index around 150 MB text file with 1024 MB max heap. But
I get Outofmemory error in the SolrJ code. 

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2882)
at
java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.jav
a:100)
at
java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:572)
at java.lang.StringBuffer.append(StringBuffer.java:320)
at java.io.StringWriter.write(StringWriter.java:60)
at org.apache.solr.common.util.XML.escape(XML.java:206)
at org.apache.solr.common.util.XML.escapeCharData(XML.java:79)
at org.apache.solr.common.util.XML.writeXML(XML.java:149)
at
org.apache.solr.client.solrj.util.ClientUtils.writeXML(ClientUtils.java:
115)
at
org.apache.solr.client.solrj.request.UpdateRequest.writeXML(UpdateReques
t.java:200)
at
org.apache.solr.client.solrj.request.UpdateRequest.getXML(UpdateRequest.
java:178)
at
org.apache.solr.client.solrj.request.UpdateRequest.getContentStreams(Upd
ateRequest.java:173)
at
org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(Embedde
dSolrServer.java:136)
at
org.apache.solr.client.solrj.request.UpdateRequest.process(UpdateRequest
.java:243)
at
org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:63)


I modified the UpdateRequest class to initialize the StringWriter object
in UpdateRequest.getXML with initial size, and cleared the
SolrInputDocument that is having the reference of the file text. Then I
am getting OOM as below:


Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2786)
at java.lang.StringCoding.safeTrim(StringCoding.java:64)
at java.lang.StringCoding.access$300(StringCoding.java:34)
at
java.lang.StringCoding$StringEncoder.encode(StringCoding.java:251)
at java.lang.StringCoding.encode(StringCoding.java:272)
at java.lang.String.getBytes(String.java:947)
at
org.apache.solr.common.util.ContentStreamBase$StringStream.getStream(Con
tentStreamBase.java:142)
at
org.apache.solr.common.util.ContentStreamBase$StringStream.getReader(Con
tentStreamBase.java:154)
at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:61)
at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(Conte
ntStreamHandlerBase.java:54)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerB
ase.java:131)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1333)
at
org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(Embedde
dSolrServer.java:139)
at
org.apache.solr.client.solrj.request.UpdateRequest.process(UpdateRequest
.java:249)
at
org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:63)


After I increase the heap size upto 1250 MB, I get OOM as 

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOfRange(Arrays.java:3209)
at java.lang.String.(String.java:216)
at java.lang.StringBuffer.toString(StringBuffer.java:585)
at
com.ctc.wstx.util.TextBuffer.contentsAsString(TextBuffer.java:403)
at
com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:821)
at org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:276)
at
org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:139)
at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:69)
at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(Conte
ntStreamHandlerBase.java:54)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerB
ase.java:131)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1333)
at
org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(Embedde
dSolrServer.java:139)
at
org.apache.solr.client.solrj.request.UpdateRequest.process(UpdateRequest
.java:249)
at
org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:63)


So looks like I won't be able to get out of these OOMs. 
Is there any way to avoid these OOMs? One option I see is to break the
file in chunks, but with this, I won't be able to search with multiple
words if they are distributed in different documents.
Also, can somebody tell me the minimum heap size required w.r.t. file
size so that document get indexed successfully? 

Thanks,
Siddharth