Re: [VOTE] Community Logo Preferences

2008-11-25 Thread Marcus Stratmann

https://issues.apache.org/jira/secure/attachment/12394282/solr2_maho_impression.png
https://issues.apache.org/jira/secure/attachment/12394376/solr_sp.png
https://issues.apache.org/jira/secure/attachment/12393936/logo_remake.jpg
https://issues.apache.org/jira/secure/attachment/12394264/apache_solr_a_red.jpg
https://issues.apache.org/jira/secure/attachment/12394353/solr.s5.jpg


Differences in output of spell checkers

2009-02-04 Thread Marcus Stratmann

Hello,

I'm trying to learn how to use the spell checkers of solr (1.3). I found 
out that FileBasedSpellChecker and IndexBasedSpellChecker produce 
different outputs.


IndexBasedSpellChecker says




1
0
4
0

85
game


false



whereas FileBasedSpellChecker returns




1
0
4

game





The differences are the usage of  respectively  for markup of 
the suggestions, missing frequences and missing "correctlySpelled" in 
FileBasedSpellChecker. Is that a bug or a feature? Or are there simply 
no universal rules for the format of the ouput? The differences make 
parsing more difficult if you use IndexBasedSpellChecker and 
FileBasedSpellChecker.


Thanks,
Marcus


Re: Differences in output of spell checkers

2009-02-05 Thread Marcus Stratmann

Hello,

Are you sending in the same query to both?  Frequency and word only get 
printed when extendedResults == true.  correctlySpelled only gets 
printed when there is Index frequency information.  For the 
FileBasedSpellChecker, there is no Frequency information, so it isn't 
returned.


Yes, I am using this request in both cases:
spellcheck?spellcheck=true&spellcheck.dictionary=title&spellcheck.q=gane&q=gane&spellcheck.extendedResults=true

Concerning FileBasedSpellChecker I wasn't able to find any online 
documentation, is there any? For the start I was using "trial an error". 
I'm still wondering which format the input file needs to have.


You write that there is no frequency information for 
FileBasedSpellChecker. Does that mean, that every word in the index has 
the same "weight" (besides the distance from the word being spell 
checked)? Then how does spelling work? Every word in the index that is 
close enough (distance) to the original is considered and the one with 
the smallest distance is returned? What effext has 
spellcheck.onlyMorePopular when there are no frequencies?


Sorry if this is answered somewhere in the docs, a link would be enough 
for me in this case.


The logic for constructing this is all handled in the 
SpellCheckComponent.toNamedList() method and is completely separated 
from the individual SpellChecker implementations.


If I understand you correctly, this means that the output is just an 
"image" of the used data structures? From the developer's view this is 
very natural, but from the user's view it is annoying to have different 
output depending on the handler used. Anyway, this is actually no big 
problem for me, I was just wondering why my parser (used for 
IndexBasedSpellChecker) didn't work for FileBasedSpellChecker.


Thanks,
Marcus


spellcheck.onlyMorePopular

2009-02-12 Thread Marcus Stratmann

Hello,

I have another question concerning the spell checking mechanism.
Setting onlyMorePopular=true and using the parameters

spellcheck=true&spellcheck.q=gran&q=gran&spellcheck.onlyMorePopular=true

I get the result


 
  
   1
   0
   4
   13
   
32
grand
   
  
  true
 


which is okay.
But when I turn off onlyMorePopular

spellcheck=true&spellcheck.q=gran&q=gran&spellcheck.onlyMorePopular=false

the output is


 


I was expecting to get *more* results when I turn off onlyMorePopular 
and to get all of the results contained in the result without 
onlyMorePopular ("grand") plus some more. Instead I get no spell check 
results at all. Why is that?


Thanks,
Marcus


Re: Differences in output of spell checkers

2009-02-12 Thread Marcus Stratmann

Hi Grant,

thanks for your help.

I have just one more question:

BTW, one workaround is to simply create an index from your file and then 
use the IndexBasedSpellChecker.  Each line equals one document.  You 
could even assign weights that way.

In the solrconfig.xml there is a line
field
Can I use a field from a different index for that (and how)? Or does the 
workaround mean that I have to make two queries, one for getting the 
search results and one to get spell checking results?


Thanks,
Marcus


Re: spellcheck.onlyMorePopular

2009-02-13 Thread Marcus Stratmann

Grant Ingersoll wrote:
I believe the reason is b/c when onlyMP is false, if the word itself is 
already in the index, it short circuits out.  When onlyMP is true, it 
checks to see if there are more frequently occurring variations.
This would mean that onlyMorePopular=false isn't useful at all. If the 
word is in the index it would not find less frequent words and if it is 
not in the index onlyMorePopular=false isn't usefull since there are no 
less popular words.

So if you are right this is a bug, isn't it?

Thanks,
Marcus


Re: spellcheck.onlyMorePopular

2009-02-13 Thread Marcus Stratmann

Shalin Shekhar Mangar wrote:

The end goal is to give spelling suggestions. Even if it gave less
frequently occurring spelling suggestions, what would you do with it?

To give you an example:
We have an index for computer games. One title is "gran turismo". The 
word "gran" is less frequent in the index than "grand". So if someone 
searches for "grand turismo" there will be no suggestion "gran".


And to come back to my last question: There seems to be no case in which 
"onlyMorePopular=false" makes sense (provided Grant's assumption is 
correct). Do you see one?


Thanks,
Marcus


Re: spellcheck.onlyMorePopular

2009-02-13 Thread Marcus Stratmann

Shalin Shekhar Mangar wrote:

And to come back to my last question: There seems to be no case in which
"onlyMorePopular=false" makes sense (provided Grant's assumption is
correct). Do you see one?


Here's a use-case -- you provide a mis-spelled word and you want the closest
suggestion by edit distance (frequency does not matter).


Hm, when I try searching for "grand" using onlyMorePopular=false I do 
not get any results. Same when trying "gran". It seems that there will 
be no results at all when using onlyMorePopular=false. Without 
onlyMorePopular there are suggestions for both terms, so there are 
suggestions close enough to the original word(s). Have you tested your 
example case?


Anyway, if you look at it from the user's point of view: The wiki says 
"spellcheck.onlyMorePopular -- Only return suggestions that result in 
more hits for the query than the existing query." This implies that if 
onlyMorePopular=false I will get even results with less hits. So when 
I'm checking "grand" I would expect to get the suggestion "gran" which 
is less frequent in the index. But it seems this is not the case.


But even if just the documentation is wrong or unclear:
1) I could not find a case in which onlyMorePopular=false works at all.
2) It would be nice if one could get suggestion with lower frequency 
than the checked word (which is, to me, what onlyMorePopular=false implies).


Thanks,
Marcus



Re: spellcheck.onlyMorePopular

2009-02-13 Thread Marcus Stratmann

Shalin Shekhar Mangar wrote:

If onlyMorePopular=true, then the algorithm finds tokens which have greater
frequency than the searched term. Among these terms, the one which is
closest (by edit distance) is returned.


Okay, this is a bit weird, but I think I got it now. Let me try to 
explain it using my example. When I search for "gran" (frequency 10) I 
get the suggestion "grand" (frequency 17) when using 
onlyMorePopular=true. When I use onlyMorePopular=false there are no 
suggestions at all. This is because there are some (rare) terms which 
are  closer to "gran" than "grand", but all of them are not considered, 
because there frequency is below 10. Is that correct?
But then, why isn't "grand" promoted to first place and returned as a 
valid suggestion?




I think I now understand the source of the confusion. onlyMorePopular=true
is a special behavior which uses *only* those tokens which have higher
frequency than the searched term. onlyMorePopular=false just switches off
this special behavior. It does *not* limit suggestions to tokens which have
lesser frequency than the searched term. In fact, onlyMorePopular=false does
not use frequency of tokens at all. We should document this clearly to avoid
such confusions in the future.


I'm still missing the two parameters accuracy and spellcheck.count. Let 
me try to explain how I (now) think the algorithm works:


1) Take all terms from the index as a basic set.
2) If onlyMorePopular=true remove all terms from the basic set which 
have a frequency below the frequency of the search term.
3) Sort the basic set in respect of distance to the search term and keep 
the  terms whith the smallest distance and which are 
"within accuracy".
4) Remove of terms which have a lower frequency than the search term in 
the case onlyMorePopular=false.

5) Return the remaining terms as suggestions.

Point 3 would explain why I do not get any suggestions for "gran" having
onlyMorePopular=false. Nevertheless I think this is a bug since point 3 
should take into account the frequency as well and promote suggestions 
with high enough frequency if suggestion with low frequency are deleted.


But this is just my assumption on how the algorithm works which explains 
why there are no suggestions using onlyMorePopular=false. Maybe I am 
wrong, but somewhere in the process "grand" is deleted from the result set.




2) It would be nice if one could get suggestion with lower frequency than
the checked word (which is, to me, what onlyMorePopular=false implies).


We could enhance spell checker to do that. But can you please explain your
use-case for limiting suggestions to tokens which have lesser frequency? The
goal of spell checker is to give suggestions of wrongly spelled words. It
was neither designed nor intended to give any other sort of query
suggestions.


An example would be the mentioned "grand turismo" (regard that in the 
example above I was searching for "gran" whereas now I am searching for 
"grand"). "gran" would not be returned as a suggestion because "grand" 
is more frequent in the index. And yes, I know, returning a suggestion 
in this case will be only useful if there is more than one word in the 
search term. You proposed to use KeywordTokenizer for this case but a) I 
(again) was not able to find any documentation for this and b) we are 
working on a different solution for this case using stored search 
queries. If you are interested, it works like this: For every word in 
the query get some spell checking suggestions. Combine these and find 
out if any of these combinations has been search for (successfully) 
before. Propose the one with the highest (search) frequency. Looks 
promising so far, but the "gran turismo" example won't work, since there 
are too many "grand"s in the index.


Thanks,
Marcus


Re: spellcheck.onlyMorePopular

2009-02-16 Thread Marcus Stratmann

Shalin Shekhar Mangar wrote:

The implementation is a bit more complicated.

1. Read all tokens from the specified field in the solr index.
2. Create n-grams of the terms read in #1 and index them into a separate
Lucene index (spellcheck index).
3. When asked for suggestions, create n-grams of the query terms, search the
spellcheck index and collects the top (by lucene score) 10*spellcheck.count
results.
4. If onlyMorePopular=true, determine frequency of each result in the solr
index and remove terms which have lesser frequency.
5. Compute the edit distance between the result and the query token.
6. Return the top spellcheck.count results (sorted by edit distance
descending) which are greater than specified accuracy.


Thanks, I think this makes things clear(er) now. I do agree that the 
documentation needs improvement on this point, as you said later in this 
thread. :)




Your primary use-case is not spellcheck at all but this might work with some
hacking. Fuzzy queries may be a better solution as Walter said. Storing, all
successful search queries may be hard to scale.


This is certainly true.

The drawback of fuzzy searching is that you get back exact and fuzzy 
hits together in one result set (correct me if I'm wrong). One could 
filter out the exact/fuzzy hits but this would make paging impossible.


The approach using KeywordTokenizer as you suggested before seems to be 
more promising to me. Unfortunately there seems to be no documentation 
for this (at least in conjunction with spell checking). If I understand 
this rightly, the tokenizer must be applied to the field in the search 
index (not the spell checking index). Is that correct?


Thanks,
Marcus


Re: Solr 1.1 HTTP server stops responding

2007-07-30 Thread Marcus Stratmann

Hi David,


We're running Solr 1.1 and we're seeing intermittent cases where
Solr stops responding to HTTP requests.  It seems like the listener
on port 8983 just doesn't respond.


When we started using solr we encountered the same problem. We are 
currently running solr 1.0 (!) with tomcat 5.5 on two servers. Our index 
has 16 million documents and is updated about 10 times per day 
(depending on the incoming data).

We found out three factors which may be responsible for the problems:

1) Memory. Our two servers running solr have 8 GB each and we have set 
the option -Xmx2560M for tomcat. We got rid of the most problems by 
increasing the memory. We had no success trying to get solr running with 
just 4GB in the machine.


2) Disk activity. This is strange. We found out that using rsync on the 
machine sometimes makes solr stop responding. We cound avoid this by 
setting an upper limit on the bandwidth rsync uses. Just recently we 
found out that even copying big files on the machine stops solr. So It 
seems that high disk activity might cause problems for solr.
(We have a MySQL database running on the same servers. Normal operation 
seems to be no problem, even if the servers have high load.)


3) Reading and writing at the same time. We had no chance updating an 
index while querying it at the same time. So when the index on our 
master server is updated all queries will go to the second server.


I think that some of the problems are solved in newer versions of solr. 
We are going to test that in the near future.


Marcus


Re: sort problem

2007-09-03 Thread Marcus Stratmann

If you could live with a cap of 2B on message id, switching to type
"int" would decrease the memory usage to 4 bytes per doc (presumably
you don't need range queries?)


I haven't found exact definitions of the fieldTypes anywhere. Does 
"integer" span the common range from -2^31 to 2^31-1?

And there seems to be no "unsigned" int, am i right?

Thanks
Marcus



Re: Distribution without SSH?

2007-11-30 Thread Marcus Stratmann

Justin Knoll wrote:
We plan to attempt to rewrite the snappuller (and possibly other 
distribution scripts, as required) to eliminate this dependency on SSH. 
I thought I ask the list in case anyone has experience with this same 
situation or any insights into the reasoning behind requiring SSH access 
to the master instance.
We use our database to store the master's state. Both master and 
slave(s) have access to the database and can exchange "messages" using a 
field in a table where we store miscellaneous information about our 
system. After an update of the master's index a flag in that field 
signals that a new index is available. The slaves regularly read this 
field an pull the new index on demand.


Marcus


Re: solr setup

2006-03-28 Thread Marcus Stratmann
Hi,

I have a tomcat5 running under linux (debian). I think that
my configuration may be wrong, because I don't get solr running.

Yonik Seeley wrote:
>the layout should look something like this:
>
>tomcat/webapps/solr.war
>tomcat/solrconf/solrconfig.xml, schema.xml, etc
>tomcat/bin/startup.sh
>
>then start tomcat by executing
>./bin/startup.sh
>from the tomcat directory

Well, there's no bin in my tomcat5 directory. I start tomcat
using "/etc/init.d/tomcat5 start". May this be a problem?

This is what I've done:
I compiled the source and copied solr-1.0.war to
/var/lib/tomcat5/webapps/solr.war. Then I copied the solrconf dir
from example to /var/lib/tomcat5/ and started tomcat.
tomcat then builds a solr dir in /var/lib/tomcat5/webapps. Fine so
far, but when I call http://localhost:8180/solr/admin/ for the first
time, I get
this:

javax.servlet.ServletException

org.apache.jasper.runtime.PageContextImpl.doHandlePageException(PageContextImpl.java:846)

org.apache.jasper.runtime.PageContextImpl.access$11(PageContextImpl.java:784)

org.apache.jasper.runtime.PageContextImpl$12.run(PageContextImpl.java:766)
java.security.AccessController.doPrivileged(Native
Method)

org.apache.jasper.runtime.PageContextImpl.handlePageException(PageContextImpl.java:764)
org.apache.jsp.admin.index_jsp._jspService(index_jsp.java:262)
org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:94)
javax.servlet.http.HttpServlet.service(HttpServlet.java:802)

org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:324)
org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:292)
org.apache.jasper.servlet.JspServlet.service(JspServlet.java:236)
javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
sun.reflect.NativeMethodAccessorImpl.invoke0(Native
Method)

sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)

sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
java.lang.reflect.Method.invoke(Method.java:585)
org.apache.catalina.security.SecurityUtil$1.run(SecurityUtil.java:243)
java.security.AccessController.doPrivileged(Native
Method)
javax.security.auth.Subject.doAsPrivileged(Subject.java:517)
org.apache.catalina.security.SecurityUtil.execute(SecurityUtil.java:272)

org.apache.catalina.security.SecurityUtil.doAsPrivilege(SecurityUtil.java:161)

root
cause

java.lang.ExceptionInInitializerError
org.apache.solr.core.SolrConfig.(SolrConfig.java:33)
org.apache.solr.update.SolrIndexConfig.(SolrIndexConfig.java:34)
org.apache.solr.core.SolrCore.(SolrCore.java:71)
org.apache.jsp.admin.index_jsp._jspService(index_jsp.java:67)
org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:94)
javax.servlet.http.HttpServlet.service(HttpServlet.java:802)

org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:324)
org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:292)
org.apache.jasper.servlet.JspServlet.service(JspServlet.java:236)
javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
sun.reflect.NativeMethodAccessorImpl.invoke0(Native
Method)

sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)

sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
java.lang.reflect.Method.invoke(Method.java:585)
org.apache.catalina.security.SecurityUtil$1.run(SecurityUtil.java:243)
java.security.AccessController.doPrivileged(Native
Method)
javax.security.auth.Subject.doAsPrivileged(Subject.java:517)
org.apache.catalina.security.SecurityUtil.execute(SecurityUtil.java:272)

org.apache.catalina.security.SecurityUtil.doAsPrivilege(SecurityUtil.java:161)



Calling the page again slightly changes the "root cause"
to:

java.lang.NoClassDefFoundError
org.apache.jsp.admin.index_jsp._jspService(index_jsp.java:67)
org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:94)
javax.servlet.http.HttpServlet.service(HttpServlet.java:802)

org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:324)
org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:292)
org.apache.jasper.servlet.JspServlet.service(JspServlet.java:236)
javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
sun.reflect.NativeMethodAccessorImpl.invoke0(Native
Method)

sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)

sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
java.lang.reflect.Method.invoke(Method.java:585)
org.apache.catalina.security.SecurityUtil$1.run(SecurityUtil.java:243)
   

Re: solr setup

2006-03-28 Thread Marcus Stratmann
> Solr looks in the current working directory for the solrconf
> directory, so it depends where that ends up when tomcat is started.
Meanwhile I found out that tomcat is located in /usr/share/tomcat5 and
that there is a bin-directory in it, which I was searching for. A
handfull of links are pointing to /var/lib/tomcat5 which I found
first. So this time I started tomcat using the ./bin/startup.sh as
recommended (had to set some environment variables first) but still
got an error messages. However, this time a different one:

javax.servlet.ServletException

org.apache.jasper.runtime.PageContextImpl.doHandlePageException(PageContextImpl.java:846)

org.apache.jasper.runtime.PageContextImpl.handlePageException(PageContextImpl.java:779)
org.apache.jsp.admin.index_jsp._jspService(index_jsp.java:262)
org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:94)
javax.servlet.http.HttpServlet.service(HttpServlet.java:802)

org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:324)
org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:292)
org.apache.jasper.servlet.JspServlet.service(JspServlet.java:236)
javax.servlet.http.HttpServlet.service(HttpServlet.java:802)

root cause

java.lang.NoClassDefFoundError
org.apache.jsp.admin.index_jsp._jspService(index_jsp.java:67)
org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:94)
javax.servlet.http.HttpServlet.service(HttpServlet.java:802)

org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:324)
org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:292)
org.apache.jasper.servlet.JspServlet.service(JspServlet.java:236)
javax.servlet.http.HttpServlet.service(HttpServlet.java:802)


At this point I gave up and tried a new approach. I changed configDir
in Config.java to "/var/lib/tomcat5/solrconf/" (this is where I placed
the configuration) and compiled the whole thing. I'm not sure, if this
really could work (could it?) and in fact it didn't. But I think that
the problem is not the location of the configuration files, but
something different. What do these Security and Privilege messages
mean showing up in the error message?

java.lang.ExceptionInInitializerError
org.apache.solr.update.SolrIndexConfig.(SolrIndexConfig.java:34)
org.apache.solr.core.SolrCore.(SolrCore.java:71)
org.apache.jsp.admin.index_jsp._jspService(index_jsp.java:67)
org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:94)
javax.servlet.http.HttpServlet.service(HttpServlet.java:802)

org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:324)
org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:292)
org.apache.jasper.servlet.JspServlet.service(JspServlet.java:236)
javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)

sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
java.lang.reflect.Method.invoke(Method.java:585)
org.apache.catalina.security.SecurityUtil$1.run(SecurityUtil.java:243)
java.security.AccessController.doPrivileged(Native Method)
javax.security.auth.Subject.doAsPrivileged(Subject.java:517)
org.apache.catalina.security.SecurityUtil.execute(SecurityUtil.java:272)

org.apache.catalina.security.SecurityUtil.doAsPrivilege(SecurityUtil.java:161)


> It might be easier to download a recent Tomcat 5.5 distribution and
> get it working with that first... then try with the bundled version of
> Tomcat once you understand how everything works.
Thanks Yonik, maybe I should try that, though I now think that the
configuration is not the main problem.

Btw, I don't like the way the config-files are handled. Searching them in
the webapps-dir is not very elegant, I think. Instead they should be
in /etc/solr or something (for linux; sorry, don't know if there's a
common place where configs are placed under windows or other OSs).

This will become a problem for me anyway because I'm planing to have
three independent indexes which should be operated by three servlets. If
my approach changing the variable configDir worked, this would really
be fine. This way I could create three war-files containing different
locations for the config (and yes, this wouldn't work with my proposed
"elegant" way of putting everything to /etc/solr). Is this approach
correct or do I have to make changes to the code elsewhere?

Thanks,
Marcus




Deleting documents

2006-04-11 Thread Marcus Stratmann
Hello,

I have a problem deleting documents from the index.

In the tutorial "SP2514N" is used as an
example for deleting. I was wondering if "" is some kind of
keyword or the name of a field (in the example, a unique field
named "id" is used). In my config I have the line
 bookID
making bookID (type slong as defined in the example config) my
unique id.

So I tried
 113976235
which resulted in
 unexpected XML tag /delete/bookID

Okay, so "id" seems to be a keyword, rather than a field name.
With my next try
 113976235
the query worked fine:
 
But after a  I found the number of documents unchanged
in the stats. Furthermore the value of deletesById was 0.
Oddly enough cumulative_deletesById was 1 (what does this value
actually mean?).

Any ideas what's going wrong?

Thanks,
Marcus



Re: Deleting documents

2006-04-12 Thread Marcus Stratmann
> Yes, I believe the Wiki has an example like this (a uniqueKey field
> not named "id")
Right, I should have looked there, too.

> > But after a  I found the number of documents unchanged
> > in the stats.
> What stat?  maxDoc may be unchanged since it doesn't reflect deleted
> documents that haven't been squeezed out of the index (it's a lucene
> thing).  numDocs should reflect the deletion.
Yep, but numDocs is unchanged after a commit.

I tried it again this morning step by step.
Starting with a newly created index, the stats say
numDocs : 9882062
maxDoc : 9882062
commits : 0
optimizes : 0
docsPending : 0
deletesPending : 0
adds : 0
deletesById : 0
deletesByQuery : 0
errors : 0
cumulative_adds : 0
cumulative_deletesById : 0
cumulative_deletesByQuery : 0
cumulative_errors : 0
docsDeleted : 0

After giving the delete-command this changed to
numDocs : 9882062
maxDoc : 9882062
commits : 0
optimizes : 0
docsPending : 0
deletesPending : 1
adds : 0
deletesById : 1
deletesByQuery : 0
errors : 0
cumulative_adds : 0
cumulative_deletesById : 1
cumulative_deletesByQuery : 0
cumulative_errors : 0
docsDeleted : 0

And yes, I am absolutely sure the id I used for deletion
existed in the index. I tried it later with a delete by
query and it worked. (I just mention this because I found
out that the stats look like that regardless of which id
you use, an existing or non-existing one.)

Finally after a commit I got:
numDocs : 9882062
maxDoc : 9882062
commits : 1
optimizes : 0
docsPending : 0
deletesPending : 0
adds : 0
deletesById : 0
deletesByQuery : 0
errors : 0
cumulative_adds : 0
cumulative_deletesById : 1
cumulative_deletesByQuery : 0
cumulative_errors : 0
docsDeleted : 0

Apparently no change in the number of documents (and
the document can still be found in the index).

Could it be the problem is that my unique key field
is of type slong (as defined in the tutorial)?

Thanks,
Marcus

-- 
GMX Produkte empfehlen und ganz einfach Geld verdienen!
Satte Provisionen für GMX Partner: http://www.gmx.net/de/go/partner


Re: Deleting documents

2006-04-15 Thread Marcus Stratmann
Yonik Seeley wrote:
> OK, I think I fixed this bug.  Haven't added a test case yet...
In our test case everything works properly now.
Thanks for the quick bugfix!

Marcus




Synchronizing commit and optimize

2006-04-24 Thread Marcus Stratmann
Hello,

when doing a commit or optimize the operation takes quite long (in my
test case at least some minutes). When I submit the command via curl, I
get the response "curl: (52) Empty reply from server" though solr is
still working (as I can see from the process list and the admin
interface). I tried the options "--connect-timeout" and "--max-time"
but still curl returns after some seconds though the request is not
fully processed. The same thing happens when I submit the commands from
a PHP-script (ensuring that it waits for a server response).

I'm not sure if I'm doing something wrong, but I could imagine three
causes for this.
1) curl (or my script) simply doesn't wail long enough to get a
response from the server. Well, I think I've ensured that this is not
the case, see above.
2) Jetty (I'm using the standard installation from the example) doesn't
wait long enough to get a response from Solr and thus returns an empty
response.
3) Solr itself is the problem.

For me point 2 sounds reasonable but I have no idea how to test this.

I'm also getting empty responses when adding documents to the index.
This happens every time when a multiple of one million documents have
been added to the index. I guess the reason is, that I have a merge
factor of 10 and that the operation of adding a document takes longer
when a multiple of 10^6 documents is reached.

Is there any way to synchronize a commit or optimize with other
commands (for example in a shell script)? The example in the script
"commit" in src/scripts doesn't use any special arguments with curl and
returns some seconds after submiting the request, so this doesn't seem
to work.

Thanks in advance,
Marcus

-- 
Echte DSL-Flatrate dauerhaft für 0,- Euro*!
"Feel free" mit GMX DSL! http://www.gmx.net/de/go/dsl


Re: Synchronizing commit and optimize

2006-04-28 Thread Marcus Stratmann
Yonik Seeley wrote:
>I think you are probably right about Jetty timing out the request.
>Solr doesn't implement timeouts for requests, and I havent' seen this
>behavior with Solr running on Resin.
>
>You could try another app server like Tomcat, or perhaps figure out of
>the Jetty timeout is configurable.

You were right, it's an Jetty issue.
In Jetty's configuration in jetty.xml I changed the parameter
maxIdleTime which seems to be in milliseconds (I wasn't able to
find documentation for this anywhere). Increasing this value to
360 (1 hour) did the trick for me. The line is
360

The default value in the example installation is 3. Maybe
it wolud be a good idea to increase this, too.

Thanks,
Marcus



Re: Java heap space

2006-04-28 Thread Marcus Stratmann

Chris Hostetter wrote:

How big is your physical index directory on disk?

It's about 2.9G now.
Is there a direct connection between size of index and usage of ram?


Your best bet is to allocate as much ram to the server as you can.
Depending on how full your caches are, and what hitratios you are getting
(the "STATISTICS" link from the Admin screen will tell you) you might want
to make some of them smaller to reduce the amount of RAM Solr uses for
them.

Hm, after disabling all caches I still get OutOfMemoryErrors.
All I do currently while testing is to delete documents. No searching or 
inserting. Typically after deleting about 20,000 documents the server 
throws the first error message.



From an acctual index standpoint, if you don't care about doc/field boosts

of lengthNorms, then the omitNorm="true" option on your fields (or
fieldtypes) will help save one byte per document per field you use it on.
That is something I could test, though I think this won't significantly 
change the size of the index.


One thing that appears suspicious to me is that everything went fine as 
long as the number of documents was below 10 million. Problems started 
when this limit was exceeded. But maybe this is just a coincidence.


Marcus


Re: Java heap space

2006-04-29 Thread Marcus Stratmann
Chris Hostetter wrote:
> interesting .. are you getting the OutOfMemory on an actual delete
> operation or when doing a commit after executing some deletes?

Yes, on a delete operation. I'm not doing any commits until the end of
all delete operations.
After reading this I was curious if using commits during deleting would
have any effect. So I tested doing a commit after 10,000 deletes at a
time (which, I know, is not recommended). But that simply didn't change
anything.

Meanwhile I found out that I can gain 10,000 documents more to delete
(before getting an OOM) by increasing the heap space by 500M.
Unfortunately we need to delete about 200,000 documents on each update
which would need 10G to be added to the heap space. Not to speak of
the same number of inserts.


> part of the problem may be that under the covers, any delete involves
> doing a query (even if oyou are deleting by uniqueKey, that's implimented
> s a delete by Term, which requires iterating over a TermEnum to find the
> relevent document, and if your index is big enough, loading that TermEnum
> and may be the cause of your OOM.

Yes, I thought so, too. And in fact I get OOM even if I just submit search
queries.


> Maybe, maybe not ... what options are you using in your solrconfig.xml's
> indexDefaults and mainIndex blocks?

I adopted the default values from the example installation which looked
quite reasonable to me.


> ... 10 million documents could be the
> magic point at which your mergeFactor triggers the merging of several
> large segments into one uber segment -- which may be big enough to cause
> an OOM when the IndexReader tries to open it.

Yes, I'm using the default mergeFactor of 10 and as 10 million is 10^7
this is what appeared suspicious to me.
Is it right, that the mergeFactor connot be changed once the index has
been built?

Marcus




Re: Java heap space

2006-05-01 Thread Marcus Stratmann

Yonik Seeley wrote:

Yes, on a delete operation. I'm not doing any commits until the end of
all delete operations.

I assume this is a delete-by-id and not a delete-by-query?  They work
very differently.


Yes, all queries are delete-by-id.



If you are first deleting so you can re-add a newer version of the
document, you don't need too... overwriting older documents based on
the uniqueKeyField is something Solr does for you!


Yes, I know. But the articles in our (sql-)database get new IDs when 
they are changed so they need to be deleted an re-inserted into the index.




Is it possible to use a profiler to see where all the memory is going?
It sounds like you may have uncovered a memory leak somewhere.


I'm not that experienced concerning Java, but maybe if you give me some 
advice I'm glad if I can help. So far I had a quick look at JMP once but 
that's all.

Don't hesitate to write me a PM on that subject.



Also what OS, what JVM, what appserver are you using?

OS: Linux (Debian GNU/Linux i686)
JVM: Java HotSpot(TM) Server VM (build 1.5.0_06-b05, mixed mode) of 
Sun's JDK 5.0.
Currently I'm using the Jetty installation from the solr nightly builds 
for test purposes.


Marcus


Re: Java heap space

2006-05-01 Thread Marcus Stratmann
Chris Hostetter wrote:
> this is off the subject of the heap space issue ... but if the id changes,
> then maybe it shouldn't be the uniqueId of your index? .. your code must
> have someone of recognizing that article B with id 222 is a changed
> version of article A with id 111 (otherwise how would you know to delete
> 111 when you insert 222?) ..whatever that mechanism is, perhaps it should
> determine your uniqueKey?

No, there is no "key" or something that reveals a relation between new
article B and old article A. After B is inserted and A is deleted, all
of A's existence is gone and we do not even know that B is A's
"successor". Changes are simply kept in a table which tells the system
which IDs to delete and which new (or changed) articles to insert,
automatically giving them new IDs. I know this may not be (or at least
sound) perfect and it is not the way things are handled normally. But
this works fine for our needs. We gather information about changes to
our data during the day and apply them on a nightly update (which, I
know, does not imply that IDs have to change).
So, yes, I'm sure I got the right uniqueKey. ;-)

Marcus





Re: Java heap space

2006-05-03 Thread Marcus Stratmann

Hello,

deleting or updating documents is still not possible for me so now I 
tried to built a completely new index. Unfortunately this didn't work 
either. Now I'm getting OOM after inserting slightly more than 20,000 
documents to the new index.


To me this looks as if a bug has been introduced since the revision of 
about 13th of april. To check this out, I looked for old builds but 
there seem to be nightly-builds of the last four days only. Okay, so 
next thing I tried was to get the code via svn. Unfortunately the code 
does not compile ("package junit.framework does not exist"). I found out 
that the last version I was able to complie was revision 393080 
(2006-04-10). So I was neither able to get back the last (for me) 
working revision nor to find out which revision this actually was.
Sorry, I would really like to help, but at the moment it seems Murphy is 
striking.


Thanks,
Marcus


Re: Java heap space

2006-05-03 Thread Marcus Stratmann

Yonik Seeley wrote:

Is your problem reproducable with a test case you can share?

Well, you can get the configuration files. If you ask for the data, this
could be a problem, since this is "real" data from our production
database. The amount of data needed could be another problem.


You could also try a different app-server like Tomcat to see if that
makes a difference.

This is another part of the problem because currently Tomcat won't work
with solr in my environment (a debian linux installation).


What type is your id field defined to be?


with slong defined by

as in the sample schema.xml.

Marcus



Re: Java heap space

2006-05-04 Thread Marcus Stratmann

Chris Hostetter wrote:

This is because building a full Solr distribution from scratch requires
that you have JUnit.  Bt it is not required to run Solr.

Ah, I see. That was a very valuable hint for me.
I was able now to compile an older revision (393957). Testing this 
revision I was able to delete more than 600,000 documents without problems.
From my point of view it looks like this: Revision 393957 works while 
the latest revision cause problems. I don't know what part of the 
distribution causes the problems but I will try to find out. I think a 
good start would be to find out which was the first revision not working 
for me. Maybe this would be enough information for you to find out what 
had been changed at this point and what causes the problems.
I will also try just to change the solr.war to check if maybe Jetty is 
responsible for the OOM.


I'll post a report when I have some results.

Marcus


Re: solr setup

2006-05-05 Thread Marcus Stratmann
Yonik Seeley wrote:
> If you start from a normal tomcat distribution, we will be able to
> eliminate that difference.

Yes, I finally got Solr working with Tomcat.
But there are still two minor problems.
The first appears when I try to get the statistics page.
I'm getting this error message:

org.apache.jasper.JasperException: Unable to compile class for JSP
An error occurred at line: 18 in the jsp file: /admin/stats.jsp
Generated servlet error:
/var/lib/tomcat5/work/Catalina/localhost/solr/org/apache/jsp/admin/stats_jsp.java:106:
 for-each loops are not supported in -source 1.3
(try -source 1.5 to enable for-each loops)
for (SolrInfoMBean.Category cat : SolrInfoMBean.Category.values()) {

I guess it's a Tomcat problem, but I don't know where it comes
from and what I can do. I'm using Tomcat 5.0.30 (from debian
testing) with the latest solr.war.


The second problem arises when I call the function "Set Level" in the
"Logging" menu. The error message is

exception

org.apache.jasper.JasperException

org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:372)
org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:292)
org.apache.jasper.servlet.JspServlet.service(JspServlet.java:236)
javax.servlet.http.HttpServlet.service(HttpServlet.java:860)

root cause

java.lang.NullPointerException
java.io.File.(File.java:194)
org.apache.jsp.admin.action_jsp._jspService(action_jsp.java:132)
org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:94)
javax.servlet.http.HttpServlet.service(HttpServlet.java:860)

org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:324)
org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:292)
org.apache.jasper.servlet.JspServlet.service(JspServlet.java:236)
javax.servlet.http.HttpServlet.service(HttpServlet.java:860)


Well, I don't really need this function, so just take it as an error
report.

Marcus




Re: Java heap space

2006-05-15 Thread Marcus Stratmann
On 5/4/06, I wrote:
> From my point of view it looks like this: Revision 393957 works while
> the latest revision cause problems. I don't know what part of the 
> distribution causes the problems but I will try to find out. I think a
> good start would be to find out which was the first revision not working
> for me. Maybe this would be enough information for you to find out what
> had been changed at this point and what causes the problems.
(As a reminder, this was a problem with Jetty.)
Unfortunately I was not able to figure out what was going on. I
compiled some newer revisions from may but my problem with deleting a
huge amount of documents did not appear again. Maybe this is because I
changed the configuration a bit, adding "omitNorms=true" for some
fields.

Meanwhile I switched over to tomcat 5.5 as application server and
things seem to go fine now. The only situation I get OutOfMemory
errors is after an optimize when the server performs an auto-warming
of the cahces:
SEVERE: Error during auto-warming of key:[EMAIL 
PROTECTED]:java.lang.OutOfMemoryError: Java heap space
(from the tomcat log)
But nevertheless the server seems to run stable now with nearly 11
million documents.

Thanks to all the friendly people helping me so far!
Marcus




Re: Separate config and index per webapp

2006-05-17 Thread Marcus Stratmann

Yonik Seeley wrote:

I am hoping I can change the default location for each webapp.  Thanks!

It's not yet possible, but see this thread:
http://www.mail-archive.com/solr-dev@lucene.apache.org/msg00298.html
If I see it right, if I just rename the webapp to, say, "solrfoo" then 
it still uses the system property solr.solr.home to search for the 
configuration, *not* solrfoo.solr.home, right?
I'm searching for a way to have multiple webapps with different 
configuration, too. I would really appreciate if that could be made 
possible. (And sorry, I'd really like to do it myself, but my java 
knowledge doesn not suffice for that.)


Another thing I would like to see is a complete detachment of the solr 
configuration of that of the servlet container. Currently I have to 
change the path to the configuration files by setting solr.solr.home or 
(even worse!) by starting Tomcat (which I use) from it's base home dir.
A while ago I proposed to put solr's config into /etc/solr (for linux). 
It was easily done (even for me) to add this directory to the places 
being searched in Config.java. I thing if this is put in *additionally* 
it should be no problem even for those people who just want to try out 
solr and have no root privileges.


Marcus


Re: Separate config and index per webapp

2006-05-17 Thread Marcus Stratmann

Chris Hostetter wrote:

correct .. we thought we can impliment something that looked at the war
file name easily ... but then we were set straight -- there is no portable
way to do that, hence we came up with the current JNDI plan which isn't
quite as "out of the box" as we had hoped, but it has the advantage of
being possible.
Yes, I observed the discussion on the developer mailing list for a while 
and was suprised to read that there isn't an easy solution for this problem.



I don't know that we'll ever be able to make configuring Solr completely
detatched from configuring the servlet container -- other then the
simplest method of putting your.  Personally i don't think that should be
a major goal: a well tuned Solr installation is going to require that you
consider/configure your servlet container's heap size to meet your needs
anyway.
Good point. Currently I'm using the solr.solr.home system property and 
besides the heap size it is the only Solr specific configuration I have 
to do with Tomcat. So I can live with that.



just to clarify: if you only want one instance of Solr on the port, you
do't *have* to start tomcat from it's base directory
I know, I just wanted to point out that somehow Tomcat is involved in 
the Solr configuration.



... you just have to
make sure the "solr" directory is in whatever the current working
directory is when you do start it.
But what if another webapp needs the server to be started from 
/some/directory/it/likes ?



If the JNDI approach gets implimented, then it should make it easy for you
to specify /etc/solr (or any other directory) as your config directory
with a one line change to your tomcat configuration.

I'm looking forward to that. :-)

Thanks,
Marcus


Re: One big XML file vs. many HTTP requests

2006-05-21 Thread Marcus Stratmann

Erik Hatcher wrote:
I believe that Solr indexes one document at a time; each document  
requires a separate HTTP POST.

Actually adding multiple documents per POST is possible
But deleting multiple documents with just one POST is not possible, 
right? Is there a special reason for that or is it because nobody asked 
for that yet? If so: I'd like to have it! ;-)


Thanks to Erik for the hint!

Marcus


Re: solrconfig environment variable

2006-05-24 Thread Marcus Stratmann
Talking about configuration and system properties: is it possible to set 
the log level of Solr's logger from a system property? Or is there any 
other way to change this level during the start of the servlet container?


Thanks,
Marcus


OutOfMemory error while sorting

2006-06-14 Thread Marcus Stratmann

Hello,

I have a new problem with OutOfMemory errors.
As I reported before, we have an index with more than 10 million 
documents and 23 fields. Recently I added a new field which we will only 
use for sorting purposes (by "adding" I mean building a new index). But 
it turned out that every query using this field for sorting ends in an 
out of memory error. Even sorting result sets containing just one 
document does not work. The field is of type solr.StrField and strange 
enough there are some other fields in the index of the same type which 
do not cause these problems (but not all of them; our uniqueKey-field 
has the same problems with sorting).
Now I am wondering why sorting works with some of the fields but not 
with others. Could it be that this depends on the content?


Thanks,
Marcus


Re: OutOfMemory error while sorting

2006-06-19 Thread Marcus Stratmann

Hi,

Chris Hostetter wrote:

This is a fairly typical Lucene issue (ie: not specific to Solr)...
Ah, I see. I should really put more attention on Lucene. But when 
working with Solr I sometimes forget about the underlying technology.



Sorting on a field requires building a FieldCache for every document --
regardless of how many documents match your query.  This cache is reused
for all searches thta sort on that field.
This makes things clear to me now. I always observed that Solr is slow 
after a commit or optimze. When I put a newly created or updated index 
into service the server always seemed to hang up. The CPU usage went to 
nearly 100 percent and no queries were answered. I found out that 
"warming" the server with serial queries, not parallel ones, bypassed 
this problem (not to be confused with warming the caches!). So after a 
commit I sent some hundred queries from our log to the server and this 
worked fine. But now I know I only need a few specific queries to do the 
job.


Thanks Chris for the great support! The Solr team is doing a very good 
job. With your help I finally got Solr running. Our system is live now 
and I will now switch over to the "Who uses Solr" thread to give you 
some feedback.


Again, thank you very much!

Marcus


Re: who uses Solr?

2006-06-19 Thread Marcus Stratmann

Our Solr system is up now since a few days. You can find it at
http://www.booklooker.de/
I'm sorry we have a german user interface only, but maybe if you want to 
try out our system you just can fill out some fields in our search form 
and press "suchen" on the right side. We are "book brokers" and maybe 
it's not to hard to find out that "Autor" means "author" and "Titel" is 
"title". "Stichwort" may be interesting because this means "keyword" and 
 will perform a search in a "multiValued" field in Solr. One important 
notice: there are two checkboxes labeled "gebraucht" (used) and "neu" 
(new). Do not check "neu" because this will search in an external 
database which is much more slower than ours. ;-)


For the more technically interested I give you some parameters. We have 
now about 10.5 million documents in our index, each consisting of 24 
fields (you can see why, when you click "SUCHEN" on the left side which 
will present you a detailed search form). The index is 2.6G big on disk.
We have two Solr servers running (actually Tomcat server), but normally 
just one is active. Our users submit about 200.000 queries per day which 
is 2.3 queries per second. Typically this varies from 1.5 to 4.5 queries 
per second over the day. Additionally we have about 100.000 "search 
tasks" in our database which are processed in the morning hours 
(increasing the number of queries per second to 11). The index is 
updated once per day on our main server and then copied to our second 
server.

If you have any question I'm glad to give you further information.

Thanks to the Solr community for helping us setting up this system!

Marcus