Re: [VOTE] Community Logo Preferences
https://issues.apache.org/jira/secure/attachment/12394282/solr2_maho_impression.png https://issues.apache.org/jira/secure/attachment/12394376/solr_sp.png https://issues.apache.org/jira/secure/attachment/12393936/logo_remake.jpg https://issues.apache.org/jira/secure/attachment/12394264/apache_solr_a_red.jpg https://issues.apache.org/jira/secure/attachment/12394353/solr.s5.jpg
Differences in output of spell checkers
Hello, I'm trying to learn how to use the spell checkers of solr (1.3). I found out that FileBasedSpellChecker and IndexBasedSpellChecker produce different outputs. IndexBasedSpellChecker says 1 0 4 0 85 game false whereas FileBasedSpellChecker returns 1 0 4 game The differences are the usage of respectively for markup of the suggestions, missing frequences and missing "correctlySpelled" in FileBasedSpellChecker. Is that a bug or a feature? Or are there simply no universal rules for the format of the ouput? The differences make parsing more difficult if you use IndexBasedSpellChecker and FileBasedSpellChecker. Thanks, Marcus
Re: Differences in output of spell checkers
Hello, Are you sending in the same query to both? Frequency and word only get printed when extendedResults == true. correctlySpelled only gets printed when there is Index frequency information. For the FileBasedSpellChecker, there is no Frequency information, so it isn't returned. Yes, I am using this request in both cases: spellcheck?spellcheck=true&spellcheck.dictionary=title&spellcheck.q=gane&q=gane&spellcheck.extendedResults=true Concerning FileBasedSpellChecker I wasn't able to find any online documentation, is there any? For the start I was using "trial an error". I'm still wondering which format the input file needs to have. You write that there is no frequency information for FileBasedSpellChecker. Does that mean, that every word in the index has the same "weight" (besides the distance from the word being spell checked)? Then how does spelling work? Every word in the index that is close enough (distance) to the original is considered and the one with the smallest distance is returned? What effext has spellcheck.onlyMorePopular when there are no frequencies? Sorry if this is answered somewhere in the docs, a link would be enough for me in this case. The logic for constructing this is all handled in the SpellCheckComponent.toNamedList() method and is completely separated from the individual SpellChecker implementations. If I understand you correctly, this means that the output is just an "image" of the used data structures? From the developer's view this is very natural, but from the user's view it is annoying to have different output depending on the handler used. Anyway, this is actually no big problem for me, I was just wondering why my parser (used for IndexBasedSpellChecker) didn't work for FileBasedSpellChecker. Thanks, Marcus
spellcheck.onlyMorePopular
Hello, I have another question concerning the spell checking mechanism. Setting onlyMorePopular=true and using the parameters spellcheck=true&spellcheck.q=gran&q=gran&spellcheck.onlyMorePopular=true I get the result 1 0 4 13 32 grand true which is okay. But when I turn off onlyMorePopular spellcheck=true&spellcheck.q=gran&q=gran&spellcheck.onlyMorePopular=false the output is I was expecting to get *more* results when I turn off onlyMorePopular and to get all of the results contained in the result without onlyMorePopular ("grand") plus some more. Instead I get no spell check results at all. Why is that? Thanks, Marcus
Re: Differences in output of spell checkers
Hi Grant, thanks for your help. I have just one more question: BTW, one workaround is to simply create an index from your file and then use the IndexBasedSpellChecker. Each line equals one document. You could even assign weights that way. In the solrconfig.xml there is a line field Can I use a field from a different index for that (and how)? Or does the workaround mean that I have to make two queries, one for getting the search results and one to get spell checking results? Thanks, Marcus
Re: spellcheck.onlyMorePopular
Grant Ingersoll wrote: I believe the reason is b/c when onlyMP is false, if the word itself is already in the index, it short circuits out. When onlyMP is true, it checks to see if there are more frequently occurring variations. This would mean that onlyMorePopular=false isn't useful at all. If the word is in the index it would not find less frequent words and if it is not in the index onlyMorePopular=false isn't usefull since there are no less popular words. So if you are right this is a bug, isn't it? Thanks, Marcus
Re: spellcheck.onlyMorePopular
Shalin Shekhar Mangar wrote: The end goal is to give spelling suggestions. Even if it gave less frequently occurring spelling suggestions, what would you do with it? To give you an example: We have an index for computer games. One title is "gran turismo". The word "gran" is less frequent in the index than "grand". So if someone searches for "grand turismo" there will be no suggestion "gran". And to come back to my last question: There seems to be no case in which "onlyMorePopular=false" makes sense (provided Grant's assumption is correct). Do you see one? Thanks, Marcus
Re: spellcheck.onlyMorePopular
Shalin Shekhar Mangar wrote: And to come back to my last question: There seems to be no case in which "onlyMorePopular=false" makes sense (provided Grant's assumption is correct). Do you see one? Here's a use-case -- you provide a mis-spelled word and you want the closest suggestion by edit distance (frequency does not matter). Hm, when I try searching for "grand" using onlyMorePopular=false I do not get any results. Same when trying "gran". It seems that there will be no results at all when using onlyMorePopular=false. Without onlyMorePopular there are suggestions for both terms, so there are suggestions close enough to the original word(s). Have you tested your example case? Anyway, if you look at it from the user's point of view: The wiki says "spellcheck.onlyMorePopular -- Only return suggestions that result in more hits for the query than the existing query." This implies that if onlyMorePopular=false I will get even results with less hits. So when I'm checking "grand" I would expect to get the suggestion "gran" which is less frequent in the index. But it seems this is not the case. But even if just the documentation is wrong or unclear: 1) I could not find a case in which onlyMorePopular=false works at all. 2) It would be nice if one could get suggestion with lower frequency than the checked word (which is, to me, what onlyMorePopular=false implies). Thanks, Marcus
Re: spellcheck.onlyMorePopular
Shalin Shekhar Mangar wrote: If onlyMorePopular=true, then the algorithm finds tokens which have greater frequency than the searched term. Among these terms, the one which is closest (by edit distance) is returned. Okay, this is a bit weird, but I think I got it now. Let me try to explain it using my example. When I search for "gran" (frequency 10) I get the suggestion "grand" (frequency 17) when using onlyMorePopular=true. When I use onlyMorePopular=false there are no suggestions at all. This is because there are some (rare) terms which are closer to "gran" than "grand", but all of them are not considered, because there frequency is below 10. Is that correct? But then, why isn't "grand" promoted to first place and returned as a valid suggestion? I think I now understand the source of the confusion. onlyMorePopular=true is a special behavior which uses *only* those tokens which have higher frequency than the searched term. onlyMorePopular=false just switches off this special behavior. It does *not* limit suggestions to tokens which have lesser frequency than the searched term. In fact, onlyMorePopular=false does not use frequency of tokens at all. We should document this clearly to avoid such confusions in the future. I'm still missing the two parameters accuracy and spellcheck.count. Let me try to explain how I (now) think the algorithm works: 1) Take all terms from the index as a basic set. 2) If onlyMorePopular=true remove all terms from the basic set which have a frequency below the frequency of the search term. 3) Sort the basic set in respect of distance to the search term and keep the terms whith the smallest distance and which are "within accuracy". 4) Remove of terms which have a lower frequency than the search term in the case onlyMorePopular=false. 5) Return the remaining terms as suggestions. Point 3 would explain why I do not get any suggestions for "gran" having onlyMorePopular=false. Nevertheless I think this is a bug since point 3 should take into account the frequency as well and promote suggestions with high enough frequency if suggestion with low frequency are deleted. But this is just my assumption on how the algorithm works which explains why there are no suggestions using onlyMorePopular=false. Maybe I am wrong, but somewhere in the process "grand" is deleted from the result set. 2) It would be nice if one could get suggestion with lower frequency than the checked word (which is, to me, what onlyMorePopular=false implies). We could enhance spell checker to do that. But can you please explain your use-case for limiting suggestions to tokens which have lesser frequency? The goal of spell checker is to give suggestions of wrongly spelled words. It was neither designed nor intended to give any other sort of query suggestions. An example would be the mentioned "grand turismo" (regard that in the example above I was searching for "gran" whereas now I am searching for "grand"). "gran" would not be returned as a suggestion because "grand" is more frequent in the index. And yes, I know, returning a suggestion in this case will be only useful if there is more than one word in the search term. You proposed to use KeywordTokenizer for this case but a) I (again) was not able to find any documentation for this and b) we are working on a different solution for this case using stored search queries. If you are interested, it works like this: For every word in the query get some spell checking suggestions. Combine these and find out if any of these combinations has been search for (successfully) before. Propose the one with the highest (search) frequency. Looks promising so far, but the "gran turismo" example won't work, since there are too many "grand"s in the index. Thanks, Marcus
Re: spellcheck.onlyMorePopular
Shalin Shekhar Mangar wrote: The implementation is a bit more complicated. 1. Read all tokens from the specified field in the solr index. 2. Create n-grams of the terms read in #1 and index them into a separate Lucene index (spellcheck index). 3. When asked for suggestions, create n-grams of the query terms, search the spellcheck index and collects the top (by lucene score) 10*spellcheck.count results. 4. If onlyMorePopular=true, determine frequency of each result in the solr index and remove terms which have lesser frequency. 5. Compute the edit distance between the result and the query token. 6. Return the top spellcheck.count results (sorted by edit distance descending) which are greater than specified accuracy. Thanks, I think this makes things clear(er) now. I do agree that the documentation needs improvement on this point, as you said later in this thread. :) Your primary use-case is not spellcheck at all but this might work with some hacking. Fuzzy queries may be a better solution as Walter said. Storing, all successful search queries may be hard to scale. This is certainly true. The drawback of fuzzy searching is that you get back exact and fuzzy hits together in one result set (correct me if I'm wrong). One could filter out the exact/fuzzy hits but this would make paging impossible. The approach using KeywordTokenizer as you suggested before seems to be more promising to me. Unfortunately there seems to be no documentation for this (at least in conjunction with spell checking). If I understand this rightly, the tokenizer must be applied to the field in the search index (not the spell checking index). Is that correct? Thanks, Marcus
Re: Solr 1.1 HTTP server stops responding
Hi David, We're running Solr 1.1 and we're seeing intermittent cases where Solr stops responding to HTTP requests. It seems like the listener on port 8983 just doesn't respond. When we started using solr we encountered the same problem. We are currently running solr 1.0 (!) with tomcat 5.5 on two servers. Our index has 16 million documents and is updated about 10 times per day (depending on the incoming data). We found out three factors which may be responsible for the problems: 1) Memory. Our two servers running solr have 8 GB each and we have set the option -Xmx2560M for tomcat. We got rid of the most problems by increasing the memory. We had no success trying to get solr running with just 4GB in the machine. 2) Disk activity. This is strange. We found out that using rsync on the machine sometimes makes solr stop responding. We cound avoid this by setting an upper limit on the bandwidth rsync uses. Just recently we found out that even copying big files on the machine stops solr. So It seems that high disk activity might cause problems for solr. (We have a MySQL database running on the same servers. Normal operation seems to be no problem, even if the servers have high load.) 3) Reading and writing at the same time. We had no chance updating an index while querying it at the same time. So when the index on our master server is updated all queries will go to the second server. I think that some of the problems are solved in newer versions of solr. We are going to test that in the near future. Marcus
Re: sort problem
If you could live with a cap of 2B on message id, switching to type "int" would decrease the memory usage to 4 bytes per doc (presumably you don't need range queries?) I haven't found exact definitions of the fieldTypes anywhere. Does "integer" span the common range from -2^31 to 2^31-1? And there seems to be no "unsigned" int, am i right? Thanks Marcus
Re: Distribution without SSH?
Justin Knoll wrote: We plan to attempt to rewrite the snappuller (and possibly other distribution scripts, as required) to eliminate this dependency on SSH. I thought I ask the list in case anyone has experience with this same situation or any insights into the reasoning behind requiring SSH access to the master instance. We use our database to store the master's state. Both master and slave(s) have access to the database and can exchange "messages" using a field in a table where we store miscellaneous information about our system. After an update of the master's index a flag in that field signals that a new index is available. The slaves regularly read this field an pull the new index on demand. Marcus
Re: solr setup
Hi, I have a tomcat5 running under linux (debian). I think that my configuration may be wrong, because I don't get solr running. Yonik Seeley wrote: >the layout should look something like this: > >tomcat/webapps/solr.war >tomcat/solrconf/solrconfig.xml, schema.xml, etc >tomcat/bin/startup.sh > >then start tomcat by executing >./bin/startup.sh >from the tomcat directory Well, there's no bin in my tomcat5 directory. I start tomcat using "/etc/init.d/tomcat5 start". May this be a problem? This is what I've done: I compiled the source and copied solr-1.0.war to /var/lib/tomcat5/webapps/solr.war. Then I copied the solrconf dir from example to /var/lib/tomcat5/ and started tomcat. tomcat then builds a solr dir in /var/lib/tomcat5/webapps. Fine so far, but when I call http://localhost:8180/solr/admin/ for the first time, I get this: javax.servlet.ServletException org.apache.jasper.runtime.PageContextImpl.doHandlePageException(PageContextImpl.java:846) org.apache.jasper.runtime.PageContextImpl.access$11(PageContextImpl.java:784) org.apache.jasper.runtime.PageContextImpl$12.run(PageContextImpl.java:766) java.security.AccessController.doPrivileged(Native Method) org.apache.jasper.runtime.PageContextImpl.handlePageException(PageContextImpl.java:764) org.apache.jsp.admin.index_jsp._jspService(index_jsp.java:262) org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:94) javax.servlet.http.HttpServlet.service(HttpServlet.java:802) org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:324) org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:292) org.apache.jasper.servlet.JspServlet.service(JspServlet.java:236) javax.servlet.http.HttpServlet.service(HttpServlet.java:802) sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) java.lang.reflect.Method.invoke(Method.java:585) org.apache.catalina.security.SecurityUtil$1.run(SecurityUtil.java:243) java.security.AccessController.doPrivileged(Native Method) javax.security.auth.Subject.doAsPrivileged(Subject.java:517) org.apache.catalina.security.SecurityUtil.execute(SecurityUtil.java:272) org.apache.catalina.security.SecurityUtil.doAsPrivilege(SecurityUtil.java:161) root cause java.lang.ExceptionInInitializerError org.apache.solr.core.SolrConfig.(SolrConfig.java:33) org.apache.solr.update.SolrIndexConfig.(SolrIndexConfig.java:34) org.apache.solr.core.SolrCore.(SolrCore.java:71) org.apache.jsp.admin.index_jsp._jspService(index_jsp.java:67) org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:94) javax.servlet.http.HttpServlet.service(HttpServlet.java:802) org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:324) org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:292) org.apache.jasper.servlet.JspServlet.service(JspServlet.java:236) javax.servlet.http.HttpServlet.service(HttpServlet.java:802) sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) java.lang.reflect.Method.invoke(Method.java:585) org.apache.catalina.security.SecurityUtil$1.run(SecurityUtil.java:243) java.security.AccessController.doPrivileged(Native Method) javax.security.auth.Subject.doAsPrivileged(Subject.java:517) org.apache.catalina.security.SecurityUtil.execute(SecurityUtil.java:272) org.apache.catalina.security.SecurityUtil.doAsPrivilege(SecurityUtil.java:161) Calling the page again slightly changes the "root cause" to: java.lang.NoClassDefFoundError org.apache.jsp.admin.index_jsp._jspService(index_jsp.java:67) org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:94) javax.servlet.http.HttpServlet.service(HttpServlet.java:802) org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:324) org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:292) org.apache.jasper.servlet.JspServlet.service(JspServlet.java:236) javax.servlet.http.HttpServlet.service(HttpServlet.java:802) sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) java.lang.reflect.Method.invoke(Method.java:585) org.apache.catalina.security.SecurityUtil$1.run(SecurityUtil.java:243)
Re: solr setup
> Solr looks in the current working directory for the solrconf > directory, so it depends where that ends up when tomcat is started. Meanwhile I found out that tomcat is located in /usr/share/tomcat5 and that there is a bin-directory in it, which I was searching for. A handfull of links are pointing to /var/lib/tomcat5 which I found first. So this time I started tomcat using the ./bin/startup.sh as recommended (had to set some environment variables first) but still got an error messages. However, this time a different one: javax.servlet.ServletException org.apache.jasper.runtime.PageContextImpl.doHandlePageException(PageContextImpl.java:846) org.apache.jasper.runtime.PageContextImpl.handlePageException(PageContextImpl.java:779) org.apache.jsp.admin.index_jsp._jspService(index_jsp.java:262) org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:94) javax.servlet.http.HttpServlet.service(HttpServlet.java:802) org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:324) org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:292) org.apache.jasper.servlet.JspServlet.service(JspServlet.java:236) javax.servlet.http.HttpServlet.service(HttpServlet.java:802) root cause java.lang.NoClassDefFoundError org.apache.jsp.admin.index_jsp._jspService(index_jsp.java:67) org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:94) javax.servlet.http.HttpServlet.service(HttpServlet.java:802) org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:324) org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:292) org.apache.jasper.servlet.JspServlet.service(JspServlet.java:236) javax.servlet.http.HttpServlet.service(HttpServlet.java:802) At this point I gave up and tried a new approach. I changed configDir in Config.java to "/var/lib/tomcat5/solrconf/" (this is where I placed the configuration) and compiled the whole thing. I'm not sure, if this really could work (could it?) and in fact it didn't. But I think that the problem is not the location of the configuration files, but something different. What do these Security and Privilege messages mean showing up in the error message? java.lang.ExceptionInInitializerError org.apache.solr.update.SolrIndexConfig.(SolrIndexConfig.java:34) org.apache.solr.core.SolrCore.(SolrCore.java:71) org.apache.jsp.admin.index_jsp._jspService(index_jsp.java:67) org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:94) javax.servlet.http.HttpServlet.service(HttpServlet.java:802) org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:324) org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:292) org.apache.jasper.servlet.JspServlet.service(JspServlet.java:236) javax.servlet.http.HttpServlet.service(HttpServlet.java:802) sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) java.lang.reflect.Method.invoke(Method.java:585) org.apache.catalina.security.SecurityUtil$1.run(SecurityUtil.java:243) java.security.AccessController.doPrivileged(Native Method) javax.security.auth.Subject.doAsPrivileged(Subject.java:517) org.apache.catalina.security.SecurityUtil.execute(SecurityUtil.java:272) org.apache.catalina.security.SecurityUtil.doAsPrivilege(SecurityUtil.java:161) > It might be easier to download a recent Tomcat 5.5 distribution and > get it working with that first... then try with the bundled version of > Tomcat once you understand how everything works. Thanks Yonik, maybe I should try that, though I now think that the configuration is not the main problem. Btw, I don't like the way the config-files are handled. Searching them in the webapps-dir is not very elegant, I think. Instead they should be in /etc/solr or something (for linux; sorry, don't know if there's a common place where configs are placed under windows or other OSs). This will become a problem for me anyway because I'm planing to have three independent indexes which should be operated by three servlets. If my approach changing the variable configDir worked, this would really be fine. This way I could create three war-files containing different locations for the config (and yes, this wouldn't work with my proposed "elegant" way of putting everything to /etc/solr). Is this approach correct or do I have to make changes to the code elsewhere? Thanks, Marcus
Deleting documents
Hello, I have a problem deleting documents from the index. In the tutorial "SP2514N" is used as an example for deleting. I was wondering if "" is some kind of keyword or the name of a field (in the example, a unique field named "id" is used). In my config I have the line bookID making bookID (type slong as defined in the example config) my unique id. So I tried 113976235 which resulted in unexpected XML tag /delete/bookID Okay, so "id" seems to be a keyword, rather than a field name. With my next try 113976235 the query worked fine: But after a I found the number of documents unchanged in the stats. Furthermore the value of deletesById was 0. Oddly enough cumulative_deletesById was 1 (what does this value actually mean?). Any ideas what's going wrong? Thanks, Marcus
Re: Deleting documents
> Yes, I believe the Wiki has an example like this (a uniqueKey field > not named "id") Right, I should have looked there, too. > > But after a I found the number of documents unchanged > > in the stats. > What stat? maxDoc may be unchanged since it doesn't reflect deleted > documents that haven't been squeezed out of the index (it's a lucene > thing). numDocs should reflect the deletion. Yep, but numDocs is unchanged after a commit. I tried it again this morning step by step. Starting with a newly created index, the stats say numDocs : 9882062 maxDoc : 9882062 commits : 0 optimizes : 0 docsPending : 0 deletesPending : 0 adds : 0 deletesById : 0 deletesByQuery : 0 errors : 0 cumulative_adds : 0 cumulative_deletesById : 0 cumulative_deletesByQuery : 0 cumulative_errors : 0 docsDeleted : 0 After giving the delete-command this changed to numDocs : 9882062 maxDoc : 9882062 commits : 0 optimizes : 0 docsPending : 0 deletesPending : 1 adds : 0 deletesById : 1 deletesByQuery : 0 errors : 0 cumulative_adds : 0 cumulative_deletesById : 1 cumulative_deletesByQuery : 0 cumulative_errors : 0 docsDeleted : 0 And yes, I am absolutely sure the id I used for deletion existed in the index. I tried it later with a delete by query and it worked. (I just mention this because I found out that the stats look like that regardless of which id you use, an existing or non-existing one.) Finally after a commit I got: numDocs : 9882062 maxDoc : 9882062 commits : 1 optimizes : 0 docsPending : 0 deletesPending : 0 adds : 0 deletesById : 0 deletesByQuery : 0 errors : 0 cumulative_adds : 0 cumulative_deletesById : 1 cumulative_deletesByQuery : 0 cumulative_errors : 0 docsDeleted : 0 Apparently no change in the number of documents (and the document can still be found in the index). Could it be the problem is that my unique key field is of type slong (as defined in the tutorial)? Thanks, Marcus -- GMX Produkte empfehlen und ganz einfach Geld verdienen! Satte Provisionen für GMX Partner: http://www.gmx.net/de/go/partner
Re: Deleting documents
Yonik Seeley wrote: > OK, I think I fixed this bug. Haven't added a test case yet... In our test case everything works properly now. Thanks for the quick bugfix! Marcus
Synchronizing commit and optimize
Hello, when doing a commit or optimize the operation takes quite long (in my test case at least some minutes). When I submit the command via curl, I get the response "curl: (52) Empty reply from server" though solr is still working (as I can see from the process list and the admin interface). I tried the options "--connect-timeout" and "--max-time" but still curl returns after some seconds though the request is not fully processed. The same thing happens when I submit the commands from a PHP-script (ensuring that it waits for a server response). I'm not sure if I'm doing something wrong, but I could imagine three causes for this. 1) curl (or my script) simply doesn't wail long enough to get a response from the server. Well, I think I've ensured that this is not the case, see above. 2) Jetty (I'm using the standard installation from the example) doesn't wait long enough to get a response from Solr and thus returns an empty response. 3) Solr itself is the problem. For me point 2 sounds reasonable but I have no idea how to test this. I'm also getting empty responses when adding documents to the index. This happens every time when a multiple of one million documents have been added to the index. I guess the reason is, that I have a merge factor of 10 and that the operation of adding a document takes longer when a multiple of 10^6 documents is reached. Is there any way to synchronize a commit or optimize with other commands (for example in a shell script)? The example in the script "commit" in src/scripts doesn't use any special arguments with curl and returns some seconds after submiting the request, so this doesn't seem to work. Thanks in advance, Marcus -- Echte DSL-Flatrate dauerhaft für 0,- Euro*! "Feel free" mit GMX DSL! http://www.gmx.net/de/go/dsl
Re: Synchronizing commit and optimize
Yonik Seeley wrote: >I think you are probably right about Jetty timing out the request. >Solr doesn't implement timeouts for requests, and I havent' seen this >behavior with Solr running on Resin. > >You could try another app server like Tomcat, or perhaps figure out of >the Jetty timeout is configurable. You were right, it's an Jetty issue. In Jetty's configuration in jetty.xml I changed the parameter maxIdleTime which seems to be in milliseconds (I wasn't able to find documentation for this anywhere). Increasing this value to 360 (1 hour) did the trick for me. The line is 360 The default value in the example installation is 3. Maybe it wolud be a good idea to increase this, too. Thanks, Marcus
Re: Java heap space
Chris Hostetter wrote: How big is your physical index directory on disk? It's about 2.9G now. Is there a direct connection between size of index and usage of ram? Your best bet is to allocate as much ram to the server as you can. Depending on how full your caches are, and what hitratios you are getting (the "STATISTICS" link from the Admin screen will tell you) you might want to make some of them smaller to reduce the amount of RAM Solr uses for them. Hm, after disabling all caches I still get OutOfMemoryErrors. All I do currently while testing is to delete documents. No searching or inserting. Typically after deleting about 20,000 documents the server throws the first error message. From an acctual index standpoint, if you don't care about doc/field boosts of lengthNorms, then the omitNorm="true" option on your fields (or fieldtypes) will help save one byte per document per field you use it on. That is something I could test, though I think this won't significantly change the size of the index. One thing that appears suspicious to me is that everything went fine as long as the number of documents was below 10 million. Problems started when this limit was exceeded. But maybe this is just a coincidence. Marcus
Re: Java heap space
Chris Hostetter wrote: > interesting .. are you getting the OutOfMemory on an actual delete > operation or when doing a commit after executing some deletes? Yes, on a delete operation. I'm not doing any commits until the end of all delete operations. After reading this I was curious if using commits during deleting would have any effect. So I tested doing a commit after 10,000 deletes at a time (which, I know, is not recommended). But that simply didn't change anything. Meanwhile I found out that I can gain 10,000 documents more to delete (before getting an OOM) by increasing the heap space by 500M. Unfortunately we need to delete about 200,000 documents on each update which would need 10G to be added to the heap space. Not to speak of the same number of inserts. > part of the problem may be that under the covers, any delete involves > doing a query (even if oyou are deleting by uniqueKey, that's implimented > s a delete by Term, which requires iterating over a TermEnum to find the > relevent document, and if your index is big enough, loading that TermEnum > and may be the cause of your OOM. Yes, I thought so, too. And in fact I get OOM even if I just submit search queries. > Maybe, maybe not ... what options are you using in your solrconfig.xml's > indexDefaults and mainIndex blocks? I adopted the default values from the example installation which looked quite reasonable to me. > ... 10 million documents could be the > magic point at which your mergeFactor triggers the merging of several > large segments into one uber segment -- which may be big enough to cause > an OOM when the IndexReader tries to open it. Yes, I'm using the default mergeFactor of 10 and as 10 million is 10^7 this is what appeared suspicious to me. Is it right, that the mergeFactor connot be changed once the index has been built? Marcus
Re: Java heap space
Yonik Seeley wrote: Yes, on a delete operation. I'm not doing any commits until the end of all delete operations. I assume this is a delete-by-id and not a delete-by-query? They work very differently. Yes, all queries are delete-by-id. If you are first deleting so you can re-add a newer version of the document, you don't need too... overwriting older documents based on the uniqueKeyField is something Solr does for you! Yes, I know. But the articles in our (sql-)database get new IDs when they are changed so they need to be deleted an re-inserted into the index. Is it possible to use a profiler to see where all the memory is going? It sounds like you may have uncovered a memory leak somewhere. I'm not that experienced concerning Java, but maybe if you give me some advice I'm glad if I can help. So far I had a quick look at JMP once but that's all. Don't hesitate to write me a PM on that subject. Also what OS, what JVM, what appserver are you using? OS: Linux (Debian GNU/Linux i686) JVM: Java HotSpot(TM) Server VM (build 1.5.0_06-b05, mixed mode) of Sun's JDK 5.0. Currently I'm using the Jetty installation from the solr nightly builds for test purposes. Marcus
Re: Java heap space
Chris Hostetter wrote: > this is off the subject of the heap space issue ... but if the id changes, > then maybe it shouldn't be the uniqueId of your index? .. your code must > have someone of recognizing that article B with id 222 is a changed > version of article A with id 111 (otherwise how would you know to delete > 111 when you insert 222?) ..whatever that mechanism is, perhaps it should > determine your uniqueKey? No, there is no "key" or something that reveals a relation between new article B and old article A. After B is inserted and A is deleted, all of A's existence is gone and we do not even know that B is A's "successor". Changes are simply kept in a table which tells the system which IDs to delete and which new (or changed) articles to insert, automatically giving them new IDs. I know this may not be (or at least sound) perfect and it is not the way things are handled normally. But this works fine for our needs. We gather information about changes to our data during the day and apply them on a nightly update (which, I know, does not imply that IDs have to change). So, yes, I'm sure I got the right uniqueKey. ;-) Marcus
Re: Java heap space
Hello, deleting or updating documents is still not possible for me so now I tried to built a completely new index. Unfortunately this didn't work either. Now I'm getting OOM after inserting slightly more than 20,000 documents to the new index. To me this looks as if a bug has been introduced since the revision of about 13th of april. To check this out, I looked for old builds but there seem to be nightly-builds of the last four days only. Okay, so next thing I tried was to get the code via svn. Unfortunately the code does not compile ("package junit.framework does not exist"). I found out that the last version I was able to complie was revision 393080 (2006-04-10). So I was neither able to get back the last (for me) working revision nor to find out which revision this actually was. Sorry, I would really like to help, but at the moment it seems Murphy is striking. Thanks, Marcus
Re: Java heap space
Yonik Seeley wrote: Is your problem reproducable with a test case you can share? Well, you can get the configuration files. If you ask for the data, this could be a problem, since this is "real" data from our production database. The amount of data needed could be another problem. You could also try a different app-server like Tomcat to see if that makes a difference. This is another part of the problem because currently Tomcat won't work with solr in my environment (a debian linux installation). What type is your id field defined to be? with slong defined by as in the sample schema.xml. Marcus
Re: Java heap space
Chris Hostetter wrote: This is because building a full Solr distribution from scratch requires that you have JUnit. Bt it is not required to run Solr. Ah, I see. That was a very valuable hint for me. I was able now to compile an older revision (393957). Testing this revision I was able to delete more than 600,000 documents without problems. From my point of view it looks like this: Revision 393957 works while the latest revision cause problems. I don't know what part of the distribution causes the problems but I will try to find out. I think a good start would be to find out which was the first revision not working for me. Maybe this would be enough information for you to find out what had been changed at this point and what causes the problems. I will also try just to change the solr.war to check if maybe Jetty is responsible for the OOM. I'll post a report when I have some results. Marcus
Re: solr setup
Yonik Seeley wrote: > If you start from a normal tomcat distribution, we will be able to > eliminate that difference. Yes, I finally got Solr working with Tomcat. But there are still two minor problems. The first appears when I try to get the statistics page. I'm getting this error message: org.apache.jasper.JasperException: Unable to compile class for JSP An error occurred at line: 18 in the jsp file: /admin/stats.jsp Generated servlet error: /var/lib/tomcat5/work/Catalina/localhost/solr/org/apache/jsp/admin/stats_jsp.java:106: for-each loops are not supported in -source 1.3 (try -source 1.5 to enable for-each loops) for (SolrInfoMBean.Category cat : SolrInfoMBean.Category.values()) { I guess it's a Tomcat problem, but I don't know where it comes from and what I can do. I'm using Tomcat 5.0.30 (from debian testing) with the latest solr.war. The second problem arises when I call the function "Set Level" in the "Logging" menu. The error message is exception org.apache.jasper.JasperException org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:372) org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:292) org.apache.jasper.servlet.JspServlet.service(JspServlet.java:236) javax.servlet.http.HttpServlet.service(HttpServlet.java:860) root cause java.lang.NullPointerException java.io.File.(File.java:194) org.apache.jsp.admin.action_jsp._jspService(action_jsp.java:132) org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:94) javax.servlet.http.HttpServlet.service(HttpServlet.java:860) org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:324) org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:292) org.apache.jasper.servlet.JspServlet.service(JspServlet.java:236) javax.servlet.http.HttpServlet.service(HttpServlet.java:860) Well, I don't really need this function, so just take it as an error report. Marcus
Re: Java heap space
On 5/4/06, I wrote: > From my point of view it looks like this: Revision 393957 works while > the latest revision cause problems. I don't know what part of the > distribution causes the problems but I will try to find out. I think a > good start would be to find out which was the first revision not working > for me. Maybe this would be enough information for you to find out what > had been changed at this point and what causes the problems. (As a reminder, this was a problem with Jetty.) Unfortunately I was not able to figure out what was going on. I compiled some newer revisions from may but my problem with deleting a huge amount of documents did not appear again. Maybe this is because I changed the configuration a bit, adding "omitNorms=true" for some fields. Meanwhile I switched over to tomcat 5.5 as application server and things seem to go fine now. The only situation I get OutOfMemory errors is after an optimize when the server performs an auto-warming of the cahces: SEVERE: Error during auto-warming of key:[EMAIL PROTECTED]:java.lang.OutOfMemoryError: Java heap space (from the tomcat log) But nevertheless the server seems to run stable now with nearly 11 million documents. Thanks to all the friendly people helping me so far! Marcus
Re: Separate config and index per webapp
Yonik Seeley wrote: I am hoping I can change the default location for each webapp. Thanks! It's not yet possible, but see this thread: http://www.mail-archive.com/solr-dev@lucene.apache.org/msg00298.html If I see it right, if I just rename the webapp to, say, "solrfoo" then it still uses the system property solr.solr.home to search for the configuration, *not* solrfoo.solr.home, right? I'm searching for a way to have multiple webapps with different configuration, too. I would really appreciate if that could be made possible. (And sorry, I'd really like to do it myself, but my java knowledge doesn not suffice for that.) Another thing I would like to see is a complete detachment of the solr configuration of that of the servlet container. Currently I have to change the path to the configuration files by setting solr.solr.home or (even worse!) by starting Tomcat (which I use) from it's base home dir. A while ago I proposed to put solr's config into /etc/solr (for linux). It was easily done (even for me) to add this directory to the places being searched in Config.java. I thing if this is put in *additionally* it should be no problem even for those people who just want to try out solr and have no root privileges. Marcus
Re: Separate config and index per webapp
Chris Hostetter wrote: correct .. we thought we can impliment something that looked at the war file name easily ... but then we were set straight -- there is no portable way to do that, hence we came up with the current JNDI plan which isn't quite as "out of the box" as we had hoped, but it has the advantage of being possible. Yes, I observed the discussion on the developer mailing list for a while and was suprised to read that there isn't an easy solution for this problem. I don't know that we'll ever be able to make configuring Solr completely detatched from configuring the servlet container -- other then the simplest method of putting your. Personally i don't think that should be a major goal: a well tuned Solr installation is going to require that you consider/configure your servlet container's heap size to meet your needs anyway. Good point. Currently I'm using the solr.solr.home system property and besides the heap size it is the only Solr specific configuration I have to do with Tomcat. So I can live with that. just to clarify: if you only want one instance of Solr on the port, you do't *have* to start tomcat from it's base directory I know, I just wanted to point out that somehow Tomcat is involved in the Solr configuration. ... you just have to make sure the "solr" directory is in whatever the current working directory is when you do start it. But what if another webapp needs the server to be started from /some/directory/it/likes ? If the JNDI approach gets implimented, then it should make it easy for you to specify /etc/solr (or any other directory) as your config directory with a one line change to your tomcat configuration. I'm looking forward to that. :-) Thanks, Marcus
Re: One big XML file vs. many HTTP requests
Erik Hatcher wrote: I believe that Solr indexes one document at a time; each document requires a separate HTTP POST. Actually adding multiple documents per POST is possible But deleting multiple documents with just one POST is not possible, right? Is there a special reason for that or is it because nobody asked for that yet? If so: I'd like to have it! ;-) Thanks to Erik for the hint! Marcus
Re: solrconfig environment variable
Talking about configuration and system properties: is it possible to set the log level of Solr's logger from a system property? Or is there any other way to change this level during the start of the servlet container? Thanks, Marcus
OutOfMemory error while sorting
Hello, I have a new problem with OutOfMemory errors. As I reported before, we have an index with more than 10 million documents and 23 fields. Recently I added a new field which we will only use for sorting purposes (by "adding" I mean building a new index). But it turned out that every query using this field for sorting ends in an out of memory error. Even sorting result sets containing just one document does not work. The field is of type solr.StrField and strange enough there are some other fields in the index of the same type which do not cause these problems (but not all of them; our uniqueKey-field has the same problems with sorting). Now I am wondering why sorting works with some of the fields but not with others. Could it be that this depends on the content? Thanks, Marcus
Re: OutOfMemory error while sorting
Hi, Chris Hostetter wrote: This is a fairly typical Lucene issue (ie: not specific to Solr)... Ah, I see. I should really put more attention on Lucene. But when working with Solr I sometimes forget about the underlying technology. Sorting on a field requires building a FieldCache for every document -- regardless of how many documents match your query. This cache is reused for all searches thta sort on that field. This makes things clear to me now. I always observed that Solr is slow after a commit or optimze. When I put a newly created or updated index into service the server always seemed to hang up. The CPU usage went to nearly 100 percent and no queries were answered. I found out that "warming" the server with serial queries, not parallel ones, bypassed this problem (not to be confused with warming the caches!). So after a commit I sent some hundred queries from our log to the server and this worked fine. But now I know I only need a few specific queries to do the job. Thanks Chris for the great support! The Solr team is doing a very good job. With your help I finally got Solr running. Our system is live now and I will now switch over to the "Who uses Solr" thread to give you some feedback. Again, thank you very much! Marcus
Re: who uses Solr?
Our Solr system is up now since a few days. You can find it at http://www.booklooker.de/ I'm sorry we have a german user interface only, but maybe if you want to try out our system you just can fill out some fields in our search form and press "suchen" on the right side. We are "book brokers" and maybe it's not to hard to find out that "Autor" means "author" and "Titel" is "title". "Stichwort" may be interesting because this means "keyword" and will perform a search in a "multiValued" field in Solr. One important notice: there are two checkboxes labeled "gebraucht" (used) and "neu" (new). Do not check "neu" because this will search in an external database which is much more slower than ours. ;-) For the more technically interested I give you some parameters. We have now about 10.5 million documents in our index, each consisting of 24 fields (you can see why, when you click "SUCHEN" on the left side which will present you a detailed search form). The index is 2.6G big on disk. We have two Solr servers running (actually Tomcat server), but normally just one is active. Our users submit about 200.000 queries per day which is 2.3 queries per second. Typically this varies from 1.5 to 4.5 queries per second over the day. Additionally we have about 100.000 "search tasks" in our database which are processed in the morning hours (increasing the number of queries per second to 11). The index is updated once per day on our main server and then copied to our second server. If you have any question I'm glad to give you further information. Thanks to the Solr community for helping us setting up this system! Marcus