Doc add limit, im experiencing it too
Old issue (see http://www.mail-archive.com/solr-user@lucene.apache.org/msg00651.html), but I'm experiencing the same exact thing on windows xp, latest tomcat. I noticed that the tomcat process gobbles memory (10 megs a second maybe) and then jams at 125 megs. Can't find a fix yet. I'm using a php interface and curl to post my xml, one document at a time, and commit every 100 document. Indexing 3 docs, it hangs at maybe 5000. Anyone got an idea on this one? It would be helpful. I may try to switch to jetty tomorrow if nothing works :( -- Michael Imbeault CHUL Research Center (CHUQ) 2705 boul. Laurier Ste-Foy, QC, Canada, G1V 4G2 Tel: (418) 654-2705, Fax: (418) 654-2212
Doc add limit problem, old issue
Old issue (see http://www.mail-archive.com/solr-user@lucene.apache.org/msg00651.html), but I'm experiencing the same exact thing on windows xp, latest tomcat. I noticed that the tomcat process gobbles memory (10 megs a second maybe) and then jams at 125 megs. Can't find a fix yet. I'm using a php interface and curl to post my xml, one document at a time, and commit every 100 document. Indexing 3 docs, it hangs at maybe 5000. Anyone got an idea on this one? It would be helpful. I may try to switch to jetty tomorrow if nothing works :( -- Michael Imbeault CHUL Research Center (CHUQ) 2705 boul. Laurier Ste-Foy, QC, Canada, G1V 4G2 Tel: (418) 654-2705, Fax: (418) 654-2212
Got it working! And some questions
First of all, in reference to http://www.mail-archive.com/solr-user@lucene.apache.org/msg00808.html , I got it working! The problem(s) was coming from solPHP; the implementation in the wiki isn't really working, to be honest, at least for me. I had to modify it significantly at multiple places to get it working. Tomcat 5.5, WAMP and Windows XP. The main problem was that addIndex was sending 1 doc at a time to solr; it would cause a problem after a few thousand docs because i was running out of resources. I modified solr_update.php to handle batch queries, and i'm now sending batches of 1000 docs at a time. Great indexing speed. Had a slight problem with the curl function of solr_update.php; the custom HTTP header wasn't recognized; I now use curl_setopt($ch, CURLOPT_POST, 1); curl_setopt($ch, CURLOPT_POSTFIELDS, $post_string); - much simpler, and now everything works! Up so far I indexed 15.000.000 documents (my whole collection, basically) and the performance i'm getting is INCREDIBLE (sub 100ms query time without warmup and no optimization at all on a 7 gigs index - and with the cache, it gets stupid fast)! Seriously, Solr amaze me every time I use it. I increased HashDocSet Maxsize to 75000, will continue to optimize this value - it helped a great deal. I will try disMaxHandler soon too; right now the standard one is great. And I will index with a better stopword file; the default one could really use improvements. Some questions (couldn't find the answer in the docs): - Is the solr php in the wiki working out of the box for anyone? Else we could modify the wiki... - What is the loadFactor variable of HashDocSet? Should I optimize it too? - What's the units on the size value of the caches? Megs, number of queries, kilobytes? Not described anywhere. - Any way to programatically change the OR/AND preference of the query parser? I set it to AND by default for user queries, but i'd like to set it to OR for some server-side queries I must do (find related articles, order by score). - Whats the difference between the 2 commits type? Blocking and non-blocking. Didn't see any differences at all, tried both. - Every time I do an command, I get the following in my catalina logs - should I do anything about it? 9-Sep-2006 2:24:40 PM org.apache.solr.core.SolrException log SEVERE: Exception during commit/optimize:java.io.EOFException: no more data available - expected end tag to close start tag from line 1, parser stopped on START_TAG seen ... @1:10 - Any benefits of setting the allowed memory for Tomcat higher? Right now im allocating 384 megs. Can't wait to try the new Faceted Queries... seriously, solr is really, really awesome up so far. Thanks for all your work, and sorry for all the questions! -- Michael Imbeault CHUL Research Center (CHUQ) 2705 boul. Laurier Ste-Foy, QC, Canada, G1V 4G2 Tel: (418) 654-2705, Fax: (418) 654-2212
Re: Doc add limit problem, old issue
Fixed my problem, the implementation of solPHP was faulty. It was sending one doc at a time (one curl per doc) and the system quickly ran out of resources. Now I modified it to send by batch (1000 at a time) and everything is #1! Michael Imbeault wrote: Old issue (see http://www.mail-archive.com/solr-user@lucene.apache.org/msg00651.html), but I'm experiencing the same exact thing on windows xp, latest tomcat. I noticed that the tomcat process gobbles memory (10 megs a second maybe) and then jams at 125 megs. Can't find a fix yet. I'm using a php interface and curl to post my xml, one document at a time, and commit every 100 document. Indexing 3 docs, it hangs at maybe 5000. Anyone got an idea on this one? It would be helpful. I may try to switch to jetty tomorrow if nothing works :( -- Michael Imbeault CHUL Research Center (CHUQ) 2705 boul. Laurier Ste-Foy, QC, Canada, G1V 4G2 Tel: (418) 654-2705, Fax: (418) 654-2212
Re: Got it working! And some questions
First of all, it seems the mailing list is having some troubles? Some of my posts end up in the wrong thread (even new threads I post), I don't receive them in my mail, and they're present only in the 'date archive' of http://www.mail-archive.com, and not in the 'thread' one? I don't receive some of the other peoples post in my mail too, problems started last week I think. Secondly, Chris, thanks for all the useful answers, everything is much clearer now. This info should be added to the wiki I think; should I do it? I'm still a little disappointed that I can't change the OR/AND parsing by just changing some parameter (like I can do for the number of results returned, for example); adding a OR between each word in the text i want to compare sounds suboptimal, but i'll probably do it that way; its a very minor nitpick, solr is awesome, as I said before. @ Brian Lucas: Don't worry, solrPHP was still 99.9% functional, great work; part of it sending a doc at a time was my fault; I was following the exact sequence (add to array, submit) displayed in the docs. The only thing that could be added is a big "//TODO: change this code" before sections you have to change to make it work for a particular schema. I'm pretty sure the custom header curl submit works for everyone else than me; I'm on a windows test box with WAMP on it, so it may be caused by that. I'll send you tomorrow the changes I done to the code anyway; as I said, nothing major. Chris Hostetter wrote: : - What is the loadFactor variable of HashDocSet? Should I optimize it too? this is the same as the loadFactor in a HashMap constructor -- but i don't think it has much affect on performance since the HashDocSets never "grow". I personally have never tuned the loadFactor :) : - What's the units on the size value of the caches? Megs, number of : queries, kilobytes? Not described anywhere. "entries" ... the number of items allowed in the cache. : - Any way to programatically change the OR/AND preference of the query : parser? I set it to AND by default for user queries, but i'd like to set : it to OR for some server-side queries I must do (find related articles, : order by score). you mean using StandardRequestHandler? ... not that i can think of off the top of my head, but typicaly it makes sense to just configure what you want for your "users" in the schema, and then make any machine generated queries be explicit. : - Whats the difference between the 2 commits type? Blocking and : non-blocking. Didn't see any differences at all, tried both. do you mean the waitFlush and waitSearcher options? if either of those is true, you shouldn't get a response back from the server untill they have finished. if they are false, then the server should respond instantly even if it takes several seconds (or maybe even minutes) to complete the operation (optimizes can take a while in some cases -- as can opening newSearchers if you have a lot of cache warming configured) : - Every time I do an command, I get the following in my : catalina logs - should I do anything about it? the optimize command needs to be well formed XML, try "" instead of just "" : - Any benefits of setting the allowed memory for Tomcat higher? Right : now im allocating 384 megs. the more memory you've got, the more cachng you can support .. but if your index changes so frequently compared to the rate of *unique* queries you get that your caches never fill up, it may not matter. -Hoss -- Michael Imbeault CHUL Research Center (CHUQ) 2705 boul. Laurier Ste-Foy, QC, Canada, G1V 4G2 Tel: (418) 654-2705, Fax: (418) 654-2212
Re: Got it working! And some questions
Hello Erik, Thanks for add that feature! "do" is fine with me, if "op" is already used (not sure about this one). Erik Hatcher wrote: On Sep 10, 2006, at 10:47 PM, Michael Imbeault wrote: I'm still a little disappointed that I can't change the OR/AND parsing by just changing some parameter (like I can do for the number of results returned, for example); adding a OR between each word in the text i want to compare sounds suboptimal, but i'll probably do it that way; its a very minor nitpick, solr is awesome, as I said before. I'm the one that added support for controlling the default operator of Solr's query parser, and I hadn't considered the use case of controlling that setting from a request parameter. It should be easy enough to add. I'll take a look at adding that support and commit it once I have it working. What parameter name should be used for this?do=[AND|OR] (for default operator)? We have df for default field. Erik -- Michael Imbeault CHUL Research Center (CHUQ) 2705 boul. Laurier Ste-Foy, QC, Canada, G1V 4G2 Tel: (418) 654-2705, Fax: (418) 654-2212
MoreLikeThis class in Lucene within Solr?
Ok, so hopefully I resolved my problems posting to this mailing list and this won't show up in some thread, but as a new topic! Is it possible in any way to use the MoreLikeThis class with solr (http://lucene.apache.org/java/docs/api/org/apache/lucene/search/similar/MoreLikeThis.html)? Right now I'm determining similar docs by just querying for the whole body with OR between words, and it's not very efficient performance wise. I never coded in Java so I really don't know where I should start... Thanks, -- Michael Imbeault CHUL Research Center (CHUQ) 2705 boul. Laurier Ste-Foy, QC, Canada, G1V 4G2 Tel: (418) 654-2705, Fax: (418) 654-2212
Re: MoreLikeThis class in Lucene within Solr?
Thanks for that Eric; It looks like a very good implementation of the class. If you ever find time to add it to the query handlers in Solr, I'm sure it would be wonderful for tons of users (solr has tons of users, right? it definitively should!). I haven't looked at the specifics of how MoreLikeThis determine which items are similar; I'm mainly wondering about performance here. Yesterday I tried to code myself a poor man's similarity class (which was nothing more than doing a search with OR between words and sorting by score), and the performance was abysmal (well, I kinda expected it. 1000+ words queries on a 15 millions docs collection, you don't expect miracles). At first glance I think it searches for the most 'relevant' words, I'm I right? What kind of performance are you getting with it? Thanks a lot, Michael Imbeault CHUL Research Center (CHUQ) 2705 boul. Laurier Ste-Foy, QC, Canada, G1V 4G2 Tel: (418) 654-2705, Fax: (418) 654-2212 Erik Hatcher wrote: I use MoreLikeThis in a custom request handler for Collex, for example the three items shown at the bottom left here: <http://svn.sourceforge.net/viewvc/patacriticism/collex/trunk/src/solr/org/nines/TermQueryRequestHandler.java?revision=391&view=markup> I would like to get MoreLikeThis hooked into the StandardRequestHandler just like highlighting and facets are now. One of these days I'll carve out time to do that if no one beats me to it. It would not be difficult to do, it would just take some time to iron out how to parameterize it cleanly for general-purpose use. Erik
Re: MoreLikeThis class in Lucene within Solr?
Thanks for the answer; and try to enjoy your vacation / travel! Can't wait to be able to interface with MoreLikeThis within Solr! Michael Imbeault CHUL Research Center (CHUQ) 2705 boul. Laurier Ste-Foy, QC, Canada, G1V 4G2 Tel: (418) 654-2705, Fax: (418) 654-2212 Erik Hatcher wrote: On Sep 12, 2006, at 3:41 PM, Michael Imbeault wrote: I haven't looked at the specifics of how MoreLikeThis determine which items are similar; I'm mainly wondering about performance here. Yesterday I tried to code myself a poor man's similarity class (which was nothing more than doing a search with OR between words and sorting by score), and the performance was abysmal (well, I kinda expected it. 1000+ words queries on a 15 millions docs collection, you don't expect miracles). At first glance I think it searches for the most 'relevant' words, I'm I right? What kind of performance are you getting with it? Performance with MoreLikeThis is not an issue. It has many parameters to tune how many terms are used in the query it builds, and it pulls these terms in an extremely efficient manner from the Lucene index. I'm doing some traveling soon, which is always a good time to hack on something tractable like adding MoreLikeThis to Solr. So your wish may be granted in a week :) Erik
Facet performance with heterogeneous 'facets'?
Been playing around with the news 'facets search' and it works very well, but it's really slow for some particular applications. I've been trying to use it to display the most frequent authors of articles; this is from a huge (15 millions articles) database and names of authors are rare and heterogeneous. On a query that takes (without facets) 0.1 seconds, it jumps to ~20 seconds with just 1% of the documents indexed (I've been getting java.lang.OutOfMemoryError with the full index). ~40 seconds for a faceted search on 2 (string) fields. Range queries on a slong field is more acceptable (even with a dozen of them, query time is still in the subsecond range). I'm I trying to do something which isn't what faceted search was made for? It would be understandable, after all, I guess the facets engine has to check very doc in the index and sort... which shouldn't yield good performance no matter what, sadly. Is there any other way I could achieve what I'm trying to do? Just a list of the most frequent (top 5) authors present in the results of a query. Thanks, -- Michael Imbeault CHUL Research Center (CHUQ) 2705 boul. Laurier Ste-Foy, QC, Canada, G1V 4G2 Tel: (418) 654-2705, Fax: (418) 654-2212
Re: Facet performance with heterogeneous 'facets'?
Just a little follow-up - I did a little more testing, and the query takes 20 seconds no matter what - If there's one document in the results set, or if I do a query that returns all 13 documents. It seems something isn't right... it looks like solr is doing faceted search on the whole index no matter what's the result set when doing facets on a string field. I must be doing something wrong? Michael Imbeault CHUL Research Center (CHUQ) 2705 boul. Laurier Ste-Foy, QC, Canada, G1V 4G2 Tel: (418) 654-2705, Fax: (418) 654-2212 Michael Imbeault wrote: Been playing around with the news 'facets search' and it works very well, but it's really slow for some particular applications. I've been trying to use it to display the most frequent authors of articles; this is from a huge (15 millions articles) database and names of authors are rare and heterogeneous. On a query that takes (without facets) 0.1 seconds, it jumps to ~20 seconds with just 1% of the documents indexed (I've been getting java.lang.OutOfMemoryError with the full index). ~40 seconds for a faceted search on 2 (string) fields. Range queries on a slong field is more acceptable (even with a dozen of them, query time is still in the subsecond range). I'm I trying to do something which isn't what faceted search was made for? It would be understandable, after all, I guess the facets engine has to check very doc in the index and sort... which shouldn't yield good performance no matter what, sadly. Is there any other way I could achieve what I'm trying to do? Just a list of the most frequent (top 5) authors present in the results of a query. Thanks,
Re: Facet performance with heterogeneous 'facets'?
Yonik Seeley wrote: I noticed this too, and have been thinking about ways to fix it. The root of the problem is that lucene, like all full-text search engines, uses inverted indicies. It's fast and easy to get all documents for a particular term, but getting all terms for a document documents is either not possible, or not fast (assuming many documents match a query). Yeah that's what I've been thinking; the index isn't built to handle such searches, sadly :( It would be very nice to be able to rapidly search by most frequent author, journal, etc. For cases like "author", if there is only one value per document, then a possible fix is to use the field cache. If there can be multiple occurrences, there doesn't seem to be a good way that preserves exact counts, except maybe if the number of documents matching a query is low. I have one value per document (I have fields for authors, last_author and first_author, and I'm doing faceted search on first and last authors fields). How would I use the field cache to fix my problem? Also, would it be better to store a unique number (for each possible author) in an int field along with the string, and do the faceted searching on the int field? Would this be faster / require less memory? I guess that yes, and I'll test that when I have the time. Just a little follow-up - I did a little more testing, and the query takes 20 seconds no matter what - If there's one document in the results set, or if I do a query that returns all 13 documents. Yes, currently the same strategy is always used. intersection_count(docs_matching_query, docs_matching_author1) intersection_count(docs_matching_query, docs_matching_author2) intersection_count(docs_matching_query, docs_matching_author3) etc... Normally, the docsets will be cached, but since the number of authors is greater than the size of the filtercache, the effective cache hit rate will be 0% -Yonik So more memory would fix the problem? Also, I was under the impression that it was only searching / sorting for authors that it knows are in the result set... in the case of only one document (1 result), it seems strange that it takes the same time as for 130 000 results. It should just check the results, see that there's only one author, and return that? And in the case of 2 documents, just sort 2 authors (or 1 if they're the same)? I understand your answer (it does intersections), but I wonder why its intersecting from the whole document set at first, and not docs_matching_query like you said. Thanks for the support, Michael
Re: Facet performance with heterogeneous 'facets'?
Another followup: I bumped all the caches in solrconfig.xml to size="1600384" initialSize="400096" autowarmCount="400096" It seemed to fix the problem on a very small index (facets on last and first author fields, + 12 range date facets, sub 0.3 seconds for queries). I'll check on the full index tomorrow (it's indexing right now, 400docs/sec!). However, I still don't have an idea what are these values representing, and how I should estimate what values I should set them to. Originally I thought it was the size of the cache in kb, and someone on the list told me it was number of items, but I don't quite get it. Better documentation on that would be welcomed :) Also, is there any plans to add an option not to run a facet search if the result set is too big? To avoid 40 seconds queries if the docset is too large... Thanks, Michael Imbeault CHUL Research Center (CHUQ) 2705 boul. Laurier Ste-Foy, QC, Canada, G1V 4G2 Tel: (418) 654-2705, Fax: (418) 654-2212 Yonik Seeley wrote: On 9/18/06, Michael Imbeault <[EMAIL PROTECTED]> wrote: Just a little follow-up - I did a little more testing, and the query takes 20 seconds no matter what - If there's one document in the results set, or if I do a query that returns all 13 documents. Yes, currently the same strategy is always used. intersection_count(docs_matching_query, docs_matching_author1) intersection_count(docs_matching_query, docs_matching_author2) intersection_count(docs_matching_query, docs_matching_author3) etc... Normally, the docsets will be cached, but since the number of authors is greater than the size of the filtercache, the effective cache hit rate will be 0% -Yonik
Re: Facet performance with heterogeneous 'facets'?
Thanks for all the great answers. Quick Question: did you say you are faceting on the first name field seperately from the last name field? ... why? You misunderstood. I'm doing faceting on first author, and last author of the list. Life science papers have authors list, and the first one is usually the guy who did most of the work, and the last one is usually the boss of the lab. I already have untokenized author fields for that using copyField. Second: you mentioned increasing hte size of your filterCache significantly, but we don't really know how heterogenous your index is ... once you made that cahnge did your filterCache hitrate increase? .. do you have any evictions (you can check on the "Statistics" page) It was at the default (16000) and it hit the ceiling so to speak. I did maxSize=1600 (for testing purpose) and now size : 17038 and 0 evictions. For a single facet field (journal name) with a limit of 5 and 12 faceted query fields (range on publication date), I now have 0.5 seconds search, which is not too bad. The filtercache size is pretty much constant no matter how many queries I do. However, if I try to add another facet field (such as first_author), something strange happens. 99% CPU, the filter cache is filling up really fast, hitratio goes to hell, no disk activity, and it can stay that way for at least 30 minutes (didn't test longer, no point really). It turns out that journal_name has 17038 different tokens, which is manageable, but first_author has > 400 000. I don't think this will ever yield good performance, so i might only do journal_name facets. Any reasons why facets tries to preload every term in the field? I have noticed that facets are not cached. Facets off, cached query take 0.01 seconds. Facet on, uncached and cached queries take 0.7 seconds. Any plans for a facets cache? I know that facets is still a very early feature, but its already awesome; my application is maybe irrealistic. Thanks, Michael
Re: Facet performance with heterogeneous 'facets'?
Dude, stop being so awesome (and the whole Solr team). Seriously! Every problem / request (MoreLikeThis class, change AND/OR preference programatically, etc) I've submitted to this mailing list has received a quick, more-than-I-ever-expected answer. I'll subscribe to the dev list (been reading it off and on), but I'm afraid I couldn't code my way of a paper bag in Java. I'll contribute to the Solr wiki (the SolrPHP part in particular) as soon as I can. Thats the least I can do! Btw, Any plans for a facets cache? Michael Imbeault CHUL Research Center (CHUQ) 2705 boul. Laurier Ste-Foy, QC, Canada, G1V 4G2 Tel: (418) 654-2705, Fax: (418) 654-2212 Yonik Seeley wrote: On 9/21/06, Michael Imbeault <[EMAIL PROTECTED]> wrote: It turns out that journal_name has 17038 different tokens, which is manageable, but first_author has > 400 000. I don't think this will ever yield good performance, so i might only do journal_name facets. Hang in there Michael, a fix is on the way for your scenario (and subscribe to solr-dev if you want to stay on the bleeding edge): http://www.nabble.com/big-faceting-speedup-for-single-valued-fields-tf2308153.html -Yonik
Re: Facet performance with heterogeneous 'facets'?
I upgraded to the most recent Solr build (9-22) and sadly it's still really slow. 800 seconds query with a single facet on first_author, 15 millions documents total, the query return 180. Maybe i'm doing something wrong? Also, this is on my personal desktop; not on a server. Still, I'm getting 0.1 seconds queries without facets, so I don't think thats the cause. In the admin panel i can still see the filtercache doing millions of lookups (and tons of evictions once it hits the maxsize). Here's the field i'm using in schema.xml : This is the query : q="hiv red blood"&start=0&rows=20&fl=article_title+authors+journal_iso+pubdate+pmid+score&qt=standard&facet=true&facet.field=first_author&facet.limit=5&facet.missing=false&facet.zeros=false I'll do more testing on the weekend, Michael Imbeault CHUL Research Center (CHUQ) 2705 boul. Laurier Ste-Foy, QC, Canada, G1V 4G2 Tel: (418) 654-2705, Fax: (418) 654-2212 Yonik Seeley wrote: On 9/21/06, Yonik Seeley <[EMAIL PROTECTED]> wrote: Hang in there Michael, a fix is on the way for your scenario (and subscribe to solr-dev if you want to stay on the bleeding edge): OK, the optimization has been checked in. You can checkout from svn and build Solr, or wait for the 9-22 nightly build (after 8:30 EDT). I'd be interested in hearing your results with it. The first facet request on a field will take longer than subsequent ones because the FieldCache entry is loaded on demand. You can use a firstSearcher/newSearcher hook in solrconfig.xml to send a facet request so that a real user would never see this slower query. -Yonik
Re: Facet performance with heterogeneous 'facets'?
Excellent news; as you guessed, my schema was (for some reason) set to version 1.0. This also caused some of the problems I had with the original SolrPHP (parsing the wrong response). But better yet, the 800 seconds query is now running in 0.5-2 seconds! Amazing optimization! I can now do faceting on journal title (17 000 different titles) and last author (>400 000 authors), + 12 date range queries, in a very reasonable time (considering im on a test windows desktop box and not a server). The only problem is if I add first author, I get a java.lang.OutOfMemoryError: Java heap space. I'm sure this problem will get away on a server with more than the current 500 megs I can allocate to Tomcat. Michael Imbeault CHUL Research Center (CHUQ) 2705 boul. Laurier Ste-Foy, QC, Canada, G1V 4G2 Tel: (418) 654-2705, Fax: (418) 654-2212 Yonik Seeley wrote: On 9/22/06, Michael Imbeault <[EMAIL PROTECTED]> wrote: I upgraded to the most recent Solr build (9-22) and sadly it's still really slow. 800 seconds query with a single facet on first_author, 15 millions documents total, the query return 180. Maybe i'm doing something wrong? Also, this is on my personal desktop; not on a server. Still, I'm getting 0.1 seconds queries without facets, so I don't think thats the cause. In the admin panel i can still see the filtercache doing millions of lookups (and tons of evictions once it hits the maxsize). The fact that you see all the filtercache usage means that the optimization didn't kick in for some reason. Here's the field i'm using in schema.xml : That looks fine... This is the query : q="hiv red blood"&start=0&rows=20&fl=article_title+authors+journal_iso+pubdate+pmid+score&qt=standard&facet=true&facet.field=first_author&facet.limit=5&facet.missing=false&facet.zeros=false That looks OK too. I assume that you didn't change the fieldtype definition for "string", and that the schema has version="1.1"? Before 1.1, all fields were assumed to be multiValued (there was no checking or enforcement). -Yonik
Spellchecker in Solr?
Hello everyone, Has anybody successfully implemented a Lucene spellchecker within Solr? If so, could you give details on how one would achieve this? If not, is it planned to make it as standard within Solr? Its a feature almost every Solr application would want to use, so I think it would be a nice idea. Sadly, I'm no Java developer, so I fear I won't be the one coding that :( Thanks, -- Michael Imbeault CHUL Research Center (CHUQ) 2705 boul. Laurier Ste-Foy, QC, Canada, G1V 4G2 Tel: (418) 654-2705, Fax: (418) 654-2212
Re: Spellchecker in Solr?
I had the very same article in mind - how would it be simpler in Solr than in Lucene? A spellchecker is pretty much standard in every major search engine nowadays - with one, Solr would be the best, hands down (even if it already is :P). Are your plans to build this anything concrete, or is it just at the 'i might do this in the future' stage? Thanks, -- Michael Imbeault CHUL Research Center (CHUQ) 2705 boul. Laurier Ste-Foy, QC, Canada, G1V 4G2 Tel: (418) 654-2705, Fax: (418) 654-2212 Kevin Lewandowski wrote: I have not done one but have been planning to do it based on this article: http://today.java.net/pub/a/today/2005/08/09/didyoumean.html With Solr it would be much simpler than the java examples they give. On 10/30/06, Michael Imbeault <[EMAIL PROTECTED]> wrote: Hello everyone, Has anybody successfully implemented a Lucene spellchecker within Solr? If so, could you give details on how one would achieve this? If not, is it planned to make it as standard within Solr? Its a feature almost every Solr application would want to use, so I think it would be a nice idea. Sadly, I'm no Java developer, so I fear I won't be the one coding that :( Thanks, -- Michael Imbeault CHUL Research Center (CHUQ) 2705 boul. Laurier Ste-Foy, QC, Canada, G1V 4G2 Tel: (418) 654-2705, Fax: (418) 654-2212
Re: Spellchecker in Solr?
I had #1 in mind. Everything in my mainIndex is supposed to be correctly spelled, so I just want to use that as a source for spelling suggestions. I'd check for suggestions on low numbers of results (no results, or very few for a one word query). #2 would be even better but as you said, its a lot trickier. For my needs, just a spelling suggester would be perfect. Would it require java programming, or could I get away with it with the current Solr (adding n-gram fields and querying on them)? Thanks, Michael Imbeault CHUL Research Center (CHUQ) 2705 boul. Laurier Ste-Foy, QC, Canada, G1V 4G2 Tel: (418) 654-2705, Fax: (418) 654-2212 Chris Hostetter wrote: : Has anybody successfully implemented a Lucene spellchecker within Solr? : If so, could you give details on how one would achieve this? There's really two ways to interpret that question ... 1) built a spell correction suggestion application powered by Solr, where you manually feed it the data as documents and the mainIndex is the source of suggestion data. 2) Embeded sepll correction suggestion in Solr, so that request handlers can return suggested alternatives allong with the results from your mainIndex. #1 would probably be pretty easy as people have mentioned. #2 would be a lot trickier... request handlers can certainly keep state, and could even write to files if they wanted to to preserve state accross JVM instances to maintain a permenant dictionary store ... and i suppose you could use a newSearcher Listener to know when documents have been added so you can scan them for new words to update your dictionary ... but off the top of my head it sounds like it would get pretty complicated. -Hoss
Sentence level searching
Hello everyone, I'm trying to do some sentence-level searching with Solr; basically, I want to find if two words are in the same sentence. As I read on the Lucene mailing list, there's many ways to do this, including but not limited to : -inserting special boundary terms to denote the start and end of a sentence. It is unclear to me what kind of query should be used to fetch results from within one sentence (something like: start_sentence_token word1 word2 end_sentence_token)? -increase token position at a sentence boundary by a large factor (1000?) so that "x y"~500 (or more) won't match across sentence boundaries. Is there an existing filter class that I could use to do this, or should I first parse my text fields with PHP and some NLP tool, and index the result (for the first case)? For the second case (increment token position), how should I do this within Solr? Is there any plans to implement such functionality as standard? Thanks for the help, -- Michael Imbeault CHUL Research Center (CHUQ) 2705 boul. Laurier Ste-Foy, QC, Canada, G1V 4G2 Tel: (418) 654-2705, Fax: (418) 654-2212
Index & search questions; special cases
Hello again, - Let's say I index "HIV-1" with class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="1"/>. Would a search on HIV AND 1 (or even HIV-1, which after parsing by the above filter would yield HIV1 or HIV 1) also find documents which have HIV and the number "1" somewhere in the document, but not directly after HIV? If so, how should I fix this? I could boost score by proximity, but I'm doing a sort on date anyway, so I guess it would be pointless to do so. - Somewhat related : Let's say I index "Polymyxin B". If I stopword single letters, would a phrase search ("Polymyxin B") still find the right documents (I don't think so, but still)? If not, I'll have to index single letters; how do I prevent the same problem as in the first question (i.e., a search on Polymyxin B yielding documents with Polymyxin and B, but not close to one another). My thought is to parse the user query and rephrase it to do phrase searches on nearby terms containing single letters / numbers. If an user search for HIV 1 hepatitis, I'd rewrite it as ("HIV 1" AND hepatitis) OR ("1 hepatitis" AND hiv). Is it a sensible solution? Thanks, -- Michael Imbeault CHUL Research Center (CHUQ) 2705 boul. Laurier Ste-Foy, QC, Canada, G1V 4G2 Tel: (418) 654-2705, Fax: (418) 654-2212
Re: Sentence level searching
Hello everyone, Solr puts a configurable gap between values of the same field, so you could index every sentence as a separate value of a multi-valued field. Thanks for the answer Yonik; I forgot about Multivalued fields! I'm not exactly sure of how to add multiple values to a single field (aside from fieldcopy). The code I'm thinking of using : PHP code to build the XML (loop for each sentence) $abstract_element = $dom->createElement('field'); $abstract_element->setAttribute('name', 'abstract'); $abstract_text = $dom->createTextNode($array['abstract']); $abstract_element->appendChild($abstract_text); (end loop) $doc->appendChild($abstract_element); Field in schema.xml : stored="false" multivalued="true" /> Where am I supposed to configure the value of the gap? positionIncrementGap in the fieldtype definition is my guess, but I'm not sure. Also, am I supposed to put multivalued in the fieldtype definition? Alternatively, could I put positionIncrementGap in the that I posted just above? Thanks for the help, Michael
Re: Index & search questions; special cases
Chris Hostetter wrote: A couple of things make your question really hard to answer ... first off, you can specify differnet analyser chains for index time and query time -- shen dealing with the WordDelim filter (or the synonym fitler) this is frequently neccessary -- so the ansers to your questions really depend on wether you use WordDelim at both index time and query time (or if you do use it in both cases, but configure it differnetly) For clarification, I'm using the filter both at index and query time. Have you by any chance played with the "Analysis" page on your Solr index? http://localhost:8983/solr/admin/analysis.jsp?name=&verbose=on&highlight=on&qverbose=on&; ...it makes it really easy to see exactly how your various fields will get parsed at index time and query time. I would also suggest you use the "debugQuery=on" option when doing some searches -- even if there aren't nay documents in your index, that will help you see how your query is getting parsed and what Query structure QueryParser is building based on the tokens it gets from each of hte Anaalyzers. Will try that, played with it in the past, but not for this particular problem, good idea :) : My thought is to parse the user query and rephrase it to do phrase : searches on nearby terms containing single letters / numbers. If an user : search for HIV 1 hepatitis, I'd rewrite it as ("HIV 1" AND hepatitis) OR : ("1 hepatitis" AND hiv). Is it a sensible solution? that's kind of a strange behavior for a search application to have ... you might just wnat to trust that your users will be smart and if they find that 'HIV 1 hepatitis' is matching docs where "1" doesn't appear near "HIV" or "hepatitis" then they will start entering '"HIV 1" hepatitis" (or 'HIV "1 hepatits"' if that's what they ment.) Sadly I can't rely on users smartness for this :) I have concerns that for stuff like Hepatitis A, it will match just about every document containing hepatitis and the very common 'a' word, anywhere in the document. I can't stopword single letters, cause then there would be no way to find documents about 'hepatitis c' and not about 'hepatitis b' for example. I will test my solution and report; if you have any other ideas, just tell me. And thanks for the help! :)
Re: Sentence level searching
So basically its just as I thought it was, thanks for the help :) I had checked the wiki before asking, but it lacks details and is often vague, or presuppose that you have knowledge about some specific terms without explaining them. Its all clear now, thanks to you ;) Michael Imbeault CHUL Research Center (CHUQ) 2705 boul. Laurier Ste-Foy, QC, Canada, G1V 4G2 Tel: (418) 654-2705, Fax: (418) 654-2212 Chris Hostetter wrote: : Thanks for the answer Yonik; I forgot about Multivalued fields! I'm not : exactly sure of how to add multiple values to a single field (aside from : fieldcopy). The code I'm thinking of using : If you look at the exampledocs, "features" and "cat" are both multivalued fields... you just list multiple s with the same name in your : Field in schema.xml : : : Where am I supposed to configure the value of the gap? : positionIncrementGap in the fieldtype definition is my guess, but I'm correct. : not sure. Also, am I supposed to put multivalued in the fieldtype : definition? Alternatively, could I put positionIncrementGap in the : that I posted just above? I *think* positionIncrementGap has to be set by on the fieldtype ... but i'm not 100% certain of that. multiValued and the other field attributes (indexed, stored, compressed, omitNorms) can be set on the field or inherited from the fieldtype. More info can be found in the comments of the example schema.xml, as well as these wiki pages... http://wiki.apache.org/solr/SchemaXml http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters -Hoss
Re: Index & search questions; special cases
Hello everyone, Thanks for all your answers; synonyms based approaches won't work because the medical / research field is evolving way too fast; it would become unmaintainable very quickly, and the list would be huge. Anyway, I can't rely on score because I'm sorting by date, so I need to eliminate the 'hiv' in one part of the doc and '1' in another part problem completely (if I want docs that fits HIV-1, or Polymyxin B, or hepatitis A - I don't want docs that fits 'A patient was cured of hepatitis C' if I search for 'hepatitis a'). : Nutch has phrase pre-filtering which helps with this. It indexes the : phrase fragments as separate terms and uses that set of matches to : filter the set of matching documents. Is this a filter that I could implement easily into Solr? I never did java, but it can't be that complicated I guess. Any help would be appreciated. That reminds me ... i seem to remember someone saying once that Nutch lso builds word based n-grams out of it's stop words, so searches on "the" or "on" won't match anything because those words are never indexed as a single tokens, but if a document contains "the dog in the house" it would match a search on "in the" because the Analyzer would treat that as a single token "in_the". This looks like exactly what I'm looking for. Is it related to the above 'nutch pre-filtering'? This way if I stopword single letters and numbers, it would still index 'hepatitis_a' as a single token, and match a search on 'hepatitis a' (non-phrase search) without hitting 'a patient has hepatitis'? I guess i'd have to apply the filter to the query too, so it turns the query into hepatitis_a? Basically, its another way to what I proposed as a solution - rewrite the query to include phrase queries when you find a stopword, if you index them anyway. Still, this solution looks better, as the size of the index would probably be smaller than if I didn't stopword single letters at all? For reference, what I proposed was: My thought is to parse the user query and rephrase it to do phrase searches on nearby terms containing single letters / numbers. If an user search for HIV 1 hepatitis, I'd rewrite it as ("HIV 1" AND hepatitis) OR ("1 hepatitis" AND hiv). Is it a sensible solution? Any chance at all this kind of filter gets implemented into solr? If not, indications on how to do it myself would be appreciated - I can't say I have a clue right now (never did java, the only lucene programming I did was via a php bridge). Thanks for the help, Michael Imbeault CHUL Research Center (CHUQ) 2705 boul. Laurier Ste-Foy, QC, Canada, G1V 4G2 Tel: (418) 654-2705, Fax: (418) 654-2212
Re: Index & search questions; special cases
CommonGrams itself seems to have some other dependencies on nutch because of other utilities in the same class, but based on a quick skim, what you really want is the nested "private static class Filter extends TokenFilter" which doesn't really have any external dependencies. If you extract that class into some more specificly named "CommonGramsFilter", all you need after that to use it in Solr is a simple little "FilterFactory" so you can refrence it in your schema.xml ... you can use the StopFilterFactory as a template since you'll need exactly the same initalization (get the name of a word list file from the init params, parse it, and build a word set out of it)... Chris, thanks for the tips (or should I say, detailed explanation!). I actually got it working! It was a pain at first (never did any java, and all this ant, junit, war, jar, java, .classes are confusing!). I had some compile errors that I cleaned up. Playing around with the filter in the admin panel analyser yields expected results; I can't thank you enough for your help. I now use : generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0"/> words="stopwords-complete.txt" ignoreCase="true"/> ignoreCase="true"/> And it works perfectly. If Solr is interested in the filter, just tell me (and how should I do to contribute it). Michael Imbeault CHUL Research Center (CHUQ) 2705 boul. Laurier Ste-Foy, QC, Canada, G1V 4G2 Tel: (418) 654-2705, Fax: (418) 654-2212 http://svn.apache.org/viewvc/incubator/solr/trunk/src/java/org/apache/solr/analysis/StopFilterFactory.java?view=markup ...all you really need to change is that the "create" method should return a new "CommonGramsFilter" instead of a StopFilter. Incidently: most of the code in CommonGrams.Filter seems to be dealing with the buffering of tokens ... it may be easier to reimpliment the logic with Solr's BufferedTokenStream as a base class.
Re: Solr and Oracle
I index documents I have in a mysql database via xml. You can build your xml documents on the fly with the data from your database and index that, no problem at all. Michael Imbeault CHUL Research Center (CHUQ) 2705 boul. Laurier Ste-Foy, QC, Canada, G1V 4G2 Tel: (418) 654-2705, Fax: (418) 654-2212 Nicolas St-Laurent wrote: Hi, Does someone use Solr to index/search in a database instead of XML documents ? I search for information about this and I don't find any. Actually, I index huge Oracle tables with Lucene with a custom made indexer/search engine. But I would prefer to use Solr instead. If someone can give me a hint on how to do this, I will appreciate. Thanks, Nicolas St-Laurent
Re: Spellchecker in Solr
I was at the origin of the thread you mentionned; I still didnt made any progress toward integrating a spell suggestion function in Solr; but then again, I'm a java and a lucene novice (though I'm learning fast thanks to all the help on the mailing list!). By all means, if you think you can do this, share with the community; to me its the last 'must have' feature that would make Solr perfect out of the box (its still awesome without this, mind you!). I think the option you describe is the easiest / best one to implement. Michael Imbeault CHUL Research Center (CHUQ) 2705 boul. Laurier Ste-Foy, QC, Canada, G1V 4G2 Tel: (418) 654-2705, Fax: (418) 654-2212 Otis Gospodnetic wrote: Hi, A month ago, the topic of a spell checker in Solr came up (c.f. http://www.mail-archive.com/solr-user@lucene.apache.org/msg01254.html ). Has anyone made any progress with that? If not, I'll have to do this to scratch my own itch. Because I'm in a hurry with this, I think I will go with the "chop terms into n-grams in the client, and send the term + the n-grams to Solr for indexing", as described here: http://www.mail-archive.com/solr-user@lucene.apache.org/msg01264.html . I will then query this index for alternative spelling suggestions just like I'd query any other Solr instance (the idea being I'd search this index in parallel with the search of the index with the actual data I want to find). I will not, at this time, modify or write any spell checker request handlers that add spelling suggestions to the response. If anyone has any comments not covered in that thread above, I'm all eyes. Otis
Re: Better highlighting fragmenter
I for one would be interested in such a fragmenter, as the default one is lacking and doesnt produce acceptable results for most applications. Michael Mike Klaas wrote: I've written an unpolished custom fragmenter for highlighting which is more expensive than the BasicFragmenter that ships with lucene, but generates more natural candidate fragments (it will tend to produce beginning/ends of sentences). Would there be interest in the community in releasing it and/or including it in Solr? -Mike