Re: LSA Implementation
Lance, It does cover European languages, but pretty much nothing on Asian languages (CJK). - Eswar On Nov 28, 2007 1:51 AM, Norskog, Lance <[EMAIL PROTECTED]> wrote: > WordNet itself is English-only. There are various ontology projects for > it. > > http://www.globalwordnet.org/ is a separate world language database > project. I found it at the bottom of the WordNet wikipedia page. Thanks > for starting me on the search! > > Lance > > -Original Message- > From: Eswar K [mailto:[EMAIL PROTECTED] > Sent: Monday, November 26, 2007 6:50 PM > To: solr-user@lucene.apache.org > Subject: Re: LSA Implementation > > The languages also include CJK :) among others. > > - Eswar > > On Nov 27, 2007 8:16 AM, Norskog, Lance <[EMAIL PROTECTED]> wrote: > > > The WordNet project at Princeton (USA) is a large database of > synonyms. > > If you're only working in English this might be useful instead of > > running your own analyses. > > > > http://en.wikipedia.org/wiki/WordNet > > http://wordnet.princeton.edu/ > > > > Lance > > > > -Original Message- > > From: Eswar K [mailto:[EMAIL PROTECTED] > > Sent: Monday, November 26, 2007 6:34 PM > > To: solr-user@lucene.apache.org > > Subject: Re: LSA Implementation > > > > In addition to recording which keywords a document contains, the > > method examines the document collection as a whole, to see which other > > > documents contain some of those same words. this algo should consider > > documents that have many words in common to be semantically close, and > > > ones with few words in common to be semantically distant. This simple > > method correlates surprisingly well with how a human being, looking at > > > content, might classify a document collection. Although the algorithm > > doesn't understand anything about what the words *mean*, the patterns > > it notices can make it seem astonishingly intelligent. > > > > When you search an such an index, the search engine looks at > > similarity values it has calculated for every content word, and > > returns the documents that it thinks best fit the query. Because two > > documents may be semantically very close even if they do not share a > > particular keyword, > > > > Where a plain keyword search will fail if there is no exact match, > > this algo will often return relevant documents that don't contain the > > keyword at all. > > > > - Eswar > > > > On Nov 27, 2007 7:51 AM, Marvin Humphrey <[EMAIL PROTECTED]> > wrote: > > > > > > > > On Nov 26, 2007, at 6:06 PM, Eswar K wrote: > > > > > > > We essentially are looking at having an implementation for doing > > > > search which can return documents having conceptually similar > > > > words without necessarily having the original word searched for. > > > > > > Very challenging. Say someone searches for "LSA" and hits an > > > archived > > > > > version of the mail you sent to this list. "LSA" is a reasonably > > > discriminating term. But so is "Eswar". > > > > > > If you knew that the original term was "LSA", then you might look > > > for documents near it in term vector space. But if you don't know > > > the original term, only the content of the document, how do you know > > > > whether you should look for docs near "lsa" or "eswar"? > > > > > > Marvin Humphrey > > > Rectangular Research > > > http://www.rectangular.com/ > > > > > > > > > > > >
Re: Combining SOLR and JAMon to monitor query execution times from a browser
Hi Noberto, JAMon is all about aggregating statistical data and displaying the information for a web browser - the main beauty is that it is easy to define what you are monitoring such as querying domain objects per customer. Cheers, Siegfried Goeschl Norberto Meijome wrote: On Tue, 27 Nov 2007 18:18:16 +0100 Siegfried Goeschl <[EMAIL PROTECTED]> wrote: Hi folks, working on a closed source project for an IP concerned company is not always fun ... we combined SOLR with JAMon (http://jamonapi.sourceforge.net/) to keep an eye of the query times and this might be of general interest +) JAMon comes with a ready-to-use ServletFilter +) we extended this implementation to keep track for queries issued by a customer and the requested domain objects, e.g. "artist", "album", "track" +) this allows us to keep track of the execution times and their distribution to find quickly long running queries without having access to the access.log from a web browser +) a small presentation can be found at http://people.apache.org/~sgoeschl/presentations/jamon-20070717.pdf +) if it is of general I can rewrite the code as contribution Thanks Siegfried, I am further interested in plugging this information into something like Nagios , Cacti , Zenoss , bigsister , Openview or your monitoring system of choice, but I haven't had much time to look into this yet. How does JAMon compare to JMX ( http://java.sun.com/javase/technologies/core/mntr-mgmt/javamanagement/) ? cheers, B _ {Beto|Norberto|Numard} Meijome There are no stupid questions, but there are a LOT of inquisitive idiots. I speak for myself, not my employer. Contents may be hot. Slippery when wet. Reading disclaimers makes you go blind. Writing them is worse. You have been Warned.
SOLR 1.2 - Updates sent containing fields that are not on the Schema fail silently
Hi I experienced a very unpleasant problem recently, when my search indexing adaptor was changed to add some new fields. The problem is my schema didn't follow those changes (new fields added), and after that SOLR was silently ignoring all documents I sent. Neither SOLR Java client or SOLR server returned me an error code or log message. In the server side, nothing was logged and the client received a standard success return. Why didn't my documents got indexed and this new fields were just ignored? That is what I think it was supposed to do. Please let me know your thoughts. Regards, Daniel http://www.bbc.co.uk/ This e-mail (and any attachments) is confidential and may contain personal views which are not the views of the BBC unless specifically stated. If you have received it in error, please delete it from your system. Do not use, copy or disclose the information in any way nor act in reliance on it and notify the sender immediately. Please note that the BBC monitors e-mails sent or received. Further communication will signify your consent to this.
Re: Memory use with sorting problem
Just wanted to add the solution to this problem, in case someone finds the matching description in the archives (see below). By reducing the granularity of the timestamp field (stored as slong) from seconds to minutes the number of unique values was reduced by an order of magnitude (there are about 500.000 minutes in a year) and hence the memory use was also reduced. Chris Chris Laux wrote: > Hi again, > > in the meantime I discovered the use of jmap (I'm not a Java programmer) > and found that all the memory was being used up by String and char[] > objects. > > The Lucene docs have the following to say on sorting memory use: > >> For String fields, the cache is larger: in addition to the above > array, the value of every term in the field is kept in memory. If there > are many unique terms in the field, this could be quite large. > > (http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/search/Sort.html) > > I am sorting on the "slong" schema type, which is of course stored as a > string. The above quote seems to indicate that it is possible for a > field not to be a string for the purposes of the sort, while I took it > from LiA that everything is a string to Lucene. > > What can I do to make sure the additional memory is not used by every > unique term? i.e. how to have the slong not be a "String field"? > > Cheers, > Chris > > > Chris Laux wrote: >> Hi all, >> >> I've been struggling with this problem for over a month now, and >> although memory issues have been discussed often, I don't seem to be >> able to find a fitting solution. >> >> The index is merely 1.5 GB large, but memory use quickly fills out the >> heap max of 1 GB on a 2 GB machine. This then works fine until >> auto-warming starts. Switching the latter off altogether is unattractive >> as it leads to response times of up to 30 s. When auto-warming starts, I >> get this error: >> >>> SEVERE: Error during auto-warming of >> key:org.apache.solr.search.QueryResultKey >> @e0b93139:java.lang.OutOfMemoryError: Java heap space >> >> Now when I reduce the size of caches (to a fraction of the default >> settings) and number of warming Searchers (to 2), memory use is not >> reduced and the problem stays. Only deactivating auto-warming will help. >> When I set the heap size limit higher (and go into swap space), all the >> extra memory seems to be used up right away, independently from >> auto-warming. >> >> This all seems to be closely connected to sorting by a numerical field, >> as switching this off does make memory use a lot more friendly. >> >> Is it normal to need that much Memory for such a small index? >> >> I suspect the problem is in Lucene, would it be better to post on their >> list? >> >> Does anyone know a better way of getting the sorting done? >> >> Thanks in advance for your help, >> >> Chris >> >> >> This is the field setup in schema.xml: >> >> > multiValued="false" /> >> > multiValued="false" /> >> >> >> >> And this is a sample query: >> >> select/?q=solr&start=0&rows=20&sort=created+desc >> >> >
Re: SOLR 1.2 - Updates sent containing fields that are not on the Schema fail silently
Yup, I do remember that happening to me before. Is this intentionally so? Ravish On Nov 28, 2007 1:41 PM, Daniel Alheiros <[EMAIL PROTECTED]> wrote: > Hi > > I experienced a very unpleasant problem recently, when my search indexing > adaptor was changed to add some new fields. The problem is my schema didn't > follow those changes (new fields added), and after that SOLR was silently > ignoring all documents I sent. > > Neither SOLR Java client or SOLR server returned me an error code or log > message. In the server side, nothing was logged and the client received a > standard success return. > > Why didn't my documents got indexed and this new fields were just ignored? > That is what I think it was supposed to do. > > Please let me know your thoughts. > > Regards, > Daniel > > > http://www.bbc.co.uk/ > This e-mail (and any attachments) is confidential and may contain personal > views which are not the views of the BBC unless specifically stated. > If you have received it in error, please delete it from your system. > Do not use, copy or disclose the information in any way nor act in reliance > on it and notify the sender immediately. > Please note that the BBC monitors e-mails sent or received. > Further communication will signify your consent to this. > >
Re: SOLR 1.2 - Updates sent containing fields that are not on the Schema fail silently
On Nov 28, 2007, at 8:41 AM, Daniel Alheiros wrote: I experienced a very unpleasant problem recently, when my search indexing adaptor was changed to add some new fields. The problem is my schema didn't follow those changes (new fields added), and after that SOLR was silently ignoring all documents I sent. Is your schema perhaps configured to ignore undefined fields? Erik
Re: CJK Analyzers for Solr
With Ultraseek, we switched to a dictionary-based segmenter for Chinese because the N-gram highlighting wasn't acceptable to our Chinese customers. I guess it is something to check for each application. wunder On 11/27/07 10:46 PM, "Otis Gospodnetic" <[EMAIL PROTECTED]> wrote: > For what it's worth I worked on indexing and searching a *massive* pile of > data, a good portion of which was in CJ and some K. The n-gram approach was > used for all 3 languages and the quality of search results, including > highlighting was evaluated and okay-ed by native speakers of these languages. > > Otis > -- > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > - Original Message > From: Walter Underwood <[EMAIL PROTECTED]> > To: solr-user@lucene.apache.org > Sent: Tuesday, November 27, 2007 2:41:38 PM > Subject: Re: CJK Analyzers for Solr > > Dictionaries are surprisingly expensive to build and maintain and > bi-gram is surprisingly effective for Chinese. See this paper: > >http://citeseer.ist.psu.edu/kwok97comparing.html > > I expect that n-gram indexing would be less effective for Japanese > because it is an inflected language. Korean is even harder. It might > work to break Korean into the phonetic subparts and use n-gram on > those. > > You should not do term highlighting with any of the n-gram methods. > The relevance can be very good, but the highlighting just looks dumb. > > wunder > > On 11/27/07 8:54 AM, "Eswar K" <[EMAIL PROTECTED]> wrote: > >> Is there any specific reason why the CJK analyzers in Solr were > chosen to be >> n-gram based instead of it being a morphological analyzer which is > kind of >> implemented in Google as it considered to be more effective than the > n-gram >> ones? >> >> Regards, >> Eswar >> >> >> >> On Nov 27, 2007 7:57 AM, Eswar K <[EMAIL PROTECTED]> wrote: >> >>> thanks james... >>> >>> How much time does it take to index 18m docs? >>> >>> - Eswar >>> >>> >>> On Nov 27, 2007 7:43 AM, James liu <[EMAIL PROTECTED] > wrote: >>> i not use HYLANDA analyzer. i use je-analyzer and indexing at least 18m docs. i m sorry i only use chinese analyzer. On Nov 27, 2007 10:01 AM, Eswar K <[EMAIL PROTECTED]> wrote: > What is the performance of these CJK analyzers (one in lucene and hylanda > )? > We would potentially be indexing millions of documents. > > James, > > We would have a look at hylanda too. What abt japanese and korean > analyzers, > any recommendations? > > - Eswar > > On Nov 27, 2007 7:21 AM, James liu <[EMAIL PROTECTED]> > wrote: > >> I don't think NGram is good method for Chinese. >> >> CJKAnalyzer of Lucene is 2-Gram. >> >> Eswar K: >> if it is chinese analyzer,,i recommend > hylanda(www.hylanda.com),,,it is >> the best chinese analyzer and it not free. >> if u wanna free chinese analyzer, maybe u can try je-analyzer. > it have >> some problem when using it. >> >> >> >> On Nov 27, 2007 5:56 AM, Otis Gospodnetic < [EMAIL PROTECTED]> >> wrote: >> >>> Eswar, >>> >>> We've uses the NGram stuff that exists in Lucene's contrib/analyzers >>> instead of CJK. Doesn't that allow you to do everything that > the >> Chinese >>> and CJK analyzers do? It's been a few months since I've looked > at >> Chinese >>> and CJK Analzyers, so I could be off. >>> >>> Otis >>> >>> -- >>> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch >>> >>> - Original Message >>> From: Eswar K <[EMAIL PROTECTED]> >>> To: solr-user@lucene.apache.org >>> Sent: Monday, November 26, 2007 8:30:52 AM >>> Subject: CJK Analyzers for Solr >>> >>> Hi, >>> >>> Does Solr come with Language analyzers for CJK? If not, can you please >>> direct me to some good CJK analyzers? >>> >>> Regards, >>> Eswar >>> >>> >>> >>> >> >> >> -- >> regards >> jl >> > -- regards jl >>> >>> > > > >
query parsing & wildcards
I'm confused by some behavior I'm seeing in Solr (i'm using 1.2.0). I have a field named "description", declared with the following fieldType: The problem I'm having is that when I search for description:deck*, I get the results I expect; when I search for description:Deck*, I get nothing. I want both queries to return the same result set. (I'm using the standard request handler.) Interestingly, when I search for description:Deck from the web interface, the debug output shows that the query term is converted to lowercase: description:Deck description:Deck description:deck description:deck ... but when I search for description:Deck*, it shows that it is not: description:Deck* description:Deck* description:Deck* description:Deck* What am I doing wrong here? Also, when I use the Field Analysis tool for description:Deck*, it shows the following (sorry for the bad copy/paste): Query Analyzer org.apache.solr.analysis.WhitespaceTokenizerFactory {} term position 1 term text Deck* term type word source start,end0,5 org.apache.solr.analysis.SynonymFilterFactory {synonyms=synonyms.txt, expand=false, ignoreCase=true} term position 1 term text Deck* term type word source start,end0,5 org.apache.solr.analysis.StopFilterFactory {words=stopwords.txt, ignoreCase=true} term position 1 term text Deck* term type word source start,end0,5 org.apache.solr.analysis.WordDelimiterFilterFactory {generateNumberParts=0, catenateWords=1, generateWordParts=0, catenateAll=0, catenateNumbers=1} term position 1 term text Deck term type word source start,end0,4 org.apache.solr.analysis.LowerCaseFilterFactory {} term position 1 term text deck term type word source start,end0,4 org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {} term position 1 term text deck term type word source start,end0,4 Thanks, Charlie
Re: SOLR / Tomcat JNDI Settings
Thanks a lot Hossman; this solved it for me. Essential for me was to understand that I had to create a solr.xml file in \conf\Catalina\localhost see hereunder in the quote an example. The docbase should point to the .war file somewhere on my system. The value-attribute for the should point to a directory where tomcat can create the lucene/solr index files. That home directory should also contain the conf directory from the example in the solr distribution. And that was it. hossman wrote: > > >docBase="/var/tmp/ac-demo/apache-solr-1.2.0/dist/apache-solr-1.2.0.war" > debug="0" > crossContext="true" > > > value="/var/tmp/ac-demo/books-solr-home/" >type="java.lang.String" >override="true" /> > > -- View this message in context: http://www.nabble.com/Tomcat-JNDI-Settings-tf4753435.html#a14001375 Sent from the Solr - User mailing list archive at Nabble.com.
Re: query parsing & wildcards
I should have Googled better. It seems that my question has been asked and answered already, and not just once: http://www.nabble.com/Using-wildcard-with-accented-words-tf4673239.html http://groups.google.com/group/acts_as_solr/browse_thread/thread/42920dc2dcc5fa88 On Nov 28, 2007 9:42 AM, Charles Hornberger <[EMAIL PROTECTED]> wrote: > I'm confused by some behavior I'm seeing in Solr (i'm using 1.2.0). I > have a field named "description", declared with the following > fieldType: > > positionIncrementGap="100" > > > > synonyms="synonyms.txt" ignoreCase="true" expand="false"/> > words="stopwords.txt"/> > generateWordParts="0" generateNumberParts="0" catenateWords="1" > catenateNumbers="1" catenateAll="0"/> > > > > > > The problem I'm having is that when I search for description:deck*, I > get the results I expect; when I search for description:Deck*, I get > nothing. I want both queries to return the same result set. (I'm using > the standard request handler.) > > Interestingly, when I search for description:Deck from the web > interface, the debug output shows that the query term is converted to > lowercase: > > description:Deck > description:Deck > description:deck > description:deck > > ... but when I search for description:Deck*, it shows that it is not: > > description:Deck* > description:Deck* > description:Deck* > description:Deck* > > What am I doing wrong here? > > Also, when I use the Field Analysis tool for description:Deck*, it > shows the following (sorry for the bad copy/paste): > > Query Analyzer > org.apache.solr.analysis.WhitespaceTokenizerFactory {} > term position 1 > term text Deck* > term type word > source start,end0,5 > org.apache.solr.analysis.SynonymFilterFactory {synonyms=synonyms.txt, > expand=false, ignoreCase=true} > term position 1 > term text Deck* > term type word > source start,end0,5 > org.apache.solr.analysis.StopFilterFactory {words=stopwords.txt, > ignoreCase=true} > term position 1 > term text Deck* > term type word > source start,end0,5 > org.apache.solr.analysis.WordDelimiterFilterFactory > {generateNumberParts=0, catenateWords=1, generateWordParts=0, > catenateAll=0, catenateNumbers=1} > term position 1 > term text Deck > term type word > source start,end0,4 > org.apache.solr.analysis.LowerCaseFilterFactory {} > term position 1 > term text deck > term type word > source start,end0,4 > org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {} > term position 1 > term text deck > term type word > source start,end0,4 > > Thanks, > Charlie >
Re: SOLR 1.2 - Updates sent containing fields that are not on the Schema fail silently
I didn't know that trick. Could you point me this documentation? Anyway, don't you think that is something wrong in discarding all documents without any warning? It's returning 200 return code without any other content on the SOLRJ response to updates and don't log anything on the server side... Regards, Daniel On 28/11/07 15:40, "Erik Hatcher" <[EMAIL PROTECTED]> wrote: > > On Nov 28, 2007, at 8:41 AM, Daniel Alheiros wrote: >> I experienced a very unpleasant problem recently, when my search >> indexing >> adaptor was changed to add some new fields. The problem is my >> schema didn't >> follow those changes (new fields added), and after that SOLR was >> silently >> ignoring all documents I sent. > > Is your schema perhaps configured to ignore undefined fields? > > Erik > http://www.bbc.co.uk/ This e-mail (and any attachments) is confidential and may contain personal views which are not the views of the BBC unless specifically stated. If you have received it in error, please delete it from your system. Do not use, copy or disclose the information in any way nor act in reliance on it and notify the sender immediately. Please note that the BBC monitors e-mails sent or received. Further communication will signify your consent to this.
Re: SOLR 1.2 - Updates sent containing fields that are not on the Schema fail silently
: I didn't know that trick. erik is refering to this in the example schema.xml... ...but it sounds like you are having some other problem ... you said that when you POST your documents with "extra" fields you get a 200 response but the documents aren't getting indexed at all correct? that is not suppose to happen, Solr should be generating an error. can you give us more info on your setup: what does your schema.xml look like, what does your update code look like (you said you were using SolrJ i believe?) what does Solr log when these updates happen, etc... -Hoss
Re: query parsing & wildcards
: I should have Googled better. It seems that my question has been asked : and answered already, and not just once: right, wildcard and prefix queries aren't analyzed by the query parser (there's more on the "why" of this in the Lucene-Java FAQ). To clarify one other part of your question : > Also, when I use the Field Analysis tool for description:Deck*, it : > shows the following (sorry for the bad copy/paste): the analysis tool only shows you the "analysis" portion of indexing/querying ... it knows nothing about which query parser you are using, so it doesn't know anything about any special query parser characters (like "*"). The output it gave you shows you want the standard request handler would have done if you'd used the standard request handler to search for... description:"Deck*" or: description:Deck\* (where the * character is 'escaped') -Hoss
RequestHandler shared resources
I have an object that I would like to share between two or more RequestHandlers. One request handler will be responsible for the object and the other I would like to handle information requests about what the object is doing. Thus, I need to share the object between the handlers. Short of using a static, does anyone have any recommended way of doing this? In a pure servlet, I could use the ServletContext. Or am I missing something? Thanks, Grant
Re: RequestHandler shared resources
Grant Ingersoll wrote: I have an object that I would like to share between two or more RequestHandlers. One request handler will be responsible for the object and the other I would like to handle information requests about what the object is doing. Thus, I need to share the object between the handlers. Short of using a static, does anyone have any recommended way of doing this? In a pure servlet, I could use the ServletContext. Or am I missing something? RequestHandlers can know about each other by asking SolrCore core.getRequestHandler( "myhandler" ) If you are using 1.3-dev, make the RequestHandler implement SolrCoreAware and then inform( SolrCore ) will be called *after* everything is initialized. is that what you need? ryan
Re: RequestHandler shared resources
Yeah, I think that would work. Actually, I should be able to get all the request handlers and then look for instances of the req handlers that I need. Thanks! -Grant On Nov 28, 2007, at 4:42 PM, Ryan McKinley wrote: Grant Ingersoll wrote: I have an object that I would like to share between two or more RequestHandlers. One request handler will be responsible for the object and the other I would like to handle information requests about what the object is doing. Thus, I need to share the object between the handlers. Short of using a static, does anyone have any recommended way of doing this? In a pure servlet, I could use the ServletContext. Or am I missing something? RequestHandlers can know about each other by asking SolrCore core.getRequestHandler( "myhandler" ) If you are using 1.3-dev, make the RequestHandler implement SolrCoreAware and then inform( SolrCore ) will be called *after* everything is initialized. is that what you need? ryan
Re: RequestHandler shared resources
: Yeah, I think that would work. Actually, I should be able to get all the : request handlers and then look for instances of the req handlers that I need. or configure reqHandler "B" with the name of reqHandler "A" that owns the resource so it knows who to ask. -Hoss
LowerCaseFilterFactory and spellchecker
think i'm just doing something wrong... was experimenting with the spellcheck handler with the nightly checkout from 11-28; seems my spellchecking is case-sensitive, even tho i think i'm adding the LowerCaseFilterFactory to both the index and query analyzers. here's a brief rundown of my testing steps. from schema.xml: from solrconfig.xml: 1 0.5 spell spelling adding the doc: curl http://localhost:8983/solr/update -H "Content-Type: text/xml" --data-binary 'Thorne' curl http://localhost:8983/solr/update -H "Content-Type: text/xml" --data-binary '' building the spellchecker: http://localhost:8983/solr/select/?q=Thorne&qt=spellchecker&cmd=rebuild querying the spellchecker: results from http://localhost:8983/solr/select/?q=Thorne&qt=spellchecker 0 1 Thorne false thorne results from http://localhost:8983/solr/select/?q=thorne&qt=spellchecker 0 2 thorne true any pointers as to what i'm doing wrong, misinterpreting? i suspect i'm just doing something bone-headed in the analyzer sections... thanks as always, rob casson miami university libraries
Re: LowerCaseFilterFactory and spellchecker
lance, thanks for the quick replylooks like 'thorne' is getting added to the dictionary, as it comes up as a suggestion for 'Thorne' i could certainly just lowercase in my client, but just confirming that i'm not just screwing it up in the firstplace :) thanks again, rc On Nov 28, 2007 8:11 PM, Norskog, Lance <[EMAIL PROTECTED]> wrote: > There are a few parameters for limiting what words are added to the > dictionary. You might be trimming out 'thorne'. See this page: > > http://wiki.apache.org/solr/SpellCheckerRequestHandler > > > -Original Message- > From: Rob Casson [mailto:[EMAIL PROTECTED] > Sent: Wednesday, November 28, 2007 4:25 PM > To: solr-user@lucene.apache.org > Subject: LowerCaseFilterFactory and spellchecker > > think i'm just doing something wrong... > > was experimenting with the spellcheck handler with the nightly checkout > from 11-28; seems my spellchecking is case-sensitive, even tho i think > i'm adding the LowerCaseFilterFactory to both the index and query > analyzers. > > here's a brief rundown of my testing steps. > > from schema.xml: > > positionIncrementGap="100"> > > > > class="solr.RemoveDuplicatesTokenFilterFactory"/> > > > > > > class="solr.RemoveDuplicatesTokenFilterFactory"/> > > > > > multiValued="true"/> > multiValued="true"/> > > > > > > from solrconfig.xml: > > class="solr.SpellCheckerRequestHandler" startup="lazy"> > > 1 > 0.5 > > spell > spelling > > > > > adding the doc: > > curl http://localhost:8983/solr/update -H "Content-Type: text/xml" > --data-binary ' name="title">Thorne' > curl http://localhost:8983/solr/update -H "Content-Type: text/xml" > --data-binary '' > > > > building the spellchecker: > > http://localhost:8983/solr/select/?q=Thorne&qt=spellchecker&cmd=rebuild > > > > querying the spellchecker: > > results from http://localhost:8983/solr/select/?q=Thorne&qt=spellchecker > > > > > 0 > 1 > > Thorne > false > > thorne > > > > results from http://localhost:8983/solr/select/?q=thorne&qt=spellchecker > > > > > 0 > 2 > > thorne > true > > > > > any pointers as to what i'm doing wrong, misinterpreting? i suspect i'm > just doing something bone-headed in the analyzer sections... > > thanks as always, > > rob casson > miami university libraries >
RE: LowerCaseFilterFactory and spellchecker
Oops, sorry, didn't think that through. The query to the spellchecker is not filtered through the field query definition. You have to do your own lower-case transformation when you do the query. This is a simple thing to resolve. But, I'm working with international alphabets and I would like 'protege' and 'protege with both e's accented` to match. The ISOLatin1 filter does this in indexing & querying. But I have to rip off the code and use it in my app to preprocess words for spell-checks. Lance -Original Message- From: Rob Casson [mailto:[EMAIL PROTECTED] Sent: Wednesday, November 28, 2007 5:16 PM To: solr-user@lucene.apache.org Subject: Re: LowerCaseFilterFactory and spellchecker lance, thanks for the quick replylooks like 'thorne' is getting added to the dictionary, as it comes up as a suggestion for 'Thorne' i could certainly just lowercase in my client, but just confirming that i'm not just screwing it up in the firstplace :) thanks again, rc On Nov 28, 2007 8:11 PM, Norskog, Lance <[EMAIL PROTECTED]> wrote: > There are a few parameters for limiting what words are added to the > dictionary. You might be trimming out 'thorne'. See this page: > > http://wiki.apache.org/solr/SpellCheckerRequestHandler > > > -Original Message- > From: Rob Casson [mailto:[EMAIL PROTECTED] > Sent: Wednesday, November 28, 2007 4:25 PM > To: solr-user@lucene.apache.org > Subject: LowerCaseFilterFactory and spellchecker > > think i'm just doing something wrong... > > was experimenting with the spellcheck handler with the nightly > checkout from 11-28; seems my spellchecking is case-sensitive, even > tho i think i'm adding the LowerCaseFilterFactory to both the index > and query analyzers. > > here's a brief rundown of my testing steps. > > from schema.xml: > > positionIncrementGap="100"> > > > > class="solr.RemoveDuplicatesTokenFilterFactory"/> > > > > > > class="solr.RemoveDuplicatesTokenFilterFactory"/> > > > > > multiValued="true"/> > multiValued="true"/> > > > > > > from solrconfig.xml: > > class="solr.SpellCheckerRequestHandler" startup="lazy"> > > 1 > 0.5 > > spell > spelling > > > > > adding the doc: > > curl http://localhost:8983/solr/update -H "Content-Type: text/xml" > --data-binary ' name="title">Thorne' > curl http://localhost:8983/solr/update -H "Content-Type: text/xml" > --data-binary '' > > > > building the spellchecker: > > http://localhost:8983/solr/select/?q=Thorne&qt=spellchecker&cmd=rebuil > d > > > > querying the spellchecker: > > results from > http://localhost:8983/solr/select/?q=Thorne&qt=spellchecker > > > > 0 > 1 > > Thorne > false > > thorne > > > > results from > http://localhost:8983/solr/select/?q=thorne&qt=spellchecker > > > > 0 > 2 > > thorne > true > > > > > any pointers as to what i'm doing wrong, misinterpreting? i suspect i'm > just doing something bone-headed in the analyzer sections... > > thanks as always, > > rob casson > miami university libraries >
RE: LowerCaseFilterFactory and spellchecker
There are a few parameters for limiting what words are added to the dictionary. You might be trimming out 'thorne'. See this page: http://wiki.apache.org/solr/SpellCheckerRequestHandler -Original Message- From: Rob Casson [mailto:[EMAIL PROTECTED] Sent: Wednesday, November 28, 2007 4:25 PM To: solr-user@lucene.apache.org Subject: LowerCaseFilterFactory and spellchecker think i'm just doing something wrong... was experimenting with the spellcheck handler with the nightly checkout from 11-28; seems my spellchecking is case-sensitive, even tho i think i'm adding the LowerCaseFilterFactory to both the index and query analyzers. here's a brief rundown of my testing steps. from schema.xml: from solrconfig.xml: 1 0.5 spell spelling adding the doc: curl http://localhost:8983/solr/update -H "Content-Type: text/xml" --data-binary 'Thorne' curl http://localhost:8983/solr/update -H "Content-Type: text/xml" --data-binary '' building the spellchecker: http://localhost:8983/solr/select/?q=Thorne&qt=spellchecker&cmd=rebuild querying the spellchecker: results from http://localhost:8983/solr/select/?q=Thorne&qt=spellchecker 0 1 Thorne false thorne results from http://localhost:8983/solr/select/?q=thorne&qt=spellchecker 0 2 thorne true any pointers as to what i'm doing wrong, misinterpreting? i suspect i'm just doing something bone-headed in the analyzer sections... thanks as always, rob casson miami university libraries
Re: LowerCaseFilterFactory and spellchecker
Rob, Let's say it worked as you want it to in the first place. If the query is for Thurne, wouldn't you get thorne (lower-case 't') as the suggestion? This may look weird for proper names. jds
Schema class configuration syntax
Hi- What is the element in an element that will load this class: org.apache.lucene.analysis.cn.ChineseFilter This did not work: This is in Solr 1.2. Thanks, Lance Norskog