Re: How to Sort By a PageRank-Like Complicated Strategy?
Dear Shashi, Thanks so much for your reply! However, I think the value of PageRank is not a static one. It must update on the fly. As I know, Lucene index is not suitable to be updated too frequently. If so, how to deal with that? Best regards, Bing On Sun, Jan 22, 2012 at 12:43 PM, Shashi Kant wrote: > Lucene has a mechanism to "boost" up/down documents using your custom > ranking algorithm. So if you come up with something like Pagerank > you might do something like doc.SetBoost(myboost), before writing to index. > > > > On Sat, Jan 21, 2012 at 5:07 PM, Bing Li wrote: > > Hi, Kai, > > > > Thanks so much for your reply! > > > > If the retrieving is done on a string field, not a text field, a complete > > matching approach should be used according to my understanding, right? If > > so, how does Lucene rank the retrieved data? > > > > Best regards, > > Bing > > > > On Sun, Jan 22, 2012 at 5:56 AM, Kai Lu wrote: > > > >> Solr is kind of retrieval step, you can customize the score formula in > >> Lucene. But it supposes not to be too complicated, like it's better can > be > >> factorization. It also regards to the stored information, like > >> TF,DF,position, etc. You can do 2nd phase rerank to the top N data you > have > >> got. > >> > >> Sent from my iPad > >> > >> On Jan 21, 2012, at 1:33 PM, Bing Li wrote: > >> > >> > Dear all, > >> > > >> > I am using SolrJ to implement a system that needs to provide users > with > >> > searching services. I have some questions about Solr searching as > >> follows. > >> > > >> > As I know, Lucene retrieves data according to the degree of keyword > >> > matching on text field (partial matching). > >> > > >> > But, if I search data by string field (complete matching), how does > >> Lucene > >> > sort the retrieved data? > >> > > >> > If I want to add new sorting ways, Solr's function query seems to > support > >> > this feature. > >> > > >> > However, for a complicated ranking strategy, such PageRank, can Solr > >> > provide an interface for me to do that? > >> > > >> > My ranking ways are more complicated than PageRank. Now I have to load > >> all > >> > of matched data from Solr first by keyword and rank them again in my > ways > >> > before showing to users. It is correct? > >> > > >> > Thanks so much! > >> > Bing > >> >
Re: Improving Solr Spell Checker Results
James, I worked out that I actually needed to 'apply' patch SOLR-2585, whoops. So I have done that now and it seems to return 'correctlySpelled=true' for 'Sigorney Wever' (when Sigorney isn't even in the dictionary). Could something have changed in the trunk to make your patch no longer work? I had to manually merge the setup for the test case due to a new 'hyphens' test case. The settings I am use are: explicit 10 false 10 true true true 10 1 5 1 default spell solr.DirectSolrSpellChecker internal 0.5 2 1 5 4 0.01 spellchecker true With the query: spellcheck=true&facet=on&fl=id,sku,name,format,thumbnail,release_date,url_path,price,special_price,year_made_attr_opt_combo,primary_cat_id&sort=score+desc,name+asc,year_made+desc&start=0&q=sigorney+wever+title:"sigorney+wever"^100+series_name:"sigorney+wever"^50&spellcheck.q=sigorney+wever&fq=store_id:"1"&rows=5 Cheers, David On 22/01/2012 2:03 AM, David Radunz wrote: James, Thanks again for your lengthy and informative response. I updated from SVN trunk again today and was successfully able to run 'ant test'. So I proceeded with trying your suggestions (for question 1 so far): On 17/01/2012 5:32 AM, Dyer, James wrote: David, The spellchecker normally won't give suggestions for any term in your index. So even if "wever" is misspelled in context, if it exists in the index the spell checker will not try correcting it. There are 3 workarounds: 1. Use the patch included with SOLR-2585 (this is for Trunk/4.x only). See https://issues.apache.org/jira/browse/SOLR-2585 I have tried using this with the original test case of 'Signorney Wever'. I didn't notice any difference, although I am a little unclear as to what exactly this patch does. Nor am I really clear what to set either of the options to, so I set them both to '5'. I tried to find the test case it mentions, but it's not present in SpellCheckCollatorTest.java .. Any suggestions? 2. try "onlyMorePopular=true" in your request. (http://wiki.apache.org/solr/SpellCheckComponent#spellcheck.onlyMorePopular). But see the September 2, 2011 comment in SOLR-2585 about why this might not do what you'd hope it would. Trying this did produce 'Signourney Weaver' as you would hope, but I am a little afraid of the downside. I would much more like a context sensative spell check that involves the terms around the correction. 3. If you're building your index on a, you can add a stopword filter that filters out all of the misspelt or rare words from the field that the dictionary is based. This could be an arduous task, and it may or may not work well for your data. I am currently using a copyField for all terms that are relevant, which is quite a lot and the dictionary would encompass a huge amount of data. Adding stopword filters would be out of the question as we presently have more than 30,000 products and this is for the initial launch, we intend to have many many more. As for your second question, I take it you're using (e)dismax with multiple fields in "qf", right? The only way I know to handle this is to create a that combines all of the fields you search across. Use this combined field to base your dictionary. Also, specifying "spellcheck.maxCollationTries" with a non-zero value will weed out the nonsense word combinations that are likely to occur when doing this, ensuring that any collations provided will indeed yield hits. The downside to doing this, of course, is it will make your first problem more acute in that there will be even more terms in your index that the spellchecker will ignore entirely, even if they're mispelled in context. Once again, SOLR-2585 is designed to tackle this problem but it is still in its early stages, and thus far it is Trunk-only. I tried setting spellcheck.maxCollationTries to 5 to see if it would help with the above problem, but it did not. I have now tried using it in the context of question 2. I tried searching for 'Sigorney Wever' in the series name (which it's not present in, as its an actor): spellcheck=true&facet=on&fl=id,sku,name,format,thumbnail,release_date,url_path,price,special_price,year_made_attr_opt_combo,series_name_attr_opt_combo&sort=score+desc,release_date+desc&start=0&q=*+series_name:"signourney+wever"^100&spellcheck.q=signourney+wever&fq=store_id:"1"+AND+series_name_attr_opt_search:*signourney*wever*&rows=5&spellcheck.maxCollationTries=5 Suggestions for 'Sigourney' Wever were returned, but no spelling suggestions or ones for series names (which i doubt there would be) should have been returned. You might also be interested in https://issues.apache.org/jira/browse/SOLR-2993 . Although this is unrelated to your two questions, the patch on this issue introduces a new "ConjunctionSolrSpellChecker" which theoretically could be enhanced to do exactly what you want. That is, you could (theoretically) create separate dictionaries for each of the fiel
RE: Tika0.10 language identifier in Solr3.5.0
Hi, This is exactly what I hope you can elaborate on - analyzer that detects the language and then analyze accordingly. How to do that? Thank you. Best Regards Ni, Bing > From: ted.dunn...@gmail.com > Date: Fri, 20 Jan 2012 09:15:30 -0800 > Subject: Re: Tika0.10 language identifier in Solr3.5.0 > To: solr-user@lucene.apache.org > > I think you misunderstood what I am suggesting. > > I am suggesting an analyzer that detects the language and then "does the > right thing" according to the language it finds. As such, it would > tokenize and stem English according to English rules, German by German > rules and would probably do a sliding bigram window in Japanese and Chinese. > > On Fri, Jan 20, 2012 at 8:54 AM, Erick Erickson > wrote: > > > bq: Why not have a polyglot analyzer > > > > That could work, but it makes some compromises and assumes that your > > languages are "close enough", I have absolutely no clue how that would > > work for English and Chinese say. > > > > But it also introduces inconsistencies. Take stemming. Even though you > > could easily stem in the correct language, throwing all those stems > > into the same filed can produce interesting results at search time since > > you run the risk of hitting something produced by one of the other > > analysis chains. > >
Re: Phonetic search for portuguese
Anyone could help? Thanks 2012/1/20, Anderson vasconcelos : > Hi > > The phonetic filters (DoubleMetaphone, Metaphone, Soundex, RefinedSoundex, > Caverphone) is only for english language or works for other languages? Have > some phonetic filter for portuguese? If dont have, how i can implement > this? > > Thanks >
Re: Improving Solr Spell Checker Results
Hey James, I have played around a bit more with the settings and tried setting spellcheck.maxResultsForSuggest=100 and spellcheck.maxCollations=3. This yields 'Sigourney Weaver' as ONE of the corrections, but it's the second one and not the first. Which is wrong if this is a patch for 'context sensative', because it doesn't really seem to honor any context at all. Unless I am missunderstanding this? Also, I don't really like maxResultsForSuggest as it means 'all or nothing'. If you set it to 10 and there are 100 results, then you offer no corrections at all even if the term is missing in the dictionary entirely. If I set spellcheck.maxResultsForSuggest=100 and spellcheck.maxCollations=3 and choose the collation with the largest 'hits' I get Sigourney Weaver and other 'popular' terms. But say I searched for 'pork and chups', the 'popular' correction is 'park and chips' where as the first correction was correct: 'pork and chips'. So really, none of the solutions either in this patch or Solr offer what I would truely call context sensative spell checking. That being, in a full text search engine you find documents based on terms and how close they are togehter in the document. It makes more than perfect sense to treat the dictionary like this, so that when there are multiple terms it offers suggestions for the terms that match closely to whats entered surrounding the term. Example: "Sigourney Wever" would never appear in a document ever. "Sigourney Weaver" however has many 'hits' in exactly that order of words. So there needs to be a way to boost suggestions based on adjacency... Much like the full text search operates. Thoughts? David On 22/01/2012 9:56 PM, David Radunz wrote: James, I worked out that I actually needed to 'apply' patch SOLR-2585, whoops. So I have done that now and it seems to return 'correctlySpelled=true' for 'Sigorney Wever' (when Sigorney isn't even in the dictionary). Could something have changed in the trunk to make your patch no longer work? I had to manually merge the setup for the test case due to a new 'hyphens' test case. The settings I am use are: explicit 10 false 10 true true true 10 1 5 1 default spell solr.DirectSolrSpellChecker internal 0.5 2 1 5 4 0.01 spellchecker true With the query: spellcheck=true&facet=on&fl=id,sku,name,format,thumbnail,release_date,url_path,price,special_price,year_made_attr_opt_combo,primary_cat_id&sort=score+desc,name+asc,year_made+desc&start=0&q=sigorney+wever+title:"sigorney+wever"^100+series_name:"sigorney+wever"^50&spellcheck.q=sigorney+wever&fq=store_id:"1"&rows=5 Cheers, David On 22/01/2012 2:03 AM, David Radunz wrote: James, Thanks again for your lengthy and informative response. I updated from SVN trunk again today and was successfully able to run 'ant test'. So I proceeded with trying your suggestions (for question 1 so far): On 17/01/2012 5:32 AM, Dyer, James wrote: David, The spellchecker normally won't give suggestions for any term in your index. So even if "wever" is misspelled in context, if it exists in the index the spell checker will not try correcting it. There are 3 workarounds: 1. Use the patch included with SOLR-2585 (this is for Trunk/4.x only). See https://issues.apache.org/jira/browse/SOLR-2585 I have tried using this with the original test case of 'Signorney Wever'. I didn't notice any difference, although I am a little unclear as to what exactly this patch does. Nor am I really clear what to set either of the options to, so I set them both to '5'. I tried to find the test case it mentions, but it's not present in SpellCheckCollatorTest.java .. Any suggestions? 2. try "onlyMorePopular=true" in your request. (http://wiki.apache.org/solr/SpellCheckComponent#spellcheck.onlyMorePopular). But see the September 2, 2011 comment in SOLR-2585 about why this might not do what you'd hope it would. Trying this did produce 'Signourney Weaver' as you would hope, but I am a little afraid of the downside. I would much more like a context sensative spell check that involves the terms around the correction. 3. If you're building your index on a, you can add a stopword filter that filters out all of the misspelt or rare words from the field that the dictionary is based. This could be an arduous task, and it may or may not work well for your data. I am currently using a copyField for all terms that are relevant, which is quite a lot and the dictionary would encompass a huge amount of data. Adding stopword filters would be out of the question as we presently have more than 30,000 products and this is for the initial launch, we intend to have many many more. As for your second question, I take it you're using (e)dismax with multiple fields in "qf", right? The only way I know to handle this is to create a that combines all of the fields you search across. Use this combined field to base your dic
Re: Phonetic search for portuguese
On Sun, Jan 22, 2012 at 5:47 PM, Anderson vasconcelos wrote: > Anyone could help? > > Thanks > > 2012/1/20, Anderson vasconcelos : >> Hi >> >> The phonetic filters (DoubleMetaphone, Metaphone, Soundex, RefinedSoundex, >> Caverphone) is only for english language or works for other languages? Have >> some phonetic filter for portuguese? If dont have, how i can implement >> this? We did this, in another context, by using the open-source aspell library to handle the spell-checking for us. This has distinct advantages as aspell is well-tested, handles soundslike in a better manner at least IMHO, and supports a wide variety of languages, including Portugese. There are some drawbacks, as aspell only has C/C++ interfaces, and hence we built bindings on top of SWIG. Also, we handled the integration with Solr via a custom filter factory, though there are better ways to do this. Such a project would thus, have dependencies on aspell, and our custom code. If there is interest in this, we would be happy to open source this code: Given our current schedule this could take 2-3 weeks. Regards, Gora
Re: How to Sort in a Different Way
what kind of new sorting ways you want? If you want to change Lucene's score of how relevant the result is, you may play with the boosting. If you just want to sort on fields, you can use "sort=fieldname" to sort on string, integer, date fields. Yunfei On Sat, Jan 21, 2012 at 8:39 AM, Bing Li wrote: > Dear all, > > I have a question when sorting retrieved data from Solr. As I know, Lucene > retrieves data according to the degree of keyword matching on text field > (partial matching). > > If I search data by string field (complete matching), how does Lucene sort > the retrieved data? > > If I want to add new sorting ways, how to do that? Now I have to load all > of matched data from Solr and rank them again in my ways before showing to > users. It is correct? > > Thanks so much! > Bing >
Re: "index-time" over boosted
Hi, I got wrong in beginning but putting omitNorms in the query url. Now following your advice, I merged the schema.xml from Nutch and Solr and made sure omitNorms was set to "true" for the content, just as you said. Unfortunately the problem remains :-( On Thursday, January 19, 2012, Jan Høydahl wrote: > Hi, > > The schema you pasted in your mail is NOT Solr3.5's default example schema. Did you get it from the Nutch project? > > And the "omitNorms" parameter is supposed to go in the tag in schema.xml, and the "content" field in the example schema does not have omitNorms="true". Try to change > > > to > > > and try again. Please note that you SHOULD customize your schema, there is really no "default" schema in Solr (or Nutch), it's only an example or starting point. For your search application to work well you will have to invest some time in designing a schema, working with your queries, perhaps exploring DisMax query parser etc etc. > > -- > Jan Høydahl, search solution architect > Cominvent AS - www.cominvent.com > Solr Training - www.solrtraining.com > > On 19. jan. 2012, at 13:01, remi tassing wrote: > >> Hello Jan, >> >> My schema wasn't changed from the release 3.5.0. The content can be seen >> below: >> >> >> >>>sortMissingLast="true" omitNorms="true"/> >>>omitNorms="true"/> >>>omitNorms="true"/> >>>positionIncrementGap="100"> >> >> >>>ignoreCase="true" words="stopwords.txt"/> >>>generateWordParts="1" generateNumberParts="1" >>catenateWords="1" catenateNumbers="1" catenateAll="0" >>splitOnCaseChange="1"/> >> >>>protected="protwords.txt"/> >> >> >> >>>positionIncrementGap="100"> >> >> >> >>>generateWordParts="1" generateNumberParts="1"/> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >>
Re: Trying to understand SOLR memory requirements
I take it from the overwhelming silence on the list that what I've asked is not possible? It seems like the suggester component is not well supported or understood, and limited in functionality. Does anyone have any ideas for how I would implement the functionality I'm looking for. I'm trying to implement a single location auto-suggestion box that will search across multiple DB tables. It would take several possible inputs: city, state, country; state,county; or country. In addition, there are many aliases for each city, state and country that map back to the original city/state/country. Once they select a suggestion, that suggestion needs to have certain information associated with it. It seems that the Suggester component is not the right tool for this. Anyone have other ideas? Thanks, Dave On Thu, Jan 19, 2012 at 6:09 PM, Dave wrote: > That was how I originally tried to implement it, but I could not figure > out how to get the suggester to return anything but the suggestion. How do > you do that? > > > On Thu, Jan 19, 2012 at 1:13 PM, Robert Muir wrote: > >> I really don't think you should put a huge json document as a search term. >> >> Just make "Brooklyn, New York, United States" or whatever you intend >> the user to actually search on/type in as your search term. >> put the rest in different fields (e.g. stored-only, not even indexed >> if you dont need that) and have solr return it that way. >> >> On Thu, Jan 19, 2012 at 12:31 PM, Dave wrote: >> > In my original post I included one of my terms: >> > >> > Brooklyn, New York, United States?{ |id|: |2620829|, >> > |timezone|:|America/New_York|,|type|: |3|, |country|: { |id| : |229| }, >> > |region|: { |id| : |3608| }, |city|: { |id|: |2616971|, |plainname|: >> > |Brooklyn|, |name|: |Brooklyn, New York, United States| }, |hint|: >> > |2300664|, |label|: |Brooklyn, New York, United States|, |value|: >> > |Brooklyn, New York, United States|, |title|: |Brooklyn, New York, >> United >> > States| } >> > >> > I'm matching on the first part of the term (the part before the ?), and >> > then the rest is being passed via JSON into Javascript, then converted >> to a >> > JSON term itself. Here is my data-config.xml file, in case it sheds any >> > light: >> > >> > >> > > > driver="com.mysql.jdbc.Driver" >> > url="" >> > user="" >> > password="" >> > encoding="UTF-8"/> >> > >> >> >pk="id" >> >query="select p.id as placeid, c.id, c.plainname, c.name, >> > p.timezone from countries c, places p where p.regionid = 1 AND p.cityid >> = 1 >> > AND c.id=p.countryid AND p.settingid=1" >> >transformer="TemplateTransformer"> >> > >> > >> > >> > >> > >> >> template="${countries.plainname}?{ >> > |id|: |${countries.placeid}|, |timezone|:|${countries.timezone}|,|type|: >> > |1|, |country|: { |id| : |${countries.id}|, |plainname|: >> > |${countries.plainname}|, |name|: |${countries.plainname}| }, |region|: >> { >> > |id| : |0| }, |city|: { |id|: |0| }, |hint|: ||, |label|: >> > |${countries.plainname}|, |value|: |${countries.plainname}|, |title|: >> > |${countries.plainname}| }"/> >> > >> >> >pk="id" >> >query="select p.id as placeid, p.countryid as countryid, >> > c.plainname as countryname, p.timezone as timezone, r.id as regionid, >> > r.plainname as regionname, r.population as regionpop from places p, >> regions >> > r, countries c where r.id = p.regionid AND p.settingid = 1 AND >> p.regionid > >> > 1 AND p.countryid=c.id AND p.cityid=1 AND r.population > 0" >> >transformer="TemplateTransformer"> >> > >> > >> > >> > >> > >> > >> > >> >> >pk="id" >> >query="select c2.id as cityid, c2.plainname as cityname, >> > c2.population as citypop, p.id as placeid, p.countryid as countryid, >> > c.plainname as countryname, p.timezone as timezone, r.id as regionid, >> > r.plainname as regionname from places p, regions r, countries c, cities >> c2 >> > where c2.id = p.cityid AND p.settingid = 1 AND p.regionid > 1 AND >> > p.countryid=c.id AND r.id=p.regionid" >> >transformer="TemplateTransformer"> >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > On Thu, Jan 19, 2012 at 11:52 AM, Robert Muir wrote: >> > >> >> I don't think the problem is FST, since it sorts offline in your case. >> >> >> >> More importantly, what are you trying to put into the FST? >> >> >> >> it appears you are indexing terms from your term dictionary, but your >> >> term dictionary is over 1GB, why is that? >> >> >> >> what do your terms look like? 1GB for 2,784,937 documents does not make >> >> sense. >> >> for ex
facet pivot and range
Hello, I can't find anything related to what I would like to do: a facet.pivot but have ranges on the second level, something like facet.pivot=cat,price where price is a range facet facet.range=price&facet.range.start=0&facet.range.end=1000&facet.range.gap=10 Is it doable with Solr4 ? How did you do it ? Thanks so much
Re: Phonetic search for portuguese
Hi Gora, thanks for the reply. I'm interesting in see how you did this solution. But , my time is not to long and i need to create some solution for my client early. If anyone knows some other simple and fast solution, please post on this thread. Gora, you could talk how you implemented the Custom Filter Factory and how used this on SOLR? Thanks 2012/1/22, Gora Mohanty : > On Sun, Jan 22, 2012 at 5:47 PM, Anderson vasconcelos > wrote: >> Anyone could help? >> >> Thanks >> >> 2012/1/20, Anderson vasconcelos : >>> Hi >>> >>> The phonetic filters (DoubleMetaphone, Metaphone, Soundex, >>> RefinedSoundex, >>> Caverphone) is only for english language or works for other languages? >>> Have >>> some phonetic filter for portuguese? If dont have, how i can implement >>> this? > > We did this, in another context, by using the open-source aspell library to > handle the spell-checking for us. This has distinct advantages as aspell > is well-tested, handles soundslike in a better manner at least IMHO, and > supports a wide variety of languages, including Portugese. > > There are some drawbacks, as aspell only has C/C++ interfaces, and > hence we built bindings on top of SWIG. Also, we handled the integration > with Solr via a custom filter factory, though there are better ways to do > this. > Such a project would thus, have dependencies on aspell, and our custom > code. If there is interest in this, we would be happy to open source this > code: Given our current schedule this could take 2-3 weeks. > > Regards, > Gora >
Re: Sort for Retrieved Data
See belowl On Fri, Jan 20, 2012 at 10:42 AM, Bing Li wrote: > Dear all, > > I have a question when sorting retrieved data from Solr. As I know, Lucene > retrieves data according to the degree of keyword matching on text field > (partial matching). > > If I search data by string field (complete matching), how does Lucene sort > the retrieved data? > If scores match exactly, which may well be the case here, then the tiebreaker is internal Lucene document id. > If I add some filters, such as time, what about the sorting way? > It doesn't. Filters only restrict the result set, they have no influence on sorting > If I just need to top ones, is it proper to just add rows? > I don't understand what you're asking. If you want the top 100 rather than the top 10, yes you can increase the &rows parameter or page (see &start). > If I want to add new sorting ways, how to do that? See the &sort parameter. This page comes up as the first google search on "solr sort" http://lucene.apache.org/solr/tutorial.html Best Erick > > Thanks so much! > Bing
Re: Getting a word count frequency out of a page field
Faceting won't work at all. Its function is to return the count of the *documents* that a value occurs in, so that's no good for your use case. "I don't know how to issue a proper SOLR query that returns a word count for a paragraph of text such as the term "amplifier" for a field. For some reason it only returns." This is really unclear. Are you asking for the word counts of a paragraph that contains "amplifier"? The number of times "amplifier" appears in a paragraph? In a document? And why do you want this information anyway? It might be an XY problem. Best Erick On Fri, Jan 20, 2012 at 1:06 PM, solr user wrote: > SOLR reports the term occurrence for terms over all the documents. I am > having trouble making a query that returns the term occurrence in a > specific page field called, documentPageId. > > I don't know how to issue a proper SOLR query that returns a word count for > a paragraph of text such as the term "amplifier" for a field. For some > reason it only returns. > > The things I've tried only return a count for 1 occurrence of the term even > though I see the term in the paragraph more than just once. > > I've tried faceting on the field, "contents" > > http://localhost:8983/solr/select?indent=on&q=*:*&wt=standard&facet=on&facet.field=documentPageId&facet.query=amplifier&facet.sort=lex&facet.missing=on&facet.method=count > > > > 21 > > > > 1 > 1 > 1 > 1 > 1 > 1 > 1 > 1 > 1 > 1 > 1 > 1 > 1 > 1 > 1 > 1 > 1 > 1 > 1 > 1 > 1 > 1 > 1 > 1 > 1 > 1 > 1 > 1 > 0 > > > > > > > > > In schema.xml: > indexed="true" /> > multiValued="false"/> > > In solrconfig.xml: > > filewrapper > caseNumber > pageNumber > documentId > contents > documentId > caseNumber > pageNumber > documentPageId > contents > > Thanks in advance,
Re: Validating solr user query
Good luck on that If you allow free-form input, bad queries are just going to happen. To prevent this from getting to Solr, you essentially have to reproduce the entire Solr/Lucene parser. So why not just let the parser to it for you and present some pretty message to the user? The other thing you can do is build your own "advanced query page" that guides the user through adding parentheses, ands, ors, nots, fuzzy, all that jazz, but that's often really painful to do. But other than making a UI that makes it difficult to make bad queries or parsing the query, you're pretty much stuck... Best Erick On Fri, Jan 20, 2012 at 2:52 PM, Dipti Srivastava wrote: > Hi All, > I ma using HTTP/JSON to search my documents in Solr. Now the client provides > the query on which the search is based. > What is a good way to validate the query string provided by the user. > > On the other hand, if I want the user to build this query using some Solr api > instead of preparing a lucene query string which API can I use for this? > I looked into > SolrQuery in SolrJ but it does not appear to have a way to specify the more > complex queries with the boolean operators and operators such as ~,+,- etc. > > Basically, I am trying to avoid running into bad query strings built by the > caller. > > Thanks! > Dipti > > > This message is private and confidential. If you have received it in error, > please notify the sender and remove it from your system. >
Re: Getting a word count frequency out of a page field
See comments inline below. On Sun, Jan 22, 2012 at 8:27 PM, Erick Erickson wrote: > Faceting won't work at all. Its function is to return the count > of the *documents* that a value occurs in, so that's no good > for your use case. > > "I don't know how to issue a proper SOLR query that returns a word count > for > a paragraph of text such as the term "amplifier" for a field. For some > reason it only returns." > > This is really unclear. Are you asking for the word counts of a paragraph > that contains "amplifier"? The number of times "amplifier" appears in > a paragraph? In a document? > I'm looking for the number of times the word or term appears in a paragraph that I'm indexing as the field name "contents". I'm storing and indexing the field name "contents" that contains multiple occurrences of the term/word. However, when I query for that term it only reports that the word/term appeared only once in the field name "contents". > > And why do you want this information anyway? It might be an XY problem. > I want to be able to search for word frequency for a page in a document that has many pages. So I can report to the user that the term/word occurred on page 1 "10" times. The user can click on the result and go right the the page where the word/term appeared most frequently. What do you mean an XY problem? > > Best > Erick > > On Fri, Jan 20, 2012 at 1:06 PM, solr user wrote: > > SOLR reports the term occurrence for terms over all the documents. I am > > having trouble making a query that returns the term occurrence in a > > specific page field called, documentPageId. > > > > I don't know how to issue a proper SOLR query that returns a word count > for > > a paragraph of text such as the term "amplifier" for a field. For some > > reason it only returns. > > > > The things I've tried only return a count for 1 occurrence of the term > even > > though I see the term in the paragraph more than just once. > > > > I've tried faceting on the field, "contents" > > > > > http://localhost:8983/solr/select?indent=on&q=*:*&wt=standard&facet=on&facet.field=documentPageId&facet.query=amplifier&facet.sort=lex&facet.missing=on&facet.method=count > > > > > > > > 21 > > > > > > > > 1 > > 1 > > 1 > > 1 > > 1 > > 1 > > 1 > > 1 > > 1 > > 1 > > 1 > > 1 > > 1 > > 1 > > 1 > > 1 > > 1 > > 1 > > 1 > > 1 > > 1 > > 1 > > 1 > > 1 > > 1 > > 1 > > 1 > > 1 > > 0 > > > > > > > > > > > > > > > > > > In schema.xml: > > > indexed="true" /> > > > multiValued="false"/> > > > > In solrconfig.xml: > > > > filewrapper > > caseNumber > > pageNumber > > documentId > > contents > > documentId > > caseNumber > > pageNumber > > documentPageId > > contents > > > > Thanks in advance, >
Re: Failure noticed from new...@zju.edu.cn
I've seen the spam filter be pretty aggressive with HTML formatting etc, what happens when you just send them as "plain text"? Best Erick On Sat, Jan 21, 2012 at 7:24 AM, David Radunz wrote: > Hey, > > Every time I send a reply to the list I get a failure for > new...@zju.edu.cn. Should I just ignore this? I am unsure if the message has > been delivered... > > Cheers, > > David
Re: Tika0.10 language identifier in Solr3.5.0
Would "doing the right thing" include firing the results at different fields based on the language detected? Your answer to Jan seems to indicate not, in which case my original comments stand. The main point is that mixing all the *results* of the analysis chains for multiple languages into a single field will likely result in "interesting" behavior. Not to say it won't be satisfactory in your situation, but there are edge cases. Best Erick On Fri, Jan 20, 2012 at 9:15 AM, Ted Dunning wrote: > I think you misunderstood what I am suggesting. > > I am suggesting an analyzer that detects the language and then "does the > right thing" according to the language it finds. As such, it would > tokenize and stem English according to English rules, German by German > rules and would probably do a sliding bigram window in Japanese and Chinese. > > On Fri, Jan 20, 2012 at 8:54 AM, Erick Erickson > wrote: > >> bq: Why not have a polyglot analyzer >> >> That could work, but it makes some compromises and assumes that your >> languages are "close enough", I have absolutely no clue how that would >> work for English and Chinese say. >> >> But it also introduces inconsistencies. Take stemming. Even though you >> could easily stem in the correct language, throwing all those stems >> into the same filed can produce interesting results at search time since >> you run the risk of hitting something produced by one of the other >> analysis chains. >>
Re: Improving Solr Spell Checker Results
Hey, I am trying to send this again as 'plain-text' to see if it delivers ok this time. All of the previous messages I sent should be below.. Cheers, David On 22/01/2012 11:42 PM, David Radunz wrote: Hey James, I have played around a bit more with the settings and tried setting spellcheck.maxResultsForSuggest=100 and spellcheck.maxCollations=3. This yields 'Sigourney Weaver' as ONE of the corrections, but it's the second one and not the first. Which is wrong if this is a patch for 'context sensative', because it doesn't really seem to honor any context at all. Unless I am missunderstanding this? Also, I don't really like maxResultsForSuggest as it means 'all or nothing'. If you set it to 10 and there are 100 results, then you offer no corrections at all even if the term is missing in the dictionary entirely. If I set spellcheck.maxResultsForSuggest=100 and spellcheck.maxCollations=3 and choose the collation with the largest 'hits' I get Sigourney Weaver and other 'popular' terms. But say I searched for 'pork and chups', the 'popular' correction is 'park and chips' where as the first correction was correct: 'pork and chips'. So really, none of the solutions either in this patch or Solr offer what I would truely call context sensative spell checking. That being, in a full text search engine you find documents based on terms and how close they are togehter in the document. It makes more than perfect sense to treat the dictionary like this, so that when there are multiple terms it offers suggestions for the terms that match closely to whats entered surrounding the term. Example: "Sigourney Wever" would never appear in a document ever. "Sigourney Weaver" however has many 'hits' in exactly that order of words. So there needs to be a way to boost suggestions based on adjacency... Much like the full text search operates. Thoughts? David On 22/01/2012 9:56 PM, David Radunz wrote: James, I worked out that I actually needed to 'apply' patch SOLR-2585, whoops. So I have done that now and it seems to return 'correctlySpelled=true' for 'Sigorney Wever' (when Sigorney isn't even in the dictionary). Could something have changed in the trunk to make your patch no longer work? I had to manually merge the setup for the test case due to a new 'hyphens' test case. The settings I am use are: explicit 10 false 10 true true true 10 1 5 1 default spell solr.DirectSolrSpellChecker internal 0.5 2 1 5 4 0.01 spellchecker true With the query: spellcheck=true&facet=on&fl=id,sku,name,format,thumbnail,release_date,url_path,price,special_price,year_made_attr_opt_combo,primary_cat_id&sort=score+desc,name+asc,year_made+desc&start=0&q=sigorney+wever+title:"sigorney+wever"^100+series_name:"sigorney+wever"^50&spellcheck.q=sigorney+wever&fq=store_id:"1"&rows=5 Cheers, David On 22/01/2012 2:03 AM, David Radunz wrote: James, Thanks again for your lengthy and informative response. I updated from SVN trunk again today and was successfully able to run 'ant test'. So I proceeded with trying your suggestions (for question 1 so far): On 17/01/2012 5:32 AM, Dyer, James wrote: David, The spellchecker normally won't give suggestions for any term in your index. So even if "wever" is misspelled in context, if it exists in the index the spell checker will not try correcting it. There are 3 workarounds: 1. Use the patch included with SOLR-2585 (this is for Trunk/4.x only). See https://issues.apache.org/jira/browse/SOLR-2585 I have tried using this with the original test case of 'Signorney Wever'. I didn't notice any difference, although I am a little unclear as to what exactly this patch does. Nor am I really clear what to set either of the options to, so I set them both to '5'. I tried to find the test case it mentions, but it's not present in SpellCheckCollatorTest.java .. Any suggestions? 2. try "onlyMorePopular=true" in your request. (http://wiki.apache.org/solr/SpellCheckComponent#spellcheck.onlyMorePopular). But see the September 2, 2011 comment in SOLR-2585 about why this might not do what you'd hope it would. Trying this did produce 'Signourney Weaver' as you would hope, but I am a little afraid of the downside. I would much more like a context sensative spell check that involves the terms around the correction. 3. If you're building your index on a, you can add a stopword filter that filters out all of the misspelt or rare words from the field that the dictionary is based. This could be an arduous task, and it may or may not work well for your data. I am currently using a copyField for all terms that are relevant, which is quite a lot and the dictionary would encompass a huge amount of data. Adding stopword filters would be out of the question as we presently have more than 30,000 products and this is for the initial launch, we intend to have many many more. As for your second question, I ta
Re: Failure noticed from new...@zju.edu.cn
Hey, That seems to have helped, I didn't get a failure notice re-sending the message. I'll have to keep that in mind. Thanks very much, David On 23/01/2012 12:41 PM, Erick Erickson wrote: I've seen the spam filter be pretty aggressive with HTML formatting etc, what happens when you just send them as "plain text"? Best Erick On Sat, Jan 21, 2012 at 7:24 AM, David Radunz wrote: Hey, Every time I send a reply to the list I get a failure for new...@zju.edu.cn. Should I just ignore this? I am unsure if the message has been delivered... Cheers, David
Re: Improving Solr Spell Checker Results
I can't help with your *real* problem, but when looking at patches, if the "resolution" field isn't set to something like "fixed" it means that the patch has NOT been applied to any code lines. There also should be commit revisions specified in the comments. If "Fix Versions" has values, that doesn't mean the patch has been applied either, that's often just a statement of where the patch *should* go. And, between the time someone uploads a patch and it actually gets *committed*, the underlying code line can, indeed, change and the patch doesn't apply cleanly. Since you've already had to do this, could you upload your version that *does* apply cleanly? Best Erick On Sun, Jan 22, 2012 at 2:56 AM, David Radunz wrote: > James, > > I worked out that I actually needed to 'apply' patch SOLR-2585, whoops. > So I have done that now and it seems to return 'correctlySpelled=true' for > 'Sigorney Wever' (when Sigorney isn't even in the dictionary). Could > something have changed in the trunk to make your patch no longer work? I had > to manually merge the setup for the test case due to a new 'hyphens' test > case. The settings I am use are: > > > explicit > 10 > > false > 10 > true > true > true > 10 > 1 > > 5 > 1 > > > > > default > spell > solr.DirectSolrSpellChecker > > > internal > > 0.5 > > 2 > > 1 > > 5 > > 4 > > 0.01 > > > > spellchecker > true > > > With the query: > > spellcheck=true&facet=on&fl=id,sku,name,format,thumbnail,release_date,url_path,price,special_price,year_made_attr_opt_combo,primary_cat_id&sort=score+desc,name+asc,year_made+desc&start=0&q=sigorney+wever+title:"sigorney+wever"^100+series_name:"sigorney+wever"^50&spellcheck.q=sigorney+wever&fq=store_id:"1"&rows=5 > > Cheers, > > David > > > > On 22/01/2012 2:03 AM, David Radunz wrote: >> >> James, >> >> Thanks again for your lengthy and informative response. I updated from >> SVN trunk again today and was successfully able to run 'ant test'. So I >> proceeded with trying your suggestions (for question 1 so far): >> >> On 17/01/2012 5:32 AM, Dyer, James wrote: >>> >>> David, >>> >>> The spellchecker normally won't give suggestions for any term in your >>> index. So even if "wever" is misspelled in context, if it exists in the >>> index the spell checker will not try correcting it. There are 3 >>> workarounds: >>> 1. Use the patch included with SOLR-2585 (this is for Trunk/4.x only). >>> See https://issues.apache.org/jira/browse/SOLR-2585 >> >> I have tried using this with the original test case of 'Signorney Wever'. >> I didn't notice any difference, although I am a little unclear as to what >> exactly this patch does. Nor am I really clear what to set either of the >> options to, so I set them both to '5'. I tried to find the test case it >> mentions, but it's not present in SpellCheckCollatorTest.java .. Any >> suggestions? >> >>> 2. try "onlyMorePopular=true" in your request. >>> (http://wiki.apache.org/solr/SpellCheckComponent#spellcheck.onlyMorePopular). >>> But see the September 2, 2011 comment in SOLR-2585 about why this might not >>> do what you'd hope it would. >> >> >> Trying this did produce 'Signourney Weaver' as you would hope, but I am a >> little afraid of the downside. I would much more like a context sensative >> spell check that involves the terms around the correction. >>> >>> >>> 3. If you're building your index on a, you can add a >>> stopword filter that filters out all of the misspelt or rare words from the >>> field that the dictionary is based. This could be an arduous task, and it >>> may or may not work well for your data. >> >> I am currently using a copyField for all terms that are relevant, which is >> quite a lot and the dictionary would encompass a huge amount of data. Adding >> stopword filters would be out of the question as we presently have more than >> 30,000 products and this is for the initial launch, we intend to have many >> many more. >>> >>> >>> As for your second question, I take it you're using (e)dismax with >>> multiple fields in "qf", right? The only way I know to handle this is to >>> create a that combines all of the fields you search across. Use >>> this combined field to base your dictionary. Also, specifying >>> "spellcheck.maxCollationTries" with a non-zero value will weed out the >>> nonsense word combinations that are likely to occur when doing this, >>> ensuring that any collations provided will indeed yield hits. The downside >>> to doing this, of course, is it will make your first problem more acute in >>> that there will be even more terms in your index that the spellchecker will >>> ignore entirely, even if they're mispelled in context. Once again, >>> SOLR-2585 is designed to tackle this problem but it is still in its early >>> stages, and thus far it is Trunk-only. >> >> I tried setting spellcheck.maxCollationTries to 5 to see if it would help >> with the above problem, but it did not. >> >> I have now tried using it in the context of question 2. I
Re: Improving Solr Spell Checker Results
Hey Erick, Sure, can you explain the process to create the patch and upload it and i'll do it first thing tomorrow. Thanks again for your help, David On 23/01/2012 12:51 PM, Erick Erickson wrote: I can't help with your *real* problem, but when looking at patches, if the "resolution" field isn't set to something like "fixed" it means that the patch has NOT been applied to any code lines. There also should be commit revisions specified in the comments. If "Fix Versions" has values, that doesn't mean the patch has been applied either, that's often just a statement of where the patch *should* go. And, between the time someone uploads a patch and it actually gets *committed*, the underlying code line can, indeed, change and the patch doesn't apply cleanly. Since you've already had to do this, could you upload your version that *does* apply cleanly? Best Erick On Sun, Jan 22, 2012 at 2:56 AM, David Radunz wrote: James, I worked out that I actually needed to 'apply' patch SOLR-2585, whoops. So I have done that now and it seems to return 'correctlySpelled=true' for 'Sigorney Wever' (when Sigorney isn't even in the dictionary). Could something have changed in the trunk to make your patch no longer work? I had to manually merge the setup for the test case due to a new 'hyphens' test case. The settings I am use are: explicit 10 false 10 true true true 10 1 5 1 default spell solr.DirectSolrSpellChecker internal 0.5 2 1 5 4 0.01 spellchecker true With the query: spellcheck=true&facet=on&fl=id,sku,name,format,thumbnail,release_date,url_path,price,special_price,year_made_attr_opt_combo,primary_cat_id&sort=score+desc,name+asc,year_made+desc&start=0&q=sigorney+wever+title:"sigorney+wever"^100+series_name:"sigorney+wever"^50&spellcheck.q=sigorney+wever&fq=store_id:"1"&rows=5 Cheers, David On 22/01/2012 2:03 AM, David Radunz wrote: James, Thanks again for your lengthy and informative response. I updated from SVN trunk again today and was successfully able to run 'ant test'. So I proceeded with trying your suggestions (for question 1 so far): On 17/01/2012 5:32 AM, Dyer, James wrote: David, The spellchecker normally won't give suggestions for any term in your index. So even if "wever" is misspelled in context, if it exists in the index the spell checker will not try correcting it. There are 3 workarounds: 1. Use the patch included with SOLR-2585 (this is for Trunk/4.x only). See https://issues.apache.org/jira/browse/SOLR-2585 I have tried using this with the original test case of 'Signorney Wever'. I didn't notice any difference, although I am a little unclear as to what exactly this patch does. Nor am I really clear what to set either of the options to, so I set them both to '5'. I tried to find the test case it mentions, but it's not present in SpellCheckCollatorTest.java .. Any suggestions? 2. try "onlyMorePopular=true" in your request. (http://wiki.apache.org/solr/SpellCheckComponent#spellcheck.onlyMorePopular). But see the September 2, 2011 comment in SOLR-2585 about why this might not do what you'd hope it would. Trying this did produce 'Signourney Weaver' as you would hope, but I am a little afraid of the downside. I would much more like a context sensative spell check that involves the terms around the correction. 3. If you're building your index on a, you can add a stopword filter that filters out all of the misspelt or rare words from the field that the dictionary is based. This could be an arduous task, and it may or may not work well for your data. I am currently using a copyField for all terms that are relevant, which is quite a lot and the dictionary would encompass a huge amount of data. Adding stopword filters would be out of the question as we presently have more than 30,000 products and this is for the initial launch, we intend to have many many more. As for your second question, I take it you're using (e)dismax with multiple fields in "qf", right? The only way I know to handle this is to create athat combines all of the fields you search across. Use this combined field to base your dictionary. Also, specifying "spellcheck.maxCollationTries" with a non-zero value will weed out the nonsense word combinations that are likely to occur when doing this, ensuring that any collations provided will indeed yield hits. The downside to doing this, of course, is it will make your first problem more acute in that there will be even more terms in your index that the spellchecker will ignore entirely, even if they're mispelled in context. Once again, SOLR-2585 is designed to tackle this problem but it is still in its early stages, and thus far it is Trunk-only. I tried setting spellcheck.maxCollationTries to 5 to see if it would help with the above problem, but it did not. I have now tried using it in the context of question 2. I tried searching for 'Sigorney Wever' in the series name (which it's not present in, as its an actor): sp
Re: Improving Solr Spell Checker Results
David: There's some good info here: http://wiki.apache.org/solr/HowToContribute#Working_With_Patches But the short form is to go into solr_home and issue this command: 'svn diff > SOLR-2585.patch'. IDE's may also have a "create patch" feature, but I find the straight SVN command more reliable. Note I'm not saying that your patch will necessarily be picked up, but it's a thoughtful gesture to upload a more current patch. In your comments please identify what code line you're working on (4.x? 3.x?). And when you upload, down near the bottom of the dialog box there'll be a radio button about "grant ASF license" which is fairly important to click for legal reasons Thanks Erick On Sun, Jan 22, 2012 at 5:54 PM, David Radunz wrote: > Hey Erick, > > Sure, can you explain the process to create the patch and upload it and > i'll do it first thing tomorrow. > > Thanks again for your help, > > David > > > On 23/01/2012 12:51 PM, Erick Erickson wrote: >> >> I can't help with your *real* problem, but when looking at patches, >> if the "resolution" field isn't set to something like "fixed" it means >> that the patch has NOT been applied to any code lines. There >> also should be commit revisions specified in the comments. >> If "Fix Versions" has values, that doesn't mean the patch has >> been applied either, that's often just a statement of where >> the patch *should* go. >> >> And, between the time someone uploads a patch and it actually >> gets *committed*, the underlying code line can, indeed, change >> and the patch doesn't apply cleanly. Since you've already had >> to do this, could you upload your version that *does* apply >> cleanly? >> >> Best >> Erick >> >> On Sun, Jan 22, 2012 at 2:56 AM, David Radunz wrote: >>> >>> James, >>> >>> I worked out that I actually needed to 'apply' patch SOLR-2585, >>> whoops. >>> So I have done that now and it seems to return 'correctlySpelled=true' >>> for >>> 'Sigorney Wever' (when Sigorney isn't even in the dictionary). Could >>> something have changed in the trunk to make your patch no longer work? I >>> had >>> to manually merge the setup for the test case due to a new 'hyphens' test >>> case. The settings I am use are: >>> >>> >>> explicit >>> 10 >>> >>> false >>> 10 >>> true >>> true >>> true >>> 10 >>> 1 >>> >>> 5 >>> 1 >>> >>> >>> >>> >>> default >>> spell >>> solr.DirectSolrSpellChecker >>> >>> >>> internal >>> >>> 0.5 >>> >>> 2 >>> >>> 1 >>> >>> 5 >>> >>> 4 >>> >>> 0.01 >>> >>> >>> >>> spellchecker >>> true >>> >>> >>> With the query: >>> >>> >>> spellcheck=true&facet=on&fl=id,sku,name,format,thumbnail,release_date,url_path,price,special_price,year_made_attr_opt_combo,primary_cat_id&sort=score+desc,name+asc,year_made+desc&start=0&q=sigorney+wever+title:"sigorney+wever"^100+series_name:"sigorney+wever"^50&spellcheck.q=sigorney+wever&fq=store_id:"1"&rows=5 >>> >>> Cheers, >>> >>> David >>> >>> >>> >>> On 22/01/2012 2:03 AM, David Radunz wrote: James, Thanks again for your lengthy and informative response. I updated from SVN trunk again today and was successfully able to run 'ant test'. So I proceeded with trying your suggestions (for question 1 so far): On 17/01/2012 5:32 AM, Dyer, James wrote: > > David, > > The spellchecker normally won't give suggestions for any term in your > index. So even if "wever" is misspelled in context, if it exists in > the > index the spell checker will not try correcting it. There are 3 > workarounds: > 1. Use the patch included with SOLR-2585 (this is for Trunk/4.x only). > See https://issues.apache.org/jira/browse/SOLR-2585 I have tried using this with the original test case of 'Signorney Wever'. I didn't notice any difference, although I am a little unclear as to what exactly this patch does. Nor am I really clear what to set either of the options to, so I set them both to '5'. I tried to find the test case it mentions, but it's not present in SpellCheckCollatorTest.java .. Any suggestions? > 2. try "onlyMorePopular=true" in your request. > > (http://wiki.apache.org/solr/SpellCheckComponent#spellcheck.onlyMorePopular). > But see the September 2, 2011 comment in SOLR-2585 about why this > might not > do what you'd hope it would. Trying this did produce 'Signourney Weaver' as you would hope, but I am a little afraid of the downside. I would much more like a context sensative spell check that involves the terms around the correction. > > > 3. If you're building your index on a, you can add a > stopword filter that filters out all of the misspelt or rare words from > the > field that the dictionary is based. This could be an arduous task, and > it > may or may not work well for your data. I am currently using a copyField for all terms that are relevant, which is >>
Re: Phonetic search for portuguese
On Mon, Jan 23, 2012 at 5:58 AM, Anderson vasconcelos wrote: > Hi Gora, thanks for the reply. > > I'm interesting in see how you did this solution. But , my time is not > to long and i need to create some solution for my client early. If > anyone knows some other simple and fast solution, please post on this > thread. What is your time line? I will see if we can expedite the open sourcing of this. > Gora, you could talk how you implemented the Custom Filter Factory and > how used this on SOLR? [...] That part is quite simple, though it is possible that I have not correctly addressed all issues for a custom FilterFactory. Please see: AspellFilterFactory: http://pastebin.com/jTBcfmd1 AspellFilter:http://pastebin.com/jDDKrPiK The latter loads a java_aspell library that is created by SWIG by setting up Java bindings on top of SWIG, and configuring it for the language of interest. Next, you will need a library that encapsulates various aspell functionality in Java. I am afraid that this is a little long: Suggest: http://pastebin.com/6NrGCVma Finally, you will have to set up the Solr schema to use this filter factory, e.g., one could create a new Solr TextField, where the solr.DoubleMetaphoneFilterFactory is replaced with com.mimirtech.search.solr.analysis.AspellFilterFactory We can discuss further how to set this up, but should probably take that discussion off-list. Regards, Gora
Re: Phonetic search for portuguese
Thanks a lot Gora. I need to delivery the first release for my client on 25 january. With your explanation, i can negociate better the date to delivery of this feature for next month, because i have other business rules for delivery and this features is more complex than i thought. I could help you to shared this solution with solr community. Maybe we can create some component in google code, or something like that, wich any solr user can use. 2012/1/23, Gora Mohanty : > On Mon, Jan 23, 2012 at 5:58 AM, Anderson vasconcelos > wrote: >> Hi Gora, thanks for the reply. >> >> I'm interesting in see how you did this solution. But , my time is not >> to long and i need to create some solution for my client early. If >> anyone knows some other simple and fast solution, please post on this >> thread. > > What is your time line? I will see if we can expedite the open > sourcing of this. > >> Gora, you could talk how you implemented the Custom Filter Factory and >> how used this on SOLR? > [...] > > That part is quite simple, though it is possible that I have not > correctly addressed all issues for a custom FilterFactory. > Please see: > AspellFilterFactory: http://pastebin.com/jTBcfmd1 > AspellFilter:http://pastebin.com/jDDKrPiK > > The latter loads a java_aspell library that is created by SWIG > by setting up Java bindings on top of SWIG, and configuring > it for the language of interest. > > Next, you will need a library that encapsulates various > aspell functionality in Java. I am afraid that this is a little > long: > Suggest: http://pastebin.com/6NrGCVma > > Finally, you will have to set up the Solr schema to use > this filter factory, e.g., one could create a new Solr > TextField, where the solr.DoubleMetaphoneFilterFactory > is replaced with > com.mimirtech.search.solr.analysis.AspellFilterFactory > > We can discuss further how to set this up, but should > probably take that discussion off-list. > > Regards, > Gora >
Re: Phonetic search for portuguese
On Mon, Jan 23, 2012 at 9:21 AM, Anderson vasconcelos wrote: > Thanks a lot Gora. > I need to delivery the first release for my client on 25 january. > With your explanation, i can negociate better the date to delivery of > this feature for next month, because i have other business rules for > delivery and this features is more complex than i thought. OK.I have ideas on how to improve this solution, but we can take these up at a later stage. We have tested this solution, and I know that it works. I will also be discussing with people here about how soon we can open source this. > I could help you to shared this solution with solr community. Maybe we > can create some component in google code, or something like that, wich > any solr user can use. Yes, I have been meaning to do that forever, but work has been intruding. We will put up something on BitBucket as soon as possible. Regards, Gora
Solr Cores
Hello, We have in production a number of individual solr Instnaces on a single JVM.As a result ,we see that the permgenSpace keeps increasing with each additional instance added. I would Like to know ,if we can have solr cores , instead of individual instances. - Is there any limit to the number of cores ,for a single instance? - Will this decrease the permgen space as the LIB is shared.? - Would there be any decrease in performance with number of cores added? - Any thing else that I should know before moving into cores? Any help would be appreciated? Regards Sujatha
Re: Search within words
Hi Thanks for the reply.. I am using NGramFilterFactory for this. But it's not working as desired. Like I have a field article_type that has been indexed using the below mentioned field type. The field definition for indexing is : now the problem is that I have a value article_type field has values like earrring and ring and it's required that when we search for ring earring should also come. But it's not happening. What else needs to be done in order to achieve this. Any further help will be appreciated. Thanks -- View this message in context: http://lucene.472066.n3.nabble.com/Search-within-words-tp3675210p3681044.html Sent from the Solr - User mailing list archive at Nabble.com.