Re: Conceptual Question
Hi Yonik, Sorry to jump on an old post There is a change interface in JIRA, as long as all of the fields originally sent are stored. Do you remember the JIRA issue, or a token to find it ? It sounds useful in some cases, for example, when you are working on analysers. That could be real life for me in future. -- Frédéric Glorieux École nationale des chartes direction des nouvelles technologies et de l'informatique
Re: add CJKTokenizer to solr
I'm jumping in the middle of the thread here. CJK = Chinese, Japanese, Korean German = etwas ganz anderes Why are you trying to use CJKAnalyzer+Tokenizer for German? Have you tried German Analyzer from Lucene contrib? Otis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simpy -- http://www.simpy.com/ - Tag - Search - Share - Original Message From: Xuesong Luo <[EMAIL PROTECTED]> To: solr-user@lucene.apache.org Sent: Friday, June 22, 2007 8:54:37 AM Subject: RE: add CJKTokenizer to solr Thanks, Toru and Chris, I tried both the CJKTokenizer and CJKAnalyzer. Both return some unexpected highlight results when I tested with Germany. The field value I searched is "Ein Mann beißt den Hund". The search criteria is beißt. When using CJKAnalyzer, beißt is treated as 2 single terms(bei and ß) the highlight result is: Ein Mann beißt den Hund When using CJKTokenizer, beißt is treated as 3 single terms, the result is: Ein Mann beißt den Hund When using standard tokenizer, beißt is treated as a word, the result is: Ein Mann beißt den Hund I understand why the standard tokenizer treat beißt as a word, but don't know how CJKAnalyzer and CJKAnalyzer work, could anyone explain a little bit? Thanks Xuesong -Original Message- From: Toru Matsuzawa [mailto:[EMAIL PROTECTED] Sent: Monday, June 18, 2007 10:29 PM To: solr-user@lucene.apache.org Subject: Re: add CJKTokenizer to solr I'm sorry. Because it was not possible to append it, it sends it again. > > I got the error below after adding CJKTokenizer to schema.xml. I > > checked the constructor of CJKTokenizer, it requires a Reader parameter, > > I guess that's why I get this error, I searched the email archive, it > > seems working for other users. Does anyone know what is the problem? > > > CJKTokenizerFactory that I am using is appended. > -- package org.apache.solr.analysis.ja; import java.io.Reader; import org.apache.lucene.analysis.cjk.CJKTokenizer ; import org.apache.lucene.analysis.TokenStream; import org.apache.solr.analysis.BaseTokenizerFactory; /** * CJKTokenizer for Solr * @see org.apache.lucene.analysis.cjk.CJKTokenizer * @author matsu * */ public class CJKTokenizerFactory extends BaseTokenizerFactory { /** * @see org.apache.solr.analysis.TokenizerFactory#create(Reader) */ public TokenStream create(Reader input) { return new CJKTokenizer( input ); } } -- Trou Matsuzawa
Re: add CJKTokenizer to solr
Hi Hoss. I've done a few tests using reflection to instantiate a simple object and the results will vary a lot depending on the JVM. As the JVM optimizes code as it is executed it will vary depending on the usage, but I think we have something to consider: If done 1,000 samples (5 clean X loop of 200) and each sample is creating 100,000 objects and the results were: With reflection: - Average : 0.0005418 - Worst (first clean execution): 0.0007760 Without reflection: - Average : 0.469 - Worst (first clean execution): 0.0002140 So comparing these numbers, I can see that using reflection on the average case will cost 10 times more than creating the object without reflection. But my question is: Do we need to create factories so frequently or the are just create once and re-used (are they thread safe)? The term Factory made me think of a class that is responsible for building others instance, so usually they can be singletons... If they don't need to be created all the time it will not impact really and will give extra flexibility in terms of incorporating new Tokenizers (it would make easier to make Solr/Lucene versions less coupled). Environment: java version "1.5.0_07" Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_07-164) Java HotSpot(TM) Client VM (build 1.5.0_07-87, mixed mode, sharing) Heap size: 256M Running on a PowerPC - Mac OS/X 10.4.9 with 1.5Gb RAM Regards, Daniel On 21/6/07 20:39, "Chris Hostetter" <[EMAIL PROTECTED]> wrote: > > : Why instead of that we don't create an UbberFactory that takes the Tokenizer > : class as a parameter and instantiates the proper Tokenizer? > > The idea has come up before ... and there's really no reason why it > wouldn't be okay to include a reflection based facotry like this in Solr > -- it just hasn't been done yet. > > One of the reasons is that there are some performance costs associated > with the reflection, so we wouldn't want to competley replace the existing > "configuration via factory name" model with a "configure via class name > and an uber factory does the reflection quetly in the background" model > because it's the kind of appraoch that would really only make sense for > simple prototypes -- in any system where you are really concerned about > performacne, reflection on every analyzer call would probably be pretty > expensive. (allthough i'd love to see benchmarks prove me wrong) > > Another question in my mind is "why doesn't solr provide an optional jar > with factories for every tokenizer/tokenfilter in the lucene contribs?" > ... the only answer to that is that no one has bothered to crank out a > patch that does it. > > http://www.nabble.com/Re%3A-making-schema.xml-nicer-to-read-use-p5939980.html > http://www.nabble.com/foo-tf1737025.html#a4720545 > > > -Hoss > http://www.bbc.co.uk/ This e-mail (and any attachments) is confidential and may contain personal views which are not the views of the BBC unless specifically stated. If you have received it in error, please delete it from your system. Do not use, copy or disclose the information in any way nor act in reliance on it and notify the sender immediately. Please note that the BBC monitors e-mails sent or received. Further communication will signify your consent to this.
Re: add CJKTokenizer to solr
Sorry I've confused things a bit... The thread safeness have to be considered only on the Tokenizers, not on the factories. So are the Tokenizers thread safe? Regards, Daniel On 22/6/07 11:36, "Daniel Alheiros" <[EMAIL PROTECTED]> wrote: > Hi Hoss. > > I've done a few tests using reflection to instantiate a simple object and > the results will vary a lot depending on the JVM. As the JVM optimizes code > as it is executed it will vary depending on the usage, but I think we have > something to consider: > > If done 1,000 samples (5 clean X loop of 200) and each sample is creating > 100,000 objects and the results were: > > With reflection: > - Average : 0.0005418 > - Worst (first clean execution): 0.0007760 > > Without reflection: > - Average : 0.469 > - Worst (first clean execution): 0.0002140 > > So comparing these numbers, I can see that using reflection on the average > case will cost 10 times more than creating the object without reflection. > > But my question is: Do we need to create factories so frequently or the are > just create once and re-used (are they thread safe)? The term Factory made > me think of a class that is responsible for building others instance, so > usually they can be singletons... If they don't need to be created all the > time it will not impact really and will give extra flexibility in terms of > incorporating new Tokenizers (it would make easier to make Solr/Lucene > versions less coupled). > > Environment: > java version "1.5.0_07" > Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_07-164) > Java HotSpot(TM) Client VM (build 1.5.0_07-87, mixed mode, sharing) > Heap size: 256M > Running on a PowerPC - Mac OS/X 10.4.9 with 1.5Gb RAM > > Regards, > Daniel > > > On 21/6/07 20:39, "Chris Hostetter" <[EMAIL PROTECTED]> wrote: > >> >> : Why instead of that we don't create an UbberFactory that takes the >> Tokenizer >> : class as a parameter and instantiates the proper Tokenizer? >> >> The idea has come up before ... and there's really no reason why it >> wouldn't be okay to include a reflection based facotry like this in Solr >> -- it just hasn't been done yet. >> >> One of the reasons is that there are some performance costs associated >> with the reflection, so we wouldn't want to competley replace the existing >> "configuration via factory name" model with a "configure via class name >> and an uber factory does the reflection quetly in the background" model >> because it's the kind of appraoch that would really only make sense for >> simple prototypes -- in any system where you are really concerned about >> performacne, reflection on every analyzer call would probably be pretty >> expensive. (allthough i'd love to see benchmarks prove me wrong) >> >> Another question in my mind is "why doesn't solr provide an optional jar >> with factories for every tokenizer/tokenfilter in the lucene contribs?" >> ... the only answer to that is that no one has bothered to crank out a >> patch that does it. >> >> http://www.nabble.com/Re%3A-making-schema.xml-nicer-to-read-use-p5939980.html >> http://www.nabble.com/foo-tf1737025.html#a4720545 >> >> >> -Hoss >> > > > http://www.bbc.co.uk/ > This e-mail (and any attachments) is confidential and may contain personal > views which are not the views of the BBC unless specifically stated. > If you have received it in error, please delete it from your system. > Do not use, copy or disclose the information in any way nor act in reliance on > it and notify the sender immediately. > Please note that the BBC monitors e-mails sent or received. > Further communication will signify your consent to this. > http://www.bbc.co.uk/ This e-mail (and any attachments) is confidential and may contain personal views which are not the views of the BBC unless specifically stated. If you have received it in error, please delete it from your system. Do not use, copy or disclose the information in any way nor act in reliance on it and notify the sender immediately. Please note that the BBC monitors e-mails sent or received. Further communication will signify your consent to this.
Re: add CJKTokenizer to solr
Tokenizers are not thread safe (I made a mistake yesterday saying they are - I don't know what I was thinking). This is why: public abstract class Tokenizer extends TokenStream { /** The text source for this Tokenizer. */ protected Reader input; < oops :( ... public abstract class CharTokenizer extends Tokenizer { public CharTokenizer(Reader input) { super(input); } ... Otis -- Lucene Consulting -- http://lucene-consulting.com/ - Original Message From: Daniel Alheiros <[EMAIL PROTECTED]> To: solr-user@lucene.apache.org Sent: Friday, June 22, 2007 12:43:50 PM Subject: Re: add CJKTokenizer to solr Sorry I've confused things a bit... The thread safeness have to be considered only on the Tokenizers, not on the factories. So are the Tokenizers thread safe? Regards, Daniel On 22/6/07 11:36, "Daniel Alheiros" <[EMAIL PROTECTED]> wrote: > Hi Hoss. > > I've done a few tests using reflection to instantiate a simple object and > the results will vary a lot depending on the JVM. As the JVM optimizes code > as it is executed it will vary depending on the usage, but I think we have > something to consider: > > If done 1,000 samples (5 clean X loop of 200) and each sample is creating > 100,000 objects and the results were: > > With reflection: > - Average : 0.0005418 > - Worst (first clean execution): 0.0007760 > > Without reflection: > - Average : 0.469 > - Worst (first clean execution): 0.0002140 > > So comparing these numbers, I can see that using reflection on the average > case will cost 10 times more than creating the object without reflection. > > But my question is: Do we need to create factories so frequently or the are > just create once and re-used (are they thread safe)? The term Factory made > me think of a class that is responsible for building others instance, so > usually they can be singletons... If they don't need to be created all the > time it will not impact really and will give extra flexibility in terms of > incorporating new Tokenizers (it would make easier to make Solr/Lucene > versions less coupled). > > Environment: > java version "1.5.0_07" > Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_07-164) > Java HotSpot(TM) Client VM (build 1.5.0_07-87, mixed mode, sharing) > Heap size: 256M > Running on a PowerPC - Mac OS/X 10.4.9 with 1.5Gb RAM > > Regards, > Daniel > > > On 21/6/07 20:39, "Chris Hostetter" <[EMAIL PROTECTED]> wrote: > >> >> : Why instead of that we don't create an UbberFactory that takes the >> Tokenizer >> : class as a parameter and instantiates the proper Tokenizer? >> >> The idea has come up before ... and there's really no reason why it >> wouldn't be okay to include a reflection based facotry like this in Solr >> -- it just hasn't been done yet. >> >> One of the reasons is that there are some performance costs associated >> with the reflection, so we wouldn't want to competley replace the existing >> "configuration via factory name" model with a "configure via class name >> and an uber factory does the reflection quetly in the background" model >> because it's the kind of appraoch that would really only make sense for >> simple prototypes -- in any system where you are really concerned about >> performacne, reflection on every analyzer call would probably be pretty >> expensive. (allthough i'd love to see benchmarks prove me wrong) >> >> Another question in my mind is "why doesn't solr provide an optional jar >> with factories for every tokenizer/tokenfilter in the lucene contribs?" >> ... the only answer to that is that no one has bothered to crank out a >> patch that does it. >> >> http://www.nabble.com/Re%3A-making-schema.xml-nicer-to-read-use-p5939980.html >> http://www.nabble.com/foo-tf1737025.html#a4720545 >> >> >> -Hoss >> > > > http://www.bbc.co.uk/ > This e-mail (and any attachments) is confidential and may contain personal > views which are not the views of the BBC unless specifically stated. > If you have received it in error, please delete it from your system. > Do not use, copy or disclose the information in any way nor act in reliance on > it and notify the sender immediately. > Please note that the BBC monitors e-mails sent or received. > Further communication will signify your consent to this. > http://www.bbc.co.uk/ This e-mail (and any attachments) is confidential and may contain personal views which are not the views of the BBC unless specifically stated. If you have received it in error, please delete it from your system. Do not use, copy or disclose the information in any way nor act in reliance on it and notify the sender immediately. Please note that the BBC monitors e-mails sent or received. Further communication will signify your consent to this.
RE: page rank
I have a few more questions base on your kindly replies to my first question. 1. My solr instance already indexed hundreds of thousands of documents, so how can I update these documents to add new field "numberField" 2. In runtime, my application might want to update value of "numberField" very frequency. How to achieve that via solr? Is that performance critical if many documents need to be updated? 3. Even I have check below wiki page for FunctionQuery, it is still not clear to me to understand this quoted words: " > In terms of score which RequestHandler are you planning to use? > If using dismax you can define a boost function: > recip(rord(numberField),1,1000,1000) " With it, how to let solr take into consideration of this numberField (kind of popularity factor)? Would it be possible to give me an example please? Best Regards, David -Original Message- From: Nick Jenkin [mailto:[EMAIL PROTECTED] Sent: Thursday, June 21, 2007 6:30 AM To: solr-user@lucene.apache.org Subject: Re: page rank Also if you are using the standard request handler you can use the "val" hack: foo:"bar" _val_:"recip(rord(numberField),1,1000,1000)" You can find more info about this here: http://wiki.apache.org/solr/FunctionQuery -Nick On 6/21/07, Daniel Alheiros <[EMAIL PROTECTED]> wrote: > Hi David. > > Yes you can. > > Just define a field as a slong type field: > > > > It can be used to sort (&sort=numberField desc) or to boost your score (it > will depend on the RequestHandler you are going to use). > > In terms of score which RequestHandler are you planning to use? > If using dismax you can define a boost function: > recip(rord(numberField),1,1000,1000) > > I hope it helps. > > Regards, > Daniel Alheiros > > On 20/6/07 16:47, "David Xiao" <[EMAIL PROTECTED]> wrote: > > > Hello folks, > > > > > > > > I am using solr to index web contents. I want to know is that possible to > > tell > > solr about rank information of contents? > > > > For example, I give each content an integer number. > > > > > > > > And I hope solr take this number into consideration when it generates search > > result. (larger number, more priority) > > > > > > > > Best Regards, > > > > David > > > > > http://www.bbc.co.uk/ > This e-mail (and any attachments) is confidential and may contain personal > views which are not the views of the BBC unless specifically stated. > If you have received it in error, please delete it from your system. > Do not use, copy or disclose the information in any way nor act in reliance > on it and notify the sender immediately. > Please note that the BBC monitors e-mails sent or received. > Further communication will signify your consent to this. > >
RE: add CJKTokenizer to solr
Thanks, otis, I didn't know CJK is only used for Asian language. I'll try the German Analyzer. -Original Message- From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] Sent: Friday, June 22, 2007 3:18 AM To: solr-user@lucene.apache.org Subject: Re: add CJKTokenizer to solr I'm jumping in the middle of the thread here. CJK = Chinese, Japanese, Korean German = etwas ganz anderes Why are you trying to use CJKAnalyzer+Tokenizer for German? Have you tried German Analyzer from Lucene contrib? Otis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simpy -- http://www.simpy.com/ - Tag - Search - Share - Original Message From: Xuesong Luo <[EMAIL PROTECTED]> To: solr-user@lucene.apache.org Sent: Friday, June 22, 2007 8:54:37 AM Subject: RE: add CJKTokenizer to solr Thanks, Toru and Chris, I tried both the CJKTokenizer and CJKAnalyzer. Both return some unexpected highlight results when I tested with Germany. The field value I searched is "Ein Mann beißt den Hund". The search criteria is beißt. When using CJKAnalyzer, beißt is treated as 2 single terms(bei and ß) the highlight result is: Ein Mann beißt den Hund When using CJKTokenizer, beißt is treated as 3 single terms, the result is: Ein Mann beißt den Hund When using standard tokenizer, beißt is treated as a word, the result is: Ein Mann beißt den Hund I understand why the standard tokenizer treat beißt as a word, but don't know how CJKAnalyzer and CJKAnalyzer work, could anyone explain a little bit? Thanks Xuesong -Original Message- From: Toru Matsuzawa [mailto:[EMAIL PROTECTED] Sent: Monday, June 18, 2007 10:29 PM To: solr-user@lucene.apache.org Subject: Re: add CJKTokenizer to solr I'm sorry. Because it was not possible to append it, it sends it again. > > I got the error below after adding CJKTokenizer to schema.xml. I > > checked the constructor of CJKTokenizer, it requires a Reader parameter, > > I guess that's why I get this error, I searched the email archive, it > > seems working for other users. Does anyone know what is the problem? > > > CJKTokenizerFactory that I am using is appended. > -- package org.apache.solr.analysis.ja; import java.io.Reader; import org.apache.lucene.analysis.cjk.CJKTokenizer ; import org.apache.lucene.analysis.TokenStream; import org.apache.solr.analysis.BaseTokenizerFactory; /** * CJKTokenizer for Solr * @see org.apache.lucene.analysis.cjk.CJKTokenizer * @author matsu * */ public class CJKTokenizerFactory extends BaseTokenizerFactory { /** * @see org.apache.solr.analysis.TokenizerFactory#create(Reader) */ public TokenStream create(Reader input) { return new CJKTokenizer( input ); } } -- Trou Matsuzawa
Re: add CJKTokenizer to solr
: Sorry I've confused things a bit... The thread safeness have to be : considered only on the Tokenizers, not on the factories. So are the : Tokenizers thread safe? nope ... they are constructed using Readers and mainting state about the text they are processing ... the only api is a "next()" method. : > But my question is: Do we need to create factories so frequently or the are : > just create once and re-used (are they thread safe)? The term Factory made : > me think of a class that is responsible for building others instance, so : > usually they can be singletons... If they don't need to be created all the just to be clear, the Factories are reused, but if we wanted one "UberFactory" class to be able to return any arbitrary Tokenizer specfied in the config, the reflection would have to be for the Tokenizer classes the factories aren't singletons, becuase you might want to use them for multiple fields with differnet configurations. -Hoss
Re: add CJKTokenizer to solr
On 21-Jun-07, at 10:22 PM, Chris Hostetter wrote: like i said though: i'm in favore of factories like this ... i just don't think we should do anything to hide their use and make refering to Tokenizer or TOkenFilter class names directly use reflection magicly. What would be the best way to not hide their use?
Re: add CJKTokenizer to solr
: What would be the best way to not hide their use? : : How about just... -Hoss
RE: Multi-language Tokenizers / Filters recommended?
Hi Daniel, As you know, Chinese and Japanese does not use space or any other delimiters to break words. To overcome this problem, CJKTokenizer uses a method called bi-gram where the run of ideographic (=Chinese) characters are made into tokens of two neighboring characters. So a run of five characters ABCDE will result in four tokens AB, BC, CD, and DE. So search for "BC" will hits this text, even if AB is a word and CD is another word. That is, it increases the noise in the hits. I don't know how much real problem it would be for Chinese. But for Japanese, my native language, this is a problem. Because of this, search result for Kyoto will include false hits of documents that incldue Tokyoto, i.e. Tokyo prefecture. There is another method called morphological analysis, which uses dictionaries and grammer rules to break down text into real words. You might want to consider this method. -kuro
Use Windows 1252 encoding...
Is it possible to use Windows 1252 encoding instead of UTF-8 for Solr ? The application runs on Linux/JDK 1.5. We are using PHP for the front end. The problem we are having is that some characters are displayed weirdly owing to the encoding. Thanks. -- View this message in context: http://www.nabble.com/Use-Windows-1252-encoding...-tf3967676.html#a11262259 Sent from the Solr - User mailing list archive at Nabble.com.
Re: page rank
Hi David 1) you will have to re-add the documents, solr does not support an update operation (only add/del) 2) same as above, solr does not support an update operation, you will need to re-add the document with the updated numberField, if its any help I have a popularity field in my index (3 million documents) which gets updated daily with no performance issues. 3) What query handler are you using, dismax or standard? dismax is when you send keywords and a lucene query is generated standard is when you create your own lucene query -Nick On 6/23/07, David Xiao <[EMAIL PROTECTED]> wrote: I have a few more questions base on your kindly replies to my first question. 1. My solr instance already indexed hundreds of thousands of documents, so how can I update these documents to add new field "numberField" 2. In runtime, my application might want to update value of "numberField" very frequency. How to achieve that via solr? Is that performance critical if many documents need to be updated? 3. Even I have check below wiki page for FunctionQuery, it is still not clear to me to understand this quoted words: " > In terms of score which RequestHandler are you planning to use? > If using dismax you can define a boost function: > recip(rord(numberField),1,1000,1000) " With it, how to let solr take into consideration of this numberField (kind of popularity factor)? Would it be possible to give me an example please? Best Regards, David -Original Message- From: Nick Jenkin [mailto:[EMAIL PROTECTED] Sent: Thursday, June 21, 2007 6:30 AM To: solr-user@lucene.apache.org Subject: Re: page rank Also if you are using the standard request handler you can use the "val" hack: foo:"bar" _val_:"recip(rord(numberField),1,1000,1000)" You can find more info about this here: http://wiki.apache.org/solr/FunctionQuery -Nick On 6/21/07, Daniel Alheiros <[EMAIL PROTECTED]> wrote: > Hi David. > > Yes you can. > > Just define a field as a slong type field: > > > > It can be used to sort (&sort=numberField desc) or to boost your score (it > will depend on the RequestHandler you are going to use). > > In terms of score which RequestHandler are you planning to use? > If using dismax you can define a boost function: > recip(rord(numberField),1,1000,1000) > > I hope it helps. > > Regards, > Daniel Alheiros > > On 20/6/07 16:47, "David Xiao" <[EMAIL PROTECTED]> wrote: > > > Hello folks, > > > > > > > > I am using solr to index web contents. I want to know is that possible to tell > > solr about rank information of contents? > > > > For example, I give each content an integer number. > > > > > > > > And I hope solr take this number into consideration when it generates search > > result. (larger number, more priority) > > > > > > > > Best Regards, > > > > David > > > > > http://www.bbc.co.uk/ > This e-mail (and any attachments) is confidential and may contain personal views which are not the views of the BBC unless specifically stated. > If you have received it in error, please delete it from your system. > Do not use, copy or disclose the information in any way nor act in reliance on it and notify the sender immediately. > Please note that the BBC monitors e-mails sent or received. > Further communication will signify your consent to this. > >
Re: Use Windows 1252 encoding...
Have you tried using the PHP functions utf8_decode/utf8_encode? As far as I understand only UTF8 is supported (but I could be wrong on that!) -Nick On 6/23/07, escher2k <[EMAIL PROTECTED]> wrote: Is it possible to use Windows 1252 encoding instead of UTF-8 for Solr ? The application runs on Linux/JDK 1.5. We are using PHP for the front end. The problem we are having is that some characters are displayed weirdly owing to the encoding. Thanks. -- View this message in context: http://www.nabble.com/Use-Windows-1252-encoding...-tf3967676.html#a11262259 Sent from the Solr - User mailing list archive at Nabble.com.
Re: Use Windows 1252 encoding...
: Is it possible to use Windows 1252 encoding instead of UTF-8 for Solr ? The not at the moment... https://issues.apache.org/jira/browse/SOLR-96 -Hoss
Re: Conceptual Question
: > There is a change interface in JIRA, as long as all of the fields : > originally sent are stored. : : Do you remember the JIRA issue, or a token to find it ? It sounds useful : in some cases, for example, when you are working on analysers. That : could be real life for me in future. https://issues.apache.org/jira/browse/SOLR-139 -Hoss