Re: Difference between UAX29URLEmailTokenizerFactory and ClassicTokenizerFactory

2017-11-24 Thread Zheng Lin Edwin Yeo
Hi Ahmet, Ok. Thanks for your advice. Regards, Edwin On 25 November 2017 at 10:23, Ahmet Arslan wrote: > > > Hi Zheng, > > UAX29UET recognizes URLs and e-mails. It does not tokenize them. It keeps > them single token. > > StandardTokenizer produce two or more tokens for an entity. > > Please t

Re: Difference between UAX29URLEmailTokenizerFactory and ClassicTokenizerFactory

2017-11-24 Thread Zheng Lin Edwin Yeo
Hi Rick, For both of the tokenizers, it does not split on the hyphens for email like this: solr-user@lucene.apache.org The entire email address remains intact for both of the tokenizers. Regards, Edwin On 24 November 2017 at 20:19, Rick Leir wrote: > Edwin > There is a spec for which characte

Re: Difference between UAX29URLEmailTokenizerFactory and ClassicTokenizerFactory

2017-11-24 Thread Ahmet Arslan
Hi Zheng, UAX29UET recognizes URLs and e-mails. It does not tokenize them. It keeps them single token. StandardTokenizer produce two or more tokens for an entity. Please try them using the analysis page, use which one suits your requirements. Ahmet On Friday, November 24, 2017, 11:46:57 A

Re: docValues

2017-11-24 Thread Kojo
Erick, thanks for explaining the memory aspects. Regarding the end user perspective, our intention is to provide a first layer of filtering, where data will be rolled up in some buckets and be displayed in charts and tables. When I told about provide access to "full" documents, it was not to displ

Re: Strip out punctuation at the end of token

2017-11-24 Thread Erick Erickson
You need to play with the (many) parameters for WordDelimiterFilterFactory. For instance, you have preserveOriginal set to 1. That's what's generating the token with the dot. You have catenateAll and catenateNumbers set to zero. That means that someone searching for 61149008 won't get a hit. The

Re: docValues

2017-11-24 Thread Erick Erickson
Kojo: bq: My question is, isn´t it to expensive in terms of memory consumption to enable docValues on fields that I dont need to facet, search etc? Well, yes and no. The memory consumed is your OS memory space and a small bit of control structures on your Java heap. It's a bit scary that your _in

Re: docValues

2017-11-24 Thread Kojo
I Think that I found the solution. After analysis, change from /export request handler to /select request handler in order to obtain other fields. I will try that. 2017-11-24 15:15 GMT-02:00 Kojo : > Thank you very much for your answer, Shawn. > > That is it, I was looking for another way to in

Re: docValues

2017-11-24 Thread Kojo
Thank you very much for your answer, Shawn. That is it, I was looking for another way to include fields non docValues to the filtered result documents. I can enable docValues to other fields and reindex all if necessary. I will tell you about the use case, because I am not sure that I am on the r

Re: Strip out punctuation at the end of token

2017-11-24 Thread Sergio García Maroto
Yes. You are right. I understand now. Let me explain my issue a bit better with the exact problem i have. I have this text "Information number 61149-008." Using the tokenizers and filters described previously i get this list of tokens. information number 61149-008. 61149 008 Basically last token

Re: docValues

2017-11-24 Thread Shawn Heisey
On 11/23/2017 1:51 PM, Kojo wrote: I am working on Solr to develop a toll to make analysis. I am using search function of Streaming Expressions, which requires a field to be indexed with docValues enabled, so I can get it. Suppose that after someone finishes the analysis, and would like to get o

Re: Strip out punctuation at the end of token

2017-11-24 Thread Shawn Heisey
On 11/24/2017 2:32 AM, marotosg wrote: Hi Shaw. Thanks for your reply. Actually my issue is with the last token. It looks like for the last token of a string. It keeps the dot. In your case Testing. This is a test. Test. Keeps the "Test." Is there any reason I can't see for that behauviour?

Re: Solr7 org.apache.lucene.index.IndexUpgrader

2017-11-24 Thread Shawn Heisey
On 11/23/2017 11:31 PM, Leo Prince wrote: We were using bit older version Solr 4.10.2 and upgrading to Solr7. We have like 4mil records in one of the core which is of course pretty huge, hence re-sourcing the index is nearly impossible and re-querying from source Solr to Solr7 is also going to b

Fwd: docValues

2017-11-24 Thread Kojo
Hi, yesterday I sent a message bellow to this list, but just after I sent the message I received an e-mail from the mail server that said that my e-mail bounced. I don´t know what that means, and since I receive no answer for the question, I don´t know whether if the message has arrived to the lis

Re: Difference between UAX29URLEmailTokenizerFactory and ClassicTokenizerFactory

2017-11-24 Thread Rick Leir
Edwin There is a spec for which characters are acceptable in an email name, and another spec for chars in a domain name. I suspect you will have more success with a tokenizer which is specialized for email, but I have not looked at UAX29URLEmailTokenizerFactory. Does ClassicTokenizerFactory spli

Re: Strip out punctuation at the end of token

2017-11-24 Thread marotosg
Hi Shaw. Thanks for your reply. Actually my issue is with the last token. It looks like for the last token of a string. It keeps the dot. In your case Testing. This is a test. Test. Keeps the "Test." Is there any reason I can't see for that behauviour? Thanks, Sergio Testing. This is a test.

Difference between UAX29URLEmailTokenizerFactory and ClassicTokenizerFactory

2017-11-24 Thread Zheng Lin Edwin Yeo
Hi, I am indexing email addresses into Solr via EML files. Currently, I am using ClassicTokenizerFactory with LowerCaseFilterFactory. However, I also found that we can also use UAX29URLEmailTokenizerFactory with LowerCaseFilterFactory. Does anyone have any recommendation on which Tokenizer is bet