Combining under new subject to reflect new question. Took a quick look at both the LowerCaseFilter and Java implementation it uses. A perfect hash would be much faster and, since LowerCaseFilter does not consider locale, applicable.
ICUFoldingFilter is a somewhat different animal. But I take your point, for an indexed/searchable field (vs. stored/returned) that may contain accented characters from a wider variety of locales, it makes a lot of sense. It seems like a single filter would perform the tasks that we use 3 filters to do. Have you ever looked at the ICU internals? Are they fairly efficient wrt character attributes and folding? I used ICU C++ libs a while back and they were never a bottleneck, but that doesn't mean they wouldn't be in this context or that the Java libs have the same performance characteristics. We use the following for US-only content (which may be a common use case): <fieldType name="text_search" class="solr.TextField"> <analyzer> <tokenizer class="solr.UAX29URLEmailTokenizerFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" /> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.ASCIIFoldingFilterFactory"/> <filter class="solr.EnglishPossessiveFilterFactory"/> </analyzer> </fieldType> Thus my interest in the question. -----Original Message----- From: Shawn Heisey [mailto:apa...@elyograg.org] Sent: Thursday, July 09, 2015 9:55 AM To: solr-user@lucene.apache.org Subject: Re: Do I really need copyField when my app can do the copy? I don't know what the CPU usage is like compared to LCF, but I use ICUFoldingFilterFactory instead. This does several things in one pass, including lowercasing (which it calls case folding), and it is aware of the all characters in Unicode. https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ICUFoldingFilterFactory The ICU classes require additional jars to be loaded into Solr before they will work. Thanks, Shawn -----Original Message----- From: Reitzel, Charles [mailto:charles.reit...@tiaa-cref.org] Sent: Thursday, July 09, 2015 9:47 AM To: solr-user@lucene.apache.org Subject: RE: LowerCaseFilterFactory burns CPU That should be fixable. In a past life, I generated a perfect hash to fold case for Unicode in a locale-neutral manner and it was very fast. If I remember right, there are only about 2500 Unicode characters that can be case folded at all. So the generated, collision-free hash function was very small and fast and the lookup table was small. I used Bob Jenkins' tool suite for a C application. http://burtleburtle.net/bob/hash/perfect.html But there are a number of other open source tools available. Bob Jenkins currently recommends this one by Botelho and Ziviani: http://homepages.dcc.ufmg.br/~nivio/papers/cikm07.pdf -----Original Message----- From: Nir Barel [mailto:ni...@checkpoint.com] Sent: Thursday, July 09, 2015 4:35 AM To: solr-user@lucene.apache.org Subject: RE: Do I really need copyField when my app can do the copy? Hi, I wants to add a question regarding copyField and LowerCaseFilterFactory We notice that LowerCaseFilterFactory takes huge part of the CPU ( via profiling ) for the text filed Can we avoid it or improve that implementation? ( keeping the insensitive case search ) Best Regards, Nir Barel ************************************************************************* This e-mail may contain confidential or privileged information. If you are not the intended recipient, please notify the sender immediately and then delete it. TIAA-CREF *************************************************************************