RE: LowerCaseFilterFactory burns CPU

Reitzel, Charles Thu, 09 Jul 2015 07:39:19 -0700

Combining under new subject to reflect new question.

Took a quick look at both the LowerCaseFilter and Java implementation it uses.  
  A perfect hash would be much faster and, since LowerCaseFilter does not 
consider locale, applicable.


ICUFoldingFilter is a somewhat different animal.   But I take your point, for 
an indexed/searchable field (vs. stored/returned) that may contain accented 
characters from a wider variety of locales, it makes a lot of sense.  It seems 
like a single filter would perform the tasks that we use 3 filters to do.    

Have you ever looked at the ICU internals?   Are they fairly efficient wrt 
character attributes and folding?   I used ICU C++ libs a while back and they 
were never a bottleneck, but that doesn't mean they wouldn't be in this context 
or that the Java libs have the same performance characteristics.

We use the following for US-only content (which may be a common use case):
    <fieldType name="text_search" class="solr.TextField">
        <analyzer>
            <tokenizer class="solr.UAX29URLEmailTokenizerFactory"/>
            <filter class="solr.StopFilterFactory" ignoreCase="true" 
words="stopwords.txt" />
            <filter class="solr.LowerCaseFilterFactory"/>
            <filter class="solr.ASCIIFoldingFilterFactory"/>
            <filter class="solr.EnglishPossessiveFilterFactory"/>
        </analyzer>
    </fieldType>

Thus my interest in the question.

-----Original Message-----
From: Shawn Heisey [mailto:apa...@elyograg.org] 
Sent: Thursday, July 09, 2015 9:55 AM
To: solr-user@lucene.apache.org
Subject: Re: Do I really need copyField when my app can do the copy?

I don't know what the CPU usage is like compared to LCF, but I use 
ICUFoldingFilterFactory instead.  This does several things in one pass, 
including lowercasing (which it calls case folding), and it is aware of the all 
characters in Unicode.

https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ICUFoldingFilterFactory

The ICU classes require additional jars to be loaded into Solr before they will 
work.

Thanks,
Shawn

-----Original Message-----
From: Reitzel, Charles [mailto:charles.reit...@tiaa-cref.org] 
Sent: Thursday, July 09, 2015 9:47 AM
To: solr-user@lucene.apache.org
Subject: RE: LowerCaseFilterFactory burns CPU

That should be fixable.   In a past life, I generated a perfect hash to fold 
case for Unicode in a locale-neutral manner and it was very fast.   If I 
remember right, there are only about 2500 Unicode characters that can be case 
folded at all.  So the generated, collision-free hash function was very small 
and fast and the lookup table was small.

I used Bob Jenkins' tool suite for a C application.
http://burtleburtle.net/bob/hash/perfect.html

But there are a number of other open source tools available.   Bob Jenkins 
currently recommends this one by Botelho and Ziviani: 
http://homepages.dcc.ufmg.br/~nivio/papers/cikm07.pdf


-----Original Message-----
From: Nir Barel [mailto:ni...@checkpoint.com] 
Sent: Thursday, July 09, 2015 4:35 AM
To: solr-user@lucene.apache.org
Subject: RE: Do I really need copyField when my app can do the copy?

Hi,

I wants to add a question regarding copyField and LowerCaseFilterFactory We 
notice that LowerCaseFilterFactory takes huge part of the CPU ( via profiling ) 
for the text filed Can we avoid it or improve that implementation? ( keeping 
the insensitive case search )

Best Regards,
Nir Barel 

*************************************************************************
This e-mail may contain confidential or privileged information.
If you are not the intended recipient, please notify the sender immediately and 
then delete it.

TIAA-CREF
*************************************************************************

RE: LowerCaseFilterFactory burns CPU

Reply via email to