Combining under new subject to reflect new question.
Took a quick look at both the LowerCaseFilter and Java implementation it uses.
A perfect hash would be much faster and, since LowerCaseFilter does not
consider locale, applicable.
ICUFoldingFilter is a somewhat different animal. But I take your point, for
an indexed/searchable field (vs. stored/returned) that may contain accented
characters from a wider variety of locales, it makes a lot of sense. It seems
like a single filter would perform the tasks that we use 3 filters to do.
Have you ever looked at the ICU internals? Are they fairly efficient wrt
character attributes and folding? I used ICU C++ libs a while back and they
were never a bottleneck, but that doesn't mean they wouldn't be in this context
or that the Java libs have the same performance characteristics.
We use the following for US-only content (which may be a common use case):
<fieldType name="text_search" class="solr.TextField">
<analyzer>
<tokenizer class="solr.UAX29URLEmailTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" />
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.EnglishPossessiveFilterFactory"/>
</analyzer>
</fieldType>
Thus my interest in the question.
-----Original Message-----
From: Shawn Heisey [mailto:[email protected]]
Sent: Thursday, July 09, 2015 9:55 AM
To: [email protected]
Subject: Re: Do I really need copyField when my app can do the copy?
I don't know what the CPU usage is like compared to LCF, but I use
ICUFoldingFilterFactory instead. This does several things in one pass,
including lowercasing (which it calls case folding), and it is aware of the all
characters in Unicode.
https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ICUFoldingFilterFactory
The ICU classes require additional jars to be loaded into Solr before they will
work.
Thanks,
Shawn
-----Original Message-----
From: Reitzel, Charles [mailto:[email protected]]
Sent: Thursday, July 09, 2015 9:47 AM
To: [email protected]
Subject: RE: LowerCaseFilterFactory burns CPU
That should be fixable. In a past life, I generated a perfect hash to fold
case for Unicode in a locale-neutral manner and it was very fast. If I
remember right, there are only about 2500 Unicode characters that can be case
folded at all. So the generated, collision-free hash function was very small
and fast and the lookup table was small.
I used Bob Jenkins' tool suite for a C application.
http://burtleburtle.net/bob/hash/perfect.html
But there are a number of other open source tools available. Bob Jenkins
currently recommends this one by Botelho and Ziviani:
http://homepages.dcc.ufmg.br/~nivio/papers/cikm07.pdf
-----Original Message-----
From: Nir Barel [mailto:[email protected]]
Sent: Thursday, July 09, 2015 4:35 AM
To: [email protected]
Subject: RE: Do I really need copyField when my app can do the copy?
Hi,
I wants to add a question regarding copyField and LowerCaseFilterFactory We
notice that LowerCaseFilterFactory takes huge part of the CPU ( via profiling )
for the text filed Can we avoid it or improve that implementation? ( keeping
the insensitive case search )
Best Regards,
Nir Barel
*************************************************************************
This e-mail may contain confidential or privileged information.
If you are not the intended recipient, please notify the sender immediately and
then delete it.
TIAA-CREF
*************************************************************************