Re: UAX29 URL Email Tokenizer not working as expected

Steve Rowe Tue, 07 May 2019 01:34:49 -0700

Hi Tom,

The documentation is wrong.  The sentence you quoted was inherited from Classic 
Tokenizer's description.  UAX 29 URL Email Tokenizer is a specialization of 
Standard Tokenizer, the 7.2 documentation for which says the following:


    Note that words are split at hyphens.

I've made an issue to fix the Solr ref guide: 
https://issues.apache.org/jira/browse/SOLR-13448

If you don't need the UAX#29 word break rules and identification of URLs and 
emails, you could switch to Classic Tokenizer, which handles hyphens like you 
want.

Alternatively, if you want to continue using UAX29 URL Email Tokenizer, you 
could use a (pre-tokenization) char filter to convert hyphens to something that 
won't trigger a word break, and then a (post-tokenization) token filter to 
convert back to a hyphen, e.g. something like (untested; "_._" is an example of 
a string that is unlikely to occur in your data and which will not trigger a 
word break[1]):

  <charFilter class="solr.PatternReplaceCharFilterFactory"
              pattern="(\d[A-Za-z]*)-([A-Za-z]*\d)" replacement="$1_._$2"/>
  <tokenizer class="solr.UAX29URLEmailTokenizerFactory"/>
  <filter class="solr.PatternReplaceFilterFactory"
          pattern="_\._" replacement="-"/>

(I'm guessing you'll need more than one PatternReplaceCharFilterFactory 
instance to handle all permutations.)

FYI the following note from UAX#29 explains why the default word break rules 
have hyphens trigger word breaks:

    The correct interpretation of hyphens in the context
    of word boundaries is challenging. It is quite common
    for separate words to be connected with a hyphen:
    “out-of-the-box,” “under-the-table,” “Italian-American,”
    and so on. A significant number are hyphenated names,
    such as “Smith-Hawkins.” When doing a Whole Word Search
    or query, users expect to find the word within those
    hyphens. While there are some cases where they are
    separate words (usually to resolve some ambiguity such
    as “re-sort” as opposed to “resort”), it is better
    overall to keep the hyphen out of the default
    definition. Hyphens include U+002D HYPHEN-MINUS, 
    U+2010 HYPHEN, possibly also U+058A ARMENIAN HYPHEN,
    and U+30A0 KATAKANA-HIRAGANA DOUBLE HYPHEN.

Steve

[1] To figure out which chars to use to not trigger a word break, look at rules 
WB6, WB7, WB8 & WB9 (https://unicode.org/reports/tr29/#WB6 etc.) - "×" in these 
rules means "do not break".  The MidLetter and MidNumLet character sets are 
your best bet for such chars: https://unicode.org/reports/tr29/#MidNumLet , 
https://unicode.org/reports/tr29/#MidLetter

> On May 6, 2019, at 7:22 AM, Tom Van Cuyck <tom.vancu...@ontoforce.com> wrote:
> 
> Hi,
> 
> The UAX29 URL Email Tokenizer is not working as expected.
> According to the documentation (
> https://lucene.apache.org/solr/guide/7_2/tokenizers.html): "Words are split
> at hyphens, unless there is a number in the word, in which case the token
> is not split and the numbers and hyphen(s) are preserved."
> 
> So I expect "ABC-123" to remain "ABC-123"
> However the term is split in 2 separate tokens "ABC" and "123".
> 
> Same for "AB12-CD34" --> "AB12" and "CD34" etc...
> 
> Is this behavior to be expected? Or is there a way to get the behavior I
> expect?
> 
> Kind regards, Tom
> 
> -- 
> 
> Would you like to receive our newsletter to stay updated? Please click here
> <http://eepurl.com/dwoymH>
> 
> 
> Tom Van Cuyck
> Software Engineer
> 
> <http://www.ontoforce.com>
> ONTOFORCE
> WINNER of EY scale-up of the year 2018
> @: tom.vancu...@ontoforce.com
> T: +32 9 292 80 37 <+32+9+292+80+37>
> W: http://www.ontoforce.com
> W: http://www.disqover.com
> AA Tower, Technologiepark 122 (3/F), 9052 Gent, Belgium
> <https://goo.gl/maps/UjuekPHVoFK2>
> CIC, One Broadway, MA 02142 Cambridge, United States
> <https://www.google.com/maps/place/One+Broadway,+1+Broadway,+Cambridge,+MA+02142/@42.3627659,-71.0857549,17z/data=!3m2!4b1!5s0x89e370a5bef53651:0xa9387af4906ce9a3!4m5!3m4!1s0x89e370a5b9258c7b:0x7d922521464507ad!8m2!3d42.3627822!4d-71.0835375>
> 
> DISCLAIMER This message (including any attachments) may contain information
> which is confidential and/or protected by intellectual property rights and
> is intended for the sole use of the recipient(s) named above. Any use of
> the information herein (including, but not limited to, total or partial
> reproduction, communication or distribution in any form) by persons other
> than the designated recipient(s) is prohibited. If you have received it by
> mistake, please notify the sender by return email and delete this message
> from your system. Please note that emails are susceptible to change.
> ONTOFORCE shall not be liable for the improper or incomplete transmission
> of the information contained in this communication nor for any delay in its
> receipt or damage to your system. ONTOFORCE does not guarantee that the
> integrity of this communication is free of viruses, interceptions or
> interference.

Re: UAX29 URL Email Tokenizer not working as expected

Reply via email to