Hi Tom, The documentation is wrong. The sentence you quoted was inherited from Classic Tokenizer's description. UAX 29 URL Email Tokenizer is a specialization of Standard Tokenizer, the 7.2 documentation for which says the following:
Note that words are split at hyphens. I've made an issue to fix the Solr ref guide: https://issues.apache.org/jira/browse/SOLR-13448 If you don't need the UAX#29 word break rules and identification of URLs and emails, you could switch to Classic Tokenizer, which handles hyphens like you want. Alternatively, if you want to continue using UAX29 URL Email Tokenizer, you could use a (pre-tokenization) char filter to convert hyphens to something that won't trigger a word break, and then a (post-tokenization) token filter to convert back to a hyphen, e.g. something like (untested; "_._" is an example of a string that is unlikely to occur in your data and which will not trigger a word break[1]): <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="(\d[A-Za-z]*)-([A-Za-z]*\d)" replacement="$1_._$2"/> <tokenizer class="solr.UAX29URLEmailTokenizerFactory"/> <filter class="solr.PatternReplaceFilterFactory" pattern="_\._" replacement="-"/> (I'm guessing you'll need more than one PatternReplaceCharFilterFactory instance to handle all permutations.) FYI the following note from UAX#29 explains why the default word break rules have hyphens trigger word breaks: The correct interpretation of hyphens in the context of word boundaries is challenging. It is quite common for separate words to be connected with a hyphen: “out-of-the-box,” “under-the-table,” “Italian-American,” and so on. A significant number are hyphenated names, such as “Smith-Hawkins.” When doing a Whole Word Search or query, users expect to find the word within those hyphens. While there are some cases where they are separate words (usually to resolve some ambiguity such as “re-sort” as opposed to “resort”), it is better overall to keep the hyphen out of the default definition. Hyphens include U+002D HYPHEN-MINUS, U+2010 HYPHEN, possibly also U+058A ARMENIAN HYPHEN, and U+30A0 KATAKANA-HIRAGANA DOUBLE HYPHEN. Steve [1] To figure out which chars to use to not trigger a word break, look at rules WB6, WB7, WB8 & WB9 (https://unicode.org/reports/tr29/#WB6 etc.) - "×" in these rules means "do not break". The MidLetter and MidNumLet character sets are your best bet for such chars: https://unicode.org/reports/tr29/#MidNumLet , https://unicode.org/reports/tr29/#MidLetter > On May 6, 2019, at 7:22 AM, Tom Van Cuyck <tom.vancu...@ontoforce.com> wrote: > > Hi, > > The UAX29 URL Email Tokenizer is not working as expected. > According to the documentation ( > https://lucene.apache.org/solr/guide/7_2/tokenizers.html): "Words are split > at hyphens, unless there is a number in the word, in which case the token > is not split and the numbers and hyphen(s) are preserved." > > So I expect "ABC-123" to remain "ABC-123" > However the term is split in 2 separate tokens "ABC" and "123". > > Same for "AB12-CD34" --> "AB12" and "CD34" etc... > > Is this behavior to be expected? Or is there a way to get the behavior I > expect? > > Kind regards, Tom > > -- > > Would you like to receive our newsletter to stay updated? Please click here > <http://eepurl.com/dwoymH> > > > Tom Van Cuyck > Software Engineer > > <http://www.ontoforce.com> > ONTOFORCE > WINNER of EY scale-up of the year 2018 > @: tom.vancu...@ontoforce.com > T: +32 9 292 80 37 <+32+9+292+80+37> > W: http://www.ontoforce.com > W: http://www.disqover.com > AA Tower, Technologiepark 122 (3/F), 9052 Gent, Belgium > <https://goo.gl/maps/UjuekPHVoFK2> > CIC, One Broadway, MA 02142 Cambridge, United States > <https://www.google.com/maps/place/One+Broadway,+1+Broadway,+Cambridge,+MA+02142/@42.3627659,-71.0857549,17z/data=!3m2!4b1!5s0x89e370a5bef53651:0xa9387af4906ce9a3!4m5!3m4!1s0x89e370a5b9258c7b:0x7d922521464507ad!8m2!3d42.3627822!4d-71.0835375> > > DISCLAIMER This message (including any attachments) may contain information > which is confidential and/or protected by intellectual property rights and > is intended for the sole use of the recipient(s) named above. Any use of > the information herein (including, but not limited to, total or partial > reproduction, communication or distribution in any form) by persons other > than the designated recipient(s) is prohibited. If you have received it by > mistake, please notify the sender by return email and delete this message > from your system. Please note that emails are susceptible to change. > ONTOFORCE shall not be liable for the improper or incomplete transmission > of the information contained in this communication nor for any delay in its > receipt or damage to your system. ONTOFORCE does not guarantee that the > integrity of this communication is free of viruses, interceptions or > interference.