Hmmm, good point on WordDelimiterFilterFactory. You're right, that should work.
Although there'd still be a problem with J. R. R. never matching jrr. But that wouldn't be solved by Pattern.... either. I'd try to define the problem away <G>... good catch Erick On Mon, Nov 22, 2010 at 12:15 PM, Shawn Heisey <s...@elyograg.org> wrote: > On 11/22/2010 7:40 AM, Erick Erickson wrote: > >> As I remember, PatternReplace... isn't in 1.4, so you'd have to move to >> 3.x >> or trunk. >> >> You could always write a custom class that did what you wanted, it's >> actually >> pretty easy. >> > > PatternReplaceCharFilterFactory isn't in 1.4, but > PatternReplaceFilterFactory is. I'm using it in my 1.4.1 installation. The > CharFilter version gets applied before tokenization, which caused problems > for me in my testing of branch_3x. In situations where the order of > operations isn't important, the CharFilter option would be great. > > Based on their description, I'd think what they actually want is > WordDelimiterFilterFactory with preserveOriginal and catenateWords turned on > at a minimum. That should match on any likely representation of J.R.R. > Tolkien. The other options can also be useful. > > In my schema, the index analyzer has WordDelimiterFilterFactory with > everything turned on except catenateAll, and the query analyzer is the same > except all three catenate options are turned off. > > Shawn > >