The main one is that you can get an explosion in the number of terms,
depending on your input, especially if you have things that aren't
regular text. Imagine
partone-1
partone-2
partone-3

parttwo-1
parttwo-2
parttwo-3

if catenateall is set to 0, you;d get 5 tokens here. If it was set to
1 you'd get  11 tokens.

Which doesn't seem like a lot until you have hundreds of thousands of
patterns like this.

So, give it a whirl and see what pops out with your particular corpus,
but keep an eye on the number of unique terms that end up in the
field.

Best
Erick

On Thu, Nov 17, 2011 at 12:18 PM, Brendan Grainger
<brendan.grain...@gmail.com> wrote:
> Hi,
>
> The default for catenateAll is 0 which we've been using on the 
> WordDelimiterFilter. What would be the possibly negative implications of 
> setting this to 1? So that:
>
> wi-fi-800
>
> would produce the tokens:
>
> wi, fi, wifi, 800, wifi800
>
> for example?
>
> Thanks

Reply via email to