Re: synonym filter and offsets

Chris Hostetter Sat, 24 Apr 2010 18:05:09 -0700

: on indexing these are passed through a synonym filter that has this line
: saturday night live => snl, saturday night live


: i now end up with four tokens
: [saturday, 0, 19], [snl, 0, 19], [night, 0, 19], [live, 0,19]
: 
: what i want is
: [saturday, 0,8], [snl, 0,19], [night, 9, 14], [live, 15,19]


Hmmm ... I don't think there's any way to make SYnonymFilterFactory do 
that.

The problem is that while in your case it seems obvious that the 
final "saturday" token should get the same start/end offset info as the 
initial "saturday" token from the tokenizer, in the general case there 
isn't a one to one mapping.  the synonym file could just as easily look 
like...

saturday-night live => snl, saturday night live

...in which case it would have no idea what offsets to apply to "saturday" 

That said: i suppose it could make sense to put in explicit logic that if 
a multi-token input produces multi-token output, and some of hte output 
tokens have the same charbuffer as tkens from the input, then they should 
get identical offsets ... but i haven't thought it threw enough to be sure 
there wouldn't be any nasty gotchas.

wanna open a feature request in jira?




-Hoss

Re: synonym filter and offsets

Reply via email to