: on indexing these are passed through a synonym filter that has this line : saturday night live => snl, saturday night live
: i now end up with four tokens : [saturday, 0, 19], [snl, 0, 19], [night, 0, 19], [live, 0,19] : : what i want is : [saturday, 0,8], [snl, 0,19], [night, 9, 14], [live, 15,19] Hmmm ... I don't think there's any way to make SYnonymFilterFactory do that. The problem is that while in your case it seems obvious that the final "saturday" token should get the same start/end offset info as the initial "saturday" token from the tokenizer, in the general case there isn't a one to one mapping. the synonym file could just as easily look like... saturday-night live => snl, saturday night live ...in which case it would have no idea what offsets to apply to "saturday" That said: i suppose it could make sense to put in explicit logic that if a multi-token input produces multi-token output, and some of hte output tokens have the same charbuffer as tkens from the input, then they should get identical offsets ... but i haven't thought it threw enough to be sure there wouldn't be any nasty gotchas. wanna open a feature request in jira? -Hoss