On Thu, Feb 11, 2010 at 10:49 AM, Grant Ingersoll <gsing...@apache.org> wrote: > > Otherwise, I'd do it via copy fields. Your first field is your main field > and is analyzed as before. Your second field does the profanity detection > and simply outputs a single token at the end, safe/unsafe. > > How long are your documents? The extra copy field is extra work, but in this > case it should be fast as you should be able to create a pretty streamlined > analyzer chain for the second task. >
The documents are web page text, so they shouldn't be more than 10-20k generally. Would something like this do the trick? @Override public boolean incrementToken() throws IOException { while (input.incrementToken()) { if (profanities.contains(termAtt.termBuffer(), 0, termAtt.termLength())) { termAtt.setTermBuffer("y", 0, 1); return false; } } termAtt.setTermBuffer("n", 0, 1); return false; } mike