On Thu, Feb 11, 2010 at 10:49 AM, Grant Ingersoll <gsing...@apache.org> wrote:
>
> Otherwise, I'd do it via copy fields.  Your first field is your main field 
> and is analyzed as before.  Your second field does the profanity detection 
> and simply outputs a single token at the end, safe/unsafe.
>
> How long are your documents?  The extra copy field is extra work, but in this 
> case it should be fast as you should be able to create a pretty streamlined 
> analyzer chain for the second task.
>

The documents are web page text, so they shouldn't be more than 10-20k
generally.  Would something like this do the trick?

  @Override
  public boolean incrementToken() throws IOException {
    while (input.incrementToken()) {
      if (profanities.contains(termAtt.termBuffer(), 0, termAtt.termLength())) {
          termAtt.setTermBuffer("y", 0, 1);
          return false;
      }
    }
    termAtt.setTermBuffer("n", 0, 1);
    return false;
  }

mike

Reply via email to