Re: implementing profanity detector

Grant Ingersoll Thu, 11 Feb 2010 08:50:24 -0800

On Jan 28, 2010, at 4:46 PM, Mike Perham wrote:

> We'd like to implement a profanity detector for documents during indexing.
> That is, given a file of profane words, we'd like to be able to mark a
> document as safe or not safe if it contains any of those words so that we
> can have something similar to google's safe search.
> 
> I'm trying to figure out how best to implement this with Solr 1.4:
> 
> - An UpdateRequestProcessor would allow me to dynamically populate a "safe"
> boolean field but requires me to pull out the content, tokenize it and run
> each token through my set of profanities, essentially running the analysis
> pipeline again.  That's a lot of overheard AFAIK.
> 
> - A TokenFilter would allow me to tap into the existing analysis pipeline so
> I get the tokens for free but I can't access the document.
> 
> Any suggestions on how to best implement this?
>



TeeSinkTokenFilter (Lucene only) would do the trick if you're up for some 
hardcoding b/c it isn't supported in Solr (patch welcome) all that well.  A 
one-off solution shouldn't be too hard to wedge in, but it will involve 
hardcoding some field names in your analyzer, I think.  

Otherwise, I'd do it via copy fields.  Your first field is your main field and 
is analyzed as before.  Your second field does the profanity detection and 
simply outputs a single token at the end, safe/unsafe.

How long are your documents?  The extra copy field is extra work, but in this 
case it should be fast as you should be able to create a pretty streamlined 
analyzer chain for the second task.

Short term, I'd do the copy field approach while maybe, depending on its 
importance to you, working on the first approach.

-Grant

Re: implementing profanity detector

Reply via email to