You could have a synonym file that, for each dirty word, changes the word into an "impossible word": for example, xyzzy. Then, a search for clean contents is:
(user search) AND NOT xyzzy A synonym filter that included payloads would be cool. On Thu, Jan 28, 2010 at 2:31 PM, Otis Gospodnetic <otis_gospodne...@yahoo.com> wrote: > How about this crazy idea - a custom TokenFilter that stores the safe flag in > ThreadLocal? > > > Otis > ---- > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch > Hadoop ecosystem search :: http://search-hadoop.com/ > > > > ----- Original Message ---- >> From: Mike Perham <mper...@onespot.com> >> To: solr-user@lucene.apache.org >> Sent: Thu, January 28, 2010 4:46:54 PM >> Subject: implementing profanity detector >> >> We'd like to implement a profanity detector for documents during indexing. >> That is, given a file of profane words, we'd like to be able to mark a >> document as safe or not safe if it contains any of those words so that we >> can have something similar to google's safe search. >> >> I'm trying to figure out how best to implement this with Solr 1.4: >> >> - An UpdateRequestProcessor would allow me to dynamically populate a "safe" >> boolean field but requires me to pull out the content, tokenize it and run >> each token through my set of profanities, essentially running the analysis >> pipeline again. That's a lot of overheard AFAIK. >> >> - A TokenFilter would allow me to tap into the existing analysis pipeline so >> I get the tokens for free but I can't access the document. >> >> Any suggestions on how to best implement this? >> >> Thanks in advance, >> mike > > -- Lance Norskog goks...@gmail.com