On Jan 28, 2010, at 4:46 PM, Mike Perham wrote: > We'd like to implement a profanity detector for documents during indexing. > That is, given a file of profane words, we'd like to be able to mark a > document as safe or not safe if it contains any of those words so that we > can have something similar to google's safe search. > > I'm trying to figure out how best to implement this with Solr 1.4: > > - An UpdateRequestProcessor would allow me to dynamically populate a "safe" > boolean field but requires me to pull out the content, tokenize it and run > each token through my set of profanities, essentially running the analysis > pipeline again. That's a lot of overheard AFAIK. > > - A TokenFilter would allow me to tap into the existing analysis pipeline so > I get the tokens for free but I can't access the document. > > Any suggestions on how to best implement this? >
TeeSinkTokenFilter (Lucene only) would do the trick if you're up for some hardcoding b/c it isn't supported in Solr (patch welcome) all that well. A one-off solution shouldn't be too hard to wedge in, but it will involve hardcoding some field names in your analyzer, I think. Otherwise, I'd do it via copy fields. Your first field is your main field and is analyzed as before. Your second field does the profanity detection and simply outputs a single token at the end, safe/unsafe. How long are your documents? The extra copy field is extra work, but in this case it should be fast as you should be able to create a pretty streamlined analyzer chain for the second task. Short term, I'd do the copy field approach while maybe, depending on its importance to you, working on the first approach. -Grant