implementing profanity detector

2010-01-28 Thread Mike Perham
We'd like to implement a profanity detector for documents during indexing. That is, given a file of profane words, we'd like to be able to mark a document as safe or not safe if it contains any of those words so that we can have something similar to google's safe search. I'm trying to figure out

implementing profanity detector

2010-02-10 Thread Mike Perham
on how to implement this efficiently with Lucene/Solr. mike On Thu, Jan 28, 2010 at 4:31 PM, Otis Gospodnetic wrote: > > How about this crazy idea - a custom TokenFilter that stores the safe flag in > ThreadLocal? > > > > ----- Original Message > > From: M

term frequency vector access?

2010-02-11 Thread Mike Perham
In an UpdateRequestProcessor (processing an AddUpdateCommand), I have a SolrInputDocument with a field 'content' that has termVectors="true" in schema.xml. Is it possible to get access to that field's term vector in the URP?

Re: implementing profanity detector

2010-02-12 Thread Mike Perham
On Thu, Feb 11, 2010 at 10:49 AM, Grant Ingersoll wrote: > > Otherwise, I'd do it via copy fields.  Your first field is your main field > and is analyzed as before.  Your second field does the profanity detection > and simply outputs a single token at the end, safe/unsafe. > > How long are your