We'd like to implement a profanity detector for documents during indexing.
That is, given a file of profane words, we'd like to be able to mark a
document as safe or not safe if it contains any of those words so that we
can have something similar to google's safe search.
I'm trying to figure out
on how to
implement this efficiently with Lucene/Solr.
mike
On Thu, Jan 28, 2010 at 4:31 PM, Otis Gospodnetic
wrote:
>
> How about this crazy idea - a custom TokenFilter that stores the safe flag in
> ThreadLocal?
>
>
>
> ----- Original Message
> > From: M
In an UpdateRequestProcessor (processing an AddUpdateCommand), I have
a SolrInputDocument with a field 'content' that has termVectors="true"
in schema.xml. Is it possible to get access to that field's term
vector in the URP?
On Thu, Feb 11, 2010 at 10:49 AM, Grant Ingersoll wrote:
>
> Otherwise, I'd do it via copy fields. Your first field is your main field
> and is analyzed as before. Your second field does the profanity detection
> and simply outputs a single token at the end, safe/unsafe.
>
> How long are your