A problem is that your profanity list will not stop growing, and with
each new word you will want to rescrub the index.
We had a thousand-word NOT clause in every query (a filter query would
be true for 99% of the index) until we switched to another
arrangement.
Another small problem was that I k
: Otherwise, I'd do it via copy fields. Your first field is your main
: field and is analyzed as before. Your second field does the profanity
: detection and simply outputs a single token at the end, safe/unsafe.
you don't even need custom code for this ... copyFiled all your text into
a 'ha
On Thu, Feb 11, 2010 at 10:49 AM, Grant Ingersoll wrote:
>
> Otherwise, I'd do it via copy fields. Your first field is your main field
> and is analyzed as before. Your second field does the profanity detection
> and simply outputs a single token at the end, safe/unsafe.
>
> How long are your
On Jan 28, 2010, at 4:46 PM, Mike Perham wrote:
> We'd like to implement a profanity detector for documents during indexing.
> That is, given a file of profane words, we'd like to be able to mark a
> document as safe or not safe if it contains any of those words so that we
> can have something si
> - A TokenFilter would allow me to tap into the existing analysis pipeline so
> I get the tokens for free but I can't access the document.
https://issues.apache.org/jira/browse/SOLR-1536
On Fri, Jan 29, 2010 at 12:46 AM, Mike Perham wrote:
> We'd like to implement a profanity detector for docume
You could have a synonym file that, for each dirty word, changes the
word into an "impossible word": for example, xyzzy. Then, a search for
clean contents is:
(user search) AND NOT xyzzy
A synonym filter that included payloads would be cool.
On Thu, Jan 28, 2010 at 2:31 PM, Otis Gospodnetic
wro
How about this crazy idea - a custom TokenFilter that stores the safe flag in
ThreadLocal?
Otis
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Hadoop ecosystem search :: http://search-hadoop.com/
- Original Message
> From: Mike Perham
> To: solr-user@lucene.apache.or