Re: stopwords as privacy measure

2012-01-10 Thread Michael Lissner
It's a bit of a privacy through obscurity measure, unfortunately. The problem is that American courts do a lousy job of removing social security numbers from cases that I put on my site. I do anonymization before sending the cases to Solr, but if you're clever (and the stopwords weren't in plac

Re: stopwords as privacy measure

2012-01-09 Thread Erik Hatcher
Mike - Indeed users won't be able to *search* for things removed by the stop filter at index time (the terms literally aren't in the index then). But be careful with the stored value. Analysis does not affect stored content. Are you anonymizing before sending to Solr (if so, why stop-word blo

Re: stopwords as privacy measure

2012-01-08 Thread Michael Lissner
I've got them configured at index and query time, so sounds like I'm all set. I'm doing anonymization of social security numbers, converting them to xxx-xx-. I don't *think* users can find a way of identifying these docs if the stopwords-based block works. Thank you both for the confirma

Re: stopwords as privacy measure

2012-01-08 Thread Gora Mohanty
On Mon, Jan 9, 2012 at 5:03 AM, Michael Lissner wrote: > I have a unique use case where I have words in my corpus that users > shouldn't ever be allowed to search for. My theory is that if I add these to > the stopwords list, that should do the trick. Yes, that should work. Are you including the

Re: stopwords as privacy measure

2012-01-08 Thread Ted Dunning
On Sun, Jan 8, 2012 at 3:33 PM, Michael Lissner < mliss...@michaeljaylissner.com> wrote: > I have a unique use case where I have words in my corpus that users > shouldn't ever be allowed to search for. My theory is that if I add these > to the stopwords list, that should do the trick. > That shou

stopwords as privacy measure

2012-01-08 Thread Michael Lissner
I have a unique use case where I have words in my corpus that users shouldn't ever be allowed to search for. My theory is that if I add these to the stopwords list, that should do the trick. I'm using the edismax parser and it seems to be working in my dev environment. Is there any risk to thi