Re: Multiple passes with WordDelimiterFilterFactory

2010-08-30 Thread Shawn Heisey
On 8/30/2010 9:01 AM, Shawn Heisey wrote: On 8/29/2010 2:17 PM, Erick Erickson wrote: <<>> Try putting this after any instances of, say, WhiteSpaceTokenizerFactory in your analyzser definition, and I believe you'll see that this is not true. At least looking at this in the analysis page from S

Re: Multiple passes with WordDelimiterFilterFactory

2010-08-30 Thread Shawn Heisey
On 8/29/2010 2:17 PM, Erick Erickson wrote: <<>> Try putting this after any instances of, say, WhiteSpaceTokenizerFactory in your analyzser definition, and I believe you'll see that this is not true. At least looking at this in the analysis page from SOLR admin sure doesn't seem to support that

Re: Multiple passes with WordDelimiterFilterFactory

2010-08-29 Thread Erick Erickson
There's nothing built into SOLR that I know of that'll deal with auto-detecting multiple languages and "doing the right thing". I know there's been discussion of that, searching the users' list might help... You may have to write your own analyzer that tries to do this, but I have no clue how you'd

Re: Multiple passes with WordDelimiterFilterFactory

2010-08-29 Thread Shawn Heisey
Thank you for taking the time to help. The way I've got the word delimiter index filter set up with only one pass, "wolf-biederman" will result in wolf, biederman, wolfbiederman, and wolf-biederman. With two passes, the last one is not present. One pass changes "gremlin's" to gremlin and gr

Re: Multiple passes with WordDelimiterFilterFactory

2010-08-29 Thread Erick Erickson
Look at the tokenizer/filter chain that makes up your analyzers, and see: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters for other tokenizer/analyzer/filter options. You're on the right track looking at the various choices provided, and I suspect you'll find what you need... Be a l

Re: Multiple passes with WordDelimiterFilterFactory

2010-08-29 Thread Shawn Heisey
On 8/28/2010 7:59 PM, Shawn Heisey wrote: The only drop in term quality that I noticed was that possessive words (apostrophe-s) no longer have the original preserved. I haven't yet decided whether that's a problem. I finally did notice another drop in term quality from the dual pass - words

Re: Multiple passes with WordDelimiterFilterFactory

2010-08-29 Thread Shawn Heisey
It's metadata for a collection of 45 million documents that is mostly photos, with some videos and text. The data is imported from a MySQL database and split among six large shards (each nearly 13GB) and a small shard with data added in the last week. That works out to between 300,000 and 50

Re: Multiple passes with WordDelimiterFilterFactory

2010-08-28 Thread Shawn Heisey
It's metadata for a collection of 45 million documents that is mostly photos, with some videos and text. The data is imported from a MySQL database and split among six large shards (each nearly 13GB) and a small shard with data added in the last week, which usually works out to between 300,000

Re: Multiple passes with WordDelimiterFilterFactory

2010-08-27 Thread Erick Erickson
I agree with Marcus, the usefulness of passing through WDF twice is suspect. You can always do a copyfield to a completely different field and do whatever you want there, copyfield forks the raw input to the second field, not the analyzed stream... What is it you're really trying to accomplish? Yo

Re: Multiple passes with WordDelimiterFilterFactory

2010-08-27 Thread Markus Jelsma
It's just a configured filter, so you should be able to define it twice. Have you tried it? But it might be tricky, the output from the first will be the input of the second so i doubt the usefulness of this approach. On Thursday 26 August 2010 17:45:45 Shawn Heisey wrote: > Can I pass my dat