Re: Multiple passes with WordDelimiterFilterFactory

Shawn Heisey Sat, 28 Aug 2010 18:59:35 -0700

It's metadata for a collection of 45 million documents that is mostlyphotos, with some videos and text. The data is imported from a MySQLdatabase and split among six large shards (each nearly 13GB) and a smallshard with data added in the last week, which usually works out tobetween 300,000 and 500,000 documents.

My goal is to reduce the index size without reducing the functionality.Using copyField would just make it larger.

The biggest issue to solve is making sure that I don't have two termswhen there's a punctuation character at the beginning or end of a word.For intstance, one chunk of text that I just analyzed ends up withterms like the following, which are unneeded duplicates:


championship.
championship
'04
04
wisconsin.
wisconsin

Since I was already toying around, I just tested the whole notion withthe analysis tool. I configured two filter steps - the first with justgenerateWordParts and catenateWords enabled, the second with all theoptions including preserveOriginal enabled. A test analysis of inputwith 59 whitespace separated words showed 93 terms with the singlefilter and 77 with two. The only drop in term quality that I noticedwas that possessive words (apostrophe-s) no longer have the originalpreserved. I haven't yet decided whether that's a problem.


Shawn


On 8/27/2010 11:00 AM, Erick Erickson wrote:

I agree with Marcus, the usefulness of passing through WDF twice
is suspect. You can always do a copyfield to a completely different
field and do whatever you want there, copyfield forks the raw input
to the second field, not the analyzed stream...

What is it you're really trying to accomplish? Your use-case would
help us help you.

About defining things differently in index and analysis. Sure, it can
make sense. But, especially with WDF it's tricky. Spend some
significant time in the admin analysis page looking at the effects
of various configurations before you decide.

Best
Erick

Re: Multiple passes with WordDelimiterFilterFactory

Reply via email to