All, I am at a bit of a loss here so any help would be greatly appreciated. I am using the DIH to grab data from a DB. The field that I am most interested in has anywhere from 1 word to several paragraphs worth of free text. What I would really like to do is pull out phrases like "Joe's coffee shop" rather than the 3 individual words. I have tried the KeywordTokenizerFactory and that does seem to do what I want it to do but it is not actually tokenizing anything so it does what I want it to for the most part but it's not creating the tokens that I need for further analysis in apps like Mahout.
We can play with the combination of tokenizers and filters all day long and see what the results are after a quick reindex. I typlically just view them in Solitas as facets which may be the problem for me too. Does anyone have an example fieldType they can share with me that shows how to extract phrases if they are there from the data I described earlier. Am I even going about this the right way? I am using today's trunk build of Solr and here is what I have munged together this morning. <fieldType name="text_ws" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true"> <analyzer > <charFilter class="solr.HTMLStripCharFilterFactory"/> <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/> <tokenizer class="solr.KeywordTokenizerFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/> <filter class="solr.ShingleFilterFactory" maxShingleSize="4" outputUnigrams="true" outputUnigramIfNoNgram="false"/> <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/> <filter class="solr.EnglishPossessiveFilterFactory"/> <filter class="solr.EnglishMinimalStemFilterFactory"/> <filter class="solr.ASCIIFoldingFilterFactory"/> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> <filter class="solr.TrimFilterFactory"/> </analyzer> </fieldType> Thanks, Adam