Erick, I totally understand that BUT the keyword tokenizer factory does a really good job extracting phrases (or what look like phrases from) from my data. I don't know why exactly but it does do it. I am going to continue working through it to see if I can't figure it out ;-)
Adam On Thu, Jun 9, 2011 at 12:26 PM, Erick Erickson <erickerick...@gmail.com>wrote: > The problem here is that none of the built-in filters or tokenizers > have a prayer > of recognizing what #you# think are phrases, since it'll be unique to > your situation. > > If you have a list of phrases you care about, you could substitute a > single token > for the phrases you care about... > > But the overriding question is what determines a phrase you're > interested in? Is it > a list or is there some heuristic you want to apply? > > Or could you just recognize them at query time and make them into a > literal phrase > (i.e. with quotationmarks)? > > Best > Erick > > On Thu, Jun 9, 2011 at 10:56 AM, Adam Estrada > <estrada.adam.gro...@gmail.com> wrote: > > All, > > > > I am at a bit of a loss here so any help would be greatly appreciated. I > am > > using the DIH to grab data from a DB. The field that I am most interested > in > > has anywhere from 1 word to several paragraphs worth of free text. What I > > would really like to do is pull out phrases like "Joe's coffee shop" > rather > > than the 3 individual words. I have tried the KeywordTokenizerFactory and > > that does seem to do what I want it to do but it is not actually > tokenizing > > anything so it does what I want it to for the most part but it's not > > creating the tokens that I need for further analysis in apps like Mahout. > > > > We can play with the combination of tokenizers and filters all day long > and > > see what the results are after a quick reindex. I typlically just view > them > > in Solitas as facets which may be the problem for me too. Does anyone > have > > an example fieldType they can share with me that shows how to > > extract phrases if they are there from the data I described earlier. Am I > > even going about this the right way? I am using today's trunk build of > Solr > > and here is what I have munged together this morning. > > > > <fieldType name="text_ws" class="solr.TextField" > positionIncrementGap="100" > > autoGeneratePhraseQueries="true"> > > <analyzer > > > <charFilter class="solr.HTMLStripCharFilterFactory"/> > > <charFilter class="solr.MappingCharFilterFactory" > > mapping="mapping-ISOLatin1Accent.txt"/> > > <tokenizer class="solr.KeywordTokenizerFactory"/> > > <filter class="solr.StopFilterFactory" ignoreCase="true" > > words="stopwords.txt" enablePositionIncrements="true"/> > > <filter class="solr.ShingleFilterFactory" maxShingleSize="4" > > outputUnigrams="true" outputUnigramIfNoNgram="false"/> > > <filter class="solr.KeywordMarkerFilterFactory" > protected="protwords.txt"/> > > <filter class="solr.EnglishPossessiveFilterFactory"/> > > <filter class="solr.EnglishMinimalStemFilterFactory"/> > > <filter class="solr.ASCIIFoldingFilterFactory"/> > > <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> > > <filter class="solr.TrimFilterFactory"/> > > </analyzer> > > </fieldType> > > > > Thanks, > > Adam > > >