Erick,

I totally understand that BUT the keyword tokenizer factory does a really
good job extracting phrases (or what look like phrases from) from my data. I
don't know why exactly but it does do it. I am going to continue working
through it to see if I can't figure it out ;-)

Adam

On Thu, Jun 9, 2011 at 12:26 PM, Erick Erickson <erickerick...@gmail.com>wrote:

> The problem here is that none of the built-in filters or tokenizers
> have a prayer
> of recognizing what #you# think are phrases, since it'll be unique to
> your situation.
>
> If you have a list of phrases you care about, you could substitute a
> single token
> for the phrases you care about...
>
> But the overriding question is what determines a phrase you're
> interested in? Is it
> a list or is there some heuristic you want to apply?
>
> Or could you just recognize them at query time and make them into a
> literal phrase
> (i.e. with quotationmarks)?
>
> Best
> Erick
>
> On Thu, Jun 9, 2011 at 10:56 AM, Adam Estrada
> <estrada.adam.gro...@gmail.com> wrote:
> > All,
> >
> > I am at a bit of a loss here so any help would be greatly appreciated. I
> am
> > using the DIH to grab data from a DB. The field that I am most interested
> in
> > has anywhere from 1 word to several paragraphs worth of free text. What I
> > would really like to do is pull out phrases like "Joe's coffee shop"
> rather
> > than the 3 individual words. I have tried the KeywordTokenizerFactory and
> > that does seem to do what I want it to do but it is not actually
> tokenizing
> > anything so it does what I want it to for the most part but it's not
> > creating the tokens that I need for further analysis in apps like Mahout.
> >
> > We can play with the combination of tokenizers and filters all day long
> and
> > see what the results are after a quick reindex. I typlically just view
> them
> > in Solitas as facets which may be the problem for me too. Does anyone
> have
> > an example fieldType they can share with me that shows how to
> > extract phrases if they are there from the data I described earlier. Am I
> > even going about this the right way? I am using today's trunk build of
> Solr
> > and here is what I have munged together this morning.
> >
> > <fieldType name="text_ws" class="solr.TextField"
> positionIncrementGap="100"
> > autoGeneratePhraseQueries="true">
> >  <analyzer >
> >  <charFilter class="solr.HTMLStripCharFilterFactory"/>
> >  <charFilter class="solr.MappingCharFilterFactory"
> > mapping="mapping-ISOLatin1Accent.txt"/>
> >  <tokenizer class="solr.KeywordTokenizerFactory"/>
> >  <filter class="solr.StopFilterFactory" ignoreCase="true"
> > words="stopwords.txt" enablePositionIncrements="true"/>
> >  <filter class="solr.ShingleFilterFactory" maxShingleSize="4"
> > outputUnigrams="true" outputUnigramIfNoNgram="false"/>
> >  <filter class="solr.KeywordMarkerFilterFactory"
> protected="protwords.txt"/>
> >  <filter class="solr.EnglishPossessiveFilterFactory"/>
> >  <filter class="solr.EnglishMinimalStemFilterFactory"/>
> >  <filter class="solr.ASCIIFoldingFilterFactory"/>
> >  <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
> >  <filter class="solr.TrimFilterFactory"/>
> >  </analyzer>
> > </fieldType>
> >
> > Thanks,
> > Adam
> >
>

Reply via email to