Re: [Free Text] Field Tokenizing

2011-06-09 Thread Erick Erickson
The KeywordTokenizer doesn't do anything to break up the input stream, it just treats the whole input to the field as a single token. So I don't think you'll be able to "extract" anything starting with that tokenizer. Look at the admin/analysis page to see a step-by-step breakdown of what your ana

Re: [Free Text] Field Tokenizing

2011-06-09 Thread Adam Estrada
Erick, I totally understand that BUT the keyword tokenizer factory does a really good job extracting phrases (or what look like phrases from) from my data. I don't know why exactly but it does do it. I am going to continue working through it to see if I can't figure it out ;-) Adam On Thu, Jun 9

Re: [Free Text] Field Tokenizing

2011-06-09 Thread Erick Erickson
The problem here is that none of the built-in filters or tokenizers have a prayer of recognizing what #you# think are phrases, since it'll be unique to your situation. If you have a list of phrases you care about, you could substitute a single token for the phrases you care about... But the overr

[Free Text] Field Tokenizing

2011-06-09 Thread Adam Estrada
All, I am at a bit of a loss here so any help would be greatly appreciated. I am using the DIH to grab data from a DB. The field that I am most interested in has anywhere from 1 word to several paragraphs worth of free text. What I would really like to do is pull out phrases like "Joe's coffee sho