On Jul 21, 2009, at 11:57 AM, JCodina wrote:

Let me sintetize:

We (well, I think Grant?) do changes in the DPTFF (
DelimitedPayloadTokenFilterFactory ) so that is able to index at the same
position different tokes that may have payloads.
1. token delimiter (#)
2. payload delimiter (|) 

We (that's me) perform a SolCAS: a UIMA CAS consumer equivalent to LuCAS but
that allows indexing using Solr. This SolCAS is able to manage generate
different tokens at the same position and maybe payloads, the result is
ready for the new  DPTFF

We (me again) develop some filtering utilities based on the payload that,
something like the stopwords 
filter but instead of rejecting those tokens that are in the stopwords list
would reject those  that are in the "payloads" list.

We will try also to develop an n-gram generator based on the payloads, like
for example find the nouns followed by an adjective that are at less than 4
positions. 

For the moment searches can not be performed based on payloads, not even as
a filter... but this is a matter of time.

Problems to solve:
Perform a nice processing of the N tokens that share the same position, as
the tokenizer.Next() will not give them together (which is a pitty) .Write
some utility tht would allow the tools that manage multitokens to have a
similar front-end and back-end that does multiple Nexts in order to put
toguether all the information at the same position, performs the treatment
with a multitoken structure and then generates a multitoken that is sent to
the backend that has the next again on single tokens...

Joan
-- 
View this message in context: 
http://www.nabble.com/Solr-and-UIMA-tp24567504p24639814.html
Sent from the Solr - User mailing list archive at Nabble.com.

Reply via email to