Sounds like a very ambitious project. I'm sure you COULD do it in Solr, but not in very short order.

Check out some discussion of simply searching within sentences:
http://markmail.org/message/aoiq62a4mlo25zzk?q=apache#query:apache+page:1+mid:aoiq62a4mlo25zzk+state:results

First, how do you expect to use/query the corpus? In other words, what are your user requirements? They will determine what structure the Solr index, analysis chains, and custom search components will need.

Also, check out the Solr OpenNLP wiki:
http://wiki.apache.org/solr/OpenNLP

And see "LUCENE-2899: Add OpenNLP Analysis capabilities as a module":
https://issues.apache.org/jira/browse/LUCENE-2899

-- Jack Krupansky

-----Original Message----- From: Rendy Bambang Junior
Sent: Monday, May 06, 2013 11:41 AM
To: solr-user@lucene.apache.org
Subject: Tokenize Sentence and Set Attribute

Hello,

I am trying to use part of speech tagger for bahasa Indonesia to filter
tokens in Solr.
The tagger receive input as word list of a sentence and return tag array.

I think the process should by like this:
- tokenize sentence
- tokenize word
- pass it into the tagger
- set attribute using tagger output
- pass it into a FilteringTokenFilter implementation

Is it possible to do this in Solr/Lucene? If it is, how?

I've read similar solution for Japanese language but since I am lack of
Japanese understanding, it couldn't help a lot.

--
Regards,
Rendy Bambang Junior
Informatics Engineering '09
Bandung Institute of Technology

Reply via email to