We are starting to use UIMA as a platform to analyze the text.
The result of analyzing a document is a UIMA CAS. A Cas is a generic  data
structure that can contain different data. 
UIMA processes single documents, They get the documents from a CAS producer,
process them using a PIPE that the user defines  and finally sends the
result to a CAS consumer, that "saves" or "stores" the result.
The pipe is then a connection of different tools that annotate the text with
different information. Different sets of tools are available out there, each
of them deffining it's own data type's  that are included in the CAS. To
perform a PIPE output and input CAS of the elements to connect need to be
compatible

There is CAS consumer that feeds a LUCENE index, it's called LUCAS but I was
looking to it, and I prefer to use UIMA connected to SOLR, why?
A: I know solr ;-) and i like it 
B: I can configure  the fields  and their processing in solr using xml. Once
done then I have it ready to use with a set of tools that allow me to easily
explore the data....  
C: Is easier to use SOLR as a "web service" that may receive docs from
different UIMA's (Natural Language processing is CPU intensive )
D: Break things down. The CAS would only produce XML that solr can process.
Then different Tokenizers can be used to deal with the data in the CAS. the
main point is that the XML has a the doc and field labels of solr.
E: The set of capabilities to process the xml is defined in XML, similar to
lucas to define the ouput and in the solr schema to define how this is
processed.


I want to use it in order to index something that is common but I can't get
any tool to do that with sol: indexing a word and coding at the same
position the syntactic and semantic information. I know that in Lucene this
is evolving and it will be possible to include metadata but for the moment


So, my idea is first to produce a UIMA CAS consumer that performs the POST
of an XML file containing the plain text text of the document  to SOLR; then
try to modify this in order to include multiple fields and start coding the
semantic information.

So, before starting, i would like to know your opinions and if anyone is
interested to collaborate, or has some code that can be integrated into
this.
 
-- 
View this message in context: 
http://www.nabble.com/Solr-and-UIMA-tp24567504p24567504.html
Sent from the Solr - User mailing list archive at Nabble.com.

Reply via email to