We are starting to use UIMA as a platform to analyze the text. The result of analyzing a document is a UIMA CAS. A Cas is a generic data structure that can contain different data. UIMA processes single documents, They get the documents from a CAS producer, process them using a PIPE that the user defines and finally sends the result to a CAS consumer, that "saves" or "stores" the result. The pipe is then a connection of different tools that annotate the text with different information. Different sets of tools are available out there, each of them deffining it's own data type's that are included in the CAS. To perform a PIPE output and input CAS of the elements to connect need to be compatible
There is CAS consumer that feeds a LUCENE index, it's called LUCAS but I was looking to it, and I prefer to use UIMA connected to SOLR, why? A: I know solr ;-) and i like it B: I can configure the fields and their processing in solr using xml. Once done then I have it ready to use with a set of tools that allow me to easily explore the data.... C: Is easier to use SOLR as a "web service" that may receive docs from different UIMA's (Natural Language processing is CPU intensive ) D: Break things down. The CAS would only produce XML that solr can process. Then different Tokenizers can be used to deal with the data in the CAS. the main point is that the XML has a the doc and field labels of solr. E: The set of capabilities to process the xml is defined in XML, similar to lucas to define the ouput and in the solr schema to define how this is processed. I want to use it in order to index something that is common but I can't get any tool to do that with sol: indexing a word and coding at the same position the syntactic and semantic information. I know that in Lucene this is evolving and it will be possible to include metadata but for the moment So, my idea is first to produce a UIMA CAS consumer that performs the POST of an XML file containing the plain text text of the document to SOLR; then try to modify this in order to include multiple fields and start coding the semantic information. So, before starting, i would like to know your opinions and if anyone is interested to collaborate, or has some code that can be integrated into this. -- View this message in context: http://www.nabble.com/Solr-and-UIMA-tp24567504p24567504.html Sent from the Solr - User mailing list archive at Nabble.com.