Hello Michael, I can help you with using the UIMA UpdateRequestProcessor [1]; the current implementation uses in-memory execution of UIMA pipelines but since I was planning to add the support for higher scalability (with UIMA-AS [2]) that may help you as well.
Tommaso [1] : http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/contrib/uima/src/java/org/apache/solr/uima/processor/UIMAUpdateRequestProcessor.java [2] : http://uima.apache.org/doc-uimaas-what.html 2011/12/5 Michael Kelleher <mj.kelle...@gmail.com> > Hello Erik, > > I will take a look at both: > > org.apache.solr.update.**processor.**LangDetectLanguageIdentifierUp** > dateProcessor > > and > > org.apache.solr.update.**processor.**TikaLanguageIdentifierUpdatePr** > ocessor > > > and figure out what I need to extend to handle processing in the way I am > looking for. I am assuming that "component" configuration is handled in a > standard way such that I can configure my new UpdateProcessor in the same > way I would configure any other UpdateProcessor "component"? > > Thanks for the suggestion. > > > 1 more question: given that I am probably going to convert the HTML to > XML so I can use XPath expressions to "extract" my content, do you think > that this kind of processing will overload Solr? This Solr instance will > be used solely for indexing, and will only ever have a single ManifoldCF > crawling job feeding it documents at one time. > > --mike >