> I am recently working on a project to integrate a > Named-Entity-Recognition-Framework (NER) in an existing > searchplatform based on Solr. The Platform uses ManifoldCF > to automatically gather the content from various > repositories. The NER-Framework creates Annotations/Metadata > from given content which I then want to integrate into the > search-platform as metadata to use for faceting. Since MCF > handles all content gathering, I need a way to integrate the > NER-Framework directly into Solr. The Goal is to get all > Annotations per document into a multivalued field. My > first thought was to create a custom filter, which just > takes the content and gives back only the Annotations. > But as I understand it, a filter only processes > predetermined Tokens, which is useless for my purpose, since > the NER-Framework needs to process the whole content of a > document. What about a custom Tokenizer? Would it be > possible to process the whole text and give back only the > Annotations as Tokens? A third thought was to manipulate the > ExtractRequestHandler (Solr Cell) used by MCF to somehow add > the Annotations as Metadata when the content and metadata is > distributed to the different fields. > > I hope my problem description is sufficient. Does anybody > have any thoughts on that subject?
UpdateRequestProcessor is more appropriate in this case. Like http://wiki.apache.org/solr/SolrUIMA