Hey Guys, I am recently working on a project to integrate a Named-Entity-Recognition-Framework (NER) in an existing searchplatform based on Solr. The Platform uses ManifoldCF to automatically gather the content from various repositories. The NER-Framework creates Annotations/Metadata from given content which I then want to integrate into the search-platform as metadata to use for faceting. Since MCF handles all content gathering, I need a way to integrate the NER-Framework directly into Solr. The Goal is to get all Annotations per document into a multivalued field. My first thought was to create a custom filter, which just takes the content and gives back only the Annotations. But as I understand it, a filter only processes predetermined Tokens, which is useless for my purpose, since the NER-Framework needs to process the whole content of a document. What about a custom Tokenizer? Would it be possible to process the whole text and give back only the Annotations as Tokens? A third thought was to manipulate the ExtractRequestHandler (Solr Cell) used by MCF to somehow add the Annotations as Metadata when the content and metadata is distributed to the different fields.
I hope my problem description is sufficient. Does anybody have any thoughts on that subject? Best regards, Tobias