Hey Guys,

I am recently working on a project to integrate a 
Named-Entity-Recognition-Framework (NER) in an existing searchplatform based on 
Solr. The Platform uses ManifoldCF to automatically gather the content from 
various repositories. The NER-Framework creates Annotations/Metadata from given 
content which I then want to integrate into the search-platform as metadata to 
use for faceting. Since MCF handles all content gathering, I need a way to 
integrate the NER-Framework directly into Solr. The Goal is to get all 
Annotations per document into a multivalued field.  My first thought was to 
create a custom filter, which just takes the content and gives back only the 
Annotations.  But as I understand it, a filter only processes predetermined 
Tokens, which is useless for my purpose, since the NER-Framework needs to 
process the whole content of a document. What about a custom Tokenizer? Would 
it be possible to process the whole text and give back only the Annotations as 
Tokens? A third thought was to manipulate the ExtractRequestHandler (Solr Cell) 
used by MCF to somehow add the Annotations as Metadata when the content and 
metadata is distributed to the different fields.

I hope my problem description is sufficient. Does anybody have any thoughts on 
that subject?

Best regards,
Tobias

Reply via email to