Well, you have a crawling and extraction pipeline. You can probably inject a classification algorithm somewhere in there, possibly NLP trained on manual seed. Or just a list of typical words as a start.
This is kind of pre-Solr stage though. Regards, Alex On 4 Jan 2016 7:37 pm, <liviuchrist...@yahoo.com.invalid> wrote: > Hi everyone, I'm working on a search engine based on solr which indexes > documents from a large variety of websites. > The engine is focused on cook recipes. However, one problem is that these > websites provide not only content related to cooking recipes but also > content related to: fashion, travel, politics, liberty rights etc etc which > are not what the user expects to find on a cooking recipes dedicated search > engine. > Is there any way to filter out content which is not related to the core > business of the search engine? > Something like parental control software maybe? > Kind regards,Christian Christian Fotache Tel: 0728.297.207 Fax: > 0351.411.570