Wikipedia Data Cleaning at Solr

Furkan KAMACI Sat, 22 Feb 2014 16:22:20 -0800

Hi;

I want to run an NLP algorithm for Wikipedia data. I used dataimport
handler for dump data and everything is OK. However there are some texts as
like:


== Altyapı bilgileri == Köyde, [[ilköğretim]] okulu yoktur fakat taşımalı
eğitimden yararlanılmaktadır.

I think that it should be like that:

Altyapı bilgileri Köyde, ilköğretim okulu yoktur fakat taşımalı eğitimden
yararlanılmaktadır.

On the other hand this should be removed:

{| border="0" cellpadding="5" cellspacing="5" |- bgcolor="#aaaaaa"
|'''Seçim Yılı''' |'''Muhtar''' |- bgcolor="#dddddd" |[[2009]] |kazım
güngör |- bgcolor="#dddddd" | |Ömer Gungor |- bgcolor="#dddddd" | |Fazlı
Uzun |- bgcolor="#dddddd" | |Cemal Özden |- bgcolor="#dddddd" | | |}

Also including titles as like == Altyapı bilgileri == should be optional (I
think that they can be removed for some purposes)

My question is that. Is there any analyzer combination to clean up
Wikipedia data for Solr?

Thanks;
Furkan KAMACI

Wikipedia Data Cleaning at Solr

Reply via email to