On Mar 24, 2010, at 1:53 PM, Andrzej Bialecki wrote:

> On 2010-03-24 16:15, Markus Jelsma wrote:
>> A bit off-topic but how about Nutch grabbing some conent and have it indexed
>> in Solr?
> 
> The problem is not with collecting and submitting the documents, the problem 
> is with parsing the Wikimedia markup embedded in XML. WikipediaTokenizer from 
> Lucene contrib/ is a quick and perhaps acceptable solution ...

Yeah, the WikipediaTokenizer does a pretty decent job, but still has a few bugs 
that need fixing.  It handles most of the syntax and can also assign types to 
the tokens based on the type of token it is.  It can also "roll up" tokens for 
things like categories into a single token (for things like faceting)

-Grant

Reply via email to