Thanks Erlend.

Not used SVN before, but have managed to download and build latest trunk code.

Now I'm getting an error when trying to access the admin page (via Jetty) because I specify HTMLStripStandardTokenizerFactory in my schema.xml, but this appears to be no-longer supplied as part of the build so I get an exception cos it can't find that class. I've checked the CHANGES.txt and found the following in the change list to 1.4.0 (!?) :

66. SOLR-1343: Added HTMLStripCharFilter and marked HTMLStripReader, HTMLStripWhitespaceTokenizerFactory and HTMLStripStandardTokenizerFactory deprecated. To strip HTML tags, HTMLStripCharFilter can be used with an arbitrary Tokenizer. (koji)

Unfortunately, I can't seem to get that to work correctly. Does anyone have an example fieldType stanza (for schema.xml) for stripping out HTML ?

Thanks and kind regards,
Gary.



On 25/01/2011 14:17, Erlend Garåsen wrote:
On 25.01.11 11.30, Erlend Garåsen wrote:

Tika version 0.8 is not included in the latest release/trunk from SVN.

Ouch, I wrote "not" instead of "now". Sorry, I replied in a hurry.

And to clarify, by "content" I mean the main content of a Word file. Title and other kinds of metadata are successfully extracted by the old 0.4 version of Tika, but you need a newer Tika version (0.8) in order to fetch the main content as well. So try the newest Solr version from trunk.

Erlend



Reply via email to