Thanks Erlend.
Not used SVN before, but have managed to download and build latest trunk
code.
Now I'm getting an error when trying to access the admin page (via
Jetty) because I specify HTMLStripStandardTokenizerFactory in my
schema.xml, but this appears to be no-longer supplied as part of the
build so I get an exception cos it can't find that class. I've checked
the CHANGES.txt and found the following in the change list to 1.4.0 (!?) :
66. SOLR-1343: Added HTMLStripCharFilter and marked HTMLStripReader,
HTMLStripWhitespaceTokenizerFactory and
HTMLStripStandardTokenizerFactory deprecated. To strip HTML tags,
HTMLStripCharFilter can be used with an arbitrary Tokenizer. (koji)
Unfortunately, I can't seem to get that to work correctly. Does anyone
have an example fieldType stanza (for schema.xml) for stripping out HTML ?
Thanks and kind regards,
Gary.
On 25/01/2011 14:17, Erlend Garåsen wrote:
On 25.01.11 11.30, Erlend Garåsen wrote:
Tika version 0.8 is not included in the latest release/trunk from SVN.
Ouch, I wrote "not" instead of "now". Sorry, I replied in a hurry.
And to clarify, by "content" I mean the main content of a Word file.
Title and other kinds of metadata are successfully extracted by the
old 0.4 version of Tika, but you need a newer Tika version (0.8) in
order to fetch the main content as well. So try the newest Solr
version from trunk.
Erlend