Re: Extracting contents of zipped files with Tika and Solr 1.4.1

Gary Taylor Tue, 25 Jan 2011 07:33:11 -0800

Thanks Erlend.

Not used SVN before, but have managed to download and build latest trunkcode.

Now I'm getting an error when trying to access the admin page (viaJetty) because I specify HTMLStripStandardTokenizerFactory in myschema.xml, but this appears to be no-longer supplied as part of thebuild so I get an exception cos it can't find that class. I've checkedthe CHANGES.txt and found the following in the change list to 1.4.0 (!?) :

66. SOLR-1343: Added HTMLStripCharFilter and marked HTMLStripReader,HTMLStripWhitespaceTokenizerFactory andHTMLStripStandardTokenizerFactory deprecated. To strip HTML tags,HTMLStripCharFilter can be used with an arbitrary Tokenizer. (koji)

Unfortunately, I can't seem to get that to work correctly. Does anyonehave an example fieldType stanza (for schema.xml) for stripping out HTML ?


Thanks and kind regards,
Gary.



On 25/01/2011 14:17, Erlend Garåsen wrote:

On 25.01.11 11.30, Erlend Garåsen wrote:
Tika version 0.8 is not included in the latest release/trunk from SVN.
Ouch, I wrote "not" instead of "now". Sorry, I replied in a hurry.
And to clarify, by "content" I mean the main content of a Word file.Title and other kinds of metadata are successfully extracted by theold 0.4 version of Tika, but you need a newer Tika version (0.8) inorder to fetch the main content as well. So try the newest Solrversion from trunk.
Erlend

Re: Extracting contents of zipped files with Tika and Solr 1.4.1

Reply via email to