On 20/07/11 22:32, Simon Willnauer wrote:
On Wed, Jul 20, 2011 at 3:17 PM, raphael812<[email protected]>  wrote:
Hello everyone,

I am quite new to lucene and i am using the book lucene in action to learn.
I need help in extracting the body content of a html page using tika. The
implementation from the book only extracts the html's metadata not the main
body content which i need. Is it possible to extract body content from htmls
and pdfs and how.
Thanks for you help.
hey,
  this seems to be a tika / extraction specific question. you should
try to ask this question on the tika list, I bet you get a quick
response there!

simon
Raphael

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Indexing-with-Lucene-tp3185409p3185409.html
Sent from the Lucene - General mailing list archive at Nabble.com.

Hello all,
i tried searching through an index i created but it gives me the following error in Netbeans 6.9 Exception in thread "main" org.apache.lucene.index.CorruptIndexException: Unknown format version: -11
        at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:249)
at org.apache.lucene.index.DirectoryReader$1.doBody(DirectoryReader.java:73) at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:677) at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:69)
        at org.apache.lucene.index.IndexReader.open(IndexReader.java:316)
        at org.apache.lucene.index.IndexReader.open(IndexReader.java:202)
at org.apache.lucene.search.IndexSearcher.<init>(IndexSearcher.java:63)
        at Searcher.search(Searcher.java:66)
        at Searcher.main(Searcher.java:59)

The trouble is i am able to search that same index using the command line. does anyone have an idea why this is so. it was working some weeks ago on netbeans and now it throws this error.
thanks for the help.

Reply via email to