On 04/09/2014 07:09, sunilragidi wrote:
Hi, I have a requirement in which I have to index a text file using Lucene.
The text file data if from a PDF file. I have used Tika to extract text from
PDF and put it into the text file.
This may be your mistake - IIRC Tika isn't great at preserving structure
within PDFs. We had a similar requirement a while ago to index large
PDFs by paragraphs, and the paragraph markers were being lost. I suggest
you look at other ways of extracting the plain text - pdftotext may
preserve more of the structure, I think that's what we used. Once you
have the individual sections you can index them as separate documents in
Solr, with metadata to indicate the document they came from.
HTH
Charlie
I want to index the text file in the following way.
1. I don't want to index the whole text file content.
2. I don't want to index sentence by sentence.
3. Instead, I want to index the text file by sections.(The text file is
huge)
How can I do this? Any help would be greatly appreciated.
--Sunil
--
View this message in context:
http://lucene.472066.n3.nabble.com/Indexing-Text-File-By-Sections-In-Lucene-tp4156843.html
Sent from the Lucene - General mailing list archive at Nabble.com.
--
Charlie Hull
Flax - Open Source Enterprise Search
tel/fax: +44 (0)8700 118334
mobile: +44 (0)7767 825828
web: www.flax.co.uk