Re: Indexing Text File By Sections In Lucene

Charlie Hull Thu, 04 Sep 2014 00:53:50 -0700

On 04/09/2014 07:09, sunilragidi wrote:

Hi, I have a requirement in which I have to index a text file using Lucene.


The text file data if from a PDF file. I have used Tika to extract text from
PDF and put it into the text file.

This may be your mistake - IIRC Tika isn't great at preserving structurewithin PDFs. We had a similar requirement a while ago to index largePDFs by paragraphs, and the paragraph markers were being lost. I suggestyou look at other ways of extracting the plain text - pdftotext maypreserve more of the structure, I think that's what we used. Once youhave the individual sections you can index them as separate documents inSolr, with metadata to indicate the document they came from.


HTH

Charlie


I want to index the text file in the following way.

     1. I don't want to index the whole text file content.
     2. I don't want to index sentence by sentence.
     3. Instead, I want to index the text file by sections.(The text file is
huge)

How can I do this? Any help would be greatly appreciated.

--Sunil



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Indexing-Text-File-By-Sections-In-Lucene-tp4156843.html
Sent from the Lucene - General mailing list archive at Nabble.com.



--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk

Re: Indexing Text File By Sections In Lucene

Reply via email to