Best way to split text

tstusr Wed, 05 Jul 2017 11:32:45 -0700

We are working on a search application for large pdfs (~ 10 - 100 Mb), there
are been correctly indexed.


However we want to make some training in the pipeline, so we are
implementing some spark mllib algorithms.

But now, some requirements are to split documents into either paragraphs or
pages. Some alternatives, we find, is to split via tika-pdfbox or making a
custom processor to catch words.

In terms of performance, what options is preferred? A custom class of tika
that extracts just paragraphs or with all document filter paragraphs that
match our vocabulary.

Thanks for your advice.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Best-way-to-split-text-tp4344498.html
Sent from the Solr - User mailing list archive at Nabble.com.

Best way to split text

Reply via email to