We are working on a search application for large pdfs (~ 10 - 100 Mb), there are been correctly indexed.
However we want to make some training in the pipeline, so we are implementing some spark mllib algorithms. But now, some requirements are to split documents into either paragraphs or pages. Some alternatives, we find, is to split via tika-pdfbox or making a custom processor to catch words. In terms of performance, what options is preferred? A custom class of tika that extracts just paragraphs or with all document filter paragraphs that match our vocabulary. Thanks for your advice. -- View this message in context: http://lucene.472066.n3.nabble.com/Best-way-to-split-text-tp4344498.html Sent from the Solr - User mailing list archive at Nabble.com.