Hi, On 11 June 2012 19:42, Pierluca Sangiorgi <[email protected]>wrote:
> As example: I've a pdf document that contain an invoice. I need to > extract and index informations relative to recipient, price, sold > items, items description, and so on. > > Is Solr the right choice for this purpose or do i need to use other > framework in addiction before posting document to Solr? > Solr is a good choice, especially if you want to start to leverage the power of search, but you will need to do a bit of work before hand if you want to split the information out to give you the power to make best use of it later. To achieve this you will first want to update the schema.xml [1] to model your target fields - i.e. the ones you mention above. You will need to parse the PDF documents using something like Apache PDFBox[2] - good for if the documents are Acrobat Forms as you can get the form field contents - or Apache Tika[3] - if you want it as a String - to get the contents. This will allow you to extract the field values from content using pattern matching. The fields can then be added to a document and posted to Solr using Solrj. Cheers, Dave [1] http://wiki.apache.org/solr/SchemaXml [2] http://pdfbox.apache.org/ [3] http://tika.apache.org
