Simple code like this:
File file = new File ("test.pdf"); InputStream input = new FileInputStream(file); Metadata metadata = new Metadata (); ContentHandler handler = new BodyContentHandler(); AutoDetectParser parse = new AutoDetectParser(); parse.parse(input, handler, metadata); input.close(); the extracted content is handler.toString() rgds, canal ________________________________ From: go canal <goca...@yahoo.com> To: solr-user@lucene.apache.org Sent: Sun, June 27, 2010 9:45:57 AM Subject: Re: How to index rich document with XML payload? Hi, I just started using Solr....I am using SolrJ client, but uploading the file directly to Solr. I think we can use Tika in our code first. Here I send the file directly to Solr which will do the text extraction: CommonsHttpSolrServer solr = new CommonsHttpSolrServer("http://localhost:8983/solr"); solr.setRequestWriter(new BinaryRequestWriter()); ContentStreamUpdateRequest up = new ContentStreamUpdateRequest ("/update/extract"); // read a file File file = new File ("tutorial.pdf"); up.addFile(file); up.setParam("literal.id", "tutorial.pdf"); up.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true); solr.request(up); So what we need to do is to add Tika. I have a question about up.setParam - am I able to create my own fields ? rgds, canal ________________________________ From: Steve Johnson <st...@parisgroup.net> To: solr-user@lucene.apache.org Sent: Sun, June 27, 2010 6:50:01 AM Subject: How to index rich document with XML payload? Greetings, I am new to Solr, but have gotten as far as successfully indexing documents both by sending XML describing the document and by sending the document itself using "update/extract". What I want to do now is, in effect, do both of these on each of my documents. I want to be able to have Tika do its magic first, and then I want to add additional fields to my document entries using XML. Is there any way to do this? In general, is there any way to apply multiple update requests to a single document entry? I do understand that I can put literal values on the "update/extract" URL to do what I'm asking. This is what I'll have to do if I can't figure out another way, but it seems messy to me...I'd much rather send an XML payload. TIA for any help.