This might help: http://www.lucidimagination.com/blog/2012/02/14/indexing-with-solrj/
The bit here is you have to have Tika parse your file and then extract the content to send to Solr... Best Erick On Fri, Apr 20, 2012 at 7:36 PM, vasuj <vasu.j...@live.in> wrote: > > 0 > down vote > favorite > share [g+] > share [fb] > share [tw] > I'm trying to index a few pdf documents using SolrJ as described at > http://wiki.apache.org/solr/ContentStreamUpdateRequestExample, below there's > the code: > > import static > org.apache.solr.handler.extraction.ExtractingParams.LITERALS_PREFIX; > import static > org.apache.solr.handler.extraction.ExtractingParams.MAP_PREFIX; > import static > org.apache.solr.handler.extraction.ExtractingParams.UNKNOWN_FIELD_PREFIX; > > import org.apache.solr.client.solrj.SolrServer; > import org.apache.solr.client.solrj.SolrServerException; > import org.apache.solr.client.solrj.impl.CommonsHttpSolrServer; > import org.apache.solr.client.solrj.request.AbstractUpdateRequest; > import org.apache.solr.client.solrj.request.ContentStreamUpdateRequest; > import org.apache.solr.common.util.NamedList; > ... > public static void indexFilesSolrCell(String fileName) throws IOException, > SolrServerException { > > String urlString = "http://localhost:8080/solr"; > SolrServer server = new CommonsHttpSolrServer(urlString); > > ContentStreamUpdateRequest up = new > ContentStreamUpdateRequest("/update/extract"); > up.addFile(new File(fileName)); > String id = fileName.substring(fileName.lastIndexOf('/')+1); > System.out.println(id); > > up.setParam(LITERALS_PREFIX + "id", id); > up.setParam(LITERALS_PREFIX + "location", fileName); // this field doesn't > exists in schema.xml, it'll be created as attr_location > up.setParam(UNKNOWN_FIELD_PREFIX, "attr_"); > up.setParam(MAP_PREFIX + "content", "attr_content"); > up.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true); > > NamedList request = server.request(up); > for(Entry<String, Object> entry : request){ > System.out.println(entry.getKey()); > System.out.println(entry.getValue()); > } > } > Unfortunately when querying for *:* I get the list of indexed documents but > the content field is empty. How can I change the code above to extract also > the document's content? > > Below there's the xml frament that describes this document: > > <doc> > <arr name="attr_content"> > <str> </str> > </arr> > <arr name="attr_location"> > <str>/home/alex/Documents/lsp.pdf</str> > </arr> > <arr name="attr_meta"> > <str>stream_size</str> > <str>31203</str> > <str>Content-Type</str> > <str>application/pdf</str> > </arr> > <arr name="attr_stream_size"> > <str>31203</str> > </arr> > <arr name="content_type"> > <str>application/pdf</str> > </arr> > <str name="id">lsp.pdf</str> > </doc> > I don't think that this problem is related to an incorrect installation of > Apache Tika, because previously I had a few ServerException but now I've > installed the required jars in the correct path. Moreover I've tried to > index a txt file using the same class but the attr_content field is always > empty. > > Also tried In the schema.xml file, "stored= true" in the content field, > > <field name="text" type="textgen" indexed="true" stored="true" > required="false" multiValued="true"/> > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/How-to-index-pdf-s-content-with-SolrJ-tp3927284p3927284.html > Sent from the Solr - User mailing list archive at Nabble.com.