0 down vote favorite share [g+] share [fb] share [tw] I'm trying to index a few pdf documents using SolrJ as described at http://wiki.apache.org/solr/ContentStreamUpdateRequestExample, below there's the code:
import static org.apache.solr.handler.extraction.ExtractingParams.LITERALS_PREFIX; import static org.apache.solr.handler.extraction.ExtractingParams.MAP_PREFIX; import static org.apache.solr.handler.extraction.ExtractingParams.UNKNOWN_FIELD_PREFIX; import org.apache.solr.client.solrj.SolrServer; import org.apache.solr.client.solrj.SolrServerException; import org.apache.solr.client.solrj.impl.CommonsHttpSolrServer; import org.apache.solr.client.solrj.request.AbstractUpdateRequest; import org.apache.solr.client.solrj.request.ContentStreamUpdateRequest; import org.apache.solr.common.util.NamedList; ... public static void indexFilesSolrCell(String fileName) throws IOException, SolrServerException { String urlString = "http://localhost:8080/solr"; SolrServer server = new CommonsHttpSolrServer(urlString); ContentStreamUpdateRequest up = new ContentStreamUpdateRequest("/update/extract"); up.addFile(new File(fileName)); String id = fileName.substring(fileName.lastIndexOf('/')+1); System.out.println(id); up.setParam(LITERALS_PREFIX + "id", id); up.setParam(LITERALS_PREFIX + "location", fileName); // this field doesn't exists in schema.xml, it'll be created as attr_location up.setParam(UNKNOWN_FIELD_PREFIX, "attr_"); up.setParam(MAP_PREFIX + "content", "attr_content"); up.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true); NamedList request = server.request(up); for(Entry<String, Object> entry : request){ System.out.println(entry.getKey()); System.out.println(entry.getValue()); } } Unfortunately when querying for *:* I get the list of indexed documents but the content field is empty. How can I change the code above to extract also the document's content? Below there's the xml frament that describes this document: <doc> <arr name="attr_content"> <str> </str> </arr> <arr name="attr_location"> <str>/home/alex/Documents/lsp.pdf</str> </arr> <arr name="attr_meta"> <str>stream_size</str> <str>31203</str> <str>Content-Type</str> <str>application/pdf</str> </arr> <arr name="attr_stream_size"> <str>31203</str> </arr> <arr name="content_type"> <str>application/pdf</str> </arr> <str name="id">lsp.pdf</str> </doc> I don't think that this problem is related to an incorrect installation of Apache Tika, because previously I had a few ServerException but now I've installed the required jars in the correct path. Moreover I've tried to index a txt file using the same class but the attr_content field is always empty. Also tried In the schema.xml file, "stored= true" in the content field, <field name="text" type="textgen" indexed="true" stored="true" required="false" multiValued="true"/> -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-index-pdf-s-content-with-SolrJ-tp3927284p3927284.html Sent from the Solr - User mailing list archive at Nabble.com.