How to index pdf's content with SolrJ?

vasuj Fri, 20 Apr 2012 16:45:54 -0700

0
down vote
favorite
share [g+]
share [fb]
share [tw]
I'm trying to index a few pdf documents using SolrJ as described at
http://wiki.apache.org/solr/ContentStreamUpdateRequestExample, below there's
the code:


import static
org.apache.solr.handler.extraction.ExtractingParams.LITERALS_PREFIX;
import static
org.apache.solr.handler.extraction.ExtractingParams.MAP_PREFIX;
import static
org.apache.solr.handler.extraction.ExtractingParams.UNKNOWN_FIELD_PREFIX;

import org.apache.solr.client.solrj.SolrServer;
import org.apache.solr.client.solrj.SolrServerException;
import org.apache.solr.client.solrj.impl.CommonsHttpSolrServer;
import org.apache.solr.client.solrj.request.AbstractUpdateRequest;
import org.apache.solr.client.solrj.request.ContentStreamUpdateRequest;
import org.apache.solr.common.util.NamedList;
...
public static void indexFilesSolrCell(String fileName) throws IOException,
SolrServerException {

  String urlString = "http://localhost:8080/solr";; 
  SolrServer server = new CommonsHttpSolrServer(urlString);

  ContentStreamUpdateRequest up = new
ContentStreamUpdateRequest("/update/extract");
  up.addFile(new File(fileName));
  String id = fileName.substring(fileName.lastIndexOf('/')+1);
  System.out.println(id);

  up.setParam(LITERALS_PREFIX + "id", id);
  up.setParam(LITERALS_PREFIX + "location", fileName); // this field doesn't
exists in schema.xml, it'll be created as attr_location
  up.setParam(UNKNOWN_FIELD_PREFIX, "attr_");
  up.setParam(MAP_PREFIX + "content", "attr_content");
  up.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);

  NamedList request = server.request(up);
  for(Entry<String, Object> entry : request){
    System.out.println(entry.getKey());
    System.out.println(entry.getValue());
  }
}
Unfortunately when querying for *:* I get the list of indexed documents but
the content field is empty. How can I change the code above to extract also
the document's content?

Below there's the xml frament that describes this document:

<doc>
  <arr name="attr_content">
    <str>            </str>
  </arr>
  <arr name="attr_location">
    <str>/home/alex/Documents/lsp.pdf</str>
  </arr>
  <arr name="attr_meta">
    <str>stream_size</str>
    <str>31203</str>
    <str>Content-Type</str>
    <str>application/pdf</str>
  </arr>
  <arr name="attr_stream_size">
    <str>31203</str>
  </arr>
  <arr name="content_type">
    <str>application/pdf</str>
  </arr>
  <str name="id">lsp.pdf</str>
</doc>
I don't think that this problem is related to an incorrect installation of
Apache Tika, because previously I had a few ServerException but now I've
installed the required jars in the correct path. Moreover I've tried to
index a txt file using the same class but the attr_content field is always
empty.

Also tried In the schema.xml file, "stored= true" in the content field, 

<field name="text" type="textgen" indexed="true" stored="true"
required="false" multiValued="true"/>

--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-index-pdf-s-content-with-SolrJ-tp3927284p3927284.html
Sent from the Solr - User mailing list archive at Nabble.com.

How to index pdf's content with SolrJ?

Reply via email to