Re: Problem with SolrJ and indexing PDF files

Jörn Franke Sun, 19 May 2019 05:07:51 -0700

You can use the Tika library to parse the PDFs and then post the text to the 
Solr servers


> Am 19.05.2019 um 11:02 schrieb Mareike Glock 
> <mareike.gl...@student.htw-berlin.de>:
> 
> Dear Solr Team,
> 
> I am trying to index Word and PDF documents with Solr using SolrJ, but most 
> of the examples I found on the internet use the SolrServer class which I 
> guess is deprecated. 
> The connection to Solr itself is working, because I can add 
> SolrInputDocuments to the index but it does not work for rich documents 
> because I get an exception.
> 
> 
> public static void main(String[] args) throws IOException, 
> SolrServerException {
>         String urlString = "http://localhost:8983/solr/localDocs16";;
>         HttpSolrClient solr = new HttpSolrClient.Builder(urlString).build();
> 
>         //is working
>         for(int i=0;i<1000;++i) {
>             SolrInputDocument doc = new SolrInputDocument();
>             doc.addField("cat", "book");
>             doc.addField("id", "book-" + i);
>             doc.addField("name", "The Legend of the Hobbit part " + i);
>             solr.add(doc);
>             if(i%100==0) solr.commit();  // periodically flush
>         }
> 
>         //is not working
>         File file = new File("path\\testfile.pdf");
> 
>         ContentStreamUpdateRequest req = new 
> ContentStreamUpdateRequest("update/extract");
> 
>         req.addFile(file, "application/pdf");
>         req.setParam("literal.id", "doc1");
>         req.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);
>         try{
>             solr.request(req);
>         }
>         catch(IOException e){
>             PrintWriter out = new 
> PrintWriter("C:\\Users\\mareike\\Desktop\\filename.txt");
>             e.printStackTrace(out);
>             out.close();
>             System.out.println("IO message: " + e.getMessage());
>         } catch(SolrServerException e){
>             PrintWriter out = new 
> PrintWriter("C:\\Users\\mareike\\Desktop\\filename.txt");
>             e.printStackTrace(out);
>             out.close();
>             System.out.println("SolrServer message: " + e.getMessage());
>         } catch(Exception e){
>             PrintWriter out = new 
> PrintWriter("C:\\Users\\mareike\\Desktop\\filename.txt");
>             e.printStackTrace(out);
>             out.close();
>             System.out.println("UnknownException message: " + e.getMessage());
>         }finally{
>             solr.commit();
>         }
> }
> 
> 
> I am using Maven (pom.xml attached) and created a JAR file, which I then 
> tried to execute from the command line, and this is the output I get:
> 
>     SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
>     SLF4J: Defaulting to no-operation (NOP) logger implementation
>     SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
> details.
>     SLF4J: Failed to load class "org.slf4j.impl.StaticMDCBinder".
>     SLF4J: Defaulting to no-operation MDCAdapter implementation.
>     SLF4J: See http://www.slf4j.org/codes.html#no_static_mdc_binder for 
> further details.
>     message: UnknownException message: Error from server at 
> http://localhost:8983/solr/localDocs17: Bad contentType for search handler 
> :application/pdf request={wt=javabin&version=2}
> 
> 
> 
> 
> 
> I hope you may be able to help me with this. I also posted this issue on 
> Github.
> 
> Cheers,
> Mareike Glock
> 
> <pom.xml>

Re: Problem with SolrJ and indexing PDF files

Reply via email to