Re: Problem with SolrJ and indexing PDF files

Erick Erickson Sun, 19 May 2019 09:17:21 -0700

Here’s a skeletal program to get you started using Tika directly in a SolrJ 
client, with a long explication of why using Solr’s extracting request handler 
is probably not what you want to do in production:


https://lucidworks.com/2012/02/14/indexing-with-solrj/

SolrServer was renamed SolrClient 4 1/2 years ago, one of my pet peeves is that 
lots of pages don’t have dates attached. The link above was updated after this 
change even though it was published in 2012, but even so you’ll find some 
methods that have since been deprecated.

If you’re using SolrCloud, you should be using CloudSolrClient rather than 
SolrClient.

Best,
Erick

> On May 19, 2019, at 5:07 AM, Jörn Franke <jornfra...@gmail.com> wrote:
> 
> You can use the Tika library to parse the PDFs and then post the text to the 
> Solr servers
> 
>> Am 19.05.2019 um 11:02 schrieb Mareike Glock 
>> <mareike.gl...@student.htw-berlin.de>:
>> 
>> Dear Solr Team,
>> 
>> I am trying to index Word and PDF documents with Solr using SolrJ, but most 
>> of the examples I found on the internet use the SolrServer class which I 
>> guess is deprecated. 
>> The connection to Solr itself is working, because I can add 
>> SolrInputDocuments to the index but it does not work for rich documents 
>> because I get an exception.
>> 
>> 
>> public static void main(String[] args) throws IOException, 
>> SolrServerException {
>>        String urlString = "http://localhost:8983/solr/localDocs16";;
>>        HttpSolrClient solr = new HttpSolrClient.Builder(urlString).build();
>> 
>>        //is working
>>        for(int i=0;i<1000;++i) {
>>            SolrInputDocument doc = new SolrInputDocument();
>>            doc.addField("cat", "book");
>>            doc.addField("id", "book-" + i);
>>            doc.addField("name", "The Legend of the Hobbit part " + i);
>>            solr.add(doc);
>>            if(i%100==0) solr.commit();  // periodically flush
>>        }
>> 
>>        //is not working
>>        File file = new File("path\\testfile.pdf");
>> 
>>        ContentStreamUpdateRequest req = new 
>> ContentStreamUpdateRequest("update/extract");
>> 
>>        req.addFile(file, "application/pdf");
>>        req.setParam("literal.id", "doc1");
>>        req.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);
>>        try{
>>            solr.request(req);
>>        }
>>        catch(IOException e){
>>            PrintWriter out = new 
>> PrintWriter("C:\\Users\\mareike\\Desktop\\filename.txt");
>>            e.printStackTrace(out);
>>            out.close();
>>            System.out.println("IO message: " + e.getMessage());
>>        } catch(SolrServerException e){
>>            PrintWriter out = new 
>> PrintWriter("C:\\Users\\mareike\\Desktop\\filename.txt");
>>            e.printStackTrace(out);
>>            out.close();
>>            System.out.println("SolrServer message: " + e.getMessage());
>>        } catch(Exception e){
>>            PrintWriter out = new 
>> PrintWriter("C:\\Users\\mareike\\Desktop\\filename.txt");
>>            e.printStackTrace(out);
>>            out.close();
>>            System.out.println("UnknownException message: " + e.getMessage());
>>        }finally{
>>            solr.commit();
>>        }
>> }
>> 
>> 
>> I am using Maven (pom.xml attached) and created a JAR file, which I then 
>> tried to execute from the command line, and this is the output I get:
>> 
>>    SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
>>    SLF4J: Defaulting to no-operation (NOP) logger implementation
>>    SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
>> details.
>>    SLF4J: Failed to load class "org.slf4j.impl.StaticMDCBinder".
>>    SLF4J: Defaulting to no-operation MDCAdapter implementation.
>>    SLF4J: See http://www.slf4j.org/codes.html#no_static_mdc_binder for 
>> further details.
>>    message: UnknownException message: Error from server at 
>> http://localhost:8983/solr/localDocs17: Bad contentType for search handler 
>> :application/pdf request={wt=javabin&version=2}
>> 
>> 
>> 
>> 
>> 
>> I hope you may be able to help me with this. I also posted this issue on 
>> Github.
>> 
>> Cheers,
>> Mareike Glock
>> 
>> <pom.xml>

Re: Problem with SolrJ and indexing PDF files

Reply via email to