Re: Excessive Wire logging while indexing.

2017-03-02 Thread Erick Erickson
Glad to hear it's working. The trick (as you've probably discovered) is to properly map the meta-data to Solr fields. The extracting request handler does this, but the real underlying issue is that there's no real standard. Word docs might have "last_editor", PDFs might have just "author". And on a

RE: Excessive Wire logging while indexing.

2017-03-02 Thread Phil Scadden
Got it all working with Tika and SolrJ. (Got the correct artifacts). Much faster now too which is good. Thanks very much for your help. Notice: This email and any attachments are confidential and may not be used, published or redistributed without the prior written consent of the Institute of Ge

Re: Excessive Wire logging while indexing.

2017-03-02 Thread Shawn Heisey
On 3/1/2017 6:59 PM, Phil Scadden wrote: > Exceptions never triggered but metadata was essentially empty except > for contentType, and content was always an empty string. I don’t know > what parser was doing, but I gave up and with the extractHandler route > instead which did at least build a full

RE: Excessive Wire logging while indexing. Blank output from tika parser

2017-03-01 Thread Phil Scadden
Belay that. I found out why parser was just returning empty data - I didn’t have the right artefact in maven. In case anyone else trips on this: org.apache.tika tika-core 1.12 org.apache.tika tika-parsers

RE: Excessive Wire logging while indexing.

2017-03-01 Thread Phil Scadden
>Another side issue: Using the extracting handler for handling rich documents >is discouraged. Tika (which is what is used by the extracting >handler) is pretty amazing software, but it has a habit of crashing or >consuming all the heap memory when it encounters a document that it doesn't >>k

RE: Excessive Wire logging while indexing.

2017-03-01 Thread Phil Scadden
The logging is coming from application which is running in Tomcat. Solr itself is running in the embedded Jetty. And yes, another look at the log4j and I see that rootlogger is set to DEBUG. I've changed that/ >On the Solr server side, the 6.4.x versions have a bug that causes extremely >high

Re: Excessive Wire logging while indexing.

2017-03-01 Thread Shawn Heisey
On 3/1/2017 4:41 PM, Phil Scadden wrote: > Using Solr 6.4.1 on windows. Installed and trial POST on my directories > worked okay. However, now trying to create an index from code running on > tomcat on the same machine as SOLR server with my own schema. Indexing of PDF > is very slow. Investigat