Glad to hear it's working. The trick (as you've probably discovered)
is to properly
map the meta-data to Solr fields. The extracting request handler does
this, but the
real underlying issue is that there's no real standard. Word docs
might have "last_editor",
PDFs might have just "author". And on a
Got it all working with Tika and SolrJ. (Got the correct artifacts). Much
faster now too which is good. Thanks very much for your help.
Notice: This email and any attachments are confidential and may not be used,
published or redistributed without the prior written consent of the Institute
of Ge
On 3/1/2017 6:59 PM, Phil Scadden wrote:
> Exceptions never triggered but metadata was essentially empty except
> for contentType, and content was always an empty string. I don’t know
> what parser was doing, but I gave up and with the extractHandler route
> instead which did at least build a full
Belay that. I found out why parser was just returning empty data - I didn’t
have the right artefact in maven. In case anyone else trips on this:
org.apache.tika
tika-core
1.12
org.apache.tika
tika-parsers
>Another side issue: Using the extracting handler for handling rich documents
>is discouraged. Tika (which is what is used by the extracting
>handler) is pretty amazing software, but it has a habit of crashing or
>consuming all the heap memory when it encounters a document that it doesn't
>>k
The logging is coming from application which is running in Tomcat. Solr itself
is running in the embedded Jetty.
And yes, another look at the log4j and I see that rootlogger is set to DEBUG.
I've changed that/
>On the Solr server side, the 6.4.x versions have a bug that causes extremely
>high
On 3/1/2017 4:41 PM, Phil Scadden wrote:
> Using Solr 6.4.1 on windows. Installed and trial POST on my directories
> worked okay. However, now trying to create an index from code running on
> tomcat on the same machine as SOLR server with my own schema. Indexing of PDF
> is very slow. Investigat