I am working on Windows 7
--
View this message in context:
http://lucene.472066.n3.nabble.com/using-extract-handler-data-not-extracted-tp4110850p4110993.html
Sent from the Solr - User mailing list archive at Nabble.com.
Not really sure...the issue seems related to text extraction so the first
suspect is tika...SOLR is playing a secondary role here. If Tika is doing
extraction good there should be an error, a warning on solr side (an
exception, a content field too long warning or something like that)
What about th
Sorry for the mistake.
im using solr 4.2, it has tika-1.3.
So now, java -jar tika-app-1.3.jar -v C:\Coding.pdf , parses pdf document
without error or msg.
Also, java -jar tika-app-1.3.jar -t C:\Coding.pdf, shows the entire
document.
Which means there is no problem in tika right??
--
View t
Sorry for the mistake.
im using solr 4.2, it has tika-1.3.
So now, java -jar tika-app-1.3.jar -v C:\Coding.pdf , parses pdf document
without error or msg.
Also, java -jar tika-app-1.3.jar -t C:\Coding.pdf, shows the entire
document.
Which means there is no problem in tika right??
--
View t
Sorry for the mistake.
im using solr 4.2, it has tika-1.3.
So now, java -jar tika-app-1.3.jar -v C:Coding.pdf , parses pdf document
without error or msg.
Also, java -jar tika-app-1.4.jar* -t *C:Cloud.docx, shows the entire
document.
Which means there is no problem in tika right??
--
View this
Please stay on (or clarify) your issue: in the first example you told us
the problem is with "Coding.pdf" file. What is that Cloud.docx? Why don't
you try with Coding.pdf? And what is the result of the extraction from
command line with Coding.pdf and the same tika version that is in your SOLR?
I w
through command line(>java -jar tika-app-1.4.jar -v C:Cloud.docx) apache
tika is able to parse .docx files, so can i use this tika-app-1.4.jar in
solr?? how to do that??
--
View this message in context:
http://lucene.472066.n3.nabble.com/using-extract-handler-data-not-extracted-tp4110850p4110
A premise: as Erik explained, most probably this issue has nothing to do
with SOLR.
So, these are the options that, in my mind, you have
*OPTION #1 : Using Tika as command line tool*a) Download Tika. Make sure
the same version of your SOLR
b) Read here: http://tika.apache.org/1.4/gettingstarted.h
ya right all 3 points are right.
Let me solve the 1 first, there is some errror in tika level indexing, for
that i need to debug at tika level right??
but how to do that?? Solr admin does not show package wise logging.
--
View this message in context:
http://lucene.472066.n3.nabble.com/using-e
Wait, don't confuse things...they should be three different issues:
1. with curl indexing happens but leaves the content field empty, so
probably something occurs at tika level during the text extraction. That's
the reason why I told you about the tika logging
2. with solrj ineexing doesn'happen
this is the output i get when indexed through* solrj*, i followed the link
you suggested.
i tried indexing .doc file.
400
17
org.apache.solr.search.SyntaxError: Cannot parse
'id:C:\solr\document\src\new_index_doc\document_1.doc': Encountered " ":" ":
"" at line 1, column 4. Was expecting one o
You know, what I'd do is one of two things:
1> Set up a remote debugging session for your sever and debug it. It's
actually quite simple. Get the source code (see
http://wiki.apache.org/solr/HowToContribute). I'll give you
http://wiki.apache.org/solr/HowToContribute. The sections near the
bottom w
the logging screen does not show tika package, also i searched on net, it
requires log4j and slf4j jars, is it true?? Do i need to do the
configurations for package level log?
--
View this message in context:
http://lucene.472066.n3.nabble.com/using-extract-handler-data-not-extracted-tp4110850
On the admin console you should be able to tune the log at package level
On 11 Jan 2014 17:31, "sweety" wrote:
> how set finest for tika package??
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/using-extract-handler-data-not-extracted-tp4110850p4110888.html
> Sent
how set finest for tika package??
--
View this message in context:
http://lucene.472066.n3.nabble.com/using-extract-handler-data-not-extracted-tp4110850p4110888.html
Sent from the Solr - User mailing list archive at Nabble.com.
Set to Finest tika packages too
On 11 Jan 2014 15:25, "sweety" wrote:
> I set the level of extract handler to finest, now the logs are :
> INFO: [document] webapp=/solr path=/update/extract
> params={commit=true&literal.id=12&debug=true} {add=[12
> (1456944038966984704)],commit=} 0 2631
> Jan 11,
I set the level of extract handler to finest, now the logs are :
INFO: [document] webapp=/solr path=/update/extract
params={commit=true&literal.id=12&debug=true} {add=[12
(1456944038966984704)],commit=} 0 2631
Jan 11, 2014 7:51:57 PM org.apache.solr.servlet.SolrDispatchFilter
handleAdminRequest
INF
Try to set to FINEST / DEBUG level the extract request handler and Tika
packages and post relevant log lines
On 11 Jan 2014 14:38, "sweety" wrote:
> Sorry, that my question was not clear.
> Initially when indexed pdf files it showed the data within this pdf in the
> contents field.as follows:(t
Sorry, that my question was not clear.
Initially when indexed pdf files it showed the data within this pdf in the
contents field.as follows:(this is output for initially indexed documents)
Cloud ctured As tale in size as well as complexity. We need a cloud based
system that will solve this problem
> Why is it so??
I'm reading your post on my mobile so probably I didn't get the point:
other then the date_modified field, what is the problem? Fields with
"ignored" prefix? That is perfectly right according with your configuration.
The other fields you declared aren't there because they are not
Are you sure date_modified is a meta-data field in the PDF document
you're extracting?
Best,
Erick
On Sat, Jan 11, 2014 at 3:00 AM, sweety wrote:
> I need to index rich text documents, this is* solrconfig.xml for extract
> handler*:
> class="solr.extraction.ExtractingRequestHandler" >
>
>
> tr
21 matches
Mail list logo