Hi Chris, thanks for looking at this.
I'm using Solr 1.4.0 including the Tika that's in the tgz file which
means Tika 0.4.
I've now discovered that only two letters are required. A single line
with XE will crash it.
This fails:
r...@gamma:/home/ross# hexdump -C test.txt
58 45 0a
: Yes, please report this to the Tika project.
except that when i run "tika-app-0.6.jar" on a text file like the one Ross
describes, i don't get the error he describes, which means it may be
something off in how Solr is using Tika.
Ross: I can't reproduce this error on the trunk using the exam
Yes, please report this to the Tika project.
Erik
On Mar 31, 2010, at 9:31 PM, Ross wrote:
Does anyone have any thoughts or suggestions on this? I guess it's
really a Tika problem. Should I try to report it to the Tika project?
I wonder if someone could try it to see if it's a genera
Does anyone have any thoughts or suggestions on this? I guess it's
really a Tika problem. Should I try to report it to the Tika project?
I wonder if someone could try it to see if it's a general problem or
just me. I can reproduce it by firing up the nano editor, creating a
file with XXBLE on one
I thought you might ask that :-)
It's because the pdf files are scanned from paper documents and OCR'd
to produce text. They still contain the image so are huge. The smaller
files are about 40 MB and cause a Java out of heap memory error. The
larger files are getting close to 500 MB. I didn't have
Why not feed the original PDF files in instead? Just curious if
pdftotext is doing a better job than Tika's PDFBox stuff.
Erik
On Mar 22, 2010, at 9:30 AM, Ross wrote:
Thanks Georg
I don't think it's that because it crashes on a one word test file I
create using the nano editor. I
Thanks Georg
I don't think it's that because it crashes on a one word test file I
create using the nano editor. I don't think nano is adding anything
extra.
My real files are created by a Windows utility called pdftotext. I
solved the problem by getting pdftotext to generate html files rather
tha
Hi,
I had problem with indexing documents some months ago as well. I found
that there were XML control characters in the documents and these were not
handled by Solr. Maybe it is the case for you as well.
Regards,
Georg
On Sun, Mar 21, 2010 at 5:58 PM, Ross wrote:
> Hi all
>
> I'm tr
Hi all
I'm trying to import some text files. I'm mostly following Avi
Rappoport's tutorial. Some of my files cause Solr to crash while
indexing. I've narrowed it down to a very simple example.
I have a file named test.txt with one line. That line is the word
XXBLE and nothing else
This is the c