Re: Solr crashing while extracting from very simple text file

2010-04-01 Thread Ross
Hi Chris, thanks for looking at this. I'm using Solr 1.4.0 including the Tika that's in the tgz file which means Tika 0.4. I've now discovered that only two letters are required. A single line with XE will crash it. This fails: r...@gamma:/home/ross# hexdump -C test.txt 58 45 0a

Re: Solr crashing while extracting from very simple text file

2010-04-01 Thread Chris Hostetter
: Yes, please report this to the Tika project. except that when i run "tika-app-0.6.jar" on a text file like the one Ross describes, i don't get the error he describes, which means it may be something off in how Solr is using Tika. Ross: I can't reproduce this error on the trunk using the exam

Re: Solr crashing while extracting from very simple text file

2010-04-01 Thread Erik Hatcher
Yes, please report this to the Tika project. Erik On Mar 31, 2010, at 9:31 PM, Ross wrote: Does anyone have any thoughts or suggestions on this? I guess it's really a Tika problem. Should I try to report it to the Tika project? I wonder if someone could try it to see if it's a genera

Re: Solr crashing while extracting from very simple text file

2010-03-31 Thread Ross
Does anyone have any thoughts or suggestions on this? I guess it's really a Tika problem. Should I try to report it to the Tika project? I wonder if someone could try it to see if it's a general problem or just me. I can reproduce it by firing up the nano editor, creating a file with XXBLE on one

Re: Solr crashing while extracting from very simple text file

2010-03-22 Thread Ross
I thought you might ask that :-) It's because the pdf files are scanned from paper documents and OCR'd to produce text. They still contain the image so are huge. The smaller files are about 40 MB and cause a Java out of heap memory error. The larger files are getting close to 500 MB. I didn't have

Re: Solr crashing while extracting from very simple text file

2010-03-22 Thread Erik Hatcher
Why not feed the original PDF files in instead? Just curious if pdftotext is doing a better job than Tika's PDFBox stuff. Erik On Mar 22, 2010, at 9:30 AM, Ross wrote: Thanks Georg I don't think it's that because it crashes on a one word test file I create using the nano editor. I

Re: Solr crashing while extracting from very simple text file

2010-03-22 Thread Ross
Thanks Georg I don't think it's that because it crashes on a one word test file I create using the nano editor. I don't think nano is adding anything extra. My real files are created by a Windows utility called pdftotext. I solved the problem by getting pdftotext to generate html files rather tha

Solr crashing while extracting from very simple text file

2010-03-22 Thread György Frivolt
Hi, I had problem with indexing documents some months ago as well. I found that there were XML control characters in the documents and these were not handled by Solr. Maybe it is the case for you as well. Regards, Georg On Sun, Mar 21, 2010 at 5:58 PM, Ross wrote: > Hi all > > I'm tr

Solr crashing while extracting from very simple text file

2010-03-21 Thread Ross
Hi all I'm trying to import some text files. I'm mostly following Avi Rappoport's tutorial. Some of my files cause Solr to crash while indexing. I've narrowed it down to a very simple example. I have a file named test.txt with one line. That line is the word XXBLE and nothing else This is the c