RE: Extracting PDF text/comment/callout/typewriter boxes with Solr CELL/Tika/PDFBox

2010-07-28 Thread David Thibault
ectory but forgot to update the war inside the example/webapps directory (that is inside Jetty). Hope this helps. Tommaso 2010/7/27 David Thibault > Alessandro & all, > > I was having the same issue with Tika crashing on certain PDFs. I also > noticed the bug where no content was

RE: Extracting PDF text/comment/callout/typewriter boxes with Solr CELL/Tika/PDFBox

2010-07-28 Thread David Thibault
Patch is more Stable. Until 4.0 will be released .... 2010/7/28 David Thibault > Yesterday I did get this working with version 4.0 from trunk. I haven't > fully tested it yet, but the content doesn't come through blank anymore, so > that's good. Would it be more stable to st

RE: Solr 3.1 and ExtractingRequestHandler resulting in blank content

2010-07-28 Thread David Thibault
uments and bring back the file name. Your app has to then use the file name. Solr/Lucene is not intended as a general-purpose content store, only an index. The ERH wiki page doesn't quite say this. It describes what the ERH does rather than what it does not do :) On Mon, Jul 26, 2010 at 12:0

RE: Extracting PDF text/comment/callout/typewriter boxes with Solr CELL/Tika/PDFBox

2010-07-28 Thread David Thibault
Hope this helps. Tommaso 2010/7/27 David Thibault > Alessandro & all, > > I was having the same issue with Tika crashing on certain PDFs. I also > noticed the bug where no content was extracted after upgrading Tika. > > When I went to the SOLR issue you link to below, I applied a

RE: Extracting PDF text/comment/callout/typewriter boxes with Solr CELL/Tika/PDFBox

2010-07-27 Thread David Thibault
Alessandro & all, I was having the same issue with Tika crashing on certain PDFs. I also noticed the bug where no content was extracted after upgrading Tika. When I went to the SOLR issue you link to below, I applied all the patches, downloaded the Tika 0.8 jars, restarted tomcat, posted a f

Solr 3.1 and ExtractingRequestHandler resulting in blank content

2010-07-26 Thread David Thibault
Hello all, I’m working on a project with Solr. I had 1.4.1 working OK using ExtractingRequestHandler except that it was crashing on some PDFs. I noticed that Tika bundled with 1.4.1 was 0.4, which was kind of old. I decided to try updating to 0.7 as per the directions here: http://wiki.apac

Re: Indexing very large files.

2008-02-23 Thread David Thibault
er granularity of retrieval - for example a dictionary, thesaurus, > and >Encyclopedia probably have what you want, but how to get it quickly? > - post-processing - like high-lighting, can be a performance killer, as > the >search/replace scans the entire large file for matc

Re: Indexing very large files.

2008-02-21 Thread David Thibault
All, A while back I was running into an issue with a Java heap out of memory error while indexing large files. I figured out that was my own error due to a misconfiguration of my Netbeans memory settings. However, now that is fixed and I have stumbled upon a new error. When trying to upload file

Logging in Solr

2008-01-16 Thread David Thibault
All, I'm new to Solr and Tomcat and I'm trying to track down some odd errors. How do I set up Tomcat to do fine-grained Solr-specific logging? I have looked around enough to know that it should be possible to do per-webapp logging in Tomcat 5.5, but the details are hard to follow for a newbie. A

Re: Indexing very large files.

2008-01-16 Thread David Thibault
irection. > > Otis > -- > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > - Original Message > From: David Thibault <[EMAIL PROTECTED]> > To: solr-user@lucene.apache.org > Sent: Wednesday, January 16, 2008 1:31:23 PM > Subject: Re: Index

Re: Indexing very large files.

2008-01-16 Thread David Thibault
recommended way to talk to Solr. > > -Yonik > > On Jan 16, 2008 4:29 PM, David Thibault <[EMAIL PROTECTED]> > wrote: > > OK, I have now bumped my tomcat JVM up to 1024MB min and 1500MB > max. For > > some reason Walter's suggestion helped me get past the 8MB file up

Re: Indexing very large files.

2008-01-16 Thread David Thibault
.java:97) at com.itstrategypartners.sents.solrUpload.UploaderTest.uploadFile( UploaderTest.java:95) Any more thoughts on possible causes? Best, Dave On 1/16/08, David Thibault <[EMAIL PROTECTED]> wrote: > > Walter and all, > > I had been bumping up the heap for my Java app (running outside of Tomcat)

Re: Indexing very large files.

2008-01-16 Thread David Thibault
M has run out of heap space. Increase the > heap space. That is an option on the "java" command. I set my heap to > 200 Meg and do it this way with Tomcat 6: > > JAVA_OPTS="-Xmx600M" tomcat/bin/startup.sh > > wunder > > On 1/16/08 8:33 AM, "David Thib

Re: Indexing very large files.

2008-01-16 Thread David Thibault
what > process gets the increased memory relative to the server. > > [EMAIL PROTECTED] > > > On Jan 16, 2008 11:33 AM, David Thibault <[EMAIL PROTECTED]> > wrote: > > > I tried raising the 1 under > > as well as and still no luck. I'm trying to

Re: Indexing very large files.

2008-01-16 Thread David Thibault
at org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.launch( JUnitTestRunner.java:912) at org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.main (JUnitTestRunner.java:766) On 1/16/08, David Thibault <[EMAIL PROTECTED]> wrote: > > I think your PS might do the trick. My JVM doesn'

Re: Indexing very large files.

2008-01-16 Thread David Thibault
rather your Java > > JVM doesn't start with enough memory. i.e. -Xmx. > > > > In raw Lucene, I've indexed 240M files > > > > Best > > Erick > > > > > > On Jan 16, 2008 10:12 AM, David Thibault <[EMAIL PROTECTED]> > &

Re: Indexing very large files.

2008-01-16 Thread David Thibault
All, I just found a thread about this on the mailing list archives because I'm troubleshooting the same problem. The kicker is that it doesn't take such large files to kill the StringBuilder. I have discovered the following: By using a text file made up of 3,443,464 bytes or less, I get no erro