Hi All, I'm experiencing a similar problem to the other's in the thread.
I've recently upgraded from apache-solr-4.0-2011-06-14_08-33-23.war to apache-solr-4.0-2011-10-14_08-56-59.war and then apache-solr-4.0-2011-10-30_09-00-00.war to index ~5300 pdfs, of various sizes, using the TikaEntityProcessor. My indexing would run to completion and was completely successful under the June build. The only error was readability of the fulltext in highlighting. This was fixed in Tika 0.10 (TIKA-611). I chose to use the October 14 build of Solr because Tika 0.10 had recently been included (SOLR-2372). On the same machine without changing any memory settings my initial problem is a Perm Gen error. Fine, I increase the PermGen space. I've set the "onError" parameter to "skip" for the TikaEntityProcessor. Now I get several (6) *SEVERE: Exception thrown while getting data* *java.net.SocketTimeoutException: Read timed out* *SEVERE: Exception in entity : tika:org.apache.solr.handler.dataimport.DataImport* *HandlerException: Exception in invoking url <url removed> # 2975* pairs. And after ~3881 documents, with auto commit set unreasonably frequently I consistently get an Out of Memory Error *SEVERE: Exception while processing: f document : null:org.apache.solr.handle**r.dataimport.DataImportHandlerException: java.lang.OutOfMemoryError: Java heap s**pace* The stack trace points to org.apache.pdfbox.io.RandomAccessBuffer.expandBuffer(RandomAccessBuffer.java:151) and org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilde r.java:718). The October 30 build performs identically. Funny thing is that monitoring via JConsole doesn't reveal any memory issues. Because the out of Memory error did not occur in June, this leads me to believe that a bug has been introduced to the code since then. Should I open an issue in JIRA? Thanks, Tricia On Tue, Aug 30, 2011 at 12:22 PM, Marc Jacobs <jacob...@gmail.com> wrote: > Hi Erick, > > I am using Solr 3.3.0, but with 1.4.1 the same problems. > The connector is a homemade program in the C# programming language and is > posting via http remote streaming (i.e. > > http://localhost:8080/solr/update/extract?stream.file=/path/to/file.doc&literal.id=1 > ) > I'm using Tika to extract the content (comes with the Solr Cell). > > A possible problem is that the filestream needs to be closed, after > extracting, by the client application, but it seems that there is going > something wrong while getting a Tika-exception: the stream never leaves the > memory. At least that is my assumption. > > What is the common way to extract content from officefiles (pdf, doc, rtf, > xls etc) and index them? To write a content extractor / validator yourself? > Or is it possible to do this with the Solr Cell without getting a huge > memory consumption? Please let me know. Thanks in advance. > > Marc > > 2011/8/30 Erick Erickson <erickerick...@gmail.com> > > > What version of Solr are you using, and how are you indexing? > > DIH? SolrJ? > > > > I'm guessing you're using Tika, but how? > > > > Best > > Erick > > > > On Tue, Aug 30, 2011 at 4:55 AM, Marc Jacobs <jacob...@gmail.com> wrote: > > > Hi all, > > > > > > Currently I'm testing Solr's indexing performance, but unfortunately > I'm > > > running into memory problems. > > > It looks like Solr is not closing the filestream after an exception, > but > > I'm > > > not really sure. > > > > > > The current system I'm using has 150GB of memory and while I'm indexing > > the > > > memoryconsumption is growing and growing (eventually more then 50GB). > > > In the attached graph I indexed about 70k of office-documents > > (pdf,doc,xls > > > etc) and between 1 and 2 percent throws an exception. > > > The commits are after 64MB, 60 seconds or after a job (there are 6 > evenly > > > divided jobs). > > > > > > After indexing the memoryconsumption isn't dropping. Even after an > > optimize > > > command it's still there. > > > What am I doing wrong? I can't imagine I'm the only one with this > > problem. > > > Thanks in advance! > > > > > > Kind regards, > > > > > > Marc > > > > > >