Nutch + Solr - Indexer causes java.lang.OutOfMemoryError: Java heap space
Hello everyone, I have configured my 2 servers to run in distributed mode (with Hadoop) and my configuration for crawling process is Nutch 2.2.1 - HBase (as a storage) and Solr. Solr is run by Tomcat. The problem is everytime I try to do the last step - I mean when I want to index data from HBase into Solr. After then this *[1]* error occures. I tried to add CATALINA_OPTS (or JAVA_OPTS) like this: CATALINA_OPTS="$JAVA_OPTS -XX:+UseConcMarkSweepGC -Xms1g -Xmx6000m -XX:MinHeapFreeRatio=10 -XX:MaxHeapFreeRatio=30 -XX:MaxPermSize=512m -XX:+CMSClassUnloadingEnabled" to Tomcat's catalina.sh script and run server with this script but it didn't help. I also add these *[2]* properties to nutch-site.xml file but it ended up with OutOfMemory again. Can you help me please? *[1]* /2014-09-06 22:52:50,683 FATAL org.apache.hadoop.mapred.Child: Error running child : java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2367) at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:130) at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:114) at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:587) at java.lang.StringBuffer.append(StringBuffer.java:332) at java.io.StringWriter.write(StringWriter.java:77) at org.apache.solr.common.util.XML.escape(XML.java:204) at org.apache.solr.common.util.XML.escapeCharData(XML.java:77) at org.apache.solr.common.util.XML.writeXML(XML.java:147) at org.apache.solr.client.solrj.util.ClientUtils.writeVal(ClientUtils.java:161) at org.apache.solr.client.solrj.util.ClientUtils.writeXML(ClientUtils.java:129) at org.apache.solr.client.solrj.request.UpdateRequest.writeXML(UpdateRequest.java:355) at org.apache.solr.client.solrj.request.UpdateRequest.getXML(UpdateRequest.java:271) at org.apache.solr.client.solrj.request.RequestWriter.getContentStream(RequestWriter.java:66) at org.apache.solr.client.solrj.request.RequestWriter$LazyContentStream.getDelegate(RequestWriter.java:94) at org.apache.solr.client.solrj.request.RequestWriter$LazyContentStream.getName(RequestWriter.java:104) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:247) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:197) at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117) at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68) at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54) at org.apache.nutch.indexwriter.solr.SolrIndexWriter.close(SolrIndexWriter.java:96) at org.apache.nutch.indexer.IndexWriters.close(IndexWriters.java:117) at org.apache.nutch.indexer.IndexerOutputFormat$1.close(IndexerOutputFormat.java:54) at org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.close(MapTask.java:650) at org.apache.hadoop.mapred.MapTask.closeQuietly(MapTask.java:1793) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:779) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:364) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190) / *[2]* http.content.limit 15000 The length limit for downloaded content using the http protocol, in bytes. If this value is nonnegative (>=0), content longer than it will be truncated; otherwise, no truncation at all. Do not confuse this setting with the file.content.limit setting. For our purposes it is twice bigger than default - parsing big pages: 128 * 1024 indexer.max.tokens 10 http.timeout 5 The default network timeout, in milliseconds. solr.commit.size 100 Defines the number of documents to send to Solr in a single update batch. Decrease when handling very large documents to prevent Nutch from running out of memory. NOTE: It does not explicitly trigger a server side commit. -- View this message in context: http://lucene.472066.n3.nabble.com/Nutch-Solr-Indexer-causes-java-lang-OutOfMemoryError-Java-heap-space-tp4157308.html Sent from the Solr - User mailing list archive at Nabble.com.
Apache Solr 4 - after 1st commit the index does not grow
I have written my own plugin for Apache Nutch 2.2.1 to crawl images, videos and podcasts from selected sites (I have 180 urls in my seed). I put this metadata to a hBase store and now I want to save it to the index (Solr). I have a lot of metadatas to save (webpages + images + videos + podcast). I am using Nutch script bin/crawl for the whole process (inject, generate, fetch, parse... and finally solrindex and dedup) but I have one problem. When I run this script for a first time, there are stored approximately 6000 documents (Lets say it is 3700 docs for images, 1700 for wegpages and the rest of docs are for videos and podcasts) to the index. It is ok... but... When I run the script for a second time, third time and so on... the index does not increase the number of documents (there are still 6000 documents) but a count of rows stored in hBase table grows (there is 97383 rows now)... Do you now where is the problem please? I am fighting with this problem really long time and I dont know... If it could be helpful, this is my configuration of solrconfix.xml http://pastebin.com/uxMW2nuq and this is my nutch-site.xml http://pastebin.com/4bj1wdmT -- View this message in context: http://lucene.472066.n3.nabble.com/Apache-Solr-4-after-1st-commit-the-index-does-not-grow-tp4077913.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Apache Solr 4 - after 1st commit the index does not grow
When I look into the log, there is: SEVERE: auto commit error...:java.lang.IllegalStateException: this writer hit an OutOfMemoryError; cannot commit at org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:2668) at org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:2834) at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:2814) at org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:529) at org.apache.solr.update.CommitTracker.run(CommitTracker.java:216) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:722) -- View this message in context: http://lucene.472066.n3.nabble.com/Apache-Solr-4-after-1st-commit-the-index-does-not-grow-tp4077913p4077924.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Apache Solr 4 - after 1st commit the index does not grow
Ok, I have removed the problem with OutOfMemory by increasing jvm parameters... and now I have another problem. My index worked since yesterday evening... the number of documents increased (I run bin/crawl script every 3 hours and I have 27040 documents now).. but the last increase was 6 hours ago... why it stoped to grow again? You can look at my solr here: http://ir-dev.lmcloud.vse.cz:8082/solr/#/~logging The log says: java.lang.RuntimeException: [was class java.io.CharConversionException] Invalid UTF-8 character 0x at char #2800441, byte #3096524) What is it? how can I solve it? Does anyone have any idea? -- View this message in context: http://lucene.472066.n3.nabble.com/Apache-Solr-4-after-1st-commit-the-index-does-not-grow-tp4077913p4078077.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Apache Solr 4 - after 1st commit the index does not grow
As I can see, this is the same problem like one from older posts - http://lucene.472066.n3.nabble.com/strange-utf-8-problem-td3094473.html ...but it was without any response. -- View this message in context: http://lucene.472066.n3.nabble.com/Apache-Solr-4-after-1st-commit-the-index-does-not-grow-tp4077913p4078079.html Sent from the Solr - User mailing list archive at Nabble.com.